A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance
Next-generation parallel and distributed computing must be dependable and have predictable performance in order to meet the requirements of increasingly complex scientific and commercial applications. The large-scale nature and changing user requirements of such applications, coupled with the changing fault environment and workloads in which they must operate, dictate that their dependability and performance must be managed in an on-line fashion, reacting to changes in anticipated and observed faults, demands placed on the system, and changes in specified dependability, performance, and/or functional requirements. In this project, we created a compiler-enabled model- and measurement-driven adaptation environment that allows distributed applications to perform as expected despite faults that may occur. Achievement of the desired capabilities required fundamental advances in and synergistic combinations between 1) compiler-based flexible dependability mechanisms, 2) efficient online model-based prediction and control, and 3) measurement-driven and compiler-enabled early error detection. We validated the adaptation environment by using it for two important applications from the scientific and commercial domains: the CARMA (Combined Array for Research in Millimeter-Wave Astronomy) data pipeline, which is a data-intensive Grid application for radio astronomy, and iMobile, which is an enterprise-scale mobile services platform.
Static and dynamic compiler transformations were used to create novel dependability mechanisms that require less system resources than traditional mechanisms (minimizing their performance impact), and that can detect classes of errors (such as programming errors) that cannot be detected by traditional replication mechanisms. The new mechanisms can be used by the online adaptation engine to achieve a specified dependability and performance objective. Examples of the new mechanisms include variant replicas of several kinds (viz., replicas that differ from the original process to provide better error detection, lower overhead, or both), and combinations of variant replicas with traditional checkpointing (for more efficient rollback recovery). To minimize static code expansion, replicas can be generated on demand under the control of the adaptation engine, via a transparent dynamic compilation framework. We also explored compiler-based techniques that allow the middleware to coordinate distributed adaptations between processes more intelligently and more efficiently.
The online model-based prediction and control engine makes use of compiler-assisted deterministic and stochastic models, together with input from compiler-guided performance and error measurements, to adapt a system's configuration to achieve the best combination of performance and dependability under existing conditions. Use of different types of models allow the configuration changes to be either 1) algorithm changes that choose among the available mechanisms, including possible dynamically generated versions, or 2) parameter changes that tune a particular algorithm to work more efficiently. Models incorporate compiler-synthesized model components that capture properties of existing and potential versions of generated code. To provide such prediction and control capabilities, we used a combination of reactive feedback control techniques, along with predictive, state-space-based stochastic modeling techniques (e.g., Markov decision processes). To ensure rapid decision-making and quick solution times for the models, we used a combination of approximation techniques (such as state-space reduction through decomposition and finite horizon computations) along with partial offline generation of controllers via symbolic solution methods.
Sophisticated error and performance measurement techniques were used to characterize system error behavior, enable early error detection, guide online adaptation models, and work with the compiler to improve error detection and tolerance. In particular, measurements on operational systems help characterize real issues in the field, including correlated errors and error propagation, that often escaped earlier detection mechanisms. On-line analysis helps extract error symptoms for early error detection and thus minimize performance impact of failures. Such analysis also provides the parameters and distribution characteristics to adapt system models and thus ensure effective on-line control. We used compiler support to develop application-specific, preemptive detection techniques to improve coverage and minimize error propagation while maintaining performance.
The larger impact of this research was the production and distribution of a practical, integrated compiler and middleware system that uses online models and measurement techniques to achieve performance and dependability in a scalable manner under a wide variety of changing conditions. The techniques we developed could ultimately impact many diverse and critical applications, including those in the electric power distribution, aerospace, healthcare, and financial services sectors.
Papers generated by the project:
This material is based upon work supported by the National Science Foundation under Grant No. 0406351. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- V. S. Adve, A. Agbaria, M. A. Hiltunen, R. K. Iyer, K. R. Joshi, Z. Kalbarczyk, R. M. Lefever, R. Plante, W. H. Sanders, and R. D. Schlichting, "A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance," Proceedings of the Next Generation Software (NGS) Workshop at the International Parallel & Distributed Processing Symposium (IPDPS), Denver, Colorado, April 4, 2005 (CD-ROM). [IEEE Xplore entry]
- S. Chen, K. R. Joshi, M. A. Hiltunen, R. D. Schlichting, and W. H. Sanders, "Using Link Gradients to Predict the Impact of Network Latency on Multitier Applications," IEEE/ACM Transactions on Networking, vol. 19, no. 3, June 2011, pp. 855-868. [IEEE Xplore entry] [ACM DOI: http://dx.doi.org/10.1109/TNET.2010.2098044]
- D. Dhurjati and V. Adve, "Backwards-Compatible Array Bounds Checking for C with Very Low Overhead," Proceedings of the International Conference on Software Engineering (ICSE), Shanghai, China, May 2006.
- D. Dhurjati and V. Adve, "Efficiently Detecting All Dangling Pointer Uses in Production Servers," In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Philadelphia, USA, June 25-28, 2006, pp. 269-280. [IEEE Xplore entry]
- D. Dhurjati, S. Kowshik, and V. Adve, "Enforcing Alias Analysis for Weakly Typed Languages," In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Ottawa, Canada, June 2006, pp. 144-157.
- S. Gaonkar, E. Rozier, A. Tong, and W. H. Sanders, "Scaling File Systems to Support Petascale Clusters: A Dependability Analysis to Support Informed Design Choices,"
Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2008), Anchorage, Alaska, June 24-27, 2008, 386-391. [IEEE Xplore entry]
- K. R. Joshi, M. Hiltunen, W. H. Sanders, and R. Schlichting, "Automatic Model-Driven Recovery in Distributed Systems," Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), Orlando, Florida, October 26-28, 2005, pp. 25-36. [IEEE Xplore entry]
- K. R. Joshi, M. A. Hiltunen, W. H. Sanders, and R. D. Schlichting, "Automatic Recovery Using Bounded Partially Observable Markov Decision Processes," Proceedings of the International Conference on Dependable Systems and Networks (DSN-2006), Philadelphia, PA, USA, June 25-28, 2006, pp. 445-456. [IEEE Xplore entry]
- K. R. Joshi, M. Hiltunen, R. Schlichting, W. H. Sanders, and A. Agbaria, "Online Model-Based Adaptation for Optimizing Performance and Dependability," Proceedings of the Workshop on Self-Managed Systems (WOSS 2004), Newport Beach, CA, October 31-November 1, 2004 (CD-ROM).
- V. V. Lam, P. Buchholz, and W. H. Sanders, "A Component-Level Path Composition Approach for Efficient Transient Analysis of Large CTMCs," Proceedings of the International Conference on Dependable Systems and Networks (DSN-2006), Philadelphia, PA, USA, June 25-28, 2006, pp. 485-494. [IEEE Xplore entry]
- R. M. Lefever, V. S. Adve, and W. H. Sanders. "Diverse Partial Memory Replication," Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010), Chicago, Illinois, June 28-July 1, 2010, pp. 71-80. [IEEE Xplore entry]
- K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, "Application-Based Metrics for Strategic Placement of Detectors," Proc. of the 11th Pacific Rim International Symposium on Dependable Computing, PRDC'05, Dec. 12-14, 2005. [IEEE Xplore entry]
- K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, "Automated Derivation of Application-aware Error Detectors using Static Analysis," Fast Abstract at the International Conference on Dependable Systems and Networks, DSN-06, June 2006.
- K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer, "Automated Derivation and Hardware Implementation of Application-Specific Error Detectors," Proc. HPCRI: 2nd Workshop on High Performance Computing: Reliability Issues; held in conjunction with the 12th International Symposium on High Performance Computer Architecture (HPCA-12), Austin, 2006.
- K. Pattabiraman, G.-P. Saggese, D. Chen, Z. Kalbarczyk, and R. Iyer, "Dynamic Derivation of Application-Specific Error Detectors and Their Implementation in Hardware," Proceedings of the 6th European Dependable Computing Conference (EDCC '06), Oct. 18-20, 2006, pp. 97-108. [IEEE Xplore entry]
- H. V. Ramasamy, A. Agbaria, and W. H. Sanders, "Parsimony-Based Approach for Obtaining Resource-Efficient and Trustworthy Execution," Dependable Computing: Proceedings of the 2nd Latin-American Symposium (LADC 2005), Salvador, Brazil, October 25-28, 2005, LNCS vol. 3747, Springer-Verlag, pp. 206-225.
- P. Sousa, N. F. Neves, P. Veríssimo, and W. H. Sanders, "Proactive Resilience Revisited: The Delicate Balance Between Resisting Intrusions and Remaining Available," Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, UK, October 2-4, 2006, pp. 71-82. [IEEE Xplore entry]
- L. Wang, K. Pattabiraman, L. Votta, C. Vick, A. Wood, Z. Kalbarczyk, and R. K. Iyer, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," Proceedings of the International Conference on Dependable Systems and Networks (DSN), Yokohoma, Japan, 2005, pp. 812-821. [IEEE Xplore entry]
COPYRIGHT NOTICES: The above electronic files are presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
The following copyright notice applies to all of the above items that appear in IEEE publications: "Personal use of this material is permitted. However, permission to reprint/publish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from IEEE."
With respect to items published by the ACM: Items are © by the authors listed and by the ACM, in the year listed on each item. The files posted here are the authors' versions of the work. They are posted here for your personal use and are not for redistribution. The definitive Version of Record was published as indicated in the bibliographic citations provided, and is available from the ACM Digital Library from the ACM Digital Object Identifier listed for each publication.