NSF CNS-0406351
Next-generation parallel and distributed computing must be dependable and have predictable performance in order to meet the requirements of increasingly complex scientific and commercial applications. The large-scale nature and changing user requirements of such applications, coupled with the changing fault environment and workloads in which they must operate, dictate that their dependability and performance must be managed in an on-line fashion, reacting to changes in anticipated and observed faults, demands placed on the system, and changes in specified dependability, performance, and/or functional requirements. In this project, we created a compiler-enabled model- and measurement-driven adaptation environment that allows distributed applications to perform as expected despite faults that may occur. Achievement of the desired capabilities required fundamental advances in and synergistic combinations between 1) compiler-based flexible dependability mechanisms, 2) efficient online model-based prediction and control, and 3) measurement-driven and compiler-enabled early error detection. We validated the adaptation environment by using it for two important applications from the scientific and commercial domains: the CARMA (Combined Array for Research in Millimeter-Wave Astronomy) data pipeline, which is a data-intensive Grid application for radio astronomy, and iMobile, which is an enterprise-scale mobile services platform.
Static and dynamic compiler transformations were used to create novel dependability mechanisms that require less system resources than traditional mechanisms (minimizing their performance impact), and that can detect classes of errors (such as programming errors) that cannot be detected by traditional replication mechanisms. The new mechanisms can be used by the online adaptation engine to achieve a specified dependability and performance objective. Examples of the new mechanisms include variant replicas of several kinds (viz., replicas that differ from the original process to provide better error detection, lower overhead, or both), and combinations of variant replicas with traditional checkpointing (for more efficient rollback recovery). To minimize static code expansion, replicas can be generated on demand under the control of the adaptation engine, via a transparent dynamic compilation framework. We also explored compiler-based techniques that allow the middleware to coordinate distributed adaptations between processes more intelligently and more efficiently.
The online model-based prediction and control engine makes use of compiler-assisted deterministic and stochastic models, together with input from compiler-guided performance and error measurements, to adapt a system’s configuration to achieve the best combination of performance and dependability under existing conditions. Use of different types of models allow the configuration changes to be either 1) algorithm changes that choose among the available mechanisms, including possible dynamically generated versions, or 2) parameter changes that tune a particular algorithm to work more efficiently. Models incorporate compiler-synthesized model components that capture properties of existing and potential versions of generated code. To provide such prediction and control capabilities, we used a combination of reactive feedback control techniques, along with predictive, state-space-based stochastic modeling techniques (e.g., Markov decision processes). To ensure rapid decision-making and quick solution times for the models, we used a combination of approximation techniques (such as state-space reduction through decomposition and finite horizon computations) along with partial offline generation of controllers via symbolic solution methods.
Sophisticated error and performance measurement techniques were used to characterize system error behavior, enable early error detection, guide online adaptation models, and work with the compiler to improve error detection and tolerance. In particular, measurements on operational systems help characterize real issues in the field, including correlated errors and error propagation, that often escaped earlier detection mechanisms. On-line analysis helps extract error symptoms for early error detection and thus minimize performance impact of failures. Such analysis also provides the parameters and distribution characteristics to adapt system models and thus ensure effective on-line control. We used compiler support to develop application-specific, preemptive detection techniques to improve coverage and minimize error propagation while maintaining performance.
The larger impact of this research was the production and distribution of a practical, integrated compiler and middleware system that uses online models and measurement techniques to achieve performance and dependability in a scalable manner under a wide variety of changing conditions. The techniques we developed could ultimately impact many diverse and critical applications, including those in the electric power distribution, aerospace, healthcare, and financial services sectors.
People
- Prof. Vikram S. Adve, UIUC
- Dr. Matti Hiltunen, AT&T Labs Research
- Prof. Ravishankar K. Iyer, UIUC
- Dr. Raymond L. Plante, UIUC/NCSA
- Prof. William H. Sanders (PI), UIUC
- Dr. Richard D. Schlichting, AT&T Labs Research
Papers generated by the project:
This material is based upon work supported by the National Science Foundation under Grant No. 0406351. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- V. S. Adve, A. Agbaria, M. A. Hiltunen, R. K. Iyer, K. R. Joshi, Z. Kalbarczyk, R. M. Lefever, R. Plante, W. H. Sanders, and R. D. Schlichting, “A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance,” Proceedings of the Next Generation Software (NGS) Workshop at the International Parallel & Distributed Processing Symposium (IPDPS), Denver, Colorado, April 4, 2005 (CD-ROM). [IEEE Xplore entry]
- A. Agbaria and W. H. Sanders, “Application-Driven Coordination-Free Distributed Checkpointing,” Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, Columbus, Ohio, June 6-10, 2005, pp. 177-186. [IEEE Xplore entry]
- S. Chen, Link Gradients: Predicting the Impact of Link Latency on Multi-tier Applications. Master’s Thesis, University of Illinois at Urbana-Champaign, 2008.
- S. Chen, K. R. Joshi, M. A. Hiltunen, R. D. Schlichting, and W. H. Sanders, “Using Link Gradients to Predict the Impact of Network Latency on Multitier Applications,” IEEE/ACM Transactions on Networking, vol. 19, no. 3, June 2011, pp. 855-868. [IEEE Xplore entry] [ACM DOI: http://dx.doi.org/10.1109/TNET.2010.2098044]
- D. Dhurjati and V. Adve, “Backwards-Compatible Array Bounds Checking for C with Very Low Overhead,” Proceedings of the International Conference on Software Engineering (ICSE), Shanghai, China, May 2006.
- D. Dhurjati and V. Adve, “Efficiently Detecting All Dangling Pointer Uses in Production Servers,” In Proceedings of the International Conference on Dependable Systems and Networks (DSN), Philadelphia, USA, June 25-28, 2006, pp. 269-280. [IEEE Xplore entry]
- D. Dhurjati, S. Kowshik, and V. Adve, “Enforcing Alias Analysis for Weakly Typed Languages,” In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Ottawa, Canada, June 2006, pp. 144-157.
- S. Gaonkar, K. Keeton, A. Merchant, and W. H. Sanders, “Designing Dependable Storage Solutions for Shared Application Environments,” IEEE Transactions on Dependable and Secure Computing, vol. 7, no. 4, October-December 2010, pp. 366-380. [IEEE Xplore entry]
- S. Gaonkar, E. Rozier, A. Tong, and W. H. Sanders, “Scaling File Systems to Support Petascale Clusters: A Dependability Analysis to Support Informed Design Choices,”
Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2008), Anchorage, Alaska, June 24-27, 2008, 386-391. [IEEE Xplore entry] - S. Gaonkar, E. Rozier, A. Tong, and W. H. Sanders, “Scaling File Systems to Support Petascale Clusters: A Dependability Analysis to Support Informed Design Choices,” University of Illinois at Urbana-Champaign Coordinated Science Laboratory technical report UILU-ENG-08-2202 (CRHC-08-01), January 2008.
- S. Gaonkar and W. H. Sanders, “G-SSASC: Simultaneous Simulation of System Models with Bounded Hazard Rates,” Proceedings of the 2009 Winter Simulation Conference, Austin, Texas, USA, December 13-16, 2009, pp. 663-673. [IEEE Xplore entry]
- S. Gaonkar and W. H. Sanders, “Analysis of the Reliability/Availability of Distributed File Systems in Large-Scale Systems: A Case Study Using Simultaneous Simulation,”Proceedings of the 8th International Workshop on Performability Modeling of Computer and Communication Systems, Edinburgh, UK, Sept. 20-21, 2007.
- S. Gaonkar and W. H. Sanders, Simultaneous Simulation of Alternative System Configurations of Markovian System Models. University of Illinois at Urbana-Champaign Coordinated Science Laboratory technical report UILU-ENG-09-2203 (CRHC-09-02), March 2009.
- K. R. Joshi, Stochastic-Model-Driven Adaptation and Recovery in Distributed Systems. Doctoral Dissertation, University of Illinois, 2007.
- K. R. Joshi, M. A. Hiltunen, and W. H. Sanders, “Performability Optimization Using Linear Bounds of Partially Observable Markov Decision Processes,” Proceedings of the 7th International Workshop on Performability Modeling of Computer and Communication Systems (PMCCS-7), Turin, Italy, September 23-24, 2005, pp. 73-76.
- K. R. Joshi, M. Hiltunen, W. H. Sanders, and R. Schlichting, “Automatic Model-Driven Recovery in Distributed Systems,” Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), Orlando, Florida, October 26-28, 2005, pp. 25-36. [IEEE Xplore entry]
- K. R. Joshi, M. A. Hiltunen, W. H. Sanders, and R. D. Schlichting, “Automatic Recovery Using Bounded Partially Observable Markov Decision Processes,” Proceedings of the International Conference on Dependable Systems and Networks (DSN-2006), Philadelphia, PA, USA, June 25-28, 2006, pp. 445-456. [IEEE Xplore entry]
- K. R. Joshi, M. Hiltunen, R. Schlichting, W. H. Sanders, and A. Agbaria, “Online Model-Based Adaptation for Optimizing Performance and Dependability,” Proceedings of the Workshop on Self-Managed Systems (WOSS 2004), Newport Beach, CA, October 31-November 1, 2004 (CD-ROM).
- V. V. Lam, P. Buchholz, and W. H. Sanders, “A Component-Level Path Composition Approach for Efficient Transient Analysis of Large CTMCs,” Proceedings of the International Conference on Dependable Systems and Networks (DSN-2006), Philadelphia, PA, USA, June 25-28, 2006, pp. 485-494. [IEEE Xplore entry]
- R. M. Lefever, Diverse Partial Memory Replication, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2011.
- R. M. Lefever, V. S. Adve, and W. H. Sanders. “Diverse Partial Memory Replication,” Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010), Chicago, Illinois, June 28-July 1, 2010, pp. 71-80. [IEEE Xplore entry]
- R. M. Lefever, V. S. Adve, and W. H. Sanders, A Mirrored Data Structures Approach to Diverse Partial Memory Replication, Proceedings of the 9th European Dependable Computing Conference (EDCC-2012), Sibiu, Romania, May 8-11, 2012, pp. 61-72. [IEEE Xplore entry]
- K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer, “Application-Based Metrics for Strategic Placement of Detectors,” Proc. of the 11th Pacific Rim International Symposium on Dependable Computing, PRDC’05, Dec. 12-14, 2005. [IEEE Xplore entry]
- K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, “Automated Derivation of Application-aware Error Detectors using Static Analysis,” Fast Abstract at the International Conference on Dependable Systems and Networks, DSN-06, June 2006.
- K. Pattabiraman, G. P. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer, “Automated Derivation and Hardware Implementation of Application-Specific Error Detectors,” Proc. HPCRI: 2nd Workshop on High Performance Computing: Reliability Issues; held in conjunction with the 12th International Symposium on High Performance Computer Architecture (HPCA-12), Austin, 2006.
- K. Pattabiraman, G.-P. Saggese, D. Chen, Z. Kalbarczyk, and R. Iyer, “Dynamic Derivation of Application-Specific Error Detectors and Their Implementation in Hardware,” Proceedings of the 6th European Dependable Computing Conference (EDCC ’06), Oct. 18-20, 2006, pp. 97-108. [IEEE Xplore entry]
- H. V. Ramasamy, Parsimonious Service Replication for Tolerating Malicious Attacks in Asynchronous Environments, Ph.D. thesis, University of Illinois at Urbana-Champaign, 2005.
- H. V. Ramasamy, A. Agbaria, and W. H. Sanders, “A Parsimonious Approach for Obtaining Resource-Efficient and Trustworthy Execution,” IEEE Transactions on Dependable and Secure Computing, vol. 4, no. 1, January-March 2007, pp. 1-17. [IEEE Xplore entry]
- H. V. Ramasamy, A. Agbaria, and W. H. Sanders, “Parsimony-Based Approach for Obtaining Resource-Efficient and Trustworthy Execution,” Dependable Computing: Proceedings of the 2nd Latin-American Symposium (LADC 2005), Salvador, Brazil, October 25-28, 2005, LNCS vol. 3747, Springer-Verlag, pp. 206-225.
- H. V. Ramasamy and C. Cachin, “Parsimonious Asychronous Byzantine-Fault-Tolerant Atomic Broadcast,” Proceedings of the 9th International Conference on Principles of Distributed Systems (OPODIS), Pisa, Italy, Dec. 12-14, 2005.
- H. Ramasamy, M. Seri, and W. H. Sanders, “The CoBFIT Toolkit,” Proceedings of the 26th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC 2007), Portland, Oregon, Aug. 12-15, 2007, pp. 350-351. [ACM DOI: http://dx.doi.org/10.1145/1281100.1281167]
- P. Sousa, N. F. Neves, P. Veríssimo, and W. H. Sanders, “Proactive Resilience Revisited: The Delicate Balance Between Resisting Intrusions and Remaining Available,” Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, UK, October 2-4, 2006, pp. 71-82. [IEEE Xplore entry]
- L. Wang, K. Pattabiraman, L. Votta, C. Vick, A. Wood, Z. Kalbarczyk, and R. K. Iyer, “Modeling Coordinated Checkpointing for Large-Scale Supercomputers,” Proceedings of the International Conference on Dependable Systems and Networks (DSN), Yokohoma, Japan, 2005, pp. 812-821. [IEEE Xplore entry]