A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance

NSF CNS-0406351

Next-generation parallel and distributed computing must be dependable and have predictable performance in order to meet the requirements of increasingly complex scientific and commercial applications. The large-scale nature and changing user requirements of such applications, coupled with the changing fault environment and workloads in which they must operate, dictate that their dependability and performance must be managed in an on-line fashion, reacting to changes in anticipated and observed faults, demands placed on the system, and changes in specified dependability, performance, and/or functional requirements. In this project, we created a compiler-enabled model- and measurement-driven adaptation environment that allows distributed applications to perform as expected despite faults that may occur. Achievement of the desired capabilities required fundamental advances in and synergistic combinations between 1) compiler-based flexible dependability mechanisms, 2) efficient online model-based prediction and control, and 3) measurement-driven and compiler-enabled early error detection. We validated the adaptation environment by using it for two important applications from the scientific and commercial domains: the CARMA (Combined Array for Research in Millimeter-Wave Astronomy) data pipeline, which is a data-intensive Grid application for radio astronomy, and iMobile, which is an enterprise-scale mobile services platform.

Static and dynamic compiler transformations were used to create novel dependability mechanisms that require less system resources than traditional mechanisms (minimizing their performance impact), and that can detect classes of errors (such as programming errors) that cannot be detected by traditional replication mechanisms. The new mechanisms can be used by the online adaptation engine to achieve a specified dependability and performance objective. Examples of the new mechanisms include variant replicas of several kinds (viz., replicas that differ from the original process to provide better error detection, lower overhead, or both), and combinations of variant replicas with traditional checkpointing (for more efficient rollback recovery). To minimize static code expansion, replicas can be generated on demand under the control of the adaptation engine, via a transparent dynamic compilation framework. We also explored compiler-based techniques that allow the middleware to coordinate distributed adaptations between processes more intelligently and more efficiently.

The online model-based prediction and control engine makes use of compiler-assisted deterministic and stochastic models, together with input from compiler-guided performance and error measurements, to adapt a system’s configuration to achieve the best combination of performance and dependability under existing conditions. Use of different types of models allow the configuration changes to be either 1) algorithm changes that choose among the available mechanisms, including possible dynamically generated versions, or 2) parameter changes that tune a particular algorithm to work more efficiently. Models incorporate compiler-synthesized model components that capture properties of existing and potential versions of generated code. To provide such prediction and control capabilities, we used a combination of reactive feedback control techniques, along with predictive, state-space-based stochastic modeling techniques (e.g., Markov decision processes). To ensure rapid decision-making and quick solution times for the models, we used a combination of approximation techniques (such as state-space reduction through decomposition and finite horizon computations) along with partial offline generation of controllers via symbolic solution methods.

Sophisticated error and performance measurement techniques were used to characterize system error behavior, enable early error detection, guide online adaptation models, and work with the compiler to improve error detection and tolerance. In particular, measurements on operational systems help characterize real issues in the field, including correlated errors and error propagation, that often escaped earlier detection mechanisms. On-line analysis helps extract error symptoms for early error detection and thus minimize performance impact of failures. Such analysis also provides the parameters and distribution characteristics to adapt system models and thus ensure effective on-line control. We used compiler support to develop application-specific, preemptive detection techniques to improve coverage and minimize error propagation while maintaining performance.

The larger impact of this research was the production and distribution of a practical, integrated compiler and middleware system that uses online models and measurement techniques to achieve performance and dependability in a scalable manner under a wide variety of changing conditions. The techniques we developed could ultimately impact many diverse and critical applications, including those in the electric power distribution, aerospace, healthcare, and financial services sectors.


Papers generated by the project:

This material is based upon work supported by the National Science Foundation under Grant No. 0406351. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.