An Automated Approach for Fault Recovery Planning in Science Clouds

AUTHOR(S):Vitalian A. Danciu, Dieter Kranzlm├╝ller, Feng Liu, Johannes Watzl, Pavlo Kerestey, Maximilian Ahrens

Failures and anomalies are inevitabilities rather than exceptions in a large-scale cloud computing infrastructure. Its multi-layered architecture and sheer scale indulge a high fault frequency thus exacerbate purely manual-based recovery approaches. In our Science Cloud scenario, the criticality of the scientific applications running on top of the cloud infrastructure requires that once a fault happens, recovery solutions must be planned fast to reduce the outage period or any negative impact. To address this challenge, we propose an automated fault recovery architecture with an AI-based planning algorithm as the core of our approach. As main contributions, this poster presents: an algorithm to automated recovery plan composition; data models for the recovery planning knowledge and an architecture to facilitate planning operations. Evaluation results of the primary prototypical implementation are also presented.

Chair/Author Details:

Vitalian A. Danciu - Ludwig-Maximilians-University Munich

Dieter Kranzlm├╝ller - Ludwig-Maximilians-University Munich

Feng Liu - Ludwig-Maximilians-University Munich

Johannes Watzl - Ludwig-Maximilians-University Munich

Pavlo Kerestey - Technical University Munich

Maximilian Ahrens - Zimory GmbH

