An Evaluation of Fault-Tolerance Techniques for Exascale Systems
SESSION: Research Poster Reception
EVENT TYPE: Poster
TIME: 5:15PM - 7:00PM
AUTHOR(S):Kurt B. Ferreira, Rolf Riesen
ABSTRACT: As High-End Computing machines continue to grow, issues such as fault tolerance and reliability limit application scalability. In this work we evaluate the suitability of three well-known techniques that allow applications to ensure progress across a wide variety of faults; coordinated checkpointing, node-level replication, and uncoordinated checkpointing with sender-side message logging. For each method we outline the techniques limitations and overheads as well as pointing out scenarios when the method would be suitable for exascale systems.