SC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis

SCHEDULE: NOV 13-19, 2010

An Evaluation of Fault-Tolerance Techniques for Exascale Systems

SESSION: Research Poster Reception


TIME: 5:15PM - 7:00PM

AUTHOR(S):Kurt B. Ferreira, Rolf Riesen

ROOM:Main Lobby

As High-End Computing machines continue to grow, issues such as fault tolerance and reliability limit application scalability. In this work we evaluate the suitability of three well-known techniques that allow applications to ensure progress across a wide variety of faults; coordinated checkpointing, node-level replication, and uncoordinated checkpointing with sender-side message logging. For each method we outline the techniques limitations and overheads as well as pointing out scenarios when the method would be suitable for exascale systems.

Chair/Author Details:

Kurt B. Ferreira - Sandia National Laboratories

Rolf Riesen - Sandia National Laboratories

