SC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis

SCHEDULE: NOV 13-19, 2010

Scalable Fault Tolerance in PGAS Programming Models

SESSION: Research Poster Reception


TIME: 5:15PM - 7:00PM

AUTHOR(S):Nawab Ali, Sriram Krishnamoorthy, Niranjan Govind, Oreste Villa, James Dinan, Robert Harrison

ROOM:Main Lobby

Recent trends in high-performance computing point towards extremely large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean-time-between-failures (MTBF) ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. We have developed three different fault tolerant techniques to compensate for the high failure rates in HPC systems. These techniques employ redundant communication, VM-based checkpointing and selective restart to provide applications with a high-degree of fault tolerance.

Chair/Author Details:

Nawab Ali - Pacific Northwest National Laboratory

Sriram Krishnamoorthy - Pacific Northwest National Laboratory

Niranjan Govind - Pacific Northwest National Laboratory

Oreste Villa - Pacific Northwest National Laboratory

James Dinan - Ohio State University

Robert Harrison - Oak Ridge National Laboratory

