SC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis

SCHEDULE: NOV 13-19, 2010

Fault Tolerance for Extreme Scale Computing through Algorithm-Based Recovery

SESSION: Research Poster Reception

EVENT TYPE: Poster

TIME: 5:15PM - 7:00PM

AUTHOR(S):Teresa Davies, Christer Karlsson, Hui Liu, Zizhong Chen

ROOM:Main Lobby

ABSTRACT:
In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While check- pointing is a very general technique and can often be used in many types of systems and to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this poster, we design algorithm-based recovery techniques for selected linear algebra operations to tolerate fail-stop failures without checkpointing. Because no periodical checkpoint is necessary during the whole computation and no roll-back is necessary during the recovery, the proposed algorithm-based recovery scheme is often highly scalable and have a good potential to scale to extreme scale computing and beyond. Experimental results demonstrate that the proposed fault tolerance technique introduces much less overhead than checkpointing on the current world's fourth fastest supercomputer Kraken.

Chair/Author Details:

Teresa Davies - Colorado School of Mines

Christer Karlsson - Colorado School of Mines

Hui Liu - Colorado School of Mines

Zizhong Chen - Colorado School of Mines

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

   Sponsors    IEEE    ACM