SC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis

SCHEDULE: NOV 13-19, 2010

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

SESSION: Scalable System Software

EVENT TYPE: Paper

TIME: 3:30PM - 4:00PM

SESSION CHAIR: Franck Cappello

AUTHOR(S):Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski

ROOM:393

ABSTRACT:
As high-performance computing systems increase in size, checkpointing to the parallel file system becomes prohibitively expensive. Multi-level checkpointing may solve this challenge through lightweight checkpoints that handle the most common failures and relying on parallel file system checkpoints only for less common, but more severe failures. To evaluate this approach in a large-scale, production system context, we developed the Scalable Checkpoint/Restart library, which checkpoints to storage on the compute nodes in addition to the parallel file system. Through experiments and modeling, we show that multi-level checkpointing benefits existing systems, and we find that the benefits increase on larger systems. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. Our approach improves machine efficiency up to 35%, while reducing the load on the parallel file system by a factor of two.

Chair/Author Details:

Franck Cappello (Chair) - INRIA and UIUC

Adam Moody - Lawrence Livermore National Laboratory

Greg Bronevetsky - Lawrence Livermore National Laboratory

Kathryn Mohror - Lawrence Livermore National Laboratory

Bronis R. de Supinski - Lawrence Livermore National Laboratory

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library and IEEE Computer Society

   Sponsors    IEEE    ACM