SC is the International Conference for
 High Performnance Computing, Networking, Storage and Analysis

SCHEDULE: NOV 13-19, 2010

Data-Aware Inter-Process Checkpoint Compression

SESSION: Research Poster Reception

EVENT TYPE: Poster

TIME: 5:15PM - 7:00PM

AUTHOR(S):Tanzima Z Islam, Kathryn Mohror, Adam Moody, Bronis de Supinski, Saurabh Bagchi, Rudolf Eigenmann

ROOM:Main Lobby

ABSTRACT:
Storing checkpoints to stable storage is a common solution for handling failures that occur during executions of HPC applications. However, this approach suffers from scalability problems. Transferring terabytes of checkpoint data over a shared network to a shared file system results in high overheads, potentially on the order of tens of minutes. Consequently, application developers may choose to take less frequent checkpoints, risking the loss of more work in the event of a failure. Additionally, large-scale applications writing per-process checkpoints may overwhelm a shared parallel file system due to the large number of metadata requests. To alleviate these problems, we present a novel inter-process checkpoint compression technique that capitalizes on the similarity in checkpoint data across ranks and reduces the file count by aggregating them in groups. Our preliminary experiments with a synthetic benchmark show that this method achieves checkpoint size reductions up to 79%.

Chair/Author Details:

Tanzima Z Islam - Purdue University

Kathryn Mohror - Lawrence Livermore National Laboratory

Adam Moody - Lawrence Livermore National Laboratory

Bronis de Supinski - Lawrence Livermore National Laboratory

Saurabh Bagchi - Purdue University

Rudolf Eigenmann - Purdue University

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

   Sponsors    IEEE    ACM