AUTHOR(S):Tanzima Z Islam, Kathryn Mohror, Adam Moody, Bronis de Supinski, Saurabh Bagchi, Rudolf Eigenmann
ABSTRACT: Storing checkpoints to stable storage is a common solution for handling failures that occur during executions of HPC applications. However, this approach suffers from scalability problems. Transferring terabytes of checkpoint data over a shared network to a shared file system results in high overheads, potentially on the order of tens of minutes. Consequently, application developers may choose to take less frequent checkpoints, risking the loss of more work in the event of a failure. Additionally, large-scale applications writing per-process checkpoints may overwhelm a shared parallel file system due to the large number of metadata requests. To alleviate these problems, we present a novel inter-process checkpoint compression technique that capitalizes on the similarity in checkpoint data across ranks and reduces the file count by aggregating them in groups. Our preliminary experiments with a synthetic benchmark show that this method achieves checkpoint size reductions up to 79%.
Tanzima Z Islam - Purdue University
Kathryn Mohror - Lawrence Livermore National Laboratory
Adam Moody - Lawrence Livermore National Laboratory
Bronis de Supinski - Lawrence Livermore National Laboratory