AUTHOR(S):Milo Polte, John Bent, Garth Gibson, Gary Grider, Ben McClelland, James Nunez, Meghan Wingate
ABSTRACT: Parallel applications running on high performance computing clusters across thousands of processors rely on checkpointing to protect themselves from failures. The process of writing a checkpoint must be completed quickly so that applications may return to useful work. We present the Parallel Log-structured Filesystem (PLFS), a middleware layer for accelerating application checkpoints. PLFS transparently decouples concurrent checkpoints into a filesystem-efficient access pattern of independent writes to individual log-files. By decoupling writes in this manner PLFS dramatically decreases the time required to perform the checkpoint. Our evaluation demonstrates that PLFS provides 2x-150x speedups for application checkpointing, with greater greater benefits at larger scale. PLFS has been implemented as a FUSE-based filesystem requiring no changes to either application code or the underlying parallel filesystem and is being put into production at Los Alamos National Laboratory. This poster also presents new performance numbers for accessing PLFS directly as a library or through MPI-IO.
Milo Polte - Carnegie Mellon University
John Bent - Los Alamos National Laboratory
Garth Gibson - Carnegie Mellon University / Panasas Inc.