PLFS: A Fast Checkpoint Filesystem

AUTHOR(S):Milo Polte, John Bent, Garth Gibson, Gary Grider, Ben McClelland, James Nunez, Meghan Wingate

Parallel applications running on high performance computing clusters across thousands of processors rely on checkpointing to protect themselves from failures. The process of writing a checkpoint must be completed quickly so that applications may return to useful work. We present the Parallel Log-structured Filesystem (PLFS), a middleware layer for accelerating application checkpoints. PLFS transparently decouples concurrent checkpoints into a filesystem-efficient access pattern of independent writes to individual log-files. By decoupling writes in this manner PLFS dramatically decreases the time required to perform the checkpoint. Our evaluation demonstrates that PLFS provides 2x-150x speedups for application checkpointing, with greater greater benefits at larger scale. PLFS has been implemented as a FUSE-based filesystem requiring no changes to either application code or the underlying parallel filesystem and is being put into production at Los Alamos National Laboratory. This poster also presents new performance numbers for accessing PLFS directly as a library or through MPI-IO.

Milo Polte - Carnegie Mellon University

John Bent - Los Alamos National Laboratory

Garth Gibson - Carnegie Mellon University / Panasas Inc.

Gary Grider - Los Alamos National Laboratory

Ben McClelland - Los Alamos National Laboratory

James Nunez - Los Alamos National Laboratory

Meghan Wingate - Los Alamos National Laboratory

