Scale and Concurrency of GIGA+: File System Directories with Millions of Files
SESSION: Student Poster Reception
EVENT TYPE: Poster, ACM Student Poster
TIME: 5:15PM - 7:00PM
ABSTRACT: PROBLEM --
Most scalable file systems enable highly concurrent access to large files, but not for large file system directories -- a growing requirement among the users and vendors of HPC file systems to efficiently handle new workloads [hecfsio:tr06, hpcs-io:2008]. Large data-intensive applications are generating a challenging new workload that uses the file system as a fast, lightweight ``database'' composed of large number of small files in a single directory. Such applications, including parallel checkpointing, astrophysics simulations and gene sequencing, generate metadata-intensive workloads of KB-sized accesses by executing concurrently on clusters of hundreds of thousands of nodes each with multiple CPU cores. At this level of massive concurrency, even simple parallel output file creation, in one output directory, one file per thread, can induce intense metadata workloads that require concurrent directories (and metadata service, in general).
Unfortunately, most file systems, including the ones running on some of the largest clusters (Panasas's PanFS, PVFS2 and GoogleFS), manage an entire directory through a single metadata server, thus limiting the scalability of directory operations. Some file systems, such as GoogleFS and HDFS, completely avoid handling many small files by defining new, non-VFS semantics that requires applications to be written (or re-written) using custom interfaces. Only two parallel file systems support distributed directories: LustreFS and IBM GPFS. While Lustre's clustered metadata is yet unsupported in a stable release [lustrefs], IBM GPFS enables concurrent directory access through distributed locking and cache consistency, but both mechanisms incur significant overheads for concurrent create workloads, especially from many clients working in one directory [gpfs:schmuck02].
Our research goal is to understand the tradeoffs involved in pushing the scalability limits of FS directories, to store billions of files and sustain more than hundreds of thousands of mutations per second, while maintaining the traditional UNIX VFS interface and POSIX semantics. Our primary contribution is an indexing technique, called GIGA+, that handles large mutating directories in a parallel and decentralized manner, while still meeting the desired load-balancing and incremental growth requirements. The central tenet of GIGA+ is to enable higher concurrency for mutations (particularly inserts) by eliminating system-wide serialization and synchronization.
GIGA+ incrementally partitions a directory across all servers, and then allows each server to repartition its portion of the directory without any central co-ordination or globally shared state. GIGA+ uses a dense bitmap encoding that maps filenames to directory partitions and to a specific server. Servers use these bitmaps to identify another server for a new partition created by splitting a local partition and to update a client's often stale view of the directory partition's index. Clients use bitmaps to lookup the (possibly former) server that stores the partition associated with a given filename. As the server expands partitions asynchronously, clients frequently end up with partially incomplete bitmaps. Unlike most prior work, GIGA+ tolerates the use of stale, inconsistent bitmaps at the client without affecting the correctness of the operation
(applications get strong consistency semantics). Client bitmaps get updated lazily by a server which detects that it has been incorrectly addressed by the client. GIGA+ also facilitates index reconfiguration when servers are added or removed with minimal re-mapping overhead.
Our FUSE-based user-level distributed directory implementation provides UNIX/VFS semantics, and can be layered on commodity file systems including local file systems (ext3 and ReiserFS) and cluster file systems. In our preliminary evaluation, for configurations of up to 32 servers, GIGA+ delivered a throughput of more than 98,000 file creates per second -- exceeding some of the most extreme HPC scalability requirements [hpcs-io:2008]. In addition, GIGA+ achieved a well balanced load distribution with two orders of magnitude less partitioning than consistent hashing and incurred a very low overhead, only 0.01% additional messages, from its use of weakly consistent mapping.
[lustrefs] LustreFS Clustered Metadata. http://arch.lustre.org/index.php?title=ClusteredMetadata.
[hecfsio:tr06] ROSS, R., FELIX, E., LOEWE, B., WARD, L., NUNEZ, J., BENT, J., SALMON, E., and GRIDER, G. "High end computing revitalization task force (HECRTF), inter agency working group (HECIWG) file systems and I/O research guidance workshop 2006". http://arch.lustre.org/index.php?title=ClusteredMetadata.
[hpcs-io:2008] NEWMAN, H. "HPCS Mission Partner File I/O Scenarios. Revision 3", Released on Nov 17, 2008, Nov. 2008.
[gpfs:schmuck02] SCHMUCK, F., and HASKIN, R. "GPFS: A Shared-Disk File System for Large Computing Clusters". In USENIX FAST 2002.