Scaling Highly-Parallel Data-Intensive Supercomputing Applications on a Parallel Clustered Filesystem
SESSION: Storage Challenge Presentations
EVENT TYPE: Storage Challenge
TIME: 10:30AM - 11:00AM
SESSION CHAIR: Alan Sussman
ROOM:388
ABSTRACT: A new class of data-intensive supercomputing applications nvolves processing massive amounts of data with a greater focus on semantically transforming the data. This class of applications is embarrassingly parallel and well suited for the MapReduce programming framework that allows users to do large-scale data analysis where the runtime handles the system architecture, data partitioning and task scheduling. In this paper, we demonstrate a business intelligence application running GPFS over a cluster of commodity machines and direct-attached storage. The architecture maximizes storage performance by using five innovative optimizations: (a) Locality awareness to allow compute jobs to be scheduled close to data; (b) Metablocks that allow large and small block sizes to co-exist in the same file system; (c) Write affinity that allows applications to dictate the layout of files on different nodes; (d) Pipelined replication to maximize use of network bandwidth; (e) Distributed recovery to minimize the effect of failures.