ABSTRACT: As HPC systems grow to reach petascale and exascale proportions, so does the complexity of monitoring these systems. Many supercomputing sites have evolved over time using a mixture of task-specific monitoring applications, monitoring protocols, and a lot of home brew scripts. While these methods often get the job done, there is no "one size fits all" solution to monitoring HPC systems, and development efforts to fill in the gaps are often duplicated across sites.
The purpose of this BOF is to bring together HPC system administrators to discuss system monitoring in all aspects of the cluster. These monitoring needs include the health of the hardware itself and hardware failures, environmentals, node and cluster status, queue status, and job performance. We will share ideas that work at the various sites to help spread best practices amongst the group, and then focus time on what is lacking in our tools and methods.
Moving beyond the BOF, a group of HPC site administrators has recently organized this year to share best practices and collaborate on building better monitoring systems. We will be looking to expand this collaboration across interested attendees.
Session Leader Details:
Corey Shields (Primary Session Leader) - Indiana University
Randal Rheinheimer (Secondary Session Leader) - Los Alamos National Laboratory