Validation of an HPC Cluster: A Sometimes Neglected Aspect of System Administration
SESSION: S02: Validation of an HPC Cluster: A Sometimes Neglected Aspect of System Administration
EVENT TYPE: Tutorial
TIME: 8:30AM - 12:00PM
Presenter(s):Michael Hebenstreit, Bob Hayes
ABSTRACT: More often than not, an HPC cluster gets a lot of attention during installation, but keeping it alive, healthy and performing well is viewed as “standard system administration”. People often forget that in a cluster one suboptimal or even defective component might have a far deeper impact then just “losing a node”. In Intel’s benchmarking HPC datacenter we have found regular revalidation to be a key factor in battling system degradation.
In this tutorial, Intel’s Customer Response Team proposes to train users on how to foresee potential problems, point to various sources to help diagnose incorrect behavior and develop script based routine checks on Linux and Windows based clusters that help maintain a cluster’s performance.