System: 22 HPC cluster systems
Duration: December 1996 thru November 2005
Data Type: Records of cluster node outages, workload logs and error logs.

 

About the data:

The data spans 22 high-performance computing systems that have been in production use at Los Alamos National Lab (LANL) between 1996 and November 2005. Most of these systems are large clusters of either NUMA (Non-Uniform-Memory-Access) nodes, or 2-way and 4-way SMP (Symmetric-Multi-Processing) nodes. In total the systems include 4750 nodes and 24101 processors.

The data contains an entry for any failure leading to a node outage that occurred during the 9-year time period and that required the attention of a system administrator. For each failure, the data includes start time and end time, the system and node affected, as well as categorized root cause information.

The workloads run on those systems are large-scale long-running 3D scientific simulations, e.g. for nuclear stockpile stewardship. These applications perform long periods (often months) of CPU computation, interrupted every few hours by a few minutes of I/O for checkpointing. Simulation workloads are often accompanied by scientific visualization of large-scale data. Visualization workloads are also CPU-intensive, but involve more reading from storage than compute workloads. Finally, some nodes are used purely as front-end nodes, and others run more than one type of workload, e.g. graphics nodes often run compute workloads as well.

At LANL, failure tolerance is frequently implemented through periodic checkpointing. When a node fails, the job(s) running on it is stopped and restarted on a different set of nodes, either starting from the most recent checkpoint or from scratch if no checkpoint exists.

The data is based on a ``remedy'' database created at LANL in June 1996. At that time, LANL introduced a site-wide policy that requires system administors to enter a description of every failure they take care of into the remedy database. Consequentially, the database contains a record for every failure that occurred in LANL's HPC systems since June 1996 and that required intervention of a system administrator.

A failure record contains the time when the failure started, the time when it was resolved, the system and node affected, the type of workload running on the node and the root cause. The workload is either compute for computational workloads, graphics for visualization workloads, or fe for front-end. Root causes fall in one of the following five high-level categories: Human error; Environment, including power outages or A/C failures; Network failure; Software failure; and Hardware failure. In addition, more detailed information on the root cause is captured, such as the particular hardware component affected by a Hardware failure. The failure classification and rules for assigning failures to categories were developed jointly by hardware engineers, administrators and operations staff.

The LANL web page and a recent paper provide more detailed information about the systems the data was collected on and the data format and a FAQ.

 

Downloads:

The data can be downloaded directly from the LANL web page. The LANL web page also provides more detailed information on the data format and the systems the data was collected on. There is also a FAQ available.

 

Papers using this data:

Bianca Schroeder and Garth A. Gibson. "A large scale study of failures in high-performance-computing systems." International Symposium on Dependable Systems and Networks (DSN 2006).

Gonzalo Zarza, Diego Lugones, Daniel Franco and Emilio Luque. "Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems." 2010 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'10), Las Vegas, Nevada, USA, July 2010.

If you are using this data in a paper, please send an e-mail with the paper reference to the moderators and we will add it to this page.

 

Acknowledgments:

We thank Gary Grider, Laura Davey, James Nunez, and the Computing, Communications, and Networking Division at LANL for their efforts in collecting the data and clearing it for public release and for their help with background information and interpretation.

If you use these data in your work, please use a similar acknowledgment.