CFDR Data

The LANL data

System:	22 HPC cluster systems
Duration:	December 1996 thru November 2005
Data Type:	Records of cluster node outages, workload logs and error logs.

About the data

The data spans 22 high-performance computing systems that have been in production use at Los Alamos National Lab (LANL) between 1996 and November 2005. Most of these systems are large clusters of either NUMA (Non-Uniform-Memory-Access) nodes, or 2-way and 4-way SMP (Symmetric-Multi-Processing) nodes. In total the systems include 4750 nodes and 24101 processors.

The data contains an entry for any failure leading to a node outage that occurred during the 9-year time period and that required the attention of a system administrator. For each failure, the data includes start time and end time, the system and node affected, as well as categorized root cause information.

The workloads run on those systems are large-scale long-running 3D scientific simulations, e.g. for nuclear stockpile stewardship. These applications perform long periods (often months) of CPU computation, interrupted every few hours by a few minutes of I/O for checkpointing. Simulation workloads are often accompanied by scientific visualization of large-scale data. Visualization workloads are also CPU-intensive, but involve more reading from storage than compute workloads. Finally, some nodes are used purely as front-end nodes, and others run more than one type of workload, e.g. graphics nodes often run compute workloads as well.

At LANL, failure tolerance is frequently implemented through periodic checkpointing. When a node fails, the job(s) running on it is stopped and restarted on a different set of nodes, either starting from the most recent checkpoint or from scratch if no checkpoint exists.

The data is based on a ``remedy'' database created at LANL in June 1996. At that time, LANL introduced a site-wide policy that requires system administors to enter a description of every failure they take care of into the remedy database. Consequentially, the database contains a record for every failure that occurred in LANL's HPC systems since June 1996 and that required intervention of a system administrator.

A failure record contains the time when the failure started, the time when it was resolved, the system and node affected, the type of workload running on the node and the root cause. The workload is either compute for computational workloads, graphics for visualization workloads, or fe for front-end. Root causes fall in one of the following five high-level categories: Human error; Environment, including power outages or A/C failures; Network failure; Software failure; and Hardware failure. In addition, more detailed information on the root cause is captured, such as the particular hardware component affected by a Hardware failure. The failure classification and rules for assigning failures to categories were developed jointly by hardware engineers, administrators and operations staff.

The LANL web page and a recent paper provide more detailed information about the systems the data was collected on and the data format and a FAQ.

Downloads

The data can be downloaded directly from the LANL web page. The LANL web page also provides more detailed information on the data format and the systems the data was collected on. There is also a FAQ available.

Papers using this data

Bianca Schroeder and Garth A. Gibson. "A large scale study of failures in high-performance-computing systems." International Symposium on Dependable Systems and Networks (DSN 2006).
00 Gonzalo Zarza, Diego Lugones, Daniel Franco and Emilio Luque. "Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems." 2010 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'10), Las Vegas, Nevada, USA, July 2010.

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

We thank Gary Grider, Laura Davey, James Nunez, and the Computing, Communications, and Networking Division at LANL for their efforts in collecting the data and clearing it for public release and for their help with background information and interpretation.

If you use these data in your work, please use a similar acknowledgment.

The HPC1 data

System:	A 765-node HPC cluster with 64 filesystem nodes
Duration:	August 2001 thru May 2006
Data Type:	Hardware replacement log

About the data

HPC1 is a five year log of hardware replacements collected from a 765 node high-performance computing cluster. Each of the 765 nodes is a 4-way SMP with 4 GB of memory and three to four 18GB 10K rpm SCSI drives. Of these nodes, 64 are used as filesystem nodes containing, in addition to the three to four 18GB drives, 17 36GB 10K rpm SCSI drives. The applications running on this system are typically large-scale scientific simulations or visualization applications. The data contains, for each hardware replacement that was recorded during the five year lifetime of this system, when the problem started, which node and which hardware component was affected, and a brief description of the corrective action.

Downloads

The data set is not available for download.

Papers using this data

A first analysis of the HPC1 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson. "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?". 5th USENIX Conference on File and Storage Technologies (FAST 2007).

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

We would like to thank Katie Vargo, J. Ray Scott and Robin Flaus from the Pittsburgh Supercomputing Center for collecting and providing us with data and helping us to interpret the data.

If you use these data in your work, please use a similar acknowledgment.

The HPC2 data

System:	A 256-node HPC cluster
Duration:	January 2004 thru July 2006
Data Type:	Hardware replacement log

About the data

HPC2 is a record of disk replacements observed on the compute nodes of a 256 node HPC cluster at Los Alamos National Lab (LANL). Each node is a 4-way SMP with 16 GB of memory and contains two 36GB 10K rpm SCSI drives, except for eight of the nodes, which contain eight 36GB 10K rpm SCSI drives each. The applications running on this system are typically large-scale scientific simulations or visualization applications. For each disk replacement, the data set records the number of the affected node, the start time of the problem, and the slot number of the replaced drive.

Papers using this data

A first analysis of the HPC2 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson. "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?". 5th USENIX Conference on File and Storage Technologies (FAST 2007).

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

If you use these data in your work, please use a similar acknowledgment.

The HPC3 data

System:	1,532 node HPC cluster
Duration:	December 2005 thru November 2006
Data Type:	Harddrive replacement log

About the data

HPC3 is a record of disk replacements observed on a 1,532 node HPC cluster. Each node is equipped with eight CPUs and 32GB of memory. Each node, except for four login nodes, has two 146GB 15K rpm SCSI disks. In addition, 11,000 7200 rpm 250GB SATA drives are used in an external shared filesystem and 144 73GB 15K rpm SCSI drives are used for the filesystem metadata. The applications running on this system are typically large-scale scientific simulations or visualization applications. For each disk replacement, the data set records the day of the replacement.

Downloads

The data set is not available for download.

Papers using this data

A first analysis of the HPC3 data is presented in the following paper:
Bianca Schroeder and Garth A. Gibson. "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you?". 5th USENIX Conference on File and Storage Technologies (FAST 2007).

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

We would like to thank the people at the organization, who has provided us with data, but would like to remain unnamed, for collecting the data and helping us to interpret the data.

The HPC4 data

System:	Five HPC systems with 512 to 131072 processors
Duration:	Between 215 and 558 days in 2004 - 2006
Data Type:	Event logs

About the data

For a detailed description of the data see the DSN'07 paper by Adam Oliner and Jon Stearley. Below follows a very brief summary. The data consists of five event logs collected between 2004 and 2006 on five different supercomputing systems: Blue Gene/L, Thunderbird, RedStorm, Spirit, and Liberty. The logs contain alert and non-alert messages identified by alert category tags, and are therefore more amenable to alert detection and prediction research than failure modeling. All five systems are ranked on the Top500 Supercomputers List as of June 2006, spanning a range from #1 to #445. They vary by two orders of magnitude in the number of processors (ranging from 512 proccesors in Liberty to 131072 processors in Blue Gene/L) and by one order of magnitude in the amount of main memory The various machines are produced by IBM, Dell, Cray and HP. All systems are installed at Sandia National Labs (SNL) in Albuquerque, NM, with the exception of BG/L, which is at Lawrence Livermore National Labs (LLNL) in Livermore, CA.

Downloads

Thunderbird (1.9 GB) | Spirit (864 MB) | Liberty (641 MB) | BlueGene/L (60 MB)

Papers using this data

A description and analysis of the HPC4 data is presented in the following paper:

Oliner and J. Stearley. "What Supercomputers Say: A Study of Five System Logs". Proceedings of the International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, 2007.

The data is also used in the following two papers:

A. J. Oliner, A. Aiken, and J. Stearley. "Alert Detection in System Logs" Proceedings of the International Conference on Data Mining (ICDM), Pisa, Italy, 2008.

J. Stearley and A. J. Oliner. "Bad Words: Finding Faults in Spirit’s Syslogs.". Workshop on Resiliency in High-Performance Computing (Resilience-2008), Lyon, France, 2008.

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

We would like to thank Jon Stearley and Adam Oliner for collecting the data and making it publicly available.

The PNNL data

System:	980 node HPC cluster
Duration:	November 2003 thru September 2007
Data Type:	Log of hardware failures

About the data

This data set is a record of hardware failures recorded on the High Performance Computing System-2 (MPP2) operated by the Environmental and Molecular Science Labratory (EMSL), Molecular Science Computing Facility (MSCF) at Pacific Northwest National Laboratory (PNNL).

The MPP2 computing system has the following equipment and capabilities:

HP/Linux Itanium-2
980 node/1960 Itanium-2 processors (Madison, 1.5 GHz) configured as follows:
- 574 nodes are "fat" compute nodes with 10 Gbyte RAM and 430 Gbyte local disk
- 366 nodes are "thin" compute nodes with 10 Gbyte RAM and 10 Gbyte local disk
- 34 nodes are Lustre server nodes (32 OSS, 2 MDS)
- 2 nodes are administrative nodes
- 4 nodes are login nodes
Quadrics QsNetII interconnect
11.8 TFlops peak theoretical performance
9.7 terabytes of RAM
450 terabytes of local scratch disk space
53 terabytes shared cluster file system (Lustre).

The applications running on this system are typically large-scale scientific simulations or visualization applications.

For each hardware failure, the data set includes a timestamp for when the failure happened, a hardware identifier, the component that failed, a description of the failure, and the repair action taken.

Downloads

The data is available for download here.

Papers using this data

Gonzalo Zarza, Diego Lugones, Daniel Franco and Emilio Luque. "Fault-tolerant Routing for Multiple Permanent and Non-permanent Faults in HPC Systems." 2010 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'10), Las Vegas, Nevada, USA, July 2010.

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

We would like to thank Evan Felix and David Brown from PNNL for collecting the data and sharing it. The data was collected and made available using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the U.S. Department of Energy's Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory, operated for the Department of Energy by Battelle.

If you use these data in your work, please use a similar aedgment.

The NERSC data

System:	Various HPC clusters
Duration:	2001 - 2006
Data Type:	Database with I/O specific failures

About the data

This data was collected with the purpose of providing failure specifics for I/O related systems and components in as much detail as possible so that analysis might produce some useful findings. Data were collected for storage, networking, computational machines, and file systems in production use at NERSC from the 2001-2006 timeframe. The data was extracted form a database used for tracking system troubles, called Remedy, and is currently stored in a mySQL database and available for export to Excel format. There are also some basic query and graph capabilities available. For more information on the data, please contact the PDSI researcher at NERSC: Akbar Mokhtarani or the Principal Investigator for PDSI at NERSC: Bill Kramer.

Downloads

The data and more information is available for download here.

Papers using this data

This data has not yet been reported on in any paper.

If you are using this data in a paper, please send an e-mail with the paper reference to the USENIX Production Department and we will add it to this page.

Acknowledgments

We would like to thank Bill Kramer and Akbar Mokhtarani from NERSC for collecting the data and sharing it.

If you use these data in your work, please use a similar acknowledgment.

The COM1 data

System:	Internet services clusters
Duration:	May 2006
Data Type:	Hardware replacement log

Blue Gene/P data from Intrepid

System:	Blue Gene/P
Duration:	Jan 09 - Aug 09
Data Type:	RAS log

About the data

The data consists of RAS log messages collected over a period of 6 months on the Blue Gene/P Intrepid system at. Each message in the log contains 15 fields as follows: RECID, MSG_ID, COMPONENT, SUBCOMPONENT, ERRCODE, SEVERITY, EVENT_TIME, FLAGS, PROCESSOR, NODE, BLOCK, LOCATION, SERIALNUMBER, ECID, MESSAGE.

More details about the data and system it comes from will be made available soon. In the meantime, please refer to the paper by Zheng et al. listed below which provides an analysis of the data.

Downloads

RAS log

Papers using this data

Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner, ''Co-Analysis of RAS Log and Job Log on Blue Gene/P,'' in Proc. of IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), Anchorage, AK, USA, 2011.

Acknowledgments

We thank Ziming Zheng and Zhiling Lan from the Illinois Institute of Technology for making this data set available.

This content is available to:

~~CFDR user~~s