You are here
Continuous Improvement Using Comprehensive Root Cause Analysis
Susan Coghlan, Argonne National Laboratory
At the Argonne Leadership Supercomputer Facility, we operate Mira, a 786K core tightly coupled supercomputer, built for scalable, tightly coupled scientific workloads. In the operation of this system and its predecessor, we have developed a process for continuous system improvement through the performance of root cause analysis of all failed jobs. On a weekly basis, failed jobs are analyzed, correlated with hardware and software failures, and categorized by area and type. The data is rolled up into monthly statistics and analyzed for trends. The trends are then used as a basis for prioritizing system development, hardware and software procurements, and major projects.
Over the last five years, this process has contributed to improved reliability (e.g. 5x improvement in Mira’s MTTI), utilization (10% increase), and changed the way that we decide which problems we should tackle. The tools created and the process refinements made have also cut the time spent doing failure analysis each week in half.
In this talk, I'll describe how we've implemented this process, what infrastructure we needed in order to make it work smoothly, and changes to the management process that have taken place as a result of this work.
Susan Coghlan is the Deputy Division Director for the Argonne Leadership Computing Facility (ALCF) and the project director for the facility’s powerful supercomputing systems including Mira, the fifth fastest supercomputer in the world. She is responsible for overseeing the installation of the supercomputers and ensuring they meet the U.S. Department of Energy’s mission needs.
In her previous roles as Associate Division Director and Director of Operations for the ALCF, she was responsible for the installation and operation of the world's fastest open science computer (TOP500 List, June 2008), the ALCF's 557-teraflops Blue Gene/P production system. Susan has worked on parallel and distributed computers for over 25 years, from developing scientific applications, including her work on a model of the human brain at the Center for NonLinear Studies at Los Alamos National Laboratory, to managing ultra-scale supercomputers like ASCI Blue Mountain, a 6,144 processor system at Los Alamos. In 2000, she co-founded a research laboratory in Santa Fe (sponsored by Turbolinux, Inc.) that developed the world's first dynamic provisioning system for large clusters and data centers. Susan is well known within the high-performance computing community, and has presented numerous tutorials, lectures, and papers on her work.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.