A State-Machine Approach to Disambiguating Supercomputer Event Logs


Jon Stearley, Robert Ballance, and Lara Bauman, Sandia National Laboratories


Supercomputer components are inherently stateful and interdependent, so accurate assessment of an event on one component often requires knowledge of previous events on that component or others. Administrators who daily monitor and interact with the system generally possess sufficient operational context to accurately interpret events, but researchers with only historical logs are at risk for incorrect conclusions. To address this risk, we present a state-machine approach for tracing context in event logs, a flexible implementation in Splunk, and an example of its use to disambiguate a frequently occurring event type on an extreme-scale supercomputer. Specifically, of 70,126 heartbeat stop events over three months, we identify only 2% as indicating failures.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {179443,
author = {Jon Stearley and Robert Ballance and Lara Bauman},
title = {A {State-Machine} Approach to Disambiguating Supercomputer Event Logs},
year = {Submitted},
url = {https://www.usenix.org/conference/mad12/workshop-program/presentation/Stearley},
publisher = {USENIX Association}

Presentation Audio