J. Lowell Wofford, Kevin Pelzel, and Travis Cotton, Los Alamos National Laboratory
The overarching design of cluster system management stacks has not changed in decades. Most existing tooling works the same: set up netboot, configure some system "images," power on, and hope for the best. This set-it-and-leave-it approach is inadequate as systems grow in size and complexity. Modern systems need robust ways to automate systems management and enforce system states over time.
We have been rethinking the tooling for clustered systems. We introduce a new framework for distributed system automation, "Kraken," as well as a Kraken-based provisioning toolkit, "Layercake." Together they provide distributed, stateful provisioning and automation across clustered systems. Immediate advantages include: scalably and reliably initializing clusters from bare metal; self-healing capabilities for (some) failures; continuous system state enforcement; automated changes to configurations, personalities, and node images (often in microseconds); all while being declarative, idempotent, modular & extensible. We will present both the Kraken/Layercake tooling and outline the core design principles.
J. Lowell Wofford is a scientist at Los Alamos National Laboratory in the HPC Design group. Over the past couple of decades, he has dabbled in many aspects of High-Performance Computing, from scientific algorithms to system design. Lowell's current work is on Cluster and Supercomputer design, including system hardware, high-speed networks, and system software architecture. Most recently, he has focused on novel ways to automate the management of very large distributed systems.
Kevin is a scientist at Los Alamos National Laboratory. He graduated from the University of Wisconsin Stout in 2018 with a Bachelors in Computer and Electrical Engineering and immediately started work at LANL's HPC division, first as a post-bach, then as a staff scientist. Since then he's been working in the HPC environments group focused on developing tools for system management, such as automated system bring up and maintenance, syslog analysis, and data transfer utilities.
Travis is a scientist at Los Alamos National Laboratory in the HPC division. He graduated from New Mexico State University in 2013 with a Masters in Computer Science and has been working in HPC in various roles, starting as a research assistant in his Master's program and throughout his career. He started working at LANL as a scientist in 2018 in the HPC systems group, where he focuses on production computing, configuration management, and cluster image building.