More Performant Cluster State Management Using Open Source Firmware and a Kraken

Note: Presentation times are in Pacific Daylight Time (PDT).

Tuesday, June 01, 2021 - 12:45 pm1:30 pm

Devon Bautista and J. Lowell Wofford, Los Alamos National Laboratory

Abstract: 

Often, vendor-provided firmware is proprietary and closed, which can present some hurdles in high-performance computing (HPC). Vendor firmware usually provides a generic way for bootstrapping systems, having to accommodate for many situations, but purpose-built clusters would benefit from more purpose-built firmware. The ability to customize the system initialization more granularly would provide more control over the hardware. This could potentially increase boot efficiency and reduce boot times by eliminating unused features and introducing more useful ones, but proprietary firmware tends to limit the amount of fine tuning that is possible. This talk will demonstrate a use case for open firmware in the context of HPC with the integration of Kraken, a distributed state management tool focused on managing stateless HPC clusters. It will demonstrate how open firmware can be leveraged for eliminating nonnecessities in the boot process of nodes, as well as for provisioning them more reliably.

Devon Bautista, Los Alamos National Laboratory

Devon is a post-masters student at Los Alamos National Laboratory working under the New Mexico Consortium. He completed his Bachelor of Science in Computer Systems Engineering in 2019 and Master of Science in Computer Engineering in 2020, both at Arizona State University, and started working at LANL as a summer intern in 2019. He currently works in LANL's HPC design group, focusing on system initialization, management, and provisioning from a low-level perspective.

J. Lowell Wofford, Los Alamos National Laboratory

J. Lowell Wofford is a scientist at Los Alamos National Laboratory in the HPC Design group. Over the past couple of decades, he has dabbled in many aspects of High-Performance Computing, from scientific algorithms to system design. Lowell's current work is on Cluster and Supercomputer design, including system hardware, high-speed networks, and system software architecture. Most recently, he has focused on novel ways to automate the management of very large distributed systems.

BibTeX
@conference {272797,
author = {Devon Bautista and J. Lowell Wofford},
title = {More Performant Cluster State Management Using Open Source Firmware and a Kraken},
year = {2021},
publisher = {{USENIX} Association},
month = jun,
}

Presentation Video