Challenges of Making Large AI Clusters Reliable

Tuesday, 7 October, 2025 - 09:4510:30

John Looney and Panos Christeas, Crusoe.ai

High Performance Computing clusters are very different to typical datacenter hardware, and the usual SRE approach of treating servers as cattle, not pets. This also impacts how SREs build on top of the lower infrastructure layer.

John Looney has been a full stack SRE for 20 years, working at every layer from hardware design to 100 million RPC/s revenue booking services. The last year has been a spent building the fastest AI training clusters possible, and learning they are very different to typical datacenters.

BibTeX
@conference {311802,
author = {John Looney and Panos Christeas},
title = {Challenges of Making Large {AI} Clusters Reliable},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video