Lerna Ekmekcioglu, Clockwork.io
A Formula 1 car at high speeds maintains a firm grip on the track, yet a piece of debris can force it to slow down and even bring it to a halt. Similarly, AI workloads rely on the network fabrics, like RoCE and Infiniband, for not only high throughput but also reliable paths to communicate.
When the path is littered with "debris" like NIC and link flaps, the results are just as frustrating: performance degrades, jobs crash, and costly rollbacks erode ROI.
Join me as we peek under the hood into the key networking challenges that hold back AI workloads. Through demos, we’ll see how these problems impact performance and reliability of AI jobs. Just as a pit crew ensures the race car thrives despite the obstacles on the track, let’s dive into these networking challenges to ensure AI workloads power through to the finish line at peak performance just like a Formula 1 champion’s car!

Lerna is a Sr. Solutions Engineer at Clockwork Systems where she helps customers meet performance and reliability goals with software solutions built on Clockwork.io’s foundational research. Prior to this, she was a Sr. Solutions Architect at AWS for 3 years. Before that, Lerna spent 17 years as an infrastructure engineer in large financial services companies focused on problems of scale like centralized authentication systems, distributed caching, and multi region cloud native deployments to name a few. In her spare time, she enjoys hiking, sightseeing and backyard astronomy.

author = {Lerna Ekmekcioglu},
title = {Resilience for {AI} Workloads at Scale: The Fast and the Finicky!},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}
