Rob Durst, Spring Health
Scaling a site-reliability culture from the ground up at a hyper-growth, resource-constrained startup is a uniquely challenging endeavor: plenty of playbooks to scale SRE teams exist, yet every startup’s socio-technical reality is its own puzzle. And while "reliability is the most important feature" rings true, tight deadlines and shifting priorities often sideline proactive reliability initiatives. Thus, since these cycles are a precious commodity, ensuring their success is paramount.
By retracing our SLO adoption journey, highlighting failures, missteps, and near wins en route to our eventual breakthrough, we uncover an effective litmus test for gauging readiness (or recognizing when a team isn’t quite there yet).
Today this readiness framework guides how we assess the timing of reliability investments at Spring Health. It also serves as a practical tool for teams in fast-growing engineering orgs still early in their reliability journey, especially those navigating similar constraints.

Rob is a Site Reliability Engineer at Spring Health where he leads the engineering organization’s SLO rollout and all things perf lab. He transitioned into his current role from the software engineering side, bringing with him experience across a range of domains from edge networking to blockchain systems. On the side he is also a programming language enthusiast who has spent entirely too much time tinkering with a declarative DSL for defining service expectations.
Rob lives at the foot of the beautiful Wasatch Front, where he and his wife spend their time outside of work cheering on the Mammoth and paddleboarding with their watermelon-obsessed Yorkie.

author = {Rob Durst},
title = {Run, Walk, Crawl, or How We Failed Our Way to {SLO} Readiness},
year = {2025},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}
