Alex Hidalgo and Alex Lee, Squarespace
We've all read the SRE books and heard stories of a magical land of Engineering organizations with functioning SRE; one where following SRE best practices will lead to a better reality for both you and your users. But how do we get there? And, what does that road look like?
This talk presents a case study on how our team, stuck in a deep reliability hole maintaining our company's centralized logging platform, adopted many SRE best practices to resolve a several-months-long incident. It's the story of how we took the highest-trafficked system in our infrastructure from being reliable ~85% of the time to a trusted and documented 99.9%.
Alex Hidalgo has been a Site Reliability Engineer since 2011. During that time he has developed a deep love for sustainable operations, metrics and monitoring, and using error budgets to drive almost every decision. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Alex Lee is an SRE at Squarespace, where he's spent the past 5 years working on systems and processes that enable more reliable engineering. He currently leads the Observability Team, building and maintaining the tools that monitor Squarespace. Based out of New York City, Alex is passionate about work that empowers others to more effectively succeed in their own goals.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.