“Disorganizing” Your SRE Organization

Tuesday, December 08, 2020 - 2:10 pm2:50 pm

Leonid Belkind, StackPulse


The COVID-driven new WFH/all-remote model has amplified traditional challenges remote teams can face with incident response and reliability. Silos and reduced information exchange, challenges onboarding or cross training engineers, increased noise and toil – just to name a few. All of these make it harder for teams to continue to deliver reliable services, at a time when reliable software is what's keeping the world connected.

Instead of trying to translate existing roles and responsibilities, processes, and methods to the new normal – we'll share how to improve reliability by 'dis-organizing' and democratizing the SRE function – empowering the entire engineering team to own reliability and adopt SRE mindset. We'll cover goals for SREs/SWEs, training for SWEs, automating knowledge management/sharing, getting started with code-based incident response playbooks – and the role of the SRE in orchestrating it all.

Leonid Belkind is a Co-Founder and CTO at StackPulse, a Site Reliability Engineering orchestration platform. Prior to StackPulse, Leonid co-founded (and was CTO of) Luminate where he guided this enterprise-grade service from inception, to widespread Fortune 500 adoption to acquisition by Symantec. Before Luminate, Leonid managed software development organizations at CheckPoint.

Through his career, Leonid has witnessed modern Software Engineering practices come and replace the traditional ones, first around Continuous Integration and Delivery pipelines, then Infrastructure Management and Monitoring, and onwards as software services have replaced on-premise products. Throughout this journey Leonid has become passionate about building reliability-first architectures, methodologies and organizational culture.

