Failure Happens: Improving Incident Response in Large-Scale Organizations

Friday, November 03, 2017 - 11:45 am12:30 pm

Damon Edwards, Rundeck, Inc.

Abstract: 

Deployment is a solved problem. Yes, there is still work to be done, but the operations community has successfully proven that we can both scale deployment automation and distribute the capability to execute deployments. Now, we have to turn our attention to the next critical constraint: What happens after deployment?

We all know that failure is inevitable and is coming our way at any moment. How do respond quickly and effectively to those failures? What works when there is just a small set of teams or an isolated system to manage will quickly break down when the organization grows in size and complexity. But on the other hand, what has been commonly practiced in large-scale enterprises is proving to be too cumbersome, too silo dependent, and simply too slow for today's business needs.

How do we rapidly respond to incidents and recover complex interdependent systems while working within an equally complex and interdependent organization? How does operations embrace the DevOps and Agile inspired demand for speed and self-service while maintaining quality and control?

This talk examines the trial-and-error lessons learned by some forward-thinking enterprises who are currently streamlining how they:

  • Resolve incidents
  • Reduce friction between teams
  • Divide up operational responsibilities
  • Improve the quality of their ongoing operations.

See how these companies are rethinking how and where operations happens by applying Lean and DevOps principles mixed with modern tooling practices.

This talk will:

  • Dissect examples of operational incidents from inside actual large enterprises
  • Identify the common organizational and technical anti-patterns that prevent quick and effective incident resolution and interfere with organizational learning
  • Discuss emerging design patterns and techniques that remove the friction and bottlenecks while empowering teams (highlighting publicly referenceable work shared with the DevOps community)

Damon Edwards, Rundeck, Inc.

Damon Edwards is a Co-Founder of Rundeck, Inc., the makers of Rundeck, the open orchestration and scheduling platform. Damon Edwards was previously a Managing Partner at DTO Solutions, a DevOps and IT Operations improvement consultancy. Damon has spent over 15 years working with both the technology and business ends of IT operations and is noted for being a leader in porting cutting-edge DevOps techniques to large enterprise organizations. Damon is also a frequent conference speaker and writer who focuses on DevOps and operations improvement topics. Damon is active in the international DevOps community, including being a co-host of the DevOps Cafe podcast with John Willis, an early core organizer of the DevOps Days conference series, and a content chair for Gene Kim’s DevOps Enterprise Summit.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {207179,
author = {Damon Edwards},
title = {Failure Happens: Improving Incident Response in {Large-Scale} Organizations},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = oct
}

Presentation Video 

Presentation Audio