You are here
Debugging and Extending Distributed Coordination Systems
Raúl Gutiérrez Segalés, Twitter
At Twitter, we use Apache ZooKeeper as a cornerstone for nearly all of our distributed systems. SREs run ZooKeeper clusters and write tools for diagnosing bad actors. They also extend the service to make it more reliable.
During this talk, we'll go over the tools we've written (and open sourced!) to monitor our systems and debug problems seen at large scale. In addition, as part of our work to rate-limit our myriad clients, we’ve developed several protocol extensions for ZooKeeper without breaking backwards compatibility. We’ll discuss the design considerations involved when extending a service with a multitude of client library versions, as well as how it has enabled us to improve our ability to quickly experiment and iterate.
Raúl is a Staff Site Reliability Engineer at Twitter, with experience on the Coordination Team as well as the Traffic Team. He is the primary author of ZKTraffic and several of the extensions that we use to keep ZooKeeper sane at Twitter.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.