Honey, I Broke the Things: Debugging Gray Failures in Production!

Wednesday, 26 October, 2022 - 17:1517:30 CEST

Radha Kumari, Slack Technologies


Migrations are one of the most challenging tasks we do as infrastructure engineers. At Slack, we switched from HAProxy to Envoy Proxy. Overall, this migration was a success, and did not cause any downtime, but even so, we ran into several interesting edge cases that caused minor problems.

Troubleshooting these sorts of 'gray' failures can be a difficult technical challenge. So this talk will discuss some of those facepalm moments: how they were detected, steps taken to investigate them, and how they were solved.

Takeaways from this talk include a specific set of approaches for debugging such problems with Envoy Proxy and other web proxies that we learnt via these events along with some engineering practices that eases the stress during a large migration.

Radha is a Staff Software Engineer for the Demand Engineering team at Slack (Ireland) where she focuses on ensuring "bytes" move in and out of Slack as expected.

Outside work, she loves travelling around the world and has been to over 25 countries since 2013. She also has a passion for collecting shoes.

