Slack's DNSSEC Rollout: Third Time's the Outage

Wednesday, 26 October, 2022 - 14:4515:30 CEST

Rafael Elvira, Slack

Abstract: 

We all have to manage DNS. DNS changes are inherently high-blast-radius and high-visibility.

We present a case study of what happened when a large SaaS company enabled DNSSEC. We did significant planning and testing beforehand. The rollout went smoothly for most of our domains, but one domain caused problems. We attempted three times to enable DNSSEC on this domain. Twice we rolled back after a partial rollout because of actual (or suspected) customer impact.

On the third occasion, we rolled out DNSSEC fully determined that the change had broken a small subset of our customers. While attempting to roll back… we made it worse. This talk will describe what happened.

Main Takeaways

  1. A better appreciation of DNSSEC’s workings, including how various DNS TTLs work between root, TLD name servers and recursive resolvers
  2. Strategies for mitigating risk of DNS changes to critical/high impact zones (and some areas we missed)
  3. An appreciation of some of the long-tail problems with DNS that are difficult to de-risk entirely with current tooling
  4. An entertaining outage story

Rafael Elvira, Slack

Rafael is a Staff Software Engineer for the Demand Engineering team at Slack based in Madrid, Spain. The Demand Engineering team enables fast and reliable delivery of Slack to our 12M+ globally distributed daily active users.

Outside work, Rafa enjoys traveling, cooking and spending time in the mountains: climbing, hiking, mountain biking or skiing with friends.

BibTeX
@conference {284631,
author = {Rafael Elvira},
title = {Slack{\textquoteright}s {DNSSEC} Rollout: Third Time{\textquoteright}s the Outage},
year = {2022},
address = {Amsterdam},
publisher = {USENIX Association},
month = oct
}

Presentation Video