Tuesday, May 23, 2017 - 3:00pm3:25pm

Zehua Liu, Zendesk


Disaster Recovery is an important area in SRE. A simplified scenario is recovery from a full data centre failure. The Zendesk Chat backend infrastructure operates in a single data centre. The way to be sure that DR works is to perform a real failover. The past failover attempts were full of surprises and unexpected issues, most of them having to do with the applications failing to work after failover, due to various reasons. These unexpected issues led to failed failover tests and/or extended maintenance window due to extra efforts required to bring things back to order, causing bad customer experience.

How can we increase our confidence in the system being working after failing over? We want to confidently declare that it should work, instead of typing the finger-crossed emoji. In this talk, we share our experiment with setting up testing for the DR environment. The biggest question here is whether we would accept writes to the DR DBs other than those coming from replication from production. The compromise is between the risks of production stability and risks of failed DR failover. We will discuss the alternates we considered and the final approach we have adopted.

Zehua Liu, Zendesk Singapore

Zehua establishes and leads the tooling team at Zendesk, where he works on making sure that developers are happy developing what they want to develop and the quality of the products the developers deliver is great. He is currently based in Singapore.

