Sherry Xiao, Facebook
Deploying a service across multiple continents is difficult, especially when you have a stateful service. Facebook now has multiple datacenters across US and Europe, while Instagram infrastructure still remains only in the US. How can we scale Instagram across the ocean? What are the problems we need to solve?
One of the databases Instagram uses heavily is Cassandra. Running Cassandra with too many copies increases the complexity of maintaining this database, not to mention that having the quorum requests travel across the ocean is just... slow. So, we partitioned our dataset! The idea is to have a Cassandra European partition and a US partition, and send the users to their nearest partition.
When we started to put together the plan for deploying Instagram in current European datacenters, we encountered several problems. How do we make sure users have all the data they need stored in the same partition? When one of the European datacenters fails, how do we failover and where do we send that traffic?
This talk will cover:
- The challenges we had during the infrastructure design and disaster recovery planning
- How we use social hash to make sure all the data belonging to one user stays in the same partition as much as possible, and how it helps improving cache miss rate
- The failover plan when one European datacenter fails, including how we shift the traffic around
I'm a Production Engineer working on scaling Instagram infrastructure. My team supports all engineering teams at Instagram, and gets involved with a large number of areas like rapidly scaling infrastructure, capacity planning, designing and practicing disaster recovery plans for Instagram.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.