How Instagram is scaling its infrastructure across the ocean
Scaling up your infrastructure is especially challenging when the next data center is on another continent.
In 2014, two years after Instagram joined Facebook, Instagram's engineering team moved the company's infrastructure from Amazon Web Services (AWS) servers to Facebook's data centers. Facebook has multiple data centers across the United States and Europe but, until recently, Instagram used only U.S. data centers.
The main reason Instagram wants to scale its infrastructure to the other side of the ocean is that we have run out of space in the United States. As the service continues to grow, Instagram has reached a point in which we need to consider leveraging Facebook's data centers in Europe. An added bonus: Local data centers will mean lower latency for European users, which will create a better user experience on Instagram.
In 2015, Instagram scaled its infrastructure from one to three data centers to deliver much-needed resiliency—our engineering team didn't want to relive the AWS disaster of 2012 when a huge storm in Virginia brought down nearly half of its instances. Scaling from three to five data centers was trivial; we simply increased the replication factor and duplicated data to the new regions; however, scaling up is more difficult when the next data center lives on another continent.
Infrastructure can generally be separated into two types:
- Stateless service is usually used as computing and scales based on user traffic (on an as-needed basis). The Django web server is one example.
- Stateful service is usually used as storage and must be consistent across data centers. Examples include Cassandra and TAO.
Everyone loves stateless services—they're easy to deploy and scale, and you can spin them up whenever and wherever as you need them. The truth is we also need stateful services like Cassandra to store user data. Running Cassandra with too many copies not only increases the complexity of maintaining the database; it's also a waste of capacity, not to mention having quorum requests travel across the ocean is just…slow.
Instagram also uses TAO, a distributed data store for the social graph, as data storage. We run TAO as a single master per shard, and no slave updates the shard for any write request. It forwards all writes to the shard's master region. Because all writes happen in the master region (which lives in the United States), the write latency is unbearable in Europe. You may notice that our problem is basically the speed of light.
Can we reduce the time it takes a request to travel across the ocean (or even make the round trip disappear)? Here are two ways we can solve this problem.
To prevent quorum requests from going across the ocean, we're thinking about partitioning our dataset into two parts: Cassandra_EU and Cassandra_US. If European users' data stores are in the Cassandra_EU partition, and U.S. users' data stores are in the Cassandra_US partition, users' requests won't need to travel long distances to fetch data.
For example, imagine there are five data centers in the United States and three data centers in the European Union. If we deploy Cassandra in Europe by duplicating the current clusters, the replication factor will be eight and quorum requests must talk to five out of eight replicas.
If, however, we can find a way to partition the data into two sets, we will have a Cassandra_US partition with a replication factor of five and a Cassandra_EU partition with a replication factor of three—and each can operate independently without affecting the others. In the meantime, a quorum request for each partition will be able to stay in the same continent, solving the round-trip latency issue.
Restrict TAO to write to local
To reduce the TAO write latency, we can restrict all EU writes to the local region. It should look almost identical to the end user. When we send a write to TAO, TAO will update locally and won't block sending the write to the master database synchronously; rather it will queue the write in the local region. In the write's local region, the data will be available immediately from TAO, while in other regions, the data will be available after it propagates from the local region. This is similar to regular writes today, which propagate from the master region.
Although different services may have different bottlenecks, by focusing on the idea of reducing or removing cross-ocean traffic, we can tackle problems one by one.
As in every infrastructure project, we've learned some important lessons along the way. Here are some of the main ones.
Don't rush into a new project. Before you start to provision servers in a new data center, make sure you understand why you need to deploy your service in a new region, what the dependencies are, and how things will work when the new region comes into play. Also, don't forget to revisit your disaster recovery plans and make any necessary changes.
Don't underestimate complexity. Always build into your schedule enough time to make mistakes, find unplanned blockers, and learn new dependencies that you didn't know about. You may find yourself on a path that would inadvertently restructure how your infrastructure was built.
Know your trade-offs. Things always come with a price. When we partitioned our Cassandra database, we saved lots of storage space by reducing the replication factor. However, to make sure each partition was still ready to face a disaster, we needed more Django capacity in the front to accept traffic from a failing region because now partitions can't share capacity with each other.
Be patient. Along the way of turning up the European data centers, I don't remember how many times we said, "Oh, crud!" But things always get sorted out, eventually. It might take longer than you expect, but have patience and work together as a team—it's a super-fun journey.
Sherry Xiao will present Cross Atlantic: Scaling Instagram Infrastructure from US to Europe at LISA18, October 29-31 in Nashville, Tenn. Read more about Sherry.