How Robust Monitoring Powers High Availability for LinkedIn Feed

Monday, March 13, 2017 - 3:20pm3:45pm

Rushin Barot, LinkedIn


It is common practice for networking services like LinkedIn to introduce a new feature to a small subset of users before deploying to all its members. Even with rigorous testing and tight processes, bugs are inevitably introduced with a new deploy. Unit and integration tests in development environment cannot completely cover all use cases. This results in an inferior experience for the users subject to this treatment. In addition, these bugs cause service outages, impacting service availability, health metrics and increased on-call burden. 

Although there exist frameworks for testing new features in development clusters, it is not possible to completely simulate all aspects of a production environment. Read requests for Feed service have enough variation to trigger large classes of unforeseen errors. So, we introduced a production instance of the service, called ‘Dark Canary’, that receives live production traffic that is tee’d from a production source node, but does not return any responses upstream. New experimental code is deployed to the dark canary and compared to the current version in production node for health failures. This helps us in reducing the number of bad sessions to the users and also provides a better understanding of the service capacity requirements.

Rushin Barot is a site Reliability Engineer at LinkedIn, where he contributes to Feed infrastructure. Prior to working at LinkedIn, he was working at Yahoo in Search infrastructure. He holds an MS in Computer Science from San Jose State University.

