CRISP: Critical Path Analysis of Large-Scale Microservice Architectures

Authors: 

Zhizhou Zhang, UC Santa Barbara; Murali Krishna Ramanathan, Prithvi Raj, and Abhishek Parwal, Uber Technologies Inc.; Timothy Sherwood, UC Santa Barbara; Milind Chabbi, Uber Technologies Inc.

Abstract: 

Microservice architectures have become the lifeblood of modern service-oriented software systems. Remote Procedure Calls (RPCs) among microservices are deeply nested, asynchronous, and large in number, thus making it very hard to identify the underlying service(s) that contribute to the overall end-to-end latency experienced by a top-level request. State-of-the-art RPC tracing tools collect a deluge of data but provide little insight. We need sophisticated tools to bubble-up signals from a myriad of RPC traces to assist developers in identifying optimization opportunities, pinpointing common bottlenecks, setting appropriate time outs, diagnosing error conditions, and planning and managing compute capacity, to name a few.

In this paper, we present \tool{} --- a tool to perform critical path analysis (CPA) over a large number of traces collected from RPCs in microservices environments. \tool{} provides three critical performance analysis capabilities: a) a \emph{top-down} CPA of any specific endpoint, which is tailored for service owners to drill down the root causes of latency issues, b) a \emph{bottom-up} CPA over all endpoints in the system --- tailored for infrastructure and performance engineers --- to bubble up those (interior) APIs that have a high impact across many endpoints, and c) an on-the-fly anomaly detection to alert potential problems.

We have applied \tool{}'s capabilities on \company{}'s entire backend system composed of sim 40K endpoints that cater to real-time requests from more than 100 million active daily users worldwide. Using the critical path as the basis of performance analysis has a) helped us identify five performance issues and optimization opportunities across two business-critical microservices, b) guided us in our future hardware choice that reduces end-to-end latencies, and c) reduced the false positives in anomaly detection by up to 50% while speeding up the training and inference by up to 28 times and up to 67 times, respectively, over the state of the art.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280762,
author = {Zhizhou Zhang and Murali Krishna Ramanathan and Prithvi Raj and Abhishek Parwal and Timothy Sherwood and Milind Chabbi},
title = {{CRISP}: Critical Path Analysis of {Large-Scale} Microservice Architectures},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
isbn = {978-1-939133-29-50},
address = {Carlsbad, CA},
pages = {655--672},
url = {https://www.usenix.org/conference/atc22/presentation/zhang-zhizhou},
publisher = {USENIX Association},
month = jul
}

Presentation Video