Metis: Robustly Tuning Tail Latencies of Cloud Systems

Authors: 

Zhao Lucis Li, USTC; Chieh-Jan Mike Liang, Microsoft Research; Wenjia He, USTC; Lianjie Zhu, Wenjun Dai, and Jin Jiang, Microsoft Bing Ads; Guangzhong Sun, USTC

Abstract: 

Tuning configurations is essential for operating modern cloud systems, but the difficulty arises from the cloud system’s diverse workloads, large system scale, and vast parameter space. The systems community has recently demonstrated the potential of predictive regression in minimizing the cost of searching for the optimal system configuration. However, we argue that cloud systems introduce challenges to the robustness of auto-tuning. First, system evaluation metrics such as tail latencies are typically sensitive to non-trivial noises. Second, while treating target systems as a black box promotes applicability, it complicates the process of selectively sample the unknown configuration-vs-performance space for modeling. To this end, Metis is an auto-tuning service used by several Microsoft services, and it implements customized Bayesian optimization to robustly improve auto-tuning: (1) the diagnostic model to find potential data outliers for re-sampling, and (2) a mixture of acquisition functions to balance exploitation, exploration and re-sampling. This paper uses the Bing Ads key- value store cluster as the running example – production results show that Metis has helped to lower the 99-percentile query lookup latency by more than 20.4%. In addition, Metis-tuned configurations outperform expert-tuned configurations, while reducing the tuning time from weeks to hours.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {216019,
author = {Zhao Lucis Li and Chieh-Jan Mike Liang and Wenjia He and Lianjie Zhu and Wenjun Dai and Jin Jiang and Guangzhong Sun},
title = {Metis: Robustly Tuning Tail Latencies of Cloud Systems},
booktitle = {2018 USENIX Annual Technical Conference (USENIX ATC 18)},
year = {2018},
isbn = {978-1-939133-01-4},
address = {Boston, MA},
pages = {981--992},
url = {https://www.usenix.org/conference/atc18/presentation/li-zhao},
publisher = {USENIX Association},
month = jul
}

Presentation Audio