{SelfTune}: Tuning Cluster Managers

Ajaykrishna Karthikeyan; Nagarajan Natarajan; Gagan Somashekar; Lei Zhao; Ranjita Bhagwan; Rodrigo Fonseca; Tatiana Racheva; Yogesh Bansal

Authors:

Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft

Abstract:

Large-scale cloud providers rely on cluster managers for container allocation and load balancing (e.g., Kubernetes), VM provisioning (e.g., Protean), and other management tasks. These cluster managers use algorithms or heuristics whose behavior depends upon multiple configuration parameters. Currently, operators manually set these parameters using a combination of domain knowledge and limited testing. In very large-scale and dynamic environments, these manually-set parameters may lead to sub-optimal cluster states, adversely affecting important metrics such as latency and throughput.

In this paper we describe SelfTune, a framework that automatically tunes such parameters in deployment. SelfTune piggybacks on the iterative nature of cluster managers which, through multiple iterations, drives a cluster to a desired state. Using a simple interface, developers integrate SelfTune into the cluster manager code, which then uses a principled reinforcement learning algorithm to tune important parameters over time. We have deployed SelfTune on tens of thousands of machines that run a large-scale background task scheduler at Microsoft. SelfTune has improved throughput by as much as 20% in this deployment by continuously tuning a key configuration parameter that determines the number of jobs concurrently accessing CPU and disk on every machine. We also evaluate SelfTune with two Azure FaaS workloads, the Kubernetes Vertical Pod Autoscaler, and the DeathStar microservice benchmark. In all cases, SelfTune significantly improves cluster performance.

Ajaykrishna Karthikeyan, Microsoft Research

Nagarajan Natarajan, Microsoft Research

Gagan Somashekar, Stony Brook University

Lei Zhao, Microsoft

Ranjita Bhagwan, Microsoft Research

Rodrigo Fonseca, Microsoft

Tatiana Racheva, Microsoft

Yogesh Bansal, Microsoft

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

Conference attendees

BibTeX

@inproceedings {285192,
author = {Ajaykrishna Karthikeyan and Nagarajan Natarajan and Gagan Somashekar and Lei Zhao and Ranjita Bhagwan and Rodrigo Fonseca and Tatiana Racheva and Yogesh Bansal},
title = {{SelfTune}: Tuning Cluster Managers},
booktitle = {20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)},
year = {2023},
isbn = {978-1-939133-33-5},
address = {Boston, MA},
pages = {1097--1114},
url = {https://www.usenix.org/conference/nsdi23/presentation/karthikeyan},
publisher = {USENIX Association},
month = apr
}

Download

Karthikeyan PDF