Hydra: a federated resource manager for data-center scale analytics

Authors: 

Carlo Curino, Subru Krishnan, and Konstantinos Karanasos, Microsoft; Sriram Rao, Facebook; Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya, Roni Burd, Sarvesh Sakalanaga, Chris Douglas, Bill Ramsey, and Raghu Ramakrishnan, Microsoft

Abstract: 

Microsoft's internal data lake processes exabytes of data over millions of cores daily on behalf of thousands of tenants. Scheduling this workload requires 10x to 100x more decisions per second than existing, general-purpose resource management frameworks are known to handle. In 2013, we were faced with a growing demand for workload diversity and richer sharing policies that our legacy system could not meet. In this paper, we present Hydra, the resource management infrastructure we built to meet these requirements.

Hydra leverages a federated architecture, in which a cluster is comprised of multiple, loosely coordinating subclusters. This allows us to scale by delegating placement of tasks on machines to each sub-cluster, while centrally coordinating only to ensure that tenants receive the right share of resources. To adapt to changing workload and cluster conditions promptly, Hydra's design features a control plane that can push scheduling policies across tens of thousands of nodes within seconds. This feature combined with the federated design allows for great agility in developing, evaluating, and rolling out new system behaviors.

We built Hydra by leveraging, extending, and contributing our code to Apache Hadoop YARN. Hydra is currently the primary big-data resource manager at Microsoft. Over the last few years, Hydra has scheduled nearly one trillion tasks that manipulated close to a Zettabyte of production data.

NSDI '19 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {226010,
author = {Carlo Curino and Subru Krishnan and Konstantinos Karanasos and Sriram Rao and Giovanni M. Fumarola and Botong Huang and Kishore Chaliparambil and Arun Suresh and Young Chen and Solom Heddaya and Roni Burd and Sarvesh Sakalanaga and Chris Douglas and Bill Ramsey and Raghu Ramakrishnan},
title = {Hydra: a federated resource manager for data-center scale analytics},
booktitle = {16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19)},
year = {2019},
isbn = {978-1-931971-49-2},
address = {Boston, MA},
pages = {177--192},
url = {https://www.usenix.org/conference/nsdi19/presentation/curino},
publisher = {{USENIX} Association},
month = feb,
}

Presentation Video