Sangeetha Abdu Jyothi, Microsoft and University of Illinois at Urbana–Champaign; Carlo Curino, Ishai Menache, and Shravan Matthur Narayanamurthy, Microsoft; Alexey Tumanov, Microsoft and Carnegie Mellon University; Jonathan Yaniv, Technion—Israel Institute of Technology; Ruslan Mavlyutov, Microsoft and University of Fribourg; Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao, Microsoft
Modern resource management frameworks for largescale analytics leave unresolved the problematic tension between high cluster utilization and job’s performance predictability—respectively coveted by operators and users. We address this in Morpheus, a new system that: 1) codifies implicit user expectations as explicit Service Level Objectives (SLOs), inferred from historical data, 2) enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and 3) mitigates inherent performance variance (e.g., due to failures) by means of dynamic reprovisioning of jobs. We validate these ideas against production traces from a 50k node cluster, and show that Morpheus can lower the number of deadline violations by 5x to 13x, while retaining cluster-utilization, and lowering cluster footprint by 14% to 28%. We demonstrate the scalability and practicality of our implementation by deploying Morpheus on a 2700-node cluster and running it against production-derived workloads.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.