AI4DL: Mining Behaviors of Deep Learning Workloads for Resource Management

Authors: 

Josep L. Berral, Barcelona Supercomputing Center, Universitat Politecnica de Catalunya; Chen Wang and Alaa Youssef, IBM Research

Abstract: 

The more we know about the resource usage patterns of workloads, the better we can allocate resources. Here we present a methodology to discover resource usage behaviors for the training workloads of Deep Learning (DL) models. From monitoring, we can observe repeating patterns and similitude of resource usage among containers running the training workloads of different DL models. The repeating patterns observed can be leveraged by the scheduler or the resource autoscaler to reduce resource fragmentation and overall resource utilization in a dedicated DL cluster. Specifically, our approach combines Conditional Restricted Boltzmann Machines (CRBMs) and clustering techniques to discover common sequences of behaviors (phases) of containers running the model training workloads in clusters providing IBM Deep Learning Services. By studying the resource usage pattern at each phase and the typical sequences of phases among different containers, we can discover a reduced set of prototypical executions representing most executions. We use statistical information from each phase to refine resource provisioning by dynamically tuning the amount of resource each container requires at each phase of its execution. Evaluation of our method shows that container resource usage displays typical patterns that can help reduce CPU and Memory consumption by 30% relative to reactive policies, which is close to having \emph{a-priori} knowledge of resource usage while fulfilling resource demand over 95% of the time.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {254114,
author = {Josep L. Berral and Chen Wang and Alaa Youssef},
title = {AI4DL: Mining Behaviors of Deep Learning Workloads for Resource Management},
booktitle = {12th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 20)},
year = {2020},
url = {https://www.usenix.org/conference/hotcloud20/presentation/berral},
publisher = {{USENIX} Association},
month = jul,
}

Presentation Video

Download Video