Rammer: Enabling Holistic Deep Learning Compiler Optimizations with {rTasks}

Lingxiao Ma; Zhiqiang Xie; Zhi Yang; Jilong Xue; Youshan Miao; Wei Cui; Wenxiang Hu; Fan Yang; Lintao Zhang; Lidong Zhou

Lingxiao Ma, Peking University and Microsoft Research; Zhiqiang Xie, ShanghaiTech University and Microsoft Research; Zhi Yang, Peking University; Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou, Microsoft Research

Performing Deep Neural Network (DNN) computation on hardware accelerators efficiently is challenging. Existing DNN frameworks and compilers often treat the DNN operators in a data flow graph (DFG) as opaque library functions and schedule them onto accelerators to be executed individually. They rely on another layer of scheduler, often implemented in hardware, to exploit the parallelism available in the operators. Such a two-layered approach incurs significant scheduling overhead and often cannot fully utilize the available hardware resources. In this paper, we propose Rammer, a DNN compiler design that optimizes the execution of DNN workloads on massively parallel accelerators. Rammer generates an efficient static spatio-temporal schedule for a DNN at compile time to minimize scheduling overhead. It maximizes hardware utilization by holistically exploiting parallelism through inter- and intra- operator co-scheduling. Rammer achieves this by proposing several novel, hardware neutral, and clean abstractions for the computation tasks and the hardware accelerators. These abstractions expose a much richer scheduling space to Rammer, which employs several heuristics to explore this space and finds efficient schedules. We implement Rammer for multiple hardware backends such as NVIDIA GPUs, AMD GPUs, and Graphcore IPU. Experiments show Rammer significantly outperforms state-of-the-art compilers such as TensorFlow XLA and TVM by up to 20.1X. It also outperforms TensorRT, a vendor optimized proprietary DNN inference library from NVIDIA, by up to 3.1X.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {258921,
author = {Lingxiao Ma and Zhiqiang Xie and Zhi Yang and Jilong Xue and Youshan Miao and Wei Cui and Wenxiang Hu and Fan Yang and Lintao Zhang and Lidong Zhou},
title = {Rammer: Enabling Holistic Deep Learning Compiler Optimizations with {rTasks}},
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {881--897},
url = {https://www.usenix.org/conference/osdi20/presentation/ma},
publisher = {USENIX Association},
month = nov
}

Download

Ma PDF

View the slides

Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks

Open Access Media

Presentation Video