{ROLLER}: Fast and Efficient Tensor Compilation for Deep Learning

Hongyu Zhu; Ruofan Wu; Yijia Diao; Shanbin Ke; Haoyu Li; Chen Zhang; Jilong Xue; Lingxiao Ma; Yuqing Xia; Wei Cui; Fan Yang; Mao Yang; Lidong Zhou; Asaf Cidon; Gennady Pekhimenko

Hongyu Zhu, University of Toronto and Microsoft Research; Ruofan Wu, Renmin University of China and Microsoft Research; Yijia Diao, Shanghai Jiao Tong University and Microsoft Research; Shanbin Ke, UCSD and Microsoft Research; Haoyu Li, Columbia University and Microsoft Research; Chen Zhang, Tsinghua University and Microsoft Research; Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, and Lidong Zhou, Microsoft Research; Asaf Cidon, Columbia University; Gennady Pekhimenko, University of Toronto

Despite recent advances in tensor compilers, it often costs hours to generate an efficient kernel for an operator, a compute-intensive sub-task in a deep neural network (DNN), on various accelerators (e.g., GPUs). This significantly slows down DNN development cycles and incurs heavy burdens on the development of general kernel libraries and custom kernels, especially for new hardware vendors. The slow compilation process is due to the large search space formulated by existing DNN compilers, which have to use machine learning algorithms to find good solutions.

In this paper, we present ROLLER, which takes a different construction-based approach to generate kernels. At the core of ROLLER is rTile, a new tile abstraction that encapsulates tensor shapes that align with the key features of the underlying accelerator, thus achieving efficient execution by limiting the shape choices. ROLLER then adopts a recursive rTile-based construction algorithm to generate rTile-based programs (rProgram), whose performance can be evaluated efficiently with a micro-performance model without being evaluated in a real device. As a result, ROLLER can generate efficient kernels in seconds, with comparable performance to the state-of-the-art solutions on popular accelerators like GPUs, while offering better kernels on less mature accelerators like IPUs.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {280896,
author = {Hongyu Zhu and Ruofan Wu and Yijia Diao and Shanbin Ke and Haoyu Li and Chen Zhang and Jilong Xue and Lingxiao Ma and Yuqing Xia and Wei Cui and Fan Yang and Mao Yang and Lidong Zhou and Asaf Cidon and Gennady Pekhimenko},
title = {{ROLLER}: Fast and Efficient Tensor Compilation for Deep Learning},
booktitle = {16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)},
year = {2022},
isbn = {978-1-939133-28-1},
address = {Carlsbad, CA},
pages = {233--248},
url = {https://www.usenix.org/conference/osdi22/presentation/zhu},
publisher = {USENIX Association},
month = jul
}

Download

Zhu PDF

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

Open Access Media

Presentation Video