QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach

Shouyang Dong, University of Science and Technology of China, Cambricon Technologies, and Institute of Computing Technology, Chinese Academy of Sciences; Yuanbo Wen, Jun Bi, Di Huang, and Jiaming Guo, Institute of Computing Technology, Chinese Academy of Sciences; Jianxing Xu and Ruibai Xu, University of Science and Technology of China, Cambricon Technologies, and Institute of Computing Technology, Chinese Academy of Sciences; Xinkai Song and Yifan Hao, Institute of Computing Technology, Chinese Academy of Sciences; Ling Li, Institute of Software, Chinese Academy of Sciences, and University of Chinese Academy of Sciences; Xuehai Zhou, University of Science and Technology of China; Tianshi Chen, Cambricon Technologies; Qi Guo, Institute of Computing Technology, Chinese Academy of Sciences; Yunji Chen, Institute of Computing Technology, Chinese Academy of Sciences, and University of Chinese Academy of Sciences

Heterogeneous deep learning systems (DLS) such as GPUs and ASICs have been widely deployed in industrial data centers, which requires to develop multiple low-level tensor programs for different platforms. An attractive solution to relieve the programming burden is to transcompile the legacy code of one platform to others. However, current transcompilation techniques struggle with either tremendous manual efforts or functional incorrectness, rendering “Write Once, Run Anywhere” of tensor programs an open question.

We propose a novel transcompiler, i.e., QiMeng-Xpiler, for automatically translating tensor programs across DLS via both large language models (LLMs) and symbolic program synthesis, i.e., neural-symbolic synthesis. The key insight is leveraging the powerful code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable. Concretely, we propose multiple LLM-assisted compilation passes via pre-defined meta-prompts for program transformation. During each program transformation, efficient symbolic program synthesis is employed to repair incorrect code snippets with a limited scale. To attain high performance, we propose a hierarchical auto-tuning approach to systematically explore both the parameters and sequences of transformation passes. Experiments on 4 DLS with distinct programming interfaces, i.e., Intel DL Boost with VNNI, NVIDIA GPU with CUDA, AMD MI with HIP, and Cambricon MLU with BANG, demonstrate that QiMeng-Xpiler correctly translates different tensor programs at the accuracy of 95% on average.

OSDI '25 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {308724,
author = {Shouyang Dong and Jun Bi and Di Huang and Jiaming Guo and Jianxing Xu and Ruibai Xu and Xinkai Song and Yifan Hao and Ling Li and Xuehai Zhou and Tianshi Chen and Qi Guo and Yunji Chen},
title = {{QiMeng-Xpiler}: Transcompiling Tensor Programs for Deep Learning Systems with a {Neural-Symbolic} Approach},
booktitle = {19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)},
year = {2025},
isbn = {978-1-939133-47-2},
address = {Boston, MA},
pages = {239--255},
url = {https://www.usenix.org/conference/osdi25/presentation/dong},
publisher = {USENIX Association},
month = jul
}

Presentation Video