Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

Authors: 

Jie Sun, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Li Su, Alibaba Group; Zuocheng Shi, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Wenting Shen, Alibaba Group; Zeke Wang, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Lei Wang, Alibaba Group; Jie Zhang, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China; Yong Li, Wenyuan Yu, and Jingren Zhou, Alibaba Group; Fei Wu, Collaborative Innovation Center of Artificial Intelligence, Zhejiang University, China and Shanghai Institute for Advanced Study of Zhejiang University, China

Abstract: 

Graph neural network(GNN) has been widely applied in real-world applications, such as product recommendation in e-commerce platforms and risk control in financial management systems. Several cache-based GNN systems have been built to accelerate GNN training in a single machine with multiple GPUs. However, these systems fail to train billion-scale graphs efficiently, which is a common challenge in the industry. In this work, we propose Legion, a system that automatically pushes the envelope of multi-GPU systems for accelerating billion-scale GNN training. First, we design a hierarchical graph partitioning mechanism that significantly improves the multi-GPU cache performance. Second, we build a unified multi-GPU cache that helps to reduce the PCIe traffic incurred by accessing both graph topology and features. Third, we develop an automatic cache management mechanism that adapts the multi-GPU cache plan according to the hardware specifications to maximize the overall training throughput. Evaluations on various GNN models and multiple datasets show that Legion supports training billion-scale GNNs in a single machine and significantly outperforms the state-of-the-art cache-based systems.

USENIX ATC '23 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {288719,
author = {Jie Sun and Li Su and Zuocheng Shi and Wenting Shen and Zeke Wang and Lei Wang and Jie Zhang and Yong Li and Wenyuan Yu and Jingren Zhou and Fei Wu},
title = {Legion: Automatically Pushing the Envelope of {Multi-GPU} System for {Billion-Scale} {GNN} Training},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
isbn = {978-1-939133-35-9},
address = {Boston, MA},
pages = {165--179},
url = {https://www.usenix.org/conference/atc23/presentation/sun},
publisher = {USENIX Association},
month = jul
}

Presentation Video