Teaching The Old Dog New Tricks: Building Efficient Data Pipelines for Large-Scale LLM Pre-training

Luofan Chen and Chenhan Wang, University of Science and Technology of China; W. Zhang, Jinxin Chi, Hequan Zhang, Zanbo Wang, Chenyuan Wang, Lishu Luo, Sijin Wu, J. Hu, Jun Wang, and Cheng Chen, ByteDance Seed; Lixin Huang, Liyang Zhao, Yong Tian, and Jun Guo, ByteDance; Youhui Bai, University of Science and Technology of China; Wencong Xiao, ByteDance Seed; Kang Chen, Tsinghua University; Cheng Li, University of Science and Technology of China

Category: 
Operational Systems Paper