Understanding the Workload Characteristics of Large Language Model Development

March 19, 2024

Deployed System

Authors:

Article shepherded by:

Rik Farrow

Large Language Models (LLMs) have presented impressive performance across several transformative tasks, such as chatbot and code generation. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. A thorough analysis of cluster workloads is essential for comprehending challenges and uncovering opportunities in designing systems tailored for LLMs.

To this end, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme of Shanghai AI Laboratory. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures.

Background: The Paradigm Transition in Model Development

Prior DL workloads generally follow a task-specific paradigm that trains the model on domain-specific data to tackle a particular task (e.g., translation). In contrast, LLMs follow an emerging paradigm that performs self-supervised training on broad data to generate a foundation model and further adapts to a wide range of downstream tasks. This shift signifies a substantial divergence in the model development pipeline and workload characteristics from prior DL workloads. Additionally, there are more unique characteristics and requirements of LLMs.

1. Unified Model Architecture. Prior DL workloads usually employ various model architectures (e.g., CNN, LSTM, GNN) to address diverse tasks. In contrast, LLMs commonly embrace the Transformer [1] architecture, like OpenAI GPT and Meta LLaMA. The architectural homogeneity suggests a high level of uniformity in the LLM development pipeline and similarity across different datacenters.

2. Requirements of Tailored Software Stack. To accommodate the enormous model size of LLMs, a series of systems implement advanced techniques to optimize the execution of LLMs. For instance, Microsoft DeepSpeed [2] and NVIDIA MegatronLM [3] accelerate the training via hybrid parallelism or state-sharding optimizer. As for model serving, vLLM [4] improves throughput via iteration scheduling and memory management.

Workloads of LLM Development Pipeline

The development of LLMs necessitates the use of extensive computational infrastructure due to their substantial model size (comprising billions of parameters) and the vast amount of training data involved. Figure 1 depicts the comprehensive LLM development pipeline, encompassing five distinct stages (blue blocks) that span from scratch to service (follow blue arrows). The grey circular arrow indicates that the pretraining stage enables periodical alignment and evaluation to assess intermediate models and adjust configuration on the fly. We explain each stage as follows:

Figure 1: Overview of the LLM development pipeline.

Data Preparation. The initial stage involves gathering and preprocessing the training data, which can be categorized into two parts: (1) pretraining data, consisting of extensive unlabeled corpora obtained from public or private sources and curated through processes like detoxification and deduplication; (2) alignment data, comprising a smaller set of high-quality labeled corpora used to align the model with specific tasks. This data is typically acquired through expensive human annotation or labeling. Besides, all the data must be tokenized to ensure compatibility with the model's input.
Pretraining. It involves self-supervised training on large-scale curated data, demanding a majority of resources within the overall development workflow. Training LLMs efficiently at scale necessitates various system innovations, such as state-sharding optimizers [2], meticulous model placement using data, pipeline, and tensor parallelisms [3].
Alignment. This stage aims to adapt LLMs with user intent on a wide range of downstream tasks. Two primary aligning paradigms are commonly used: (1) prompt engineering, specifying prompts (i.e., inputs) without modifying model parameters. For example, in text summarization, appending a prompt “TL; DR” to the input article can improve model performance; (2) fine-tuning, updating model parameters on a task-specific dataset to improve performance in a particular domain.
Evaluation. Given the vast application scenarios of LLM, it may be inaccurate to assess model quality solely based on a single metric like training loss. There are numerous factors to consider, such as accuracy, fairness, and toxicity. Consequently, it is crucial to account for a diverse set of criteria and measure performance across multiple tasks. Furthermore, regular evaluation is essential during the pretraining stage to provide timely feedback on model quality.
Deployment. To meet the strict cost and latency constraints of LLM applications, several advanced techniques have been developed to achieve efficient model serving, including quantization, distillation, CUDA kernel optimization, model parallelism and memory management [4].

LLMs versus Prior DL Workloads

Developing LLMs is closely intertwined with the support of GPU clusters in various aspects. However, many conclusions and implications from existing DL workload analysis work [5], [6], [7], conducted before the rise of LLMs, are not applicable to LLM development. Consequently, conducting a comprehensive analysis of cluster workloads becomes crucial for understanding the unique challenges and identifying potential opportunities in designing systems optimized for LLMs.

In this work, we share our operational experiences in the datacenter Acme of Shanghai AI Laboratory. It houses two distinct clusters, Seren and Kalos, dedicated to LLM development and equipped with 4,704 A100 GPUs in total. Table 1 summarizes the configurations of these two homogeneous LLM clusters. Each node is equipped with 8x NVIDIA A100-SXM 80GB GPUs, which are interconnected by NVLink.

Cluster	#CPUs	GPUs	Mem(GB)	Network	#Nodes	Total #GPUs
Seren	128 Cores	8 x A100	1,024	1 x 200Gb/s	286	2,288
Kalos	128 Cores	8 x A100	2,048	5 x 200Gb/s	302	2,416

Table 1: Summary of per-node specification and cluster scale for two independent LLM clusters in Acme.

Our characterization study is based on traces collected from these two LLM clusters, spanning 6 months from March to August 2023. Note that Acme does not involve any serving jobs (i.e., workloads in the deployment stage), as our LLMs are deployed on a separate cluster. We compare our trace with prior DL traces, including Microsoft Philly [5], SenseTime Helios [6], Alibaba PAI [7]. Unlike Acme, which is solely dedicated to LLM development, these datacenters encompass a mixture of general DL workloads from various domains. We highlight several key findings from our analysis.

Figure 2: Overview of different cluster characteristics. (a) Workload: CDF of the GPU job duration. (b) Infrastructure: CDF of GPU utilization.

1. Shorter Job Duration. As shown in Figure 2 (a), contrary to the prevailing stereotype that LLM-related jobs are typically long-running, we find the workloads in our clusters (blue and orange lines) exhibit shorter GPU job durations (i.e., job runtime, excluding queuing delay) compared to the DL workloads observed in previous job traces (dotted lines). Specifically, both the Seren and Kalos have a median job duration of 2 minutes, which is 1.7~7.2x shorter than the median job durations of other clusters.

To provide an explanation for this observation, we outline several potential factors:

(1) Advancements in hardware. The evolution of GPU and networking delivers substantial efficiency improvement. Moreover, there is a trend of users requesting more GPU resources than what was typical in previous clusters, which can markedly speed up the training process.

(2) Extensive associated workloads. LLM development pipeline involves numerous small-scale associated jobs, such as evaluation. We will delve into this aspect later.

(3) High rate of incompletion. Upon examining the final statuses of jobs, as depicted in Figure 3, we find that approximately 40% of jobs fail, with completed jobs consuming only 20~30% of GPU resources. This underscores the urgent need for a fault-tolerant system.

Figure 3: Final statuses of jobs in terms of (a) quantity and (b) utilized GPU resources.

2. Polarized GPU Utilization. Figure 2 (b) shows cluster-wide GPU utilization distributions across various clusters. It is evident that the GPU utilization in our two clusters exhibits a polarized pattern, primarily concentrated in two distinct states: 0% and 100%. This polarization mainly stems from the fact that the workloads in our clusters share similar model architectures, i.e., transformer-based LLMs. In contrast, Philly and PAI encompass a broader range of utilization. Besides, when comparing the median GPU utilization, Seren and Kalos exhibit significantly higher values at 97% and 99%, respectively, in contrast to 48% and 4% observed in Philly and PAI. This observation aligns with the common understanding that LLMs are computationally intensive. It also implies that GPU-sharing-based scheduling techniques may not be suitable for LLM development.

Analysis of LLM Workload Categories

To strive for a deeper understanding of the characteristics of different workloads in the LLM development pipeline, we further categorize jobs into specific types according to their production division and metadata.

Figure 4: Distribution of different workload types in Seren (a, b) and Kalos (c, d).

1. Highly-skewed Workload Distribution. Figure 4 presents the distribution of job counts and GPU time across various workload types, where only Seren contains SFT and MLLM workloads (SFT: Supervised Fine-Tuning for model alignment. MLLM: Multimodal Large Language Model. Other: Unclassified jobs). Note that CPU jobs are excluded. Besides, MLLM jobs incorporate their own development pipeline (e.g., pretraining) and adopt smaller model scales for exploration purposes. Our analysis primarily focuses on LLM jobs. It is obvious that evaluation jobs constitute the majority of the total job count in both clusters, yet they consume a relatively small portion of resources (0.8% in Kalos). In contrast, pretraining jobs only account for 0.9% and 3.2% of the total job count but consume 69.5% and 94.0% of the total GPU time in Seren and Kalos respectively.

Figure 5: CDF of GPU job duration and queuing delay for different workload types in Seren (a, b) and Kalos (c, d).

2. Similar Temporal Distribution. Figure 5 shows the distribution of job duration and queuing delay across different workloads. In terms of job duration, although pretraining jobs have the longest duration, they surpass other workloads within an order of magnitude in the median, and less than 5% of jobs last for over 1 day in both clusters. This can be attributed to frequent failures during pretraining.

Regarding job queuing delay, contrary to previous reports [5], [6], [7] suggesting that larger-scale jobs experience longer wait times, we observe that evaluation jobs have the longest queuing delay despite having the lowest GPU demands and shortest job duration. This discrepancy is due to the majority of resources being reserved for pretraining jobs to minimize their queuing delays. Evaluation jobs are typically submitted as a batch simultaneously with lower priority, utilizing the limited spare resources.

Our System Evolution for Pretraining Workload

We further conduct fine-grained analysis for pretraining jobs, as they are the most resource-intensive workloads. To enhance training efficiency, our pretraining framework, InternEvo [8], undergoes continuous refinement and iteration in its system design. As presented in Figure 6, the initial version of InternEvo (adopted by our early jobs) is denoted as (a) primarily utilizes 3D parallelism akin to that of NVIDIA MegatronLM [3], and (b) employs a hierarchical ZeRO mechanism [8] that implements selective redundant sharding of model states, achieving optimal trade-off between communication overhead and GPU memory footprint.

Figure 6: GPU SM utilization of pretraining a 123B LLM using different strategies of InternEvo over 2048 GPUs.

To provide a detailed example, we profile an LLM with 123 billion parameters across 2048 GPUs. Figure 5 illustrates the GPU SM (Streaming Multiprocessor) utilization for the same LLM under various training strategies. Both versions maintain the same global batch size and are optimized according to their respective configurations. It is evident that InternEvo V2 presents superior peak SM utilization and exhibits reduced idle periods compared to InternEvo V1, achieving around 16% acceleration. The relatively low utilization of 3D parallelism is mainly due to the impact of communication introduced by hybrid parallelism on the critical path, such as bubbles in pipeline parallelism. Note that the different inter- and intra-node communication hardware settings may lead to different optimal configurations.

Failure Analysis

To understand the root cause of failures, we conduct a comprehensive analysis of job errors, primarily relying on runtime logs and hardware monitor data from our two clusters.

Basically, they can be classified into three categories as follows. Note that these classifications may overlap, and the primary criterion for classifying a specific type of error is its most frequent occurrence.

Infrastructure. Infrastructure-related failures arise from issues within the underlying computational platform or remote storage.

Framework. Framework errors can be associated with tensor operations, shapes, data types, or unexpected behaviors. They are often observed in the initial phases of jobs and are typically resolved by fixing the configurations.
Script. Script errors typically stem from programming errors or user oversights. They constitute the majority of failures and are often addressed by revising codes.

Category	Reason	Num	Avg. GPU Demands	Avg. Time to Failure (mins)	Total %
Infrastructure	NVLinkError	54	800	868.1	30.25%
	CUDAError	21	847	923.2	15.77%
	NodeFailure	16	712	1288.8	14.30%
	ECCError	12	680	1303.4	11.00%
	NetworkError	12	758	549.6	4.53%
	ConnectionError	147	29	51.9	3.44%
	S3StorageError	10	422	2317.8	2.12%
	NCCLTimeoutError	6	596	159.7	0.50%
	NCCLRemoteError	3	1152	50.5	0.15%
Framework	DataloaderKilled	6	445	1580.6	4.38%
	AttributeError	67	228	67.8	3.90%
	OutOfMemoryError	14	572	323.8	3.28%
	RuntimeError	65	441	66.4	1.72%
	AssertionError	105	413	41.7	1.24%
	ValueError	33	387	9.9	0.16%
	ZeroDivisionError	5	499	14.5	0.03%
	ModelLoadingError	104	8	2.6	0.00%
	DatasetLoadingError	5	1	1.6	0.00%
Script	FileNotFoundError	568	21	14.2	2.83%
	OSError	266	8	9.6	0.28%
	TypeError	620	18	0.9	0.06%
	NameError	18	247	3.2	0.02%
	PermissionError	7	438	4.3	0.01%
	ImportError	111	93	1.1	0.01%
	KeyError	260	7	3	0.01%
	SyntaxError	10	391	0.7	0.00%
	ArgumentError	3	344	0.7	0.00%
	CalledProcessError	4	256	0.2	0.00%
	IndexError	23	6	1.6	0.00%

Table 2: Job failure statistics. It is sorted based on Total% (i.e., the percentage of GPU time summation in different categories). Num: Number of Occurrences.

We highlight several key observations from our failure analysis.

1. Infrastructure Failures Cause Most Severe Impact. As shown in Table 2, jobs that fail because of infrastructure issues often use a substantial number of GPUs (GPU Demand). They take over 82% GPU resources (GPU Time) with only 11% failed job quantity (Num). Most of these jobs are long-term pretraining tasks that can experience hardware failures multiple times, such as issues with GPU (e.g., CUDAError, ECCError), NVLink, and network system (e.g., NCCLRemoteError, S3StorageError). Addressing these infrastructure failures requires meticulous diagnostic efforts to pinpoint the source of the problems, often leading to the maintenance or replacement of defective hardware, which results in significant restart costs.

2. Failures Caused by High Temperature. Another noteworthy observation is that training 7B models in Kalos tends to result in GPU overheating, which can cause NVLinkError or ECCError. This phenomenon is largely due to the highly optimized communication cost, resulting in an exceptionally low GPU idle rate. We observe that the overall temperature in the cluster server room increased by approximately 5°C when training these models. Besides, we find most of these jobs occurred in July 2023, which is the hottest month on record. This anomalous climate may be a potential cause of these failures, which is aligned with the finding recently reported by Microsoft [9]. Subsequently, our team enhanced the cooling capabilities of the cluster, leading to a significant reduction in the frequency of such failures.

Conclusion

In summary, we analyze LLM workloads and resource utilization in our datacenter Acme, revealing unique features and challenges of LLM development, such as resource inefficiencies and failure impacts. We believe that our lessons and insights have broad applicability and can well benefit subsequent research.

For more information on this work, such as our system optimization for pretraining and evaluation workloads, please refer to our publication: ‘Characterization of Large Language Model Development in the Datacenter’, in USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2024 [10, Paper].

Appendix

References:

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, NeurIPS ’17, 2017.

[2] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20, 2020.

[3] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, 2021.

[4] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, SOSP ’23, 2023.

[5] Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference, USENIX ATC ’19, 2019.

[6] Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang. Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, 2021.

[7] Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, ChengWang, Jian He, Yong Li, Liping Zhang,Wei Lin, and Yu Ding. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’22, 2022.

[8] Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, and Peng Sun. Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding. arXiv, abs/2401.09149, 2024.

[9] Yifan Xiong, Yuting Jiang, Ziyue Yang, Lei Qu, Guoshuai Zhao, Shuguang Liu, Dong Zhong, Boris Pinzur, Jie Zhang, Yang Wang, Jithin Jose, Hossein Pourreza, Jeff Baxter, Kushal Datta, Prabhat Ram, Luke Melton, Joe Chau, Peng Cheng, Yongqiang Xiong, and Lidong Zhou. Anubis: Towards reliable cloud ai infrastructure via proactive validation. arXiv, abs/2402.06194, 2024.

[10] Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang. Characterization of large language model development in the datacenter. In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI ’24, 2024.

Article Categories:

AI/ML

Last updated March 19, 2024

Authors:

Qinghao Hu is a research assistant professor at Nanyang Technological University, Singapore. His research focuses on machine learning systems and datacenter scheduling.

[email protected]

Peng Sun is a senior research scientist in Shanghai AI Laboratory and SenseTime. His research interests include cloud computing, computer networking, data center, big data and large-scale cluster computing systems for machine learning.

[email protected]

Tianwei Zhang is an assistant professor at Nanyang Technological University. His research focuses on computer system security. He is particularly interested in security threats and defenses in machine learning systems, autonomous systems, computer architecture and distributed systems.

[email protected]