Toppings: {CPU-Assisted}, {Rank-Aware} Adapter Serving for {LLM} Inference

Suyi Li; Hanfeng Lu; Tianyuan Wu; Minchen Yu; Qizhen Weng; Xusheng Chen; Yizhou Shan; Binhang Yuan; Wei Wang

Suyi Li, Hanfeng Lu, and Tianyuan Wu, Hong Kong University of Science and Technology; Minchen Yu, The Chinese University of Hong Kong, Shenzhen; Qizhen Weng, TeleAI, China Telecom; Xusheng Chen and Yizhou Shan, Huawei Cloud; Binhang Yuan and Wei Wang, Hong Kong University of Science and Technology

Low-Rank Adaptation (LoRA) is a popular approach that adapts a base large language model (LLM) to domain-specific tasks by adding lightweight trainable adapters. In this paper, we present Toppings, a system that efficiently serves many LoRA adapters derived from a common base model. Toppings pins the base model on GPUs and dynamically loads the requested LoRA adapters from host memory as new requests arrive. In view of the high GPU loading overhead, which not only delays the time-to-first-token of the newly arrived request but also interrupts the ongoing decoding of all inflight queries when continuous batching is in use, Toppings proposes a CPU-assisted LoRA serving approach. It simultaneously uses CPUs to compute the lightweight adaption for prefilling as the requested LoRA adapter is being loaded onto GPUs; it then switches to the GPUs after loading completes to resume the remaining computation. Toppings develops a highly optimized synchronization mechanism and pipeline loading scheme to efficiently coordinate LoRA computation on the CPUs and GPUs. Toppings further designs a rank-aware scheduling algorithm that optimally schedules heterogeneous LoRA requests to maximize the SLO attainment. Compared with the state-of-the-art LoRA serving systems, Toppings improves the average request serving latency by up to 1.7× and achieves an SLO attainment of up to 99%.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {308486,
author = {Suyi Li and Hanfeng Lu and Tianyuan Wu and Minchen Yu and Qizhen Weng and Xusheng Chen and Yizhou Shan and Binhang Yuan and Wei Wang},
title = {Toppings: {CPU-Assisted}, {Rank-Aware} Adapter Serving for {LLM} Inference},
booktitle = {2025 USENIX Annual Technical Conference (USENIX ATC 25)},
year = {2025},
isbn = {978-1-939133-48-9},
address = {Boston, MA},
pages = {613--629},
url = {https://www.usenix.org/conference/atc25/presentation/li-suyi-toppings},
publisher = {USENIX Association},
month = jul
}

Download

Li-Suyi-Toppings PDF

Toppings: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference

Open Access Media

Presentation Video