Torpor: {GPU-Enabled} Serverless Computing for {Low-Latency}, {Resource-Efficient} Inference

Minchen Yu; Ao Wang; Dong Chen; Haoxuan Yu; Xiaonan Luo; Zhuohao Li; Wei Wang; Ruichuan Chen; Dapeng Nie; Haoran Yang; Yu Ding

Minchen Yu, The Chinese University of Hong Kong, Shenzhen; and Hong Kong University of Science and Technology; Ao Wang, Alibaba Group; Dong Chen, Haoxuan Yu, Xiaonan Luo, Zhuohao Li, and Wei Wang, Hong Kong University of Science and Technology; Ruichuan Chen, Nokia Bell Labs; Dapeng Nie, Haoran Yang, and Yu Ding, Alibaba Group

Serverless computing offers a compelling cloud model for online inference services. However, existing serverless platforms lack efficient support for GPUs, hindering their ability to deliver high-performance inference. In this paper, we present Torpor, a serverless platform for GPU-efficient, low-latency inference. To enable efficient sharing of a node’s GPUs among numerous inference functions, Torpor maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding with model swapping). Torpor uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to minimize latency overhead caused by model swapping. Additionally, we design an interference-aware request scheduling algorithm that utilizes high-speed GPU interconnects to meet latency service-level objectives (SLOs) for individual inference functions. We have implemented Torpor and evaluated its performance in a production environment. Utilizing late binding and model swapping, Torpor can concurrently serve hundreds of inference functions on a worker node with 4 GPUs, while achieving latency performance comparable to native execution, where each model is cached exclusively on a GPU. Pilot deployment in a leading commercial serverless cloud shows that Torpor reduces the GPU provisioning cost by 70% and 65% for users and the platform, respectively.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {308484,
author = {Minchen Yu and Ao Wang and Dong Chen and Haoxuan Yu and Xiaonan Luo and Zhuohao Li and Wei Wang and Ruichuan Chen and Dapeng Nie and Haoran Yang and Yu Ding},
title = {Torpor: {GPU-Enabled} Serverless Computing for {Low-Latency}, {Resource-Efficient} Inference},
booktitle = {2025 USENIX Annual Technical Conference (USENIX ATC 25)},
year = {2025},
isbn = {978-1-939133-48-9},
address = {Boston, MA},
pages = {597--612},
url = {https://www.usenix.org/conference/atc25/presentation/yu},
publisher = {USENIX Association},
month = jul
}

Download

Yu PDF

Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference

Open Access Media

Presentation Video