{FlexLLM}: {Token-Level} {Co-Serving} of {LLM} Inference and Finetuning with {SLO} Guarantees

Gabriele Oliaro; Xupeng Miao; Xinhao Cheng; Vineeth Kada; Mengdi Wu; Ruohan Gao; Yingyi Huang; Remi Delacourt; April Yang; Yingcheng Wang; Colin Unger; Zhihao Jia

Gabriele Oliaro, Carnegie Mellon University; Xupeng Miao, Purdue University; Xinhao Cheng, Carnegie Mellon University; Vineeth Kada, Anthropic PBC; Mengdi Wu, Ruohan Gao, and Yingyi Huang, Carnegie Mellon University; Remi Delacourt, Mistral AI; April Yang, Carnegie Mellon University; Yingcheng Wang, Purdue University; Colin Unger, Stanford University; Zhihao Jia, Carnegie Mellon University and Amazon Web Services

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters—wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations—dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by 1.9-4.8× under heavy inference workloads and 2.5-6.8× under light loads, preserving over 76% of peak finetuning progress even at peak demand. FlexLLM is publicly available at https://flexllm.github.io.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316058,
author = {Gabriele Oliaro and Xupeng Miao and Xinhao Cheng and Vineeth Kada and Mengdi Wu and Ruohan Gao and Yingyi Huang and Remi Delacourt and April Yang and Yingcheng Wang and Colin Unger and Zhihao Jia},
title = {{FlexLLM}: {Token-Level} {Co-Serving} of {LLM} Inference and Finetuning with {SLO} Guarantees},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1359--1379},
url = {https://www.usenix.org/conference/nsdi26/presentation/oliaro},
publisher = {USENIX Association},
month = may
}

Download

Oliaro PDF

Oliaro Paper (Prepublication) PDF

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

Open Access Media

Presentation Video