Libra: Flexible Request Partitioning and Scheduling for Serving Unbalanced and Dynamic {LLM} Workloads

Chaoyi Ruan; Yinhe Chen; Dongqi Tian; Yandong Shi; Yongji Wu; Jialin Li; Cheng Li

Chaoyi Ruan, National University of Singapore; Yinhe Chen, Dongqi Tian, and Yandong Shi, University of Science and Technology of China; Yongji Wu, UC Berkeley; Jialin Li, National University of Singapore; Cheng Li, University of Science and Technology of China and Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

LLM inference must meet strict latency SLOs while maximizing throughput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput.

We introduce Libra, a high performance LLM serving system that maximizes goodput under SLO constraints even when handling imbalanced and dynamic workloads. At the core of Libra is a micro-request based flexible partitioning and scheduling (FPS) abstraction. The abstraction splits each request at any token boundary into multiple cooperating segments. Libra then designs a two-level scheduling framework that balances micro-request load across unified GPU instances. The framework consists of a global scheduler that selects per-request split points, and a local scheduler on each GPU instance to form SLO-aware batches. Finally, Libra uses chunked KV cache transfers to support cross-instance micro-request execution. On real-world traces, Libra improves goodput by up to 1.91× and 1.61×, increases serving capacity from 1.15× to 3.07×, and improves serving performance by up to 74.2% in a hybrid workload under strict SLOs and A100/H100 GPUs compared to state-of-the-art colocated and disaggregated baselines.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316696,
author = {Chaoyi Ruan and Yinhe Chen and Dongqi Tian and Yandong Shi and Yongji Wu and Jialin Li and Cheng Li},
title = {Libra: Flexible Request Partitioning and Scheduling for Serving Unbalanced and Dynamic {LLM} Workloads},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {1243--1258},
url = {https://www.usenix.org/conference/nsdi26/presentation/ruan-libra},
publisher = {USENIX Association},
month = may
}

Download

Ruan PDF

Libra: Flexible Request Partitioning and Scheduling for Serving Unbalanced and Dynamic LLM Workloads

Open Access Media

Presentation Video