{SYMPHONY}: Enabling {Compute-Memory} Disaggregation in {LLM} Serving Systems

Saurabh Agarwal; Bodun Hu; Anyong Mao; Aditya Akella; Shivaram Venkataraman

Saurabh Agarwal and Bodun Hu, UT-Austin; Anyong Mao, UW-Madison; Aditya Akella, UT-Austin; Shivaram Venkataraman, UW-Madison

Large Language Models (LLMs) power AI applications such as chatbots and agents, which maintain conversational state across multiple turns. Serving these workloads is inherently stateful: each request generates a KV cache storing token-level state. Existing systems either recompute caches or offload them to host memory—both approaches incur high latency, cause load imbalance, and limit scalability. We present SYMPHONY, a disaggregated memory management layer that decouples compute from KV cache storage while meeting strict latency requirements. To enable disaggregation, SYMPHONY employs advisory requests—prefetching hints derived from user interactions or workload structure—to move caches off the critical path and enable fine-grained, request-level load balancing. Since these predictive signals are often unreliable, SYMPHONY introduces two key techniques: priority-based KV cache management, which allocates memory based on neural network structure and request priority, and cooperative memory management, which dynamically coordinates GPU memory with the serving framework. Evaluations on LLaMA models with ShareGPT and Burst-GPT workloads show that SYMPHONY reduces end-to-end latency by 2.4× over vLLM and serves 4× more requests with minimal latency increase.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316572,
author = {Saurabh Agarwal and Bodun Hu and Anyong Mao and Aditya Akella and Shivaram Venkataraman},
title = {{SYMPHONY}: Enabling {Compute-Memory} Disaggregation in {LLM} Serving Systems},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {2027--2041},
url = {https://www.usenix.org/conference/nsdi26/presentation/agarwal},
publisher = {USENIX Association},
month = may
}

Download

Agarwal PDF

SYMPHONY: Enabling Compute-Memory Disaggregation in LLM Serving Systems

Open Access Media

Presentation Video