Bidaw: Enhancing {Key-Value} Caching for Interactive {LLM} Serving via Bidirectional {Computation–Storage} Awareness

Shipeng Hu; Guangyan Zhang; Yuqi Zhou; Yaya Wei; Ziyan Zhong; Jike Chen

Shipeng Hu and Guangyan Zhang, Tsinghua University; Yuqi Zhou, China University of Geosciences Beijing; Yaya Wei and Ziyan Zhong, China Telecom Omni-channel Operation Center; Jike Chen, Tsinghua University

In interactive LLM serving, historical key–value tensors (KVs) of multi-round conversations are often cached in a two-tier storage system consisting of host memory and SSDs, which provides large capacity at low cost. However, loading KVs from two-tier storage in existing approaches increases serving latency by up to 3.8× and decreases throughput by up to 2.0× compared to an ideal large-memory setting on our interactive conversation workload. This inefficiency arises from poor coordination between compute engine and two-tier storage.

This paper proposes Bidaw, an efficient KV caching approach with two-tier storage that enables bidirectional awareness between compute and storage. Bidaw introduces two key mechanisms. First, the compute engine schedules requests with KV-loading latency awareness by separating requests whose KVs reside in different storage layers and reordering them by KV size to reduce blocking. Second, the storage system improves host memory hit rates by leveraging LLM-generated responses to predict user access patterns during KV eviction. For further optimization, Bidaw balances storage footprint against computational savings by selectively caching storage-efficient history tensors.

Experiments on our interactive conversation workload and a public multi-round conversation workload of interactive LLM serving show that Bidaw reduces response latency by up to 3.58× and improves throughput by up to 1.83× over state-of-the-art approaches, approaching the theoretical upper bound achieved when all KVs reside entirely in host memory.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {315971,
author = {Shipeng Hu and Guangyan Zhang and Yuqi Zhou and Yaya Wei and Ziyan Zhong and Jike Chen},
title = {Bidaw: Enhancing {Key-Value} Caching for Interactive {LLM} Serving via Bidirectional {Computation{\textendash}Storage} Awareness},
booktitle = {24th USENIX Conference on File and Storage Technologies (FAST 26)},
year = {2026},
isbn = {978-1-939133-53-3},
address = {Santa Clara, CA},
pages = {101--116},
url = {https://www.usenix.org/conference/fast26/presentation/hu-shipeng},
publisher = {USENIX Association},
month = feb
}

Download

Hu PDF

Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional Computation–Storage Awareness

Open Access Media

Presentation Video