SolidAttention: Low-Latency SSD-based Serving on Memory-Constrained PCs

Xinrui Zheng, Dongliang Wei, Jianxiang Gao, Yixin Song, Zeyu Mi, and Haibo Chen, Shanghai Jiao Tong University

AI personal computers (AIPCs) enable the local deployment of large language model (LLM) inference, offering enhanced privacy guarantees and customizable serving. However, such deployments are constrained by limited memory capacity, primarily due to the substantial key-value (KV) cache overhead. This paper introduces SolidAttention, an LLM inference engine which addresses these limitations through a tight co-design of dynamic attention sparsity algorithms and SSD-based storage management. Specifically, to maximize SSD bandwidth utilization, SolidAttention consolidates multiple KV pairs into coarse-grained blocks and implements speculative prefetching mechanisms that exploit temporal locality in sparse attention. By fine-grained orchestration of computation and I/O operations while reusing synchronization points, SolidAttention further minimizes SSD-induced blocking latency. With a 128k-token context, SolidAttention improves the inference speed by up to 3.1× and reduces the KV cache memory footprint by up to 98% without compromising inference accuracy.

FAST '26 Open Access Sponsored by
NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {315939,
author = {Xinrui Zheng and Dongliang Wei and Jianxiang Gao and Yixin Song and Zeyu Mi and Haibo Chen},
title = {{SolidAttention}: {Low-Latency} {SSD-based} Serving on {Memory-Constrained} {PCs}},
booktitle = {24th USENIX Conference on File and Storage Technologies (FAST 26)},
year = {2026},
isbn = {978-1-939133-53-3},
address = {Santa Clara, CA},
pages = {67--82},
url = {https://www.usenix.org/conference/fast26/presentation/zheng},
publisher = {USENIX Association},
month = feb
}

Presentation Video