Inference in the Shadows: Taming Memory Bandwidth Contention in Mobile {LLM} Inference with Sereno

Tong Xin; Xinrui Shi; Mingkai Dong; Zeyu Mi

Tong Xin, Xinrui Shi, Mingkai Dong, and Zeyu Mi, Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University

The proliferation of large language models (LLMs) on mobile devices introduces a new performance challenge: resource contention between compute-intensive inference and latency-sensitive foreground applications. We identify a severe and asymmetric interference where concurrent LLM inference substantially degrades foreground applications’ quality-of-service (QoS)—increasing the aggregate jank rate (the fraction of frames that appear as visible stutters) by 153%. In contrast, LLM throughput degrades by only 1.01% and 1.64% during prefill and decode stages, respectively. This imbalance arises because the hardware prioritizes NPU memory traffic—originally to guarantee critical media tasks (e.g., video recording)—a privilege that best-effort LLM inference inherits, causing aggressive bandwidth contention. To address this asymmetric degradation, we present SERENO, a foreground-QoS-friendly LLM inference framework that resolves bandwidth contention between foreground applications and background LLM inference, without hardware modification. SERENO repurposes speculative decoding to introduce fine-grained yield points for preemptible execution, letting the system detect memory contention and dynamically yield bandwidth to the foreground without losing inference progress. Extensive evaluations on commercial smartphones across diverse categories of popular applications demonstrate that SERENO reduces the foreground jank rate by up to 92.6% (58.5% on average) while boosting LLM throughput by up to 67.9% (26.4% on average). Compared with vanilla speculative decoding, SERENO can reduce the foreground jank rate by up to 72.1% while incurring only a 6.2% performance degradation.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {318624,
author = {Tong Xin and Xinrui Shi and Mingkai Dong and Zeyu Mi},
title = {Inference in the Shadows: Taming Memory Bandwidth Contention in Mobile {LLM} Inference with Sereno},
booktitle = {20th USENIX Symposium on Operating Systems Design and Implementation (OSDI 26)},
year = {2026},
isbn = {978-1-939133-55-7},
address = {Seattle, WA},
pages = {2299--2318},
url = {https://www.usenix.org/conference/osdi26/presentation/xin},
publisher = {USENIX Association},
month = jul
}

Download

Xin PDF

Inference in the Shadows: Taming Memory Bandwidth Contention in Mobile LLM Inference with Sereno

Open Access Media