Cortex: Achieving {Low-Latency}, {Cost-Efficient} Remote Data Access For {LLM} via {Semantic-Aware} Knowledge Caching

Chaoyi Ruan; Chao Bi; Kaiwen Zheng; Ziji Shi; Xinyi Wan; Jialin Li

Chaoyi Ruan, National University of Singapore; Chao Bi, University of Science and Technology of China; Kaiwen Zheng, University of Toronto; Ziji Shi, National University of Singapore; Xinyi Wan, Sea AI Lab and National University of Singapore; Jialin Li, National University of Singapore

Large Language Model (LLM) agents tackle data-intensive tasks such as deep research and code generation. However, their effectiveness depends on frequent interactions with knowledge sources across remote clouds or regions. Such interactions can create non-trivial latency and cost bottlenecks. Existing caching solutions focus on exact-match queries, limiting their effectiveness for semantic knowledge reuse.

To address this challenge, we introduce Cortex, a novel cross-region knowledge caching architecture for LLM agents. At its core are two abstractions: Semantic Element (SE) and Semantic Retrieval Index (Seri). A semantic element captures the semantic embedding representation of an LLM query together with performance-aware metadata such as latency, cost, and staticity. Seri then provides two-stage retrieval: a vector similar index with semantic embedding for fast candidate selection and a lightweight LLM-powered semantic judger for precise validation. Atop these primitives, Cortex builds a new cache interface that includes a new semantic-aware cache hit definition, a cost-efficient eviction policy, and proactive prefetching. To reduce overhead, Cortex co-locates the small LLM judger with the main LLM using adaptive scheduling and resource sharing. Our evaluation demonstrates that Cortex delivers substantial performance improvements without compromising correctness. On representative search workloads, Cortex achieves up to a 3.6× increase in throughput by maintaining cache hit rates of over 85×, while preserving accuracy virtually identical to non-cached baselines. Cortex also improves throughput for coding tasks by 20×, showcasing its versatility across diverse agentic workloads.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {316694,
author = {Chaoyi Ruan and Chao Bi and Kaiwen Zheng and Ziji Shi and Xinyi Wan and Jialin Li},
title = {Cortex: Achieving {Low-Latency}, {Cost-Efficient} Remote Data Access For {LLM} via {Semantic-Aware} Knowledge Caching},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {2407--2421},
url = {https://www.usenix.org/conference/nsdi26/presentation/ruan-cortex},
publisher = {USENIX Association},
month = may
}

Download

Ruan PDF

Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching

Open Access Media

Presentation Video