USENIX | The Advanced Computing Systems Association

Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}

Agrawal A, Kedia N, Panwar A, Mohan J, Kwatra N, Gulavani B, Tumanov A, Ramjee R. 2024. Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :117--134.

{ServerlessLLM}: {Low-Latency} Serverless Inference for Large Language Models

Fu Y, Xue L, Huang Y, Brabete A-O, Ustiugov D, Patel Y, Mai L. 2024. {ServerlessLLM}: {Low-Latency} Serverless Inference for Large Language Models. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :135--153.

{InfiniGen}: Efficient Generative Inference of Large Language Models with Dynamic {KV} Cache Management

Lee W, Lee J, Seo J, Sim J. 2024. {InfiniGen}: Efficient Generative Inference of Large Language Models with Dynamic {KV} Cache Management. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :155--172.

Llumnix: Dynamic Scheduling for Large Language Model Serving

Sun B, Huang Z, Zhao H, Xiao W, Zhang X, Li Y, Lin W. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :173--191.

{DistServe}: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Zhong Y, Liu S, Chen J, Hu J, Zhu Y, Liu X, Jin X, Zhang H. 2024. {DistServe}: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :193--210.

{ACCL+}: an {FPGA-Based} Collective Engine for Distributed Applications

He Z, Korolija D, Zhu Y, Ramhorst B, Laan T, Petrica L, Blott M, Alonso G. 2024. {ACCL+}: an {FPGA-Based} Collective Engine for Distributed Applications. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :211--231.

Beaver: Practical Partial Snapshots for Distributed Cloud Services

Yu L, Zhang X, Zhang H, Sonchack J, Ports D, Liu V. 2024. Beaver: Practical Partial Snapshots for Distributed Cloud Services. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :233--249.

Fast and Scalable In-network Lock Management Using Lock Fission

Zhang H, Cheng K, Chen R, Chen H. 2024. Fast and Scalable In-network Lock Management Using Lock Fission. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :251--268.

Chop Chop: Byzantine Atomic Broadcast to the Network Limit

Camaioni M, Guerraoui R, Monti M, Roman P-L, Vidigueira M, Voron G. 2024. Chop Chop: Byzantine Atomic Broadcast to the Network Limit. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :269--287.

Enabling Tensor Language Model to Assist in Generating {High-Performance} Tensor Programs for Deep Learning

Zhai Y, Yang S, Pan K, Zhang R, Liu S, Liu C, Ye Z, Ji J, Zhao J, Zhang Y et al.. 2024. Enabling Tensor Language Model to Assist in Generating {High-Performance} Tensor Programs for Deep Learning. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). :289--305.

Pages