Tetris: Memory-efficient Serverless Inference through Tensor Sharing

Authors: 

Jie Li, Laiping Zhao, and Yanan Yang, Tianjin University; Kunlin Zhan, 58.com; Keqiu Li, Tianjin University

Abstract: 

Executing complex, memory-intensive deep learning inference services poses a major challenge for serverless computing frameworks, which would densely deploy and maintain inference models at high throughput. We observe the excessive memory consumption problem in serverless inference systems, due to the large-sized models and high data redundancy.

We present Tetris, a serverless platform catered to inference services with an order of magnitude lower memory footprint. Tetris’s design carefully considers the extensive memory sharing of runtime and tensors. It supports minimizing the runtime redundancy through a combined optimization of batching and concurrent execution and eliminates tensor redundancy across instances from either the same or different functions using a lightweight and safe tensor mapping mechanism. Our comprehensive evaluation demonstrates that Tetris saves up to 93% memory footprint for inference services, and increases the function density by 30× without impairing the latency.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {280704,
author = {Jie Li and Laiping Zhao and Yanan Yang and Kunlin Zhan and Keqiu Li},
title = {Tetris: Memory-efficient Serverless Inference through Tensor Sharing},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
address = {Carlsbad, CA},
url = {https://www.usenix.org/conference/atc22/presentation/li-jie},
publisher = {USENIX Association},
month = jul
}

Presentation Video