Chijun Sima, Tencent; Yao Fu and Man-Kit Sit, The University of Edinburgh; Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, and Haidong Rong, Tencent; Pierre-Louis Aublin, IIJ research laboratory; Luo Mai, The University of Edinburgh
Deep Learning Recommender Systems (DLRSs) need to update models at low latency, thus promptly serving new users and content. Existing DLRSs, however, fail to do so. They train/validate models offline and broadcast entire models to global inference clusters. They thus incur significant model update latency (e.g. dozens of minutes), which adversely affects Service-Level Objectives (SLOs).
This paper describes Ekko, a novel DLRS that enables low-latency model updates. Its design idea is to allow model updates to be immediately disseminated to all inference clusters, thus bypassing long-latency model checkpoint, validation and broadcast. To realise this idea, we first design an efficient peer-to-peer model update dissemination algorithm. This algorithm exploits the sparsity and temporal locality in updating DLRS models to improve the throughput and latency of updating models. Further, Ekko has a model update scheduler that can prioritise, over busy networks, the sending of model updates that can largely affect SLOs. Finally, Ekko has an inference model state manager which monitors the SLOs of inference models and rollbacks the models if SLO-detrimental biased updates are detected. Evaluation results show that Ekko is orders of magnitude faster than state-of-the-art DLRS systems. Ekko has been deployed in production for more than one year, serves over a billion users daily and reduces the model update latency compared to state-of-the-art systems from dozens of minutes to 2.4 seconds.
OSDI '22 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Chijun Sima and Yao Fu and Man-Kit Sit and Liyi Guo and Xuri Gong and Feng Lin and Junyu Wu and Yongsheng Li and Haidong Rong and Pierre-Louis Aublin and Luo Mai},
title = {Ekko: A {Large-Scale} Deep Learning Recommender System with {Low-Latency} Model Update},
booktitle = {16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)},
year = {2022},
isbn = {978-1-939133-28-1},
address = {Carlsbad, CA},
pages = {821--839},
url = {https://www.usenix.org/conference/osdi22/presentation/sima},
publisher = {USENIX Association},
month = jul
}