Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training

Authors: 

Mark Zhao, Stanford University and Meta; Satadru Pan, Niket Agarwal, Zhaoduo Wen, David Xu, Anand Natarajan, Pavan Kumar, Shiva Shankar P, Ritesh Tijoriwala, Karan Asher, Hao Wu, Aarti Basant, Daniel Ford, Delia David, Nezih Yigitbasi, Pratap Singh, and Carole-Jean Wu, Meta; Christos Kozyrakis, Stanford University

Abstract: 

Tectonic-Shift is the storage fabric for Meta’s production machine learning (ML) training infrastructure. Industrial storage fabrics for ML need to meet both the intensive IO and high-capacity storage demands of training jobs. Our prior storage fabric, Tectonic, used hard disk drives (HDDs) to store training data. However, HDDs provide poor IO-per-watt performance. This inefficiency hindered the scalability of our storage fabric, and thus limited our ability to keep pace with rapidly growing training IO demands.

This paper describes our journey to build and deploy Tectonic-Shift, a composite storage fabric that efficiently serves the needs of our training infrastructure. We begin with a deep workload characterization that guided an extensive hardware and software design space exploration. We then present the principled design of Tectonic-Shift, which maximizes storage power efficiency by combining Shift, a flash storage tier, with Tectonic. Shift improves efficiency by absorbing reads using IO-efficient flash, reducing required HDD capacity. Shift maximizes IO absorption via novel application-aware cache policies that infer future access patterns from training dataset specifications. Shift absorbs 1.51-3.28x more IO than an LRU flash cache and reduces power demand in a petabyte-scale production Tectonic-Shift cluster by 29%.

USENIX ATC '23 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {288721,
author = {Mark Zhao and Satadru Pan and Niket Agarwal and Zhaoduo Wen and David Xu and Anand Natarajan and Pavan Kumar and Shiva Shankar P and Ritesh Tijoriwala and Karan Asher and Hao Wu and Aarti Basant and Daniel Ford and Delia David and Nezih Yigitbasi and Pratap Singh and Carole-Jean Wu},
title = {{Tectonic-Shift}: A Composite Storage Fabric for {Large-Scale} {ML} Training},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
isbn = {978-1-939133-35-9},
address = {Boston, MA},
pages = {433--449},
url = {https://www.usenix.org/conference/atc23/presentation/zhao},
publisher = {USENIX Association},
month = jul
}

Presentation Video