A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Authors: 

Nils Blach and Maciej Besta, ETH Zürich; Daniele De Sensi, ETH Zürich and Sapienza University of Rome; Jens Domke, RIKEN Center for Computational Science (R-CCS); Hussein Harake, Swiss National Supercomputing Centre (CSCS); Shigang Li, ETH Zürich and BUPT, Beijing; Patrick Iff, ETH Zürich; Marek Konieczny, AGH-UST; Kartik Lakhotia, Intel Labs; Ales Kubicek and Marcel Ferrari, ETH Zürich; Fabrizio Petrini, Intel Labs; Torsten Hoefler, ETH Zürich

Abstract: 

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.

NSDI '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {295581,
author = {Nils Blach and Maciej Besta and Daniele De Sensi and Jens Domke and Hussein Harake and Shigang Li and Patrick Iff and Marek Konieczny and Kartik Lakhotia and Ales Kubicek and Marcel Ferrari and Fabrizio Petrini and Torsten Hoefler},
title = {A {High-Performance} Design, Implementation, Deployment, and Evaluation of The Slim Fly Network},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {1025--1044},
url = {https://www.usenix.org/conference/nsdi24/presentation/blach},
publisher = {USENIX Association},
month = apr
}