Measuring Congestion in High-Performance Datacenter Interconnects

Authors: 

Saurabh Jha and Archit Patke, University of Illinois at Urbana-Champaign; Jim Brandt and Ann Gentile, Sandia National Lab; Benjamin Lim, University of Illinois at Urbana-Champaign; Mike Showerman and Greg Bauer, National Center for Supercomputing Applications; Larry Kaplan, Cray Inc.; Zbigniew Kalbarczyk, University of Illinois at Urbana-Champaign; William Kramer, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Ravi Iyer, University of Illinois at Urbana-Champaign

Abstract: 

While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world's largest 3D torus network of Blue Waters, a 13.3-petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.

NSDI '20 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {246494,
author = {Saurabh Jha and Archit Patke and Jim Brandt and Ann Gentile and Benjamin Lim and Mike Showerman and Greg Bauer and Larry Kaplan and Zbigniew Kalbarczyk and William Kramer and Ravi Iyer},
title = {Measuring Congestion in {High-Performance} Datacenter Interconnects },
booktitle = {17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)},
year = {2020},
isbn = {978-1-939133-13-7},
address = {Santa Clara, CA},
pages = {37--57},
url = {https://www.usenix.org/conference/nsdi20/presentation/jha},
publisher = {USENIX Association},
month = feb
}

Presentation Video