SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks

Authors: 

Yilong Geng, Shiyu Liu, and Zi Yin, Stanford University; Ashish Naik, Google Inc.; Balaji Prabhakar and Mendel Rosenblum, Stanford University; Amin Vahdat, Google Inc.

Abstract: 

Network measurement and monitoring have been key to understanding the inner workings of computer networks and debugging the performance problems of distributed applications. Despite many products and much research on these topics, in the context of data centers, performing accurate measurement at scale in near real-time has remained elusive. On one hand, switch-based telemetry can give accurate per-packet views, but these must be assembled across the network and across packets to get network- and application-level insight: this is not scalable. On the other hand, purely end-host-based measurement is naturally scalable but so far has only provided partial views of in-network operation.

In this paper, we set out to push the boundary of edge-based measurement by scalably and accurately reconstructing the full queueing dynamics in the network with data gathered entirely at the transmit and receive network interface cards (NICs). We begin with a Signal Processing framework for quantifying a key trade-off: reconstruction accuracy versus the amount of data gathered. Based on this, we propose SIMON, an accurate and scalable measurement system for data centers that reconstructs key network state variables like packet queuing times at switches, link utilizations, and queue and link compositions at the flow-level. We then demonstrate that the function approximation capability of multi-layered neural networks can speed up SIMON by a factor of 5,000--10,000, enabling it to run in near real-time. We deployed SIMON in three testbeds with different link speeds, layers of switching and number of servers; evaluations with NetFPGAs and a cross-validation technique show that SIMON reconstructs queue-lengths to within 3-5 KBs and link utilizations to less than 1% of actual. The accuracy and speed of SIMON enables sensitive A/B tests, which greatly aids the real-time development of algorithms, protocols, network software and applications.

NSDI '19 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {227621,
author = {Yilong Geng and Shiyu Liu and Zi Yin and Ashish Naik and Balaji Prabhakar and Mendel Rosenblum and Amin Vahdat},
title = {{SIMON}: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks},
booktitle = {16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)},
year = {2019},
isbn = {978-1-931971-49-2},
address = {Boston, MA},
pages = {549--564},
url = {https://www.usenix.org/conference/nsdi19/presentation/geng},
publisher = {USENIX Association},
month = feb
}

Presentation Video