SRNIC: A Scalable Architecture for RDMA NICs

Authors: 

Zilong Wang, Hong Kong University of Science and Technology; Layong Luo and Qingsong Ning, ByteDance; Chaoliang Zeng, Wenxue Li, and Xinchen Wan, Hong Kong University of Science and Technology; Peng Xie, Tao Feng, Ke Cheng, Xiongfei Geng, Tianhao Wang, Weicheng Ling, Kejia Huo, Pingbo An, Kui Ji, Shideng Zhang, Bin Xu, Ruiqing Feng, and Tao Ding, ByteDance; Kai Chen, Hong Kong University of Science and Technology; Chuanxiong Guo

Abstract: 

RDMA is expected to be highly scalable: to perform well in large-scale data center networks where packet losses are inevitable (i.e., high network scalability), and to support a large number of performant connections per server (i.e., high connection scalability). Commercial RoCEv2 NICs (RNICs) fall short on scalability as they rely on a lossless, limited-scale network fabric and support only a small number of performant connections. Recent work IRN improves the network scalability by relaxing the lossless network requirement, but the connection scalability issue remains unaddressed.

In this paper, we aim to address the connection scalability challenge, while maintaining high performance and low CPU overhead as commercial RNICs, and high network scalability as IRN, by designing SRNIC, a Scalable RDMA NIC architecture. Our key insight in SRNIC is that, on-chip data structures and their memory requirements in RNICs can be minimized with careful protocol and architecture co-designs to improve connection scalability. Guided by this insight, we analyze all data structures involved in an RDMA conceptual model, and remove them as many as possible with RDMA protocol header modifications and architectural innovations, including cache-free QP scheduler and memory-free selective repeat. We implement a fully functional SRNIC prototype using FPGA. Experiments show that, SRNIC achieves 10K performant connections on chip and outperforms commercial RNICs by 18x in terms of normalized connection scalability (i.e., the number of performant connections per 1MB memory), while achieving 97 Gbps throughput and 3.3 μs latency with less than 5% CPU overhead, and maintaining high network scalability.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

Wang Paper (Prepublication) PDF