Automatic, Application-Aware I/O Forwarding Resource Allocation

Authors: 

Xu Ji, Tsinghua University; National Supercomputing Center in Wuxi; Bin Yang and Tianyu Zhang, National Supercomputing Center in Wuxi; Shandong University; Xiaosong Ma, Qatar Computing Research Institute, HBKU; Xiupeng Zhu, National Supercomputing Center in Wuxi; Shandong University; Xiyang Wang, National Supercomputing Center in Wuxi; Nosayba El-Sayed, Emory University; Jidong Zhai, Tsinghua University; Weiguo Liu, National Supercomputing Center in Wuxi; Shandong University; Wei Xue, Tsinghua University; National Supercomputing Center in Wuxi

Abstract: 

The I/O forwarding architecture is widely adopted on modern supercomputers, with a layer of intermediate nodes sitting between the many compute nodes and backend storage nodes. This allows compute nodes to run more efficiently and stably with a leaner OS, offloads I/O coordination and communication with backend from the compute nodes, maintains less concurrent connections to storage systems, and provides additional resources for effective caching, prefetching, write buffering, and I/O aggregation. However, with many existing machines, these forwarding nodes are assigned to serve fixed set of compute nodes.

We explore an automatic mechanism, DFRA, for application-adaptive dynamic forwarding resource allocation. With I/O monitoring data that proves affordable to acquire in real time and maintain for long-term history analysis, Upon each job's dispatch, DFRA conducts a history-based study to determine whether the job should be granted more forwarding resources or given dedicated forwarding nodes. Such customized I/O forwarding lets the small fraction of I/O-intensive applications achieve higher I/O performance and scalability, meanwhile effectively isolating disruptive I/O activities. We implemented, evaluated, and deployed DFRA on Sunway TaihuLight, the current No.2 supercomputer in the world. It improves applications' I/O performance by up to 16.0x, eliminates most of the inter-application I/O interference, and has saved over 200 million of core-hours during its deployment on TaihuLight for past 8 months. Finally, our proposed DFRA design is not platform-dependent, making it applicable to the management of existing and future I/O forwarding or burst buffer resources.

FAST '19 Open Access Sponsored by NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {227796,
author = {Xu Ji and Bin Yang and Tianyu Zhang and Xiaosong Ma and Xiupeng Zhu and Xiyang Wang and Nosayba El-Sayed and Jidong Zhai and Weiguo Liu and Wei Xue},
title = {Automatic, {Application-Aware} {I/O} Forwarding Resource Allocation},
booktitle = {17th USENIX Conference on File and Storage Technologies (FAST 19)},
year = {2019},
isbn = {978-1-939133-09-0},
address = {Boston, MA},
pages = {265--279},
url = {https://www.usenix.org/conference/fast19/presentation/ji},
publisher = {USENIX Association},
month = feb
}

Presentation Video