Sayed Hadi Hashemi, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Paul Rausch; Benjamin Rabe, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Kuan-Yen Chou, University of Illinois at Urbana-Champaign; Simeng Liu, University of Illinois at Urbana-Champaign and National Center for Supercomputing Applications; Volodymyr Kindratenko, National Center for Supercomputing Applications; Roy H Campbell, University of Illinois at Urbana-Champaign
The growing popularity of Deep Neural Networks (DNN) within the mainstream \cite{gartnerhype} has had a rapid transformative effect on clusters and data centers.
DNN training jobs are becoming one of the largest tenants within clusters, and often take hours to weeks to complete; and even a slight performance improvement can save substantial runtime costs. Despite this fact, the DNN specific performance tuning tools are yet to keep up with the needs of the new changes in production environments.
On one hand, the existing application-agnostic resource-level tools such as top, Nvidia Nsight (for GPU utilization), IPM (for MPI network monitoring) are too limited to predict or explain the behavior and performance of a job accurately. In DNN applications, there exists a complex relationship among resources. Even though measuring coarse metrics such as bandwidth, latency, and GPU/CPU utilization can draw an overall picture of cluster performance, these metrics are not easily translatable to application-level metrics and do not provide actionable insights on how to handle performance bottlenecks.
On the other hand, the short list of application-aware tools, such as MLModelScope \cite{dakkak2018mlmodelscope}, TensorBoard \cite{tensorboard}, and \texttt{tf.RunOptions} \cite{tensorflow-trace}, while able to provide actionable insights, are mainly designed for the need of application developers and are not intended for production use. Such tools require substantial modification to applications, and early planning as to what, when and how data should be collected.
In this article, we introduce \texttt{tensorflow-tracing}~to fill the gap between these two classes of performance tuning tools. To achieve this goal, \texttt{tensorflow-tracing}~addresses the following technical challenges:
\begin{itemize}[noitemsep,topsep=0pt,leftmargin=*] \item Collecting the application-level runtime metrics, such as the timing of each operation or the iteration time, needs explicitly expressed in the training job source code. To makes it possible to trace ML jobs without requiring any application modification, \texttt{tensorflow-tracing}~ \textit{monkeypatches} the \texttt{tensorflow} library at the system level. \item Collecting some metrics is expensive and have a significant overhead on the runtime. \texttt{tensorflow-tracing}~treats metrics differently; it collects low-overhead metrics automatically, while expensive ones are collected on demand through an admin interface. \item There is no easy way to exchange runtime metrics among users and admins --- \texttt{tensorflow-tracing}~facilities this through a portable file format and supporting tools to explore these metrics offline. \end{itemize}
The \texttt{tensorflow-tracing}~is publicly available under \texttt{Apache-2.0} license\footnote{\url{https://github.com/xldrx/tensorflow-tracer}}. It supports native TensorFlow \cite{tensorflow}, Horovod \cite{horovod}, and IBM PowerAI \cite{powerai} applications.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Sayed Hadi Hashemi and Paul Rausch and Benjamin Rabe and Kuan-Yen Chou and Simeng Liu and Volodymyr Kindratenko and Roy H Campbell},
title = {tensorflow-tracing: A Performance Tuning Framework for Production},
booktitle = {2019 USENIX Conference on Operational Machine Learning (OpML 19)},
year = {2019},
isbn = {978-1-939133-00-7},
address = {Santa Clara, CA},
pages = {31--33},
url = {https://www.usenix.org/conference/opml19/presentation/hashemi},
publisher = {USENIX Association},
month = may
}