Interactive Debugging for Data-Intensive Scalable Computing using Data Provenance

Tyson Condie, UCLA


Data-Intensive Scalable Computing (DISC) systems are being leveraged for analyzing large datasets. DISC system programs are authored in a domain specific language and submitted for execution on a cluster of machines in the form of jobs. Today, DISC users have limited visibility into the logical operations of their jobs during execution. As such, DISC programmers must resort to rudimentary methods—such as, trial and error debugging—to debug their program logic. BigDebug is our effort to fill this program execution visibility gap by providing an interactive debugging toolkit for Apache Spark. Interestingly, many of features in BigDebug stem from the use of data provenance. In this talk, I will present BigDebug and Titian, which augments Apache Spark with interactive data provenance query capabilities. More information on the BigDebug project can be found here:

@conference {204304,
title = {Interactive Debugging for {Data-Intensive} Scalable Computing using Data Provenance},
year = {2017},
address = {Seattle, WA},
publisher = {USENIX Association},
month = jun