Cybersecurity Research Datasets: Taxonomy and Empirical Analysis


Muwei Zheng, Hannah Robbins, Zimo Chai, Prakash Thapa, and Tyler Moore, The University of Tulsa


We inspect 965 cybersecurity research papers published between 2012 and 2016 in order to understand better how datasets are used, produced and shared. We construct a taxonomy of the types of data created and shared, informed and validated by the examined papers. We then analyze the gathered data on datasets. Three quarters of existing datasets used as input to research are publicly available, but just 20% of datasets created by researchers are publicly shared. Furthermore, the rate of public sharing has remained flat over time. Using a series of linear regressions, we demonstrate that those researchers who do make public the datasets they create are rewarded with more citations to the associated papers. Hence, we conclude that an under-appreciated incentive exists for researchers to share their created datasets with the broader research community.

