Cybersecurity Research Datasets: Taxonomy and Empirical Analysis


Muwei Zheng, Hannah Robbins, Zimo Chai, Prakash Thapa, and Tyler Moore, The University of Tulsa


We inspect 965 cybersecurity research papers published between 2012 and 2016 in order to understand better how datasets are used, produced and shared. We construct a taxonomy of the types of data created and shared, informed and validated by the examined papers. We then analyze the gathered data on datasets. Three quarters of existing datasets used as input to research are publicly available, but just 20% of datasets created by researchers are publicly shared. Furthermore, the rate of public sharing has remained flat over time. Using a series of linear regressions, we demonstrate that those researchers who do make public the datasets they create are rewarded with more citations to the associated papers. Hence, we conclude that an under-appreciated incentive exists for researchers to share their created datasets with the broader research community.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {220245,
author = {Muwei Zheng and Hannah Robbins and Zimo Chai and Prakash Thapa and Tyler Moore},
title = {Cybersecurity Research Datasets: Taxonomy and Empirical Analysis},
booktitle = {11th USENIX Workshop on Cyber Security Experimentation and Test (CSET 18)},
year = {2018},
address = {Baltimore, MD},
url = {},
publisher = {USENIX Association},
month = aug