AutoLabel: Automated Fine-Grained Log Labeling for Cyber Attack Dataset Generation

Yihao Peng and Tongxin Zhang, Tsinghua University; Jieshao Lai, University of Science and Technology of China; Yuxuan Zhang, Yiming Wu, Hai Wan, and Xibin Zhao, Tsinghua University

High-quality labeled log datasets are essential for log-based cyber-security research, such as anomaly detection and forensic analysis. However, such datasets are scarce and generally not publicly accessible. Existing methods for generating labeled log datasets have several limitations: they are labor-intensive, require specialized expertise, provide inadequate support for multi-source logs, and produce coarse-grained labels.

This paper presents AutoLabel, which automates fine-grained log labeling by reducing the labeling problem to obtaining an accurate attack subgraph in a provenance graph. It modifies the environment, applications, and attack tools to generate auxiliary information during attacks. Then, from the resulting audit logs, it builds a provenance graph and leverages the auxiliary information to correlate application and traffic logs with audit logs, identify key attack-related edges, refine the graph to mitigate dependency explosion, and ultimately extract an attack subgraph for precise labeling.

Experiments in 29 scenarios, including 25 real CVE vulnerabilities across 12 widely-used applications (spanning 5 programming languages) plus a Sandworm threat simulation by MITRE CTID, show that AutoLabel achieves 100% labeling accuracy, substantially reduces manual log-analysis effort, and produces labeled datasets in no more than 96 minutes per scenario. AutoLabel has generated over 580 datasets that could be served as benchmarks.

Category: 
Short Presentation

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {309536,
author = {Yihao Peng and Tongxin Zhang and Jieshao Lai and Yuxuan Zhang and Yiming Wu and Hai Wan and Xibin Zhao},
title = {{AutoLabel}: Automated {Fine-Grained} Log Labeling for Cyber Attack Dataset Generation},
booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
year = {2025},
isbn = {978-1-939133-52-6},
address = {Seattle, WA},
pages = {547--566},
url = {https://www.usenix.org/conference/usenixsecurity25/presentation/peng-yihao},
publisher = {USENIX Association},
month = aug
}

Presentation Video