Fabrício Ceschin, Federal University of Paraná (UFPR), Brazil
Machine Learning (ML) has been widely applied to cybersecurity and is currently considered state-of-the-art for solving many open issues in that field. However, it is challenging to evaluate how good the produced solutions are, since security challenges may not appear in other areas, as security problems could incur infeasible solutions for real-world applications. For instance, a phishing detection model that does not consider a non-stationary distribution would not work given that 68% of phishing emails blocked by Gmail are different daily. In this talk, I will discuss some of the challenges of applying ML to cybersecurity, which include: (i) dataset problems, such as dataset definition, where defining the right size is key to creating a representative model of the task being performed, and class imbalance, where the distribution between classes differs substantially; (ii) adversarial machine learning and concept drift/evolution, where attackers constantly develop adversarial samples to avoid detection leading to changes in the concept in the data, and turning defense solutions obsolete due to the volatility of security data; and (iii) evaluation problems, such as delayed labels, where new data do not have ground-truth labels available right after collection, producing a gap between the data collection, their labeling process, and models training/testing. My goal is to point directions to future cybersecurity researchers and practitioners applying ML to their problems. Finally, for each challenge described, I will show how existing solutions may fail under certain circumstances, and propose possible solutions to fix them when appropriate.
Fabrício Ceschin, Federal University of Paraná (UFPR), Brazil
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Fabr{\'\i}cio Ceschin},
title = {Spotting the Differences: Quirks of Machine Learning (in) Security},
year = {2023},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = jan
}