Check out the new USENIX Web site. next up previous
Next: Data Set Up: MEF: Malicious Email Filter Previous: Monitoring the Propagation of

Methodology for Building Data Mining Detection Models

We gathered a large set of programs from public sources and defined a learning problem with two classes: malicious and benign executables. Each example in the data set is a Windows or MS-DOS format executable, although the framework we present is applicable to other formats. To standardize our data-set, we used MacAfee's [5] virus scanner and labeled our programs as either malicious or benign executables.

We split the dataset into two subsets: the training set and the test set. The data mining algorithms used the training set while generating the rule sets, and after training we used a test set to test the accuracy of the classifiers on a set of unseen examples.


Matthew G. Schultz