The classifier we incorporated into Procmail was a Naive Bayes classifier . A naive Bayes classifier computes the likelihood that a program is malicious given the features that are contained in the program. We assumed that there were similar byte sequences in malicious executables that differentiated them from benign programs, and the class of benign programs had similar sequences that differentiated them from the malicious executables.
The model output by the naive Bayes algorithm labels examples based on the byte strings that they contain. For instance, if a program contained a significant number of malicious byte sequences and a few or no benign sequences, then it labels that binary as malicious. Likewise, a binary that was composed of many benign features and a smaller number of malicious features is labeled benign by the system.
The naive Bayes algorithm computed the probability that a given feature was malicious, and the probability that a feature was benign by computing statistics on the set of training data. Then to predict whether a binary, or collection of hex strings was malicious or benign, those probabilities were computed for each hex string in the binary, and then the Naive Bayes independence assumption was used. The independence assumption was applied in order to efficiently compute the probability that a binary was malicious and the probability that the binary was benign.
One draw back of the naive Bayes method was that it requires more than 1 GB of RAM to generate its detection model. To make the algorithm more efficient we divided the problem into smaller pieces that would fit in memory and generated a classifier for each of the subproblems. The subproblem was to classify based on every 16 subsets of the data organized according to the first letter of the hex string.
For this we trained sixteen Naive Bayes classifiers so that all hex strings were trained on. For example, one classifier trained on all hex strings starting with an ``A'', and another on all hex strings starting with a ``0''. This was done 16 times and then a voting algorithm was then used to combine their outputs.
A more thorough description along with an example, can be found in a companion paper .