A serious security risk today is the propagation of malicious executables through email attachments. A malicious executable is defined to be a program that performs a malicious act, such as compromising a system's security, damaging a system or obtaining sensitive information without the user's permission. Recently there have been some high profile incidents with malicious email attachments such as the ILOVEYOU virus and its clones. These malicious attachments caused significant damage in a short time. The Malicious Email Filter (MEF) project provides a tool for the protection of systems against malicious email attachments.
An email filter that operates within a mail to detect malicious Windows binaries has many advantages. Operating from a mail server, the email filter could automatically filter the email each host receives. The mail server could either wrap the malicious email with a warning addressed to the user, or it could block the email depending upon the server's settings. All of this could be done without the server's users having to scan attachments themselves or having to download updates for their virus scanners. This way the system administrator can be responsible for updating the filter instead of relying on end users. We present such a system on a UNIX implementation of sendmail using procmail.
The standard approach to protecting against malicious emails is to use a virus scanner. Commercial virus scanners can effectively detect known malicious executables, but unfortunately they can not detect unknown malicious executables reliably. The reason for this is that most of these virus scanners are signature based. For each known malicious binary, the scanner contains a byte sequence that identifies the malicious binary. However, an unknown malicious binary, one without a pre-existing signature, will most likely go undetected.
We built upon preliminary research at Columbia University on data-mining methods to detect malicious binaries . The idea is that by using data-mining, knowledge of known malicious executables can be generalized to detect unknown malicious executables. Data mining methods are ideal for this purpose because they detect patterns in large amounts of data, such as byte code, and use these patterns to detect future instances in similar data along with detecting known instances. Our framework used classifiers to detect malicious executables. A classifier is a rule set, or detection model, generated by the data mining algorithm that was trained over a given set of training data.
The goal of this paper is to describe a data mining based filter which integrates with Procmail's pre-existent security filter  to detect malicious executables. The MEF system is an application of more theoretical research into this problem . The data mining-based detection system within MEF is a preliminary system that will become more accurate and efficient as our research progresses, and new data sets are analyzed. It uses a scoring system based on a data mining classifier to determine whether or not an attachment may be malicious. If an attachment's score is above a certain threshold it is considered malicious.
This work expanded upon Procmail's pre-existant filter which already defangs active-content HTML tags to protect users who read their mail from a web brower or HTML-enabled mail client. Also, if the attachment is labeled as malicious, the system ``mangles'' the attachment name to prevent the mail client from automatically executing the attachment. It also has built in security filters such as long filenames in attachments, and long MIME headers, which may crash or allow exploits of some clients. This filter lacks the ability to automatically update its list of known malicious executables leaving the system vulnerable to attacks by new and unknown viruses. Furthermore, its evaluation of an attachment is based solely on the name of the executable and not the contents of the attachment itself. We replaced this signature based detection algorithm with our data mining classifier that added the ability to detect both the set of known malicious binaries and a set of previously unseen, but similar malicious binaries. Although the MEF implementation was designed for the data mining-based detection system, any method to evaluate binaries including a standard signature based scanner can be used.
Since the methods and classifier models we describe are probabilistic, we provide a means of determining whether a binary was borderline. A borderline binary is a program that has similar probabilities for both classes (i.e. could be either a malicious executable or a benign program). As a parameter of the filter, the system administrator may specify what a borderline case is. Guidelines on how to set this parameter are described in detail later. If it is a borderline case then along with the option to wrap it as a malicious program there is an option in the network filter to send a copy of the malicious executable to a central repository such as CERT. There, it can be examined by human experts. After analysis by virus experts, the model can be updated to be more accurate by including these borderline cases.
The detection model generation works as follows. The binaries are first statically analyzed to extract byte-sequences, and then the classifiers are generated by analyzing a subset of the data. Then the classifier (or detection model) is tested on a set of previously unseen data. We implemented a traditional, signature-based algorithm to compare its performance with the data mining algorithms. Using standard statistical cross-validation techniques, the data mining-based framework for malicious binary detection had a detection rate of 97.76%, over double the detection rate of a signature-based scanner.
The organization of the paper is as follows. In Section 2, we present the system features and their integration with Procmail. In Section 3, we detail the methods that are employed to track the propagation of malicious attachments. Section 4 describes how the detection algorithms work, and their results. Section 5 discusses the system's performance, and Sections 6 and 7 conclude the paper and discuss future work.