Borderline binaries are binaries that have similar probabilities of being benign and malicious (e.g. 50% chance it is malicious, and 50% chance it is benign). The binaries are important to keep track of because they are likely to be mislabeled, so they should be included in the training set. To facilitate this, the system archives the borderline cases, and at periodic intervals the collection of borderline binaries is sent back to a central server by the system administrator.
Once at the central repository, these binaries can then be analyzed by experts to determine whether they are malicious or not, and subsequently included in the future versions of the detection models. Any binary that is determined to be a borderline case will be forwarded to the repository and wrapped with a warning as though it were a malicious attachment.
A simple metric to detect borderline cases and redirect them to an evaluation party is to define a borderline case to be a case where the difference between the probability it is malicious and the probability it is benign is above a threshold. This threshold is set based on the policies of the host.
For example in a secure setting, the threshold could be set at 20%. In this case all binaries that have a 60/40 split are labeled as borderline. In other words, binaries with a 60% chance (according to the model) of being malicious and 40% chance of being benign would be labeled borderline, and vice versa. This setting can be determined by the system administrator or left on the default setting of 51.25/48.75, a threshold of 2.5%.
Receiving borderline cases and updating the detection model is an important aspect of the data mining approach. The larger the data set that is used to generate models then the more accurate the detection models will be. This is because borderline cases are executables that could potentially lower the detection and accuracy rates by being misclassified, so they should be trained over.