MutantX-S: Scalable Malware Clustering Based on Static Features


Xin Hu, IBM T.J. Watson Research Center; Sandeep Bhatkar and Kent Griffin, Symantec Research Labs; Kang G. Shin, University of Michigan


The current lack of automatic and speedy labeling of a large number (thousands) of malware samples seen everyday delays the generation of malware signatures and has become a major challenge for anti-virus industries. In this paper, we design, implement and evaluate a novel, scalable framework, called MutantX-S, that can efficiently cluster a large number of samples into families based on programs’ static features, i.e., code instruction sequences. MutantX-S is a unique combination of several novel techniques to address the practical challenges of malware clustering. Specifically, it exploits the instruction format of x86 architecture and represents a program as a sequence of opcodes, facilitating the extraction of N-gram features. It also exploits the hashing trick recently developed in the machine learning community to reduce the dimensionality of extracted feature vectors, thus significantly lowering the memory requirement and computation costs. Our comprehensive evaluation on a MutantX-S prototype using a database of more than 130,000 malware samples has shown its ability to correctly cluster over 80% of samples within 2 hours, achieving a good balance between accuracy and scalability. Applying MutantX-S on malware samples created at different times, we also demonstrate that MutantX-S achieves high accuracy in predicting labels for previously unknown malware.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {178990,
author = {Xin Hu and Kang G. Shin and Sandeep Bhatkar and Kent Griffin},
title = {MutantX-S: Scalable Malware Clustering Based on Static Features},
booktitle = {2013 {USENIX} Annual Technical Conference ({USENIX} {ATC} 13)},
year = {2013},
isbn = {978-1-931971-01-0},
address = {San Jose, CA},
pages = {187--198},
url = {},
publisher = {{USENIX} Association},
month = jun,

Presentation Video

Download Video

Presentation Audio


0 dislikes