Code is not Natural Language: Unlock the Power of Semantics-Oriented Graph Representation for Binary Code Similarity Detection

Authors: 

Haojie He, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University; Xingwei Lin, Ant Group; Ziang Weng and Ruijie Zhao, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University; Shuitao Gan, Laboratory for Advanced Computing and Intelligence Engineering; Libo Chen, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University; Yuede Ji, University of North Texas; Jiashui Wang, Ant Group; Zhi Xue, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University

Abstract: 

Binary code similarity detection (BCSD) has garnered significant attention in recent years due to its crucial role in various binary code-related tasks, such as vulnerability search and software plagiarism detection. Currently, BCSD systems are typically based on either instruction streams or control flow graphs (CFGs). However, these approaches have limitations. Instruction stream-based approaches treat binary code as natural languages, overlooking well-defined semantic structures. CFG-based approaches exploit only the control flow structures, neglecting other essential aspects of code. Our key insight is that unlike natural languages, binary code has well-defined semantic structures, including intra-instruction structures, inter-instruction relations (e.g., def-use, branches), and implicit conventions (e.g. calling conventions). Motivated by that, we carefully examine the necessary relations and structures required to express the full semantics and expose them directly to the deep neural network through a novel semantics-oriented graph representation. Furthermore, we propose a lightweight multi-head softmax aggregator to effectively and efficiently fuse multiple aspects of the binary code. Extensive experiments show that our method significantly outperforms the state-of-the-art (e.g., in the x64-XC retrieval experiment with a pool size of 10000, our method achieves a recall score of 184%, 220%, and 153% over Trex, GMN, and jTrans, respectively).

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.