Code is not Natural Language: Unlock the Power of Semantics-Oriented Graph Representation for Binary Code Similarity Detection

The best binary code similarity detection have treated code as if it were natural language;

December 22, 2023

Research

Authors:

Article shepherded by:

Rik Farrow

Machine Learning has been used in many research products that analyze the semantic similarity between binary functions, an analysis often used for security purposes. State-of-the-art methods use either linear stream or control flow graph representations when preprocessing code for ML. We feel these approaches treat binary code too similarly to natural language, overlooking the well-defined semantic structures of binary code. We thus suggest instead using a semantically aware preprocessing. Extensive experiments show that our method significantly outperforms the state-of-the-art.

Binary code similarity detection (BCSD) is a fundamental task that determines the semantic similarity between two binary functions. It serves as a crucial component in addressing various important challenges in the field of software security. For instance, consider a scenario where we have a bunch of vulnerable binary function samples from closed-source libraries or operating systems (e.g., VxWorks) at hand, and we want to see if these vulnerabilities are present in any routers or other relevant devices. We can identify these potential 1-day vulnerabilities by first computing the semantics similarities between these vulnerable functions and all functions in the device firmwares, and then examining the matches with particularly high similarity scores.

However, detecting the semantic similarity between binary functions is challenging. On one hand, two semantically identical functions may exhibit entirely different syntax representations, particularly in cross-architecture scenarios. On the other hand, two semantically different functions can possess similar syntax representations. Thus, it is essential for a BCSD solution to understand the code semantics.

In this article, we will step towards enabling BCSD solutions to understand the semantics of binary code better.

How We Perform BCSD

Understanding the semantics of binary code is not a well-defined task that can be solved by traditional hand-written algorithms. Thus, with the development of artificial intelligence techniques, deep learning (DL) based algorithms are naturally introduced [6,10,11]. To ensure scalability, most DL-based BCSD solutions adopt a two-step framework for computing the similarities, as depicted in Figure 1. In the first step, the system learns how to perform a semantic similarity-preserving mapping from binary functions to numerical vectors (which we call embeddings). Subsequently, the similarity between these embeddings is calculated as an approximation to the similarity of functions.

Figure 1: A popular framework for binary code similarity detection (BCSD).

The distinctions among these BCSD solutions primarily lie in their approaches to generating embeddings. Existing embedding methods can generally be classified into two directions, i.e., instruction stream-based and control flow graph (CFG)-based, as illustrated in Figure 2. The former relies on natural language processing (NLP) techniques and generates the embedding directly from the entire instruction stream of the target function. The latter partitions the function into basic blocks, generates embeddings from the instruction streams of these basic blocks, and then fuses the block embeddings into the final code embedding through graph neural networks on the CFG.

Figure 2: Two lines of works on code semantics learning.

Is That Good Enough?

We notice that the stream representation with NLP-based methods that is adopted by the state-of-the-art works in both directions. The underlying assumption behind this prevalence is that natural languages and assembly languages (used by binary code) exhibit fundamental similarities. However, it is noteworthy that despite the resemblance between natural language sentences and instruction streams, they implicitly differ in the following three aspects:

Natural languages are designed for exchanging information while binary code is designed to match the machine architecture exactly. Natural language sentences are ambiguous and weakly structured while the binary code has well-defined structures, semantics, and conventions. For instance, accurately parsing the syntactic dependency in natural language sentences is challenging, while it is easy in binary code.
Reordering instructions or moving instructions between basic blocks is feasible, which suggests that binary code should be represented in a more flexible representation rather than sequences of instructions. In fact, instructions are often reordered and moved around basic blocks by the compiler for optimization. However, there is no well-defined way to reorder words or sentences in natural languages while preserving the semantics.
Binary code can be thought of as consisting only of verbs (instruction codes) and pronouns (register/memory references); natural languages, on the other hand, are rich in syntax. Moreover, when instructions use registers or memory slots, they indeed use the value temporarily cached inside them. The exact registers or memory slots used to cache the value are semantics-independent.

We summarize the differences in the following table:

	Natural Languages	Assembly Languages
Purpose	Exchanging information	Machine Execution
Structure	Weakly structured	Fully structured
Linearity	Naturally linear	More flexible
Composition	Abundant	Full of pronouns and verbs

These differences suggest that treating code as natural languages may be suboptimal, providing an opportunity, one we explored.

Case Study: Transformer + Instruction Stream

Let us take one more step to see how the differences between natural languages and binary code may impede the semantics learning of binary code with NLP-based methods. NLP-based methods consume a sequence of tokens. Abstractly, each token contains two aspects of information, i.e., its token content and its position in the sequence. Take the most famous Transformer-based [9] models as an example. For each input token, its content as well as its position in the sequence are first transformed into two embedding vectors respectively and then summed, as shown in Figure 3a. And the resulting embedding set of tokens is fed to the subsequent networks. Thus, modifying either token content or token position results in different inputs of NLP-based models. If such modification does not change code semantics, these models need to learn from it.

Figure 3: Three semantically equivalent variants of a simple instruction stream snippet.

Figure 3 shows three trivial semantically equivalent transformations that the sequence representation imposes on NLP-based models to learn:

• Transformation-1, the first two instructions of this code snippet can be swapped without modifying the code semantics. However, this transformation results in changes in the produced embedding sets. For instance, the token embedding of the token LOAD is now added with the position embedding of the index 5 (as shown in Figure 3b) while it is originally added with the position embedding of the index 2. Thus, the model needs to learn that the order of some instructions can be adjusted without modifying the semantics while others cannot.

• Transformation-2, the entire code snippet is placed at a different position in the sequence, e.g., due to the insertion of some dummy instructions at the beginning of the sequence. In figure 3c, the position embeddings of all tokens are changed, resulting in a totally different embedding set. The model needs to learn that the same sub-sequence of tokens in different positions are semantically equal.

• Transformation-3, all the uses of the register r0 are substituted by a previously unused register r2. The model needs to learn that the choice of exact registers used are semantics-independent. On the contrary, the data flow passing through registers is an integral aspect of code semantics.

For these transformation rules to be learned by models and extended to real-world cases, a lot of related samples need to be learned. For instance, if only code snippets in Figure 3a and Figure 3c are given, models may learn that the given code snippet at position 0 and position k are semantically equivalent, while it is not necessary for them to learn if similar rules hold when another code snippet is given or the snippet is placed at another position. In addition, it costs extra network parameters and layers. Overall, learning these additional semantics makes the NLP-based approaches more expensive and more difficult to generalize.

Our Ideal Representation for Semantic Learning

The semantics-preserving transformations listed above provide several hints on properties that an ideal binary code representation should possess:

The representation should not assign a position identifier to each instruction or token, as the semantics is independent of position. Therefore, a graph representation would be more appropriate.
The representation should not fully adopt the execution order to restrict instructions since a significant portion of execution order restrictions is machine-dependent but semantics-independent, which suggests a more flexible control flow representation.
Some token contents are independent of semantics, while the relations between instructions or tokens carry the semantics. We thus aim to eliminate the semantics-independent tokens and model semantic relations.

To sum up, we want a flexible graph representation for binary code with its edges (representing relations between tokens) emphasized and semantics-independent elements purged. We name such a graph representation a semantics-oriented graph (SOG). We will try to provide an intuitive explanation of what a SOG is by how it is constructed. If you are interested in its technical details, please refer to [2].

Before continuing, we would like to introduce a toy intermediate representation, which is mostly self-explanatory. To avoid unnecessary complexity, we present ideas based on the toy IR. As shown in Figure 4, the toy IR includes four types of tokens, i.e. registers which are of the form ri (e.g. r1), instruction operators which appear as the first token of the instruction or the first token on the right of the equation symbol (e.g. STORE, CALL), integer literals and labels (e.g. L1). Labels mark the start of basic blocks and are used by branch instructions. The exact semantics of toy IR instructions are not our current concern.

Figure 4: A proof-of-concept example of the step-by-step lifting of code from the linear representation to the Semantics-Oriented Graph.

We model the construction of SOG in three steps:

First, we lift the sequence of instructions into a graph representation and reveal the relations between instructions as edges. Figure 4b shows an example of such a representation lifted from the code snippet in Figure 4a. This graph can be obtained by enhancing the data flow graph (DFG) with additional control flow (execution flow) and effect flow (necessary execution order) edges. We name this representation the instruction-based semantics-complete Graph (ISCG), as it carries exactly the same semantic information as the original linear representation and takes instructions as nodes. One interesting thing is that we do not mark which basic block an instruction belongs to, because instructions can actually float over basic blocks without changing the semantics, as long as these three types of relations are respected.

Second, we reveal intra-instruction structures by splitting instructions into tokens and translating the structures into edges between tokens. This step simplifies the generation of node embeddings and eases the elimination of semantics-independent elements. The intra-instruction structures can be interpreted as another kind of data relations. For example, we can interpret the instruction r0 = LOAD 0x1000 as the LOAD token using the 0x1000 token and the r0 token using the processed result of the LOAD token. By further refining the inter-instruction relations on the exact instruction tokens which introduce them (i.e., control and effect relations are defined on instruction operators, while data relations are defined on the register tokens or others that temporarily cache data), and labeling the positions of the operands on the edges, we obtain the graph shown in Figure 4c. Edges with omitted labels in this graph have the default label 1 (i.e., these relations are built on their first operands). Figure 4c reveals semantically related inter-instruction relations and intra-instruction structures. We name this representation the token-based semantics-complete graph (TSCG).

Third, since the choice of temporary stores (e.g., registers or stack slots) to cache data between instructions is semantically independent, we further remove these nodes and connect their inputs and outputs directly, forming Figure 4d, the final representation that this study targets.

Ablation Study

To demonstrate the efficiency of the SOG representation for BCSD, we select several promising representations from previous work that are not surpassed by others:

• CFG-opc200 is a CFG based binary function representation using the opc200 manually crafted features as basic block attributes. This representation is proposed by Marcelli et al. [4] and shows advantages over previous Word2vec [7] based methods [5].

• CFG-PalmTree aggregates unsupervised instruction embeddings generated by the PalmTree model [3] as the basic block embedding. PalmTree is the state-of-the-art unsupervised instruction embedding network. We follow the practice of the original paper to use mean pooling as the aggregator to obtain basic block embedding. This baseline only supports the x86 architecture.

• CFG-HBMP use HBMP model [8] to compute the block embedding end-to-end. This method is proposed by Yu et al. [13] and performs better than previous pre-training methods [5,12] using Bert [1] or Word2vec [7].

In addition, we compare the SOG representation with straw-man representations proposed in the last section:

• P-DFG (the `P-' prefix refers to `Pcode-based') takes instructions as nodes and def-use relations between instructions as edges. We build the DFG based on Ghidra's Pcode IR and generate instruction embeddings through a bidirectional GRU layer end-to-end.

• P-CDFG is similar to the P-DFG representation except that it integrates the control flow relations as additional edges.

• P-ISCG is the first semantics-complete graph proposed in the last section. The difference between ISCG and CDFG lies in the introduction of effect flow edges. An example of ISCG can be found in Figure 4b.

• P-TSCG is the second semantics-complete graph proposed. Compared to the ISCG, it additionally reveals intra-instruction structures. Figure 4c shows an example of TSCG.

To compare with those baselines, we keep other parts of our solution, tune only the hyper-parameters tightly related to these methods, and report the best results found. Specifically, we tune only the hyper-parameters of graph encoders and the batch sizes within the constraint of GPU memory (10 GiB). Observing that the result scores of some baselines are close, we repeat experiments 10 times and report mean values to mitigate the effect of randomness.

Following the common practice of previous work [4,10,12], we evaluate our method on the binary function retrieval task, i.e. given a query function f and a pool of N different binary functions, our goal is to retrieve the only ground truth function that is compiled from the same source code as the query function.

Table 1 shows the results of the ablation study. The first section of the table contains the results of baseline representations, in which the CFG-HBMP method stably outperforms the other two methods. The second section shows the results of the straw-man representations. The last section shows the results of the SOG representation.

	XA	XM		x64-XC

	100	100	10000	100	10000

CFG-OPC200	92.8/95.6	62.9/69.3	32.0/38.7	64.2/70.3	36.8/42.9
CFG-PalmTree	-	-	-	66.1/72.3	36.3/43.0
CFG-HBMP	94.3/96.6	65.3/71.9	33.7/40.6	67.2/73.0	39.0/45.5

P-DFG	93.4/96.2	69.7/76.0	37.4/44.7	69.9/75.7	42.8/49.3
P-CDFG	94.6/96.9	71.1/77.3	38.6/45.9	72.4/77.8	43.6/50.5
P-ISCG	95.1/97.2	71.3/77.2	40.1/47.2	71.9/77.1	44.2/50.9
P-TSCG	95.5/97.4	73.2/78.9	41.3/48.8	73.5/78.7	46.4/52.9

P-SOG	95.5/97.5	74.5/80.2	43.8/50.8	75.6/80.7	48.1/54.6

Table 1: Results of the ablation study. The first row lists the names of the eval- uated subtasks, where XA denotes cross-architecture and bitness retrieval, XM denotes cross-architecture, bitness, compiler, compiler version, and optimiza- tion level retrieval, and x64-XC denotes cross-compiler, compiler version, and optimization level retrieval on the dataset consisting of binary functions from the x64 architecture only. The second row lists the pool sizes (the task is to retrieve a ground truth function from a pool consisting of N negative functions and 1 ground truth function, where N is the pool size). The scores (%) are RECALL@1/MRR (Mean Reciprocal Rank).

Compared to the baseline representation in the first section of Table 1, the proposed SOG representation improves the recall by nearly 10 percent in all subtasks except XA. XA is the easiest subtask for the GNN and the graph representation-based methods, in which even the CFG-opc200 method can achieve a recall rate as high as 92.8%. And we observe that the relative performance of the SOG over the CFG-HBMP becomes better as the pool size increases from 100 to 10000 in both the XM subtask (from 9.2% to 10.1% in RECALL@1) and the x64-XC subtask (from 8.4% to 9.1% in RECALL@1).

In the second section, the RECALL@1 and MRR scores of P-DFG, P-CDFG, P-ISCG, and P-TSCG generally increase successively, demonstrating that the reveal of control flow relations, effect flow relations, and intra-instruction structures can indeed improve the efficacy. The introduction of effect flow edges slightly hurt the performance in the XC subtask when setting the pool size as 100. This may be attributed to the dirty effect problem which we discuss in Section 6 of our paper [2]. The performance in all subtasks benefits from the introduction of control flow edges and the reveal of the intra-instruction structures.

The P-DFG method outperforms the CFG-HBMP method in nearly all subtasks except XA, which demonstrates the superiority of CFGs in the cross architectures scenario and the superiority of DFGs in the cross-optimization and the cross-compiler subtasks. In addition, the superiority of the SOG over the TSCG supports the hypothesis that keeping the semantics-independent elements in the representation hurts the generalization ability of the model.

Discussion and Prospect

Recently, breakthroughs have been made in natural language understanding. Large language models have emerged as frontrunners and have been adapted and specialized for comprehending binary code. Nevertheless, we believe that assembly languages, which are designed for machine execution, are fundamentally different from natural languages. And thus, they deserve special treatment.

In this article, we propose the use of semantics-oriented graph representation, which is a concise and semantics-complete graph representation for binary code. Although we only focus on utilizing this representation for BCSD, it should also be applied to other binary code-related tasks that require understanding the code semantics. Additionally, we have also made efforts to improve the graph neural networks for this representation. However, we are of the opinion that current graph neural networks fall short of fully harnessing the potential of the SOG representation.

For more information on this work, please refer to our publication in USENIX Security 2024 [2].

Appendix

References:

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. https://arxiv.org/abs/1810.04805

[2] Haojie He, Xingwei Lin, Ziang Weng, Ruijie Zhao, Shuitao Gan, Libo Chen, Yuede Ji, Jiashui Wang, and Zhi Xue. Code is not natural language: Unlock the power of semantics-oriented graph representation for binary code similarity detection. In 33rd USENIX Security Symposium (USENIX Security 24), PHILADELPHIA, PA, August 2024. USENIX Association. https://www.usenix.org/conference/usenixsecurity24/presentation/he.

[3] Xuezixiang Li, Yu Qu, and Heng Yin. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3236–3251, Virtual Event Republic of Korea, November 2021. ACM. https://dl.acm.org/doi/10.1145/3460120.3484587.

[4] Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22), pages 2099–2116, Boston, MA, August 2022. USENIX Association. https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli.

[5] Luca Massarelli, Giuseppe A. Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. In Proceedings 2019 Workshop on Binary Analysis Research, San Diego, CA, 2019. Internet Society. https://www.ndss-symposium.org/ndss-paper/auto-draft-40/.

[6] Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Roberto Perdisci, Cl ́ementine Maurice, Giorgio Giacinto, and Magnus Almgren, editors, Detection of Intrusions and Malware, and Vulnerability Assessment, volume 11543, pages 309–329. Springer International Publishing, Cham, 2019. http://link.springer.com/10.1007/978-3-030-22038-9_15.

[7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c492....

[8] Aarne Talman, Anssi Yli-Jyr ̈a, and J ̈org Tiedemann. Sentence Embeddings in NLI with Iterative Refinement Encoders. Natural Language Engineering, 25(4):467–482, July 2019. http://arxiv.org/abs/1808.08762.

[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547de....

[10] Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. jTrans: jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1–13, Virtual South Korea, July 2022. ACM. https://dl.acm.org/doi/10.1145/3533767.3534367.

[11] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363–376, Dallas Texas USA, October 2017. ACM. https://dl.acm.org/doi/10.1145/3133956.3134018.

[12] Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1145–1152, April 2020. https://aaai.org/ojs/index.php/AAAI/article/view/5466.

[13] Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, and Shi Wu. CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching. NeurIPS, page 12, 2020. https://proceedings.neurips.cc/paper/2020/hash/285f89b802bcb2651801455c8....

Article Categories:

Security

AI/ML

Last updated January 26, 2024

Authors:

Haojie He is a master's student at the School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University (SJTU). He has a broad interest in software security and programming languages. His recent research has focused on binary analysis.

[email protected]

Ziang Weng is a master's student at the School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University (SJTU). His research focuses on software security and AI for security.

[email protected]

Libo Chen has been a senior engineer with the School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University (SJTU) since 2018. His research mainly focuses on software security, embedded system security, and AI for security.

[email protected]