Disassembly Engine Implementation

Next: Binary Instrumentation Up: Binary Disassembly Previous: Disassembly Challenges

Disassembly Engine Implementation

Our disassembly engine is built on the x86 instruction set parsing and disassembly capabilities of an existing disassembler [13]. We use a combination of well-known disassembly techniques, viz. recursive traversal and linear sweep (described briefly in the previous subsection 3.1.1) and complement them with compiler-independent pattern matching heuristics. We assume that the data expected in a code section are typically dispatch tables (address bytes), strings and compiler alignment bytes. Since the goal of this project is to insert protection code into every procedure of the input binary, we should identify as many code bytes as possible; otherwise the transformed binary may have security holes. However, we place maintenance of original program semantics at a higher priority than security, so whenever in doubt we mark bytes as data instead of code, thus avoiding unsafe binary instrumentation. The following is a step-by-step description of the disassembly process:

Identify potential address bytes for dispatch table discovery and strings. Dispatch tables typically contain code section addresses. Since we know the address range of the code section, we can mark any sequence of 4 bytes, which have a value that lies within this address range as 'potential address' bytes. Sequences of printable characters that have a certain minimum length and are terminated by a null character are marked as 'potential strings.'
Starting from the program's main entry point, which is obtained from the input binary's PE header [17], we perform a control flow analysis on the binary to traverse the paths of the program's control flow graph. All code bytes identified in this step are marked as 'definitely code' and all associated data bytes marked as 'definitely data'. We also identify targets of CALL instructions as function entry points and targets of conditional and unconditional jump instructions as jump targets. Since this step can distinguish data from code with 100% accuracy, it overrides analysis results from other steps whenever there is a conflict. For example, the following byte sequence will be identified as a call instruction because the result from Step (2) overrides that of Step (1).
```
Identified as instruction 
'call dword[0x30001344]' in Step (2) 
<--------------------------->  
0xFF 0x15 0x44 0x13 0x00 0x30
          <-----------------> 		
          Identified as 'potential 
          address' (0x30001344) in 
          Step (1)
```
To identify the entry points of potential callback functions, for which there are no explicit call sites, we look for instruction sequences such as:
```
    push imm32
    mov reg32, imm32
```
Typically the target address of a callback function is usually passed as an argument to some function, with which the callback function is registered. Such an argument could be passed through the stack as an immediate value ( push imm32) or through a register, which contains the address value ( mov reg32, imm32; push reg32). If the byte at imm32 has not been identified as a 'potential address' or a 'potential string' in Step (1), and if it looks like a legitimate instruction starting byte, we consider it as a function entry point (although despite being a legitimate code byte it may not actually be a function entry point) and proceed disassembling the subsequent bytes as instructions.
To identify other types of functions for which there are no explicit call sites, we next look for bytes in the code section that have not been identified as code or data yet. Every time such a byte is located, we start instruction parsing if it looks like a legitimate instruction starting byte. In both Steps (3) and (4), the point where such instruction parsing begins is called a 'reset point'. Instruction parsing continues until an unconditional branch instruction ( ret or jmp) is encountered. If the result of an instruction parsing procedure is inconsistent with any previously identified byte or leads to an illegal instruction byte, the result since the reset point is revoked and all the bytes from the 'reset point' to the current position are marked as data, thus avoiding any potentially unsafe binary rewriting. After an unconditional branch we look for the next suitable 'reset point' to start the next instruction parsing attempt.
Because any sequence of instruction bytes should end with an unconditional branch instruction ( jmp or ret), we look for code sequences that end without an unconditional branch ( jmp or ret) instruction and mark such code sequences as data. This check provides a final line of defense to eliminate any potentially incorrectly identified instruction sequences. For example, in the following code sequence, Bytes 1 to 3 will be marked as data.
```
    1: mov eax, ebx 
    3: push eax
    4: data
```
Also, the byte next to an unconditional branch has to be either a data byte or if it is a code byte, it must be a branch target (as the previous instruction, being an unconditional branch, doesn't fall through). Therefore, in the case that the byte next to an unconditional branch is a code byte and has not been marked as a function entry point or a jump target, we mark it as a function entry point, even though they could just as well be targets of a branch instruction inside the same function. The motivation behind this "optimistic" identification of functions, as seen in steps 3) and 5) will be explained in subsection 4.3.

Next: Binary Instrumentation Up: Binary Disassembly Previous: Disassembly Challenges

Manish Prasad
2003-04-05