To find the valid instructions of a function (i.e., the instructions
that belong to the program), we attempt to reconstruct the function's
intra-procedural control flow graph. A control flow graph (CFG) is
defined as a directed graph
in which vertices
represent basic blocks and an edge
represents a possible flow of control from
to
. A basic block
describes a sequence of instructions without any jumps or jump targets
in the middle. More formally, a basic block is defined as a sequence
of instructions where the instruction in each position dominates, or
always executes before, all those in later positions, and no other
instruction executes between two instructions in the
sequence. Directed edges between blocks represent jumps in the control
flow, which are caused by control transfer instructions (CTIs) such as
calls, conditional and unconditional jumps, or return instructions.
The traditional approach to reconstruct the control flow graph of a function works similar to a recursive disassembler. The analysis commences at the function's start address and instructions are disassembled until a control transfer instruction is encountered. The process is then continued recursively at all jump targets that are local to the procedure and, in case of a call instruction or a conditional jump, at the address following the instruction. In case of an obfuscated binary, however, the disassembler cannot continue directly after a call instruction. In addition, many local jumps are converted into non-local jumps to addresses outside the function to blur local control flow. In most cases, the traditional approach leads to a control flow graph that covers only a small fraction of the valid instructions of the function under analysis. This claim is supported by the experimental data shown in Section 6 that includes the results for a state-of-the-art recursive disassembler.
We developed an alternative technique to extract a more complete control flow graph. The technique is composed of two phases: in the first phase, an initial control flow graph is determined. In the following phase, conflicts and ambiguities in the initial CFG are resolved. The two phases are presented in detail in the next two sections.