Check out the new USENIX Web site. next up previous
Next: Run-Time Overhead Up: Macro-Benchmark Results Previous: Macro-Benchmark Results

Disassembly Accuracy

The binary-rewriting RAD prototype uses control flow analysis and a set of other heuristics to distinguish between code from embedded data. In general, control flow analysis is quite effective in identifying the code regions for non-interactive applications, which usually do not have many call-back functions. However, for interactive GUI applications, such as those in Microsoft Office suite, control-flow analysis alone is not quite as effective because of the hidden call-back functions. Therefore these applications represent the most challenging test programs for a disassembler. Table 3 shows the disassembly accuracy of three such programs, MS Powerpoint, MS Frontpage, and MS Publisher. The disassembly accuracy of all these programs is above 99%. The way we measure disassembly accuracy is to manually inspect the resulting assembly code and determine whether the instructions look ``reasonable.'' From our experiences, instructions that are disassembled from data tend to appear out of place and thus can be easily detected.

Table 3: Disassembly accuracy achieved as measured through manual inspection of the resulting assembly code. Higher accuracy means that more bytes are successfully disassembled.
Application Code section No. of incorrectly decoded Accuracy
  size bytes (approximation)  
MS PowerPoint 4.059 MB 2500 99.93%
MS Publisher 2.314 MB 50 99.99%
MS FrontPage 983 KB 900 99.91%


Since the manual inspection method used above may not seem rigorous enough, to further verify our disassembly results, we experimented with certain Cygwin ported Unix applications (with available sources) on Windows, compiled with gcc's profiling options and then analyzed them offline with gprof.

Table 4: Evaluation of disassembly results by comparison with original program sources. The numbers in the parenthesis on the third column represent the number of ``over-identified'' functions.
Application No. of functions No. of functions Miss No. of returns % of returns
  (source code) (disassembly) % unprotected unprotected
Gzip 234 234 (80) 0 2 0.85
Wget 626 626 (140) 0 3 0.48
Apache 1191 1159 (350) 2.69 38 3.19
Whois 148 148(15) 0 0 0
OpenSSL 2820 2812(780) 0.283 7 0.248


Thus the % of both missed functions and unprotected returns in interesting functions appear to be reasonable. The missed functions in Apache were typically functions without any absolute address references in the code section, which were invoked through a table of function pointers to which the addresses of those functions were assigned statically. The table, being a static array variable, is located in the .data section and so were the function addresses. The static call graph generated by gprof also shows the parents of these functions as 'unidentified'. Because the results obtained from control flow analysis is guaranteed to be correct, it is useful to measure the percentage of instructions that can be identified through control flow analysis, which gives an indication of how useful other heuristics are in identifying instructions, especially for GUI-intensive interactive applications. Table 5 shows the total number of instruction bytes in each test application and the percentage of them that control flow analysis can successfully detect. As expected, for non-interactive applications, which rarely use any call-back functions, control flow analysis can achieve a very high detection accuracy, more than 97%. However, for interactive applications, the percentage is around 80%. The difference between the coverage percentages in Table 3 and 4 for the three programs, MS Powerpoint, MS Frontpage, and MS Publisher, represent the contribution of the pattern-based heuristics that binary-rewriting RAD employs to the total code region coverage.

Table 5: Column 2 shows the percentage of a program's code section bytes that is detected purely through control flow analysis. Column 3 shows the percentage of indirect branch instructions among all the branch instructions. RET instructions are not included in this count.
Application % of code section covered by control flow analysis % of indirect branch instructions
WFtpd (Ftp server) 97.13% 5.81%
BIND (DNS server) 97.32% 5.42%
MS Access 84.57% 8.29%
Notepad 97.54% 1.73%
MS Powerpoint 80.11% 7.54%
Windows Help 99.67% 1.41%
MS FrontPage 87.20% 8.98%
MS Publisher 93.86% 8.94%


Finally, because control flow analysis plays such an important role in the disassembly process, it is instructive to investigate deeper why it cannot detect all the instructions in the program. Other than functions that do not have explicit call sites, indirect branch instructions are the main culprit. We measure the percentage of indirect control branch instructions in the test applications and the results are shown in Table 5. Again, interactive applications such as MS Powerpoint and Access tend to have a higher percentage of indirect branches than others, which reflects the event-driven programming style of these applications, and correspondingly a more extensive use of function pointers and switch statements. In summary, our disassembly results appear to be better than the previous best reported in the literature [22], which claims 99.9% precision using binaries with relocation information, but most of their experiments were on smaller programs, all of which were plain command line programs without any GUI callback functions, which makes disassembly tougher. Furthermore, the presence of symbol table information in binaries (possibly inadvertently) eliminates problems regarding function boundary identification. However, there is a question of the symbol table format. It could either be the generic COFF symbol table, supported by the PE/COFF binary formats, or it could be a compiler specific format, like the VC++ .pdb. Apparently, compilers tend to favor their proprietary formats for symbol table over the generic COFF format. This is evident since the default compilation options for generating debug information do not produce COFF symbol tables, and generate proprietary symbol tables instead.
next up previous
Next: Run-Time Overhead Up: Macro-Benchmark Results Previous: Macro-Benchmark Results
Manish Prasad
2003-04-05