ATOM, A Flexible Interface for Building High Performance Program Analysis Tools Alan Eustace and Amitabh Srivastava Digital Equipment Corporation, Western Research Laboratory Abstract Code instrumentation is a powerful mechanism for understanding program behavior. Unfortunately, code instrumentation is extremely difficult, and therefore has been mostly relegated to building special purpose tools for use on standard industry benchmark suites. ATOM (Analysis Tools with OM) provides a very flexible and efficient code instrumentation interface that allows powerful, high performance program analysis tools to be built with very little effort. This paper illustrates this flexibility by building five complete tools that span the interests of application programmers, computer architects, and compiler writers. This flexibility does not come at the expense of performance. Because ATOM uses procedure calls as the interface between the application and the analysis routines, the performance of each tool is similar to or greatly exceeds the best known hand-crafted implementations. 1. Introduction Program analysis tools are extremely important for understanding program behavior. Computer architects need such tools to evaluate how well programs will perform on new architectures. Software writers need tools to analyze their programs and identify critical pieces of code. Compiler writers often use such tools to determine how well their instruction scheduling or branch prediction algorithm is performing or to provide input for profile-driven optimizations. Over the past decade three classes of tools for different machines and applications have been developed. The first class consists of basic block counting tools like Pixie[Pixie], Epoxie[Epoxie] and QPT[QP]. The second class consists of data and instruction address tracing tools. Pixie and QPT can also generate address traces. They communicate these traces to analysis routines through inter-process communication. Tracing on the WRL Titan[Borg] communicated with analysis routines using shared memory, but this required operating system modifications. MPTRACE[MPTRACE] is similar to Pixie but it collects traces for multiprocessors by instrumenting assembly code. ATUM [ATUM] generates address traces by modifying microcode and saves a compressed trace in a file that is analyzed offline. The third class of tools consists of simulators. Tango Lite [TANGO] supports multiprocessor simulation by instrumenting assembly language code. PROTEUS[PROTEUS] also supports multiprocessor simulation but instrumentation is done by the compiler. g88[g88] simulates Motorola 88000 using threaded interpreter techniques. Shade[Shade] uses instruction level simulation to selectively generate traces. This technique offers considerable flexibility at the expense of much lower performance. Figure 1: The ATOM Process The important features that distinguish ATOM [atompldi,atomref,atomuser] from previous systems are listed below. ATOM is a tool-building system. A diverse set of tools ranging from basic block counting to cache modeling can be easily built. ATOM provides the common infrastructure in all code-instrumenting tools, which is the cumbersome part. The user simply specifies the tool details. ATOM allows selective instrumentation. The user specifies the points in the application to be instrumented, the procedure calls to be made, and the arguments to be passed. The communication of data is through procedure calls. Information is directly passed from the application to the specified analysis routine with a procedure call instead of through interprocess communication, files on disk, or a shared buffer with central dispatch mechanism. ATOM tool overhead is proportional to the complexity of the underlying analysis. Many interesting tools can be built that have little or no impact on application performance. Even though the analysis routines run in the same address space as the application, precise information about the application is presented to analysis routines at all times. ATOM is independent of compiler and language systems. To illustrate the power and flexibility of this approach, this paper fully implements a variety of custom program analysis tools, including input/output, instruction profiling, cache simulation, dynamic memory allocation, procedure inlining profile driven optimizations, and evaluating the quality of compiled code. None of these tools takes more than 60 lines of code to implement. These tools form the basis of many of the tools that are distributed as part of the standard ATOM distribution. Digital provides ATOM as an Advanced Development Kit in OSF/1, Version 3. To illustrate the performance of these tools, each was applied to the SPEC92 tool suite. The instrumented application times are compared to the uninstrumented applications using wall clock times. 2. Implementation of ATOM ATOM is built using OM[OM], a link-time code modification system. OM takes as input a fully linked executable that has been annotated with relocation records. This file is used to build a symbolic intermediate representation which contains no hard coded addresses. This framework provides an ideal platform for inserting, rearranging, or deleting instructions. Previous papers [Addr64,UR] have explored the usefulness of this approach at performing a wide range of program optimizations. ATOM uses this same intermediate representation to implement program instrumentation. A complete description of the ATOM implementation was presented at PLDI[atompldi]. Figure 2. Read Tool Implementation Figure 2 describes the implementation. First, the OM generic object modification library is linked with a tool specific instrumentation file to produce a custom instrumenting tool. This program reads in the user application, and modifies it by adding calls to tool specific analysis procedures. ATOM completes the process by linking the instrumented application with the tool specific analysis file. The output of ATOM is a custom instrumented application executable that is run in exactly the same manner as the original application. 3. A Simple Example From a user perspective, applying an ATOM tool to an application Status: RO is done by executing the following command. atom appl.rr read.inst.c read.anal.c -o appl.read The first argument is the application program, which has been specially linked to include relocation records. The second and third arguments are the instrumentation and analysis files. In this example, we instrument the application to count and write to a file the total number of bytes read each time the instrumented application is executed. The instrumentation file is shown of the left side of Figure 2. Line 2 includes the instrument.h file which defines the ATOM primitives for manipulating application programs. Line 3 defines the Instrument procedure, which is linked with the OM object code modification library to produce a custom instrumenting tool. All analysis procedures that are called from the application program are declared and placed by the Instrument procedure. Line 5 makes use of the AddCallProto ATOM primitive to declare the name and arguments to the RecordRead analysis procedure. This procedure takes a single argument of type REGV. Arguments of type REGV are used to pass the contents of a specific processor register. Line 6 declares the PrintResult analysis procedure, which does not take any arguments. Line 7 calls the GetNamedProc primitive to return a pointer to the read procedure. Line 8 checks to see if this value is NULL, indicating that the procedure read is not defined in this application. All communication between the application program and the analysis procedures is done through procedure calls. Line 9 uses the ATOM primitive AddCallProc to add a call to the read procedure. The first argument, p is a pointer to the read procedure. The second argument, ProcAfter, specifies that the call is to be inserted after the read procedure is executed. The third argument indicates that the call is to the RecordRead analysis procedure. The remainder of the arguments are used to determine what values ATOM passes to RecordRead. In this case, the final argument passes the contents of the register RET_RES_1 to the analysis procedure. In the Alpha AXP calling convention, this register contains the contents of the value returned by the read procedure. The RecordRead procedure is shown on the right side of Figure readtool. This procedure simply adds this size to a total. Line 10 calls the AddCallProgram primitive, which adds a call to the PrintResults procedure after the application finishes executing. The corresponding analysis procedure opens a file, prints out the result, and closes the file. The definitions for both these procedures are shown on the right side of Figure 2. A sample output file is shown in the Appendix. Notice that it is important that the analysis procedure does not call the instrumented version of the read procedure, since reads that occur inside the analysis procedure must not increment the application totals. To guarantee this, library procedures that would normally be shared between the application and the analysis procedures are linked into the instrumented application twice. Only the version that is linked into the application is instrumented. This guarantees that calls made to read by the analysis procedures do not influence the statistics gathered by the read tool. Although this tool is relatively simple, it is straightforward to extend the read tool into a general input/output tool. The first extension is to add calls to analysis procedures before and after for open system calls. This allows the analysis procedures to record the name of the file opened and the file descriptor returned. By instrumenting both read and write procedures and passing the first (file descriptor) and third arguments (size in bytes), the read and write totals can be accumulated for each open file. The final extension is to use the Alpha AXP cycle counter to maintain fine grain times of how long each operation takes. This allows the tool to determine the rate of read and write operations. This extended tool is called io and is distributed with ATOM as part of the standard tool set. Figure 3. Profiling Tool Implementation 3. ATOM Primitives ATOM tools traverse an application, find interesting places to add calls to analysis procedures, and pass arguments that correspond to data or events in the application. To provide these functions, ATOM provides three types of primitives: navigation, information, and instrumentation. Navigation primitives traverse the application. The simple example presented above used the GetNamedProc primitive to find a specific procedure. Other navigational primitives traverse procedures, basic blocks within procedures, and instructions within basic blocks. A basic block is a set of sequential assembly language instructions that are not interrupted by branches or jumps. Information primitives provide static information about instructions, basic blocks, procedures, or the program. For example, given an instruction, ATOM primitives can return the program counter, the opcode, the instruction class, address displacements, the source line number, a mask of the registers used or set by the instruction, etc. Given a basic block, primitives are provided to find the number of instructions in the basic block and the starting program counter of the block. Given a procedure, primitives are provided to find the file name, stack frame size, register save and restore masks, etc. General program information includes the sizes of text and data sections, along with general statistics on the number of procedures, basic blocks and procedures in the application. Instrumentation primitives allow calls to analysis procedures to be inserted into the application before or after instructions, basic blocks, procedures. The arguments to these procedures can include any value computed by the instrumentation routine or provided by ATOM primitives. The arguments of these procedures can be constants, processor registers, effective addresses, branch condition values, arguments to application procedures, file names, line numbers, or character strings. Although not shown in these examples, ATOM also allows command line arguments to be passed to instrumentation routines. Parameters can also be passed to analysis procedures through setenv variables. 4. Instruction Profiling In this section we implement an instruction profiler based on counting the number of instructions executed in each procedure. Although it is possible to implement this tool by placing a call to an analysis procedure before every instruction in the application, ATOM's selective instrumentation can significantly reduce this overhead by instrumenting only basic blocks. For example, if a set of 10 sequential executed instructions are inside of a loop, we can keep track of the total number of instructions executed by adding 10 each time we enter the loop body. Figure 3 defines the instrumentation and analysis files for the profile tool. As in the previous section, lines 6 through 9 of the instrumentation file declares the interface to the OpenFile, ProcedureCount, ProcedurePrint, and CloseFile analysis procedures. In line 10, the AddCallProgram primitive is used to add a call to OpenFile before the application begins execution. The GetProgramInfo ATOM primitive, when passed the ProgramNumberProcs argument, returns the number of procedures in the application. The corresponding analysis procedure uses this argument to allocate sufficient memory to accumulate a count for each procedure in the application. Lines 12 through 21 navigate each procedure in the application. Within each procedure, lines 14 through 18 process each basic blocks. Line 16 calls the AddCallBlock ATOM primitive to add a call to the ProcedureCount analysis procedure. The two arguments passed are a procedure index n, and the number of instructions in the basic block. This value is returned by the GetBlockInfo primitive. The corresponding analysis procedure uses these arguments to increment the number of instructions executed by this procedure. Figure 4. Cache Tool Implementation For each procedure in the application, line 19 adds a call to the ProcedurePrint analysis procedure. ProcedurePrint is passed the unique procedure index and the name of the procedure. This name is returned by the ProcName ATOM primitive. The corresponding analysis file uses these two parameters to determine if the procedure was executed, and if so, prints the procedure name, number of instructions, and percentage of instructions executed in this procedure to a file. Notice that the effect of line 19 is to add hundreds of calls to analysis procedures to the end of the program, each with a different index and character string. Line 22 adds a call to the CloseFile analysis procedure after the application completes executing. A sample output file is included in the Appendix. Although this is a very simple profiling tool, many more interesting tools can be built using the same principles. Russell Kao built an ATOM based version of the popular tool Gprof. This tool adds procedure calls at the start of each procedure to push the name of the procedure on a procedure call stack. This stack is popped by adding a similar analysis procedure call to the procedure exit. Gprof reports the percentage of time spent in a procedure and the procedures descendants. The instrumentation procedure was also expanded to use the Alpha AXP dual issue rules to compute cycles rather than instructions executed. Many other profile based tools have also been developed. One such tool records the value of the Alpha AXP cycle counter at the start of the procedure and at the end of the procedure and computes the wall clock time spent in each procedure. 4. Cache Simulator Processor cycle times are getting faster at a much greater rate than main memory access times. This disparity has led computer architects to place a subset of main memory into one or more levels of fast, expensive cache memory[comparch]. The effectiveness of this technique is application dependent. Applications that reference the same address multiple times or that use nearby data items benefit most from the data cache. Figure 5. Dynamic Memory Tool Implementation Although it is clear that cache memory plays an increasingly important role in application performance, measuring cache performance has been relegated to a few industrial and university research reports. Almost all of these studies have focused primarily on the performance of the SPEC92 benchmark suite. This section presents a simple tool that simulates the execution of the application running in a 64K-byte direct mapped data cache with 32-byte blocks. The tool computes the total number of data cache references, the number of misses, and the miss rate. The miss rate is the number of misses divided by the number of references. The strategy used in this tool is to instrument all load and store instructions with a call to an analysis procedure called Reference which is passed the effective address. This effective address is used to simulate the application running in the cache. The cache tool implementation is shown in Figure[cachetool]. Line 5 of the instrumentation file declares the Reference analysis procedure. The type VALUE indicates that the argument does not live in a processor register, but must be computed by ATOM prior to passing the value to the analysis procedure. Lines 11 through 17 examine each instruction. Lines 13 and 14 determine if the instruction is a load or a store. If so, the AddCallInst ATOM primitive adds a call to instruction i. The InstBefore argument adds the call before the instruction. The name of the analysis procedure to call is Reference, and the argument passed is the EffAddrValue, which ATOM computes by adding the contents of the base register plus the sign extended displacement. Line 20 completes the tool by adding a call to the PrintResults procedure after the application completes execution. The analysis procedure is shown on the right side of Figure 4 This is a simple implementation of a direct mapped cache. Lines 4 defines the cache data structure, which is used to hold a cache tag for each 32 byte block in the cache. Line 5 defines the reference and miss counters. Lines 7 through 9 compute the cache tag and index, and line 11 probes the cache. If the tags do not match, a miss is recorded in line 12, and the tag is updated in line 13. In either case, the number of references is incremented. A sample output file is shown in the Appendix. To guarantee that the results properly reflect the reference pattern of the uninstrumented program, ATOM guarantees that all data items referenced in the original program are placed in exactly the same locations when the program is instrumented. To guarantee this accuracy for instruction cache simulations, ATOM converts all references to the program counter to those of the uninstrumented program before passing the contents to the analysis procedures. Although the number of hits and misses is useful to computer architects, this information has rarely been presented in a form that is useful to application programmers. By combining the instruction profile tool shown in the previous section with the cache modeling tool shown above, ATOM can create a hybrid tool that shows cache misses in a profile like format. This tool is called memsys and it is included with the standard ATOM distribution. 7. Dynamically Allocated Memory Many programs make extensive use of dynamically allocated memory. Such memory is typically allocated using the malloc system call, and deallocated using the free system call. These procedures are called thousands of times by application programs, allocating, deallocating, and reallocating the same piece of memory many times. This section presents a tool that computes the total number of bytes allocated and freed over the course of the application's execution. The implementation is shown in Figure 5. Lines 4 through 6 of the instrumentation file are used to search for procedures with the names malloc and free. If these procedures are present in the application, these library functions are replaced in lines 6 and 7 by the procedures my_malloc and my_free. The ReplaceProcedure semantics require the type and arguments of the new procedure to be identical to the original procedure calls. Figure 6. Compiler Auditing Implementation The analysis procedures prepend the size of allocated objects to each dynamically allocated element. Line 5 calls the standard version of malloc, but requests additional memory to prepend the object size. Line 6 adds this size to the total amount of allocated memory. Line 7 saves this size in the first location in the dynamically allocated memory. The pointer to the start of the requested memory is returned in line 8. Each call to free was replaced in the application by a call to my_free. In line 12, this procedure uses a negative index to access the size of the object, which it adds to the total amount deallocated by the application. Line 14 calls the standard free procedure to deallocate the memory. A sample output file is shown in the appendix. The ability to replace procedures and monitor data references is fundamental to an emerging set of tools that monitor allocations, deallocations and references to memory[purify]. Jeremy Dion and Louis Monier[third] recently completed an ambitious ATOM based tool called Third Degree, that finds and reports many kinds of reads of uninitialized memory, reads and writes to unallocated memory, array bound errors, and freeing the same object more than once. The technique used is to replace all calls to allocate and free library procedures with versions that keep track of the ranges of valid heap locations. Symbolic interpretation in the instrumentation procedures is used to significantly reduce the number of memory references that must be instrumented. The result is a very effective and efficient tool for testing the validity of memory operations. This tool is also included in the standard ATOM distribution. 8. Compiler Auditing Modern compilers implement a long list of optimizations: loop unrolling, reductions in strength, software pipelining, global register allocation, instruction rearrangement. Unfortunately, these techniques are complicated and interact in non-trivial ways. The resulting code often misses simple optimizations. Tools that evaluate the quality of the compiled code and isolate potential performance problems are called compiler auditors[audit]. This section presents a simple compiler auditing tool that adds a procedure call before each load instruction to save the contents of the destination register. Another procedure call is added after each load instruction that checks to see if the destination register was modified by the instruction. If not, the instruction loaded a value that was already in the register. These loads are termed redundant. The implementation is shown in Figure 6. This tool is similar to previous tools, with the exception of lines 17 through 24. Line 17 checks if the instruction is a load operation. If so, line 18 adds a call to the SaveLoad procedure before the instruction and passes the contents of the destination register, as returned by the GetInstRegEnum ATOM primitive. Line 20 adds a matching call to CheckLoad after the load instruction. The arguments are a unique index of the load instruction, and the new contents of the destination register. CheckLoad compares this value to the value saved by the SaveLoad analysis procedure and increments the appropriate counters. The output file contains a count of the number or redundant times each load is executed along with the number of times the load was redundant. A sample output file is shown in the appendix. This tool is very effective at detecting loop invariant instructions and unnecessary spilling and restoring of registers. In one very early version of the compiler, this tool found 8 identical sequential load instructions from the same memory location to the same destination register! Redundant loads are often caused by redundant data, and therefore are not be indicative of compiler performance bugs. This is the case in the SPEC92 benchmark hydro2d where an amazing 42 percent of the loads are redundant. This is caused by large sparsely populated arrays. 9. Performance The performance of ATOM tools is a function of the number of analysis procedure calls that are executed and the amount of work done by each call. Figure 7 shows the performance of each tool over the SPEC92 benchmark suite. Each entry reflects the wall clock time of the instrumented program divided by the wall clock time of the uninstrumented program. DynMem Read Prof Cache Audit alvinn 0.987 0.989 3.91 8.71 11.93 compress 1.007 0.980 4.14 9.37 7.52 doduc 0.990 1.008 2.91 8.30 12.05 ear 1.011 1.005 5.75 6.61 9.40 eqntott 0.995 0.997 4.18 8.12 10.84 espresso 1.050 1.006 8.92 10.66 14.49 fpppp 0.994 0.994 1.76 12.97 21.83 gcc1 1.016 1.006 5.65 8.60 10.54 hydro2d 0.987 0.991 2.40 7.42 10.51 li 1.021 1.054 6.20 11.06 12.48 mdljdp2 0.996 1.000 2.93 4.32 5.20 mdljsp2 1.020 1.027 3.18 6.13 9.37 nasa7 0.997 1.002 1.69 8.19 11.38 ora 0.931 0.931 4.22 7.88 11.65 sc 1.033 1.030 6.01 6.96 8.50 spice 1.012 1.018 3.93 7.28 9.54 su2cor 0.985 0.992 2.46 7.73 9.64 swm256 0.987 0.990 1.47 9.54 13.96 tomcatv 0.998 0.990 1.82 6.40 13.55 wave5 0.995 0.993 3.15 9.28 10.69 Average 1.000 1.004 3.61 8.27 11.23 Figure 7. Performance of Atom Tools The Dynamic Memory and Read tools have a minimal affect on application performance, since both have relatively few instrumentation points. Contrast this with the compiler auditing tool, which adds two calls to analysis procedures for each load instruction. Also notice that there is considerable variation between benchmarks for a single tool. For example, the profile tool slows down application by as little as 1.472 for swm256 and as much as 8.919 for espresso. Both instrument at basic blocks, but since the basic block size of espresso is much smaller, the instrumented application spends a larger percentage of time in the analysis procedures. These slowdowns have been acceptable for the vast majority of applications. However, in real time applications, ATOM's selective instrumentation interface can be used to minimize the instrumentation effects on program behavior. When comparing these times to other tools reported in the literature, it is important to include the time necessary to gather the data and to analyze the results. For example, many cache instrumentation tools studies report competitive times for gathering trace data into in-memory buffers, but do not include the times to empty the buffer, simulate the cache, and report the results. There are many ways to substantially increase the performance of ATOM based tools. One approach is to reduce or eliminate the analysis procedure call overhead either through inlining or other compiler optimizations. Another approach is to make use of the flexibility of the instrumentation interface to reduce the frequency of analysis procedure calls. For example, the profile tool instrumentation routine can be easily rewritten to eliminate adding calls to analysis procedures for those blocks where data flow analysis determines that the count is identical to another block that has already been instrumented. Another example is instruction translation buffer simulation. Here, ATOM based tools need only instrument branches or sequential execution that crosses page boundaries. Since these are relatively infrequent, these tools are very efficient. Another example are tools that simulate branch prediction algorithms. Rather than infer branch behavior by sifting through instruction address traces, ATOM tools instrument only conditional branches. 10. Conclusions ATOM is a unique tool for understanding program performance. The flexible interface allows a diverse set of tools to be built with minimal effort. Without the support ATOM provides, these tools would be extremely difficult to build. The performance of these tools compares favorably with hand-crafted implementations, since instrumentation is inserted only when necessary to gather statistics. Communication of data to the analysis procedures is accomplished through procedure calls, rather than relying on expensive interprocess communication. The analysis routines are always presented with information about the application program as if it was executing uninstrumented. ATOM has been applied to many commercial applications with text sizes of up to 100MB. Hundreds of tools have been written by both industrial and university users to evaluate the performance of caches, garbage collection algorithms, branch prediction, compiler optimizations, input/output, system calls, novel CPU architectures, as well as many other aspects of system performance. Brad Chen of Harvard University used modified version of ATOM to instrument the OSF/1 kernel. Acknowledgments Many people have helped us to bring ATOM to its current form. Jim Keller, Mike Burrows, Roger Cruz, John Edmondson, Mike McCallig, Dirk Meyer, Richard Swan and Mike Uhler were our first internal users, and Dirk Grunwald and Brad Calder provided our first external field test site. Jeremy Dion, Ramsey Haddad, Russell Kao, Greg Lueck, Mike McCallig, and Louis Monier contributed exciting new tools. Many, many others, provided help, support, encouragement, bug reports, flames, and endorsements! Also, Brad Chen, Ted Romer, Ramsey Haddad, Louis Monier, and anonymous USENIX reviewers provided helpful suggestions on the content of this paper. We thank you all. Appendix: Sample Outputs The Read Tool instruments an application program to determine the total number of bytes read by an application program. A sample output file is shown below. 64916 bytes In this example, the SPEC92 nasa7 benchmark read 64,916 bytes. The Profile Tool instruments an application program to compile a standard instruction profile. A sample output file is shown below. Procedure Instructions Percentage __start 148 0.000 main 107 0.000 compress 58040713 66.487 output 28907385 33.114 rindex 130 0.000 ... ... ... Total 87296502 For the SPEC92 compress benchmark, 28,907,385 instructions were executed inside the output procedure. This represented 33% of all the instructions executed in the application program. The Cache Tool instruments an application program to determine the number of references, misses, and miss rate if the application was run in a 64KB direct mapped, read/write allocate cache with 32 byte lines. A sample output file is shown below. 5387822402 620855884 11.5233 When running the SPEC92 spice benchmark, the application would have made 5,387,822,402 references, had 620,855,884 misses, for an overall miss rate of 11.5 percent. The Dynamic Memory Tool instruments an application program to the total amount of memory allocated and freed. A sample output file is shown below. 96107 8208 For the SPEC92 xlisp benchmark, 96,107 bytes were allocated and 8,208 bytes were freed. The Compiler Auditing Tool instruments an application program to find potentially redundant instructions. Here are the results when running the tool on the ear benchmark. A sample output file is shown below. PC Count Wasted ... ... ... 0x12000a2d4 188 1 0x12000a4e0 188 2 0x12000a4fc 188 188 0x12000a538 561 188 ... ... ... The instruction at 0x12000a4fc was executed 188 times and never changed the contents of the destination register. This might be worth investigating further. ATUM: Anant Agarwal, Richard Sites, and Mark Horwitz. ATUM: A New Technique for Capturing Address Traces Using Microcode. Proceedings of the 13th International Symposium on Computer Architecture, June 1986. g88: Robert Bedichek. Some Efficient Architectures Simulation Techniques. Winter 1990 USENIX Conference, January 1990. Borg: Anita Borg, R.E. Kessler, Georgia Lazana, and David Wall. Long Address Traces from RISC Machines: Generation and Analysis, Proceedings of the 17th Annual Symposium on Computer Architecture, May 1990, also available as WRL Research Report 89/14, Sep 1989. PROTEUS: Eric A. Brewer, Chrysanthos N. Dellarocas, Adrian Colbrook, and William E. Weihl. PROTEUS: A High-Performance Parallel-Architecture Simulator. MIT/LCS/TR-516, MIT, 1991. Shade: Robert F. Cmelik and David Keppel, Shade: A Fast Instruction-Set Simulator for Execution Profiling. Technical Report UWCSE 93-06-06, University of Washington. MPTRACE: Susan J. Eggers, David R. Keppel, Eric J. Koldinger, and Henry M. Levy. Techniques for Efficient Inline Tracing on a Shared-Memory Multiprocessor. SIGMETRICS Conference on Measurement and Modeling of Computer Systems, vol 8, no 1, May 1990. TANGO: Stephen R. Goldschmidt and John L. Hennessy, The Accuracy of Trace-Driven Simulations of Multiprocessors. CSL-TR-92-546, Computer Systems Laboratory, Stanford University, September 1992. purify: Reed Hastings and Bob Joyce. Fast Detection of Memory Leaks and Access Errors. Winter 1992 USENIX Conference, January 1992. comparch: John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach, pp. 408-425, Morgan Kaufmann, 1990. cprog: Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language, Prentice-Hall, 1978. QP: James R. Larus and Thomas Ball. Rewriting executable files to measure program behavior. Software, Practice and Experience, vol 24, no. 2, pp 197-218, February 1994. audit: James R. Larus and Satish Chandra. Using Tracing and Dynamic Slicing to Tune Compilers. University of Wisconsin Computer Sciences Department Technical Report #1174. August, 1993 Pixie: MIPS Computer Systems, Inc. Assembly Language Programmer's Guide, 1986. third: Digital Equipment Corporation. Third Degree Reference Manual, 1993 atomref: Digital Equipment Corporation. ATOM Reference Manual, 1993 atomuser: Digital Equipment Corporation. ATOM User Manual, 1993 SRM: Richard L. Sites, ed. Alpha Architecture Reference Manual Digital Press, 1992. atompldi: Amitabh Srivastava and Alan Eustace. ATOM: A System for Building Customized Program Analysis Tools. Proceedings of the SIGPLAN'94 Conference on Programming Language Design and Implementation, June, 1994. OM: Amitabh Srivastava and David W. Wall. A Practical System for Intermodule Code Optimization at Link-Time. Journal of Programming Language, 1(1), pp 1-18, March 1993. Also available as WRL Research Report 92/6, December 1992. Addr64: Amitabh Srivastava and David W. Wall. Link-Time Optimization of Address Calculation on a 64-bit Architecture. Proceedings of the SIGPLAN'94 Conference on Programming Language Design and Implementation, to appear. Also available as WRL Research Report 94/1, February 1994. UR: Amitabh Srivastava. Unreachable procedures in object-oriented programming, ACM LOPLAS, Vol 1, #4, pp 355-364, December 1992. Also available as WRL Research Report 93/4, August 1993. Epoxie: David W. Wall. Systems for late code modification. In Robert Giegerich and Susan L. Graham, eds, Code Generation - Concepts, Tools, Techniques, pp. 275-293, Springer-Verlag, 1992. Also available as WRL Research Report 92/3, May 1992. OSF/1 is a registered trademark of the Open Software Foundation, Inc. Alpha, AXP, and DECstation are trademarks of Digital Equipment Corporation. Alan Eustace received B.S., M.S, and Ph.D. degrees in Computer Science from the University of Central Florida. Since 1986, he has been a researcher at the Digital Equipment Corporation Western Research Laboratory. His interests include VLSI design, computer architecture, and tools for understanding and improving program performance and correctness. Amitabh Srivastava received a B.Tech. in Electrical Engineering from Indian Institute of Technology, Kanpur, and his M.S. in Computer Science from Pennsylvania State University. Since 1991, he has been a researcher at the Digital Equipment Corporation Western Research Laboratory, working in compilers and link-time code modification. He is the architect of the OM link-time technology which he used to build OM and ATOM systems. Prior to that he was researcher at the Texas Instruments Central Research Labs working on Lisp machines, compilers and object-oriented programming extensions to Scheme. He is the author of the SCOOPS system. Address for correspondence: Digital Equipment Corporation Western Research Laboratory, 250 University Avenue, Palo Alto, California, 94301, or by electronic mail at eustace@wrl.dec.com or amitabh@wrl.dec.com.