Optimizing the Performance of Dynamically-Linked Programs W. Wilson Ho Wei-Chau Chang Lilian H. Leung Silicon Graphics, Inc. Abstract Dynamically-linked programs in general do not perform as well as statically-linked programs. This paper identifies three main areas that account for the performance loss. First, symbols are referenced indirectly and thus extra instructions are required. Second, the overhead in run-time symbol resolution is signifi- cant. Third, poor locality of functions in shared libraries and data structures maintained by the run-time linker may result in poor memory utilization. This paper presents new optimization techniques we developed that address these three areas and sig- nificantly improve the performance of dynamically-linked pro- grams. Also, we provide measurements of the performance improve- ment achieved. Most importantly, we show that all desirable fea- tures of shared libraries can be achieved without sacrificing performance. 1. Introduction More and more UNIX systems support dynamic linking of shared libraries [MIPS90, Arno86, Ausl90, Cout92, Ging87] because they provide many desirable features. For example, both disk space and physical memory utilization are reduced due to increased sharing, and shared libraries can be replaced transparently with- out re-linking all user programs. More in-depth discussions can be found in [Ging89, Saba90]. However, the use of dynamic shared libraries does incur a performance penalty. Dynamically-linked programs generally run slower than statically-linked programs because they incur extra run-time overhead. This overhead includes the execution of extra instructions resulted from indi- rect addressing and run-time symbol resolutions, and extra memory requirement due to poor locality of functions in shared libraries and data structures used by the run-time linker. Several optimizations have been proposed to improve the per- formance of shared libraries. Run-time overhead in indirect function calls can be improved by reducing the number of instruc- tions used in the calling sequence [Kepp93]. Symbol resolutions can be deferred and carried out on-demand to improve start-up time [Saba90]. Loading and fixing-up of shared libraries can be cached to reduce the amount of work for subsequence invocations [Nels93, Orr93]. These methods are effective in speeding up the execution of dynamically-linked programs from their corresponding initial implementations. However, there still remains signifi- cant performance degradation of these programs when compared with their statically-linked counterparts. Informal discussions on the Usenet news group comp.arch seem to suggest that 10-25% degradation is commonly observed. This paper describes a number of techniques we developed that can bring the performance of dynamically-linked programs much closer to the level of that of statically-linked programs. With these optimizations, we have built the entire IRIX system (SGI's implementation of the UNIX operating system) based on shared libraries without sacrificing performance. In fact, for most libraries, only the shared versions are provided. All opti- mization techniques described in this paper have been implemented and are available in version 5.3 of the IRIX system. There are three main areas that account for the performance loss of dynamically-linked programs. First, text and data sym- bols are referenced indirectly. As a result, more instructions need to be executed. Our approach takes advantage of the infor- mation derived from the cross-module optimization phase of the MIPS compiler [Chow86, Hime87] and converts as many indirect address references as possible to direct references. Second, symbol resolution takes time, even with deferred binding. Furthermore, since trapping data references is expen- sive, binding of data symbols is carried out at start-up time. This causes large C++ applications to suffer a long start-up delay because they usually contain a large number of function pointer tables. We developed a scheme called quickstart that virtually eliminates any need for run-time symbol binding. Third, despite the sharing of library text among multiple processes, the use of shared libraries can sometimes require more memory. This extra memory requirement comes from both the data structures used by the run-time linker for symbol resolution for each process and the poor access locality of functions in a shared library. We solved the first problem by pre-computing most of the information required by the run-time linker at static-link time and putting them in sharable, read-only segments of shared libraries. To improve the access locality of functions in a shared library, we developed cord, a tool that repositions these functions based on either profiling information or explicit user-specification. The rest of this paper is organized as follows: Section 2 provides an overview of the first implementation of the IRIX shared libraries. Section 3, 4, and 5 present techniques we employed to improve the shared libraries performance in the three respective areas described above. Each section is followed by an analysis of the gains in performance. All measurements are taken on an Iris Indigo, which has a 100 MHz MIPS R4000 processor. Section 6 summarizes the results and offers some conclusions. 2. Background In this paper, we concentrate on the performance of SVR4-style shared libraries [Syst90], i.e., implementations that support dynamic symbol binding, sharing of text and read-only data among multiple processes, and mapping of the same library to different address space in different processes. In other words, less flexible implementations such as those that require fixed address assignment [Arno86, Hobb87] are not considered. A full description of the basic implementation of IRIX's shared libraries can be found in [MIPS90]. This section describes only those implementation details that are relevant to the optimizations presented in this paper. Under the IRIX implementation of shared libraries, the com- piler always generates position-independent code (PIC). Thus, any object file created by the compiler can be part of a shared library. PIC is implemented by turning all address references into indirect references through a Global Offset Table (GOT). A GOT is created at static-link time and there is one for each shared library as well as the main executable itself. It is a table of addresses of all symbols that are referenced, and the content of this table is updated by the run-time linker on demand, i.e., deferred binding. Function calls are implemented by first loading the address of the callee into a register from the corresponding entry in the GOT, followed by a jump-and-link- register instruction. By updating its GOT with correct values, the run-time linker can relocate a shared library to any virtual address. Furthermore, a dedicated register is used to hold the address pointing to the GOT corresponding to the function being executed. When control is transferred between different shared libraries, the address of the GOT is computed by the correspond- ing callee and the new value is put in this GOT register. How- ever, when control is passed between functions within the same shared library, the content of the GOT register needs not be changed and can be used immediately. Using PIC allows the text segment to be shared by all pro- cesses because it never needs to be modified. The GOT approach localizes all the run-time fix-up of addresses and cuts down the number of copy-on-write memory pages. When compared to the jump table approach in the SunOS compiler [Ging87], our implementation of function calls requires only one control transfer instead of two. This difference is of particular importance in modern com- puter architecture [Hein93, Hsu94], where jumps are in general expensive. Furthermore, the instruction to load addresses from the GOT can often be scheduled to be executed earlier so that other useful work can be done while the address is being fetched from memory. 3. Eliminating Indirect Addressing As stated in Section 2, all object files generated by our com- piler are PIC. This feature allows any object file created by the compiler to be put into a shared library. It also relieves programmers from the burden of deciding at compile-time whether a particular object should become part of a shared library or an executable. However, if the main executable is never relocated at run- time, we can assign absolute addresses to it and replace all indirect references for symbols defined in the main executable by direct references. This can be achieved if the system guarantees that no shared library is mapped to the address space already used by the executable itself. Unfortunately, under the traditional separate compilation model, the compiler cannot determine if references of undefined global symbols will be resolved with a definition from a shared library or from the main executable itself. Hence, it has to assume the worst case and generate indirect references for such symbols. _________________________________________________________________ +--------+----------+------------------------------------+ | |statical- | dynamically linked | |program |ly linked +-------------------+----------------+ | | | direct addressing | PIC | +--------+----------+-------------------+----------------+ |espresso| 43.2s | 45.2s (4.6%) | 47.4s (9.7%) | +--------+----------+-------------------+----------------+ |lisp | 89.0s | 95.9s (7.8%) | 114.8s (29.0%) | +--------+----------+-------------------+----------------+ |eqntott | 13.9s | 13.8s (-0.7%) | 14.0s (0.7%) | +--------+----------+-------------------+----------------+ |compress| 64.3s | 67.0s (4.2%) | 71.8s (11.7%) | +--------+----------+-------------------+----------------+ |sc | 95.8s | 102.7s (7.2%) | 107.3s (12.0%) | +--------+----------+-------------------+----------------+ |gcc | 120.9s | 121.8s (0.7%) | 138.8s (14.8%) | +--------+----------+-------------------+----------------+ |average | | 3.97% | 12.98% | +--------+----------+-------------------+----------------+ Table 1. Performance Degradation of Dynamically-Linked Programs _________________________________________________________________ Our goal is to derive a scheme that can automatically deter- mine if direct addressing is possible for any given symbol, with- out requiring any specification from the programmer. The approach is to extends the cross-module optimization phase of the compiler to collect information about where symbols are defined. Under the cross-module optimization model, the MIPS compiler combines separately-compiled intermediate-code objects and per- forms function inlining, interprocedural register allocation, dead function removal, and many other optimizations that are not supported in the traditional model [Hime87]. All these optimiza- tions are done before the final machine code generation. Note that during this cross-module optimization phase, the compiler collects information from all the files that compose the program, including the shared libraries that it links with, and thus has complete knowledge of where each symbol is defined. By simply recording this information in the symbol table and passing it down to the code generator, direct references can be generated for symbols defined in the executable module and indirect refer- ences can be generated for symbols defined in the shared libraries. This optimization improves the performance in several ways. First, function call overhead within the main executable is reduced. There is no more memory reference for loading the address of the callee. Also, with the absolute address embedded as an immediate value in the jump-and-link instruction, the pro- cessor can potentially pre-fetch the next instruction without delay. Second, overhead for indirect references of data symbols is also eliminated. Very few other implementations, if any, can support dynamically-linked programs with direct data references. This optimization can have a significant effect on performance, especially when the symbols are referenced within a loop. Third, all symbols defined inside the main executable can be removed from the GOT, and thus the amount of work performed by the run- time linker is reduced. Furthermore, since the size of the GOT is reduced, the compiler can put more global variables in the data area close to the GOT and access these variables through a 16-bit immediate offset from the GOT register [Chow87]. Under the MIPS architecture, accessing data via an immediate offset from a register takes only one instruction, and is even faster than the normal direct addressing, which takes two instructions instead of one [Hein93]. Table 1 compares the performance of several dynamically- linked programs against the statically-linked versions. These programs are from the SPECint benchmark suite [SPEC91]. Three versions of each programs were compared: the statically-linked version, the dynamically-linked version with the direct- addressing optimization described in this section, and the default PIC version. The performance of each dynamically-linked version is compared to that of the statically-linked counterpart. Numbers in parenthesis are the percentage degradation. By eliminating indirect addressing, the performance of these programs is consistently better than the PIC version. The most significant improvement is found in lisp and gcc, both of which contains a large number of function calls. When compared with the statically-linked version, the worst performance degradation is cut down to only 7.8%, while eqntott and gcc show almost no degradation. On average, dynamically-linked programs are now only 3.97% slower than statically-linked programs, which is a great improvement over what is commonly observed in other imple- mentations. Note that only the performance of single-process runs are compared. In general, the use of shared libraries provides a better system throughput because their text pages are shared among different processes, resulting in less memory consumption system-wide. Our results show that such an improved system throughput can be achieved with minimal penalty to the perfor- mance of individual process. We discuss the memory requirement of shared libraries in more detail in Section 5.1. 4. Quickstart One of the major performance degradation in executing dynami- cally-linked programs is due to run-time symbol resolutions. The performance is affected in two ways. First, more instructions are executed because of the work performed by the run-time linker. Second, if the run-time linker needs to update the GOT's, more copy-on-write pages are needed and thus the residence size of the program is also increased. Since it is too expensive to trap data references, data symbols, including function point- ers, are resolved by the run-time linker at program start-up time. As a result, in cases where function pointers are fre- quently used, as in many C++ programs, deferred binding cannot alleviate the run-time linker's work to a large extent. Further- more, in some cases, deferred binding of symbols can cause the program to run even slower than immediate binding of all symbols at start-up time. These suggests that minimizing symbol resolu- tion by the run-time linker is crucial in achieving better run- time performance. A new mechanism called quickstart is developed to address this problem. Under quickstart, the static linker attempts to resolve all symbol references at static link-time, and if none of these references need to be updated at run-time, the run-time linker needs to perform virtually nothing other than the mapping of the required shared libraries. However, to achieve quick- start, the run-time linker must verify that the following two conditions are satisfied before running the user program [Hime90]. First, the shared libraries used at run-time must be the same as those used at static link--time, and second, all libraries are mapped into their pre-assigned default locations and none of the libraries in the list overlaps in virtual addresses. To assist the run-time linker in verifying the first condi- tion, each shared library is uniquely identified with a name, a time-stamp ( the time when it was built), a checksum (the sum of the external interfaces and sizes of all common symbols), and a version string. At static link-time, all the information about the shared libraries used in building a certain binary are recorded in the binary itself, regardless of whether the binary is an executable or another shared library. In addition, symbols that have multiple definitions are recorded in the binary to enable the run-time linker to correctly resolve references to these symbols. To satisfy the second condition, we support a registry file that the static linker uses to determine the default virtual addresses for each individual shared library, which ensures that the shared libraries are assigned non- overlapping addresses. As soon as the run-time linker detects that any of those two conditions is not met, it has to perform all the necessary symbol resolutions and relocations before running the program. This is known as non-quickstart. Even with quickstart, the run-time linker still needs to perform symbol resolutions on those multiply-defined symbols, if they exist. In order to reduce the amount of work, we imple- mented hidden symbols. Hidden symbols can only be referenced within the shared library that they are defined in. Since these symbols cannot be referenced by other binaries, they directly contribute to making the list of multiply-defined symbols smaller, effectively alleviating the work of the run-time linker. To handle those cases when the quickstart conditions are violated, e.g., when some of the shared libraries used in an exe- cutable are replaced, we developed a system that traverses the entire list of installed system binaries, performs necessary sym- bol resolutions and relocations, and at the end rewrites those modified binaries in place. All of these are typically done transparently during new software installation. We call this system re-quickstart. This scheme turns out to be very success- ful in maintaining quickstart status without re-linking any exe- cutable or shared library. It should be pointed out that quickstart is only an opti- mization feature, and has no bearing on the correct execution of a program. Without quickstart, the run-time linker performs deferred binding by default, while immediate binding of symbols is also provided as an option. Table 2 shows the performance improvement of quickstart and deferred binding over immediate binding. For each program, the total time spent in the run-time linker is measured. The per- centage improvement is given in parenthesis. These five programs are chosen mainly because of their popularity of use by program- mers. The other reason is their differences in characteristics such as the number of shared libraries used and the number of data symbols that needs to be resolved versus that of text sym- bols. DiskView, a huge X-window application written in C++, is a graphical interface for disk and file system manipulation used frequently by system administrators. Insight, a large multi- media application, is used for searching and browsing through on- line documents. Gdiff, a medium-sized X-window program, is a graphical differential file comparator. Dbx, a small C++ pro- gram, is SGI's source-level debugger. Last but not least, ls, listing the contents of a directory, is a widely used UNIX com- mand. _________________________________________________________________ +---------+-------+------------------------+-------------+ | | | non-quickstart | | |program | # of +---------+--------------+ quickstart | | | libs. | immed. | deferred | | | | | binding | binding | | +---------+-------+---------+--------------+-------------+ |diskview | 17 | 3.66s | 3.43s (6%) | 2.06s (44%) | +---------+-------+---------+--------------+-------------+ |insight | 14 | 3.36s | 1.59s (53%) | 0.13s (96%) | +---------+-------+---------+--------------+-------------+ |gdiff | 6 | 0.28s | 0.31s (-11%) | 0.04s (84%) | +---------+-------+---------+--------------+-------------+ |dbx | 3 | 0.32s | 0.32s (0%) | 0.02s (94%) | +---------+-------+---------+--------------+-------------+ |ls | 1 | 0.04s | 0.04s (0%) | 0.00s (99%) | +---------+-------+---------+--------------+-------------+ |average | | | 10% | 83% | +---------+-------+---------+--------------+-------------+ Table 2. Performance Improvement of Quickstart and Deferred Binding over Immediate Binding. _________________________________________________________________ Overall, programs that are quickstarted spend at least 40% less time in the run-time linker than those with immediate bind- ing. On average, there is an impressive 83% improvement. Most programs show that deferred binding of symbols has advantage over immediate binding [Saba90], and it is more apparent in cases where the number of shared libraries used is larger. This is probably because in general most library functions in a shared library are not used and thus need not be resolved at all. In the case of gdiff, however, the run-time linker was invoked many times over the course of execution, owing to many function calls that required symbol resolution, resulting in longer time spent in the run-time linker. In conclusion, whether deferred binding is used or not, it is far better to maintain the quickstart sta- tus, requiring the run-time linker to perform the least amount of work. It should be mentioned that even though only five programs are shown here, we have done substantial testing on hundreds of applications and the results are unanimously in favor of quick- start. 5. Reducing Memory Requirements One of the goals of shared libraries is to reduce the system-wide physical memory requirement through increased sharing of text and read-only data of shared libraries among multiple processes. By reducing the system-wide memory requirement, the number of page faults is generally reduced, thus improving the overall perfor- mance. While the use of shared library does reduce memory require- ment, this reduction is sometimes offset by other overhead of dynamic linking. The first problem is that locality of functions within a shared library might be poor. A shared library is cre- ated by combining all functions it provides into a single module. When a shared library is linked with a process, the entire module is mapped into the process' address space and every function defined in the library is available to the process. Only those memory pages corresponding to functions that are actually invoked are loaded on demand. Typically, a process uses only a subset of the functions provided by a shared library. Unless these func- tions are all packed closely together within the shared library, many memory pages that are loaded might be partially filled with functions that are never or rarely invoked. As a result, memory utilization is sub-optimal. The second problem is that data structures used by the run- time linker for symbol resolutions may sometimes be large. For example, the IRIX implementation of xterm, a typical X applica- tion, links with seven shared libraries containing a total of 5856 symbols. Note that these data structures are private to each process and are not shared. The effect of this hidden cost of dynamic linking has rarely been studied. Section 5.1 describes cord, a tool we developed to improve function locality within a shared library. Sections 5.2 presents an optimization to reduce the amount of memory required by the run-time linker. 5.1. Procedure Repositioning To reduce system-wide memory usage of instruction text space, a post-linking binary optimization tool called cord is implemented to reposition procedures in executables and shared libraries. Cord attempts to reduce run-time memory usage and achieve better instruction cache mapping by grouping frequently used procedures on the same page or adjacent pages, thus reducing the total num- ber of memory pages needed during program execution. The proce- dure repositioning is controlled by a feedback file generated by profiling tools. The feedback file can also be hand-tuned to directly control the procedure placement. Cord and the profiling system can be used to reposition pro- cedures based on three different criteria: procedure invocation count, procedure cycle count and procedure density (procedure cycle count divided by procedure size). Using procedure invoca- tion count to guide procedure repositioning improves memory usage by grouping frequently invoked procedures into adjacent pages. Procedure repositioning based on procedure density is intended to give smaller procedures a heavier weight in deciding the order of repositioning. Earlier experiments show that the effects on per- formance and memory usage of repositioning based on these three factors varies, but show very little difference. Our results indicate that cycle count yields a small advantage in the overall performance. Therefore, it is chosen as the default and is used in the measurement presented in this section. The fact that libraries may be shared by different processes makes the performance modeling and measurement of shared libraries more difficult. In contrast, the benefit of procedure repositioning is more apparent for executables, because their execution pattern is more tractable. As a result, the benefit of procedure repositioning for shared libraries is an interesting issue worth investigating. There has been a number of researches on binary reposition- ing [Hwu89, McFa89, Pett90], including procedure and basic block repositioning. However, none of those addresses memory usage and shared libraries performance, which is one of the important bene- fits of using cord. To study the effect of procedure repositioning on memory usage, several key applications, including Xsgi, 4Dwm and xwsh, are selected as targets for memory usage optimization. Xsgi is an X window system server on IRIX. 4Dwm is the IRIX extended Motif window manager. Xwsh is an IRIX terminal emulator. Four libraries are chosen for this measurement because they tend to use more memory during most X sessions. The libraries are libXm, libX11, libc, and libXt. The executables are profiled by a typical run. For libraries, the repositioning is based on profile data from 4Dwm and xwsh. The details of the profile run for each library are listed in Table 3. 4Dwm uses libXm, libX11, libc, libXt and oth- ers. Libraries used in xwsh include libX11, libc, and others. The measurement is done after re-booting IRIX followed by a login session with 4Dwm, clock, two copies of xwsh, and a number of other X applications. It is important to note that this experiment is carried out on a loaded system. Many other pro- cesses are sharing those four libraries in addition to 4Dwm and xwsh. For example, libX11 and libXt is used by xdm, the X display manager, and libc is used by all dynamically-linked processes. The improvement of instruction memory usage for "cord'ed" executables is listed in Table 4. The figures show the total amount of memory used by each program. The results show that the instruction memory usage has been reduced by an average of 20.1%. These programs typically have long life spans and are frequently run throughout the entire X session. Therefore, the amount of memory usage reduction is quite significant. _________________________________________________________________ +--------+---------------+ | | profiled with | |library +-------+-------+ | | 4Dwm | xwsh | +--------+-------+-------+ |libXm | X | | +--------+-------+-------+ |libX11 | X | X | +--------+-------+-------+ |libc | X | X | +--------+-------+-------+ |libXt | X | | +--------+-------+-------+ Table 3. How Libraries are Profiled. _________________________________________________________________ _________________________________________________________________ +-----------+-----------+-----------+--------------+ |executable | original | cord'ed | memory usage | | | (K bytes) | (K bytes) | reduction | +-----------+-----------+-----------+--------------+ |xwsh | 383 | 312 | 18.5% | +-----------+-----------+-----------+--------------+ |Xsgi | 466 | 410 | 12.0% | +-----------+-----------+-----------+--------------+ |4Dwm | 308 | 216 | 29.8% | +-----------+-----------+-----------+--------------+ |average | | | 20.1% | +-----------+-----------+-----------+--------------+ Table 4. Total Instruction Memory Usage for Procedure Repositioned Executables. _________________________________________________________________ _________________________________________________________________ +--------+-----------+-----------+--------------+ |library | original | cord'ed | memory usage | | | (K bytes) | (K bytes) | reduction | +--------+-----------+-----------+--------------+ |libXm | 1212 | 914 | 24.6% | +--------+-----------+-----------+--------------+ |libX11 | 561 | 503 | 10.3% | +--------+-----------+-----------+--------------+ |libc | 600 | 526 | 12.3% | +--------+-----------+-----------+--------------+ |libXt | 351 | 329 | 6.2% | +--------+-----------+-----------+--------------+ |average | | | 13.35% | +--------+-----------+-----------+--------------+ Table 5. Total Instruction Memory Usage for Procedure Repositioned Shared Libraries. _________________________________________________________________ The improvement of instruction memory usage for cord'ed shared libraries is listed in Table 5. Taken into account that those libraries are shared by other processes during the measure- ment, the average instruction memory reduction of 13.35% is quite impressive. 5.2. Pre-compute Run-time Information As stated in Section 4, a successfully quickstarted process vir- tually eliminates any processing by the run-time linker. How- ever, when quickstart fails, the run-time linker needs to perform symbol resolution. Typically, for each unresolved symbol, the run-time linker goes through the symbol table of each shared library and searches for a definition. Defined symbols in each symbol table can be accessed via a hash table, which is pre- computed and placed in the read-only segment of a shared library. To further reduce symbol search time, the run-time linker caches each hash value it computes and keeps these values in a table called the msym table, which has a one-to-one corresponding relationship with the symbol table itself. Hence, the hash value of each symbol needs to be computed only once. Given an unde- fined symbol, the run-time linker first extracts the correspond- ing msym table for the hash value. It then goes through the hash tables of the main executable and each shared library, and looks for a definition of the symbol using this hash value. As a result, only simple table look-up is needed and run-time symbol resolution is very efficient. _________________________________________________________________ +---------+----------------+-----------------+ |program | No. of Symbols | Size of msym | | | | tables (Kbytes) | +---------+----------------+-----------------+ |diskview | 22505 | 87.9 | +---------+----------------+-----------------+ |insight | 19468 | 76.0 | +---------+----------------+-----------------+ |gdiff | 5745 | 22.4 | +---------+----------------+-----------------+ |dbx | 5324 | 20.8 | +---------+----------------+-----------------+ |ls | 2567 | 10.0 | +---------+----------------+-----------------+ Table 6. Size of Msym Tables. _________________________________________________________________ Table 6 shows the sizes of the msym tables for the same set of programs listed in Table 2. For each program, the total num- ber of symbol table entries from the program and the shared libraries is shown. The msym tables contain a four-byte hash value for each symbol table entry. The table shows that the amount of memory used by the msym tables is very large. Further- more, the msym tables are private to each process. Therefore, in a typical workstation environment with 50 processes, the msym tables can easily take up several megabytes of memory. For exam- ple, the libc on IRIX has 2368 symbols. Since every program links with libc, the size of the msym tables for libc alone for 50 processes is about half a megabyte. A simple X application such as xterm requires 23 kilobytes for the msym tables per pro- cess. Fortunately, the memory requirement of the msym tables can be greatly reduced without sacrificing the speed of symbol reso- lution. This can be achieved by pre-computing the hash value for each symbol and putting the entire msym table into the shared library when it is built. As a result, the sizes of the shared libraries increase, but the msym tables are shared among all pro- cesses and so the system-wide memory requirement is reduced. Furthermore, these tables are only loaded on demand. If a pro- cess is quickstarted, none of the symbol tables, hash tables, and msym tables is even touched. 6. Conclusion This paper identifies three main areas that account for the per- formance loss of dynamically-linked programs, and presents the optimization techniques that we developed to solve these problems. All optimizations described in this paper have been implemented and the results have been proved to be very positive. In fact, we applied these optimizations to build the entire IRIX system based on shared libraries and no noticeable performance degradation can be observed. We show that the cross-module optimization can eliminate indirect addressing for data references and function calls, and greatly reduces the run-time overhead of dynamically-linked pro- grams. The quickstart mechanism virtually eliminates all run- time symbol resolution. By repositioning the procedures in a shared library and pre-computing the run-time data structures for symbol resolution, the system-wide memory requirement can be greatly reduced. Most important of all, we show that the flexibility, usabil- ity, and other advantages of shared libraries can be achieved without sacrificing performance. Acknowledgment We would like to thank Sun Chan, Fred Chow, Seema Hiranandani, and Raymond Lo for their valuable comments on earlier drafts of this paper. References [MIPS90] MIPS Processor Supplement for the System V Application Binary Interface, Prentice Hall, Englewood Cliffs, NJ (1990). [Syst90] System V Application Binary Interface, Prentice Hall, Englewood Cliffs, NJ (1990). [SPEC91] SPEC Newsletter. December 1991. [Arno86] Arnold, J. Q., "Shared Libraries on UNIX System V," Proc. Summer Usenix, pp. 1-10 (1986). [Ausl90] Auslander, M. A., "Managing Programs and Libraries in AIX Version 3 for RISC System/6000 processors," IBM Journal of Research and Development Vol. 34(1) pp. 98-104 (January 1990). [Chow86] Chow, F., M. Himelstein, E. Killian, and L. Weber, "Engineering a RISC Compiler System," Proc. Compcon., pp. 132-137 (March 1986). [Chow87] Chow, F., S. Correll, M. Himelstein, E. Killian, and L. Weber, "How Many Addressing Modes are Enough?," Proc. of the 2nd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 117-121 (October 1987). [Cout92] Coutant, Cary A. and Michelle A. Ruscetta, "Shared Libraries for HP-UX," Hewlett-Packard Journal, pp. 46-53 (June 1992). [Ging87] Gingell, R. A., M. Lee, X. T. Dang, and M. S. Weeks, "Shared Libraries in SunOS," Proc. Summer Usenix, pp. 131-145 (Summer 1987). [Ging89] Gingell, Robert A., "Shared Libraries," UNIX Review Vol. 7(8) pp. 56-66 (August 1989). [Hein93] Heinrich, Joseph, MIPS R4000 User's Manual, Prentice Hall, Englewood Cliffs, NJ (1993). [Hime87] Himelstein, Mark I., Fred C. Chow, and Kevin Enderby, "Cross-Module Optimizations: Its Implementation and Benefits," Proc. Summer Usenix, pp. 347-356 (June 1987). [Hime90] Himelstein, Mark I., "An Implementation of Dynamic Shared Objects," Unpublished paper, MIPS Computer Sys- tems, Inc., (August 1990). [Hobb87] Hobbs, J., "Installed Shareable Images," Dec. Profes- sional Vol. 6(4) pp. 78-82 (April 1987). [Hsu94] Hsu, Peter Yan-Tek, "Designing the TFP Microprocessor," IEEE Micro, pp. 23-33 (April 1994). [Hwu89] Hwu, Wen-mei and Pohua Chang, "Achieving High Instruc- tion Cache Performance with an Optimizing Compiler," 16th ISCA, pp. 242-251 (1989). [Kepp93] Keppel, David and Stephen Russell, "Faster Dynamic Linking for SPARC V8 and System V.4," Technical Report 93-12-08, University of Washington (1993). [McFa89] McFarling, Scott, "Program Optimization for Instruction Caches," International Symposium on Architecture Sup- port for Programming Languages and Operating Systems, pp. 183-191 (April 1989). [Nels93] Nelson, Michael. N. and Graham Hamilton, "Higher Per- formance Dynamic Linking Through Caching," Proc. Summer Usenix, pp. 253-266 (June 1993). [Orr93] Orr, Douglas B., John Bonn, Jay Lepreau, and Robert Mecklenburg, "Fast and Flexible Shared Libraries," Proc. Summer Usenix, pp. 237-251 (June 1993). [Pett90] Pettis, Karl and Robert Hansen, "Profile Guided Code Positioning," Proceedings of the SIGPLAN Programming Language Design and Implementation, pp. 16-27 (June 1990). [Saba90] Sabatella, Marc, "Issues in Shared Libraries Design," Proc. Summer Usenix, pp. 11-23 (June 1990). W. Wilson Ho received his M.S. and Ph.D. degree in computer sci- ence from the University of California at Davis. He has been working in the compiler group in Mips, Inc. and Silicon Graphics, Inc. since 1992. His research interests include inter-procedural analysis, dynamic linking, link-time optimization, and debugging of distributed programs. He is also the author the GNU dynamic loader "dld". His email address is ho@sgi.com. Wei-Chau Chang has been working in the compiler group in Mips, Inc. and Silicon Graphics, Inc. since 1992. His interests include compiler optimization for instruction cache, program instrumentation, and execution-driven simulation. He received his M.S. degree in computer science from the University of Illinois at Urbana-Champaign. Lilian H. Leung received her M.S. degree in Computer Science from Stanford University in 1986. She has been developing compiler at Mips, Inc, and Silicon Graphics, Inc. since 1988. Before that, she was at Informix Software, Inc. designing database languages. Her most recent work focus is on static and run-time linking. She was the original writer of the SGI/Mips' run-time loader.