################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX Summer 1993 Technical Conference Cincinnati, Ohio June 21-25, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Call Path Profiling of Monotonic Program Resources in UNIX by Robert J. Hall Aaron J. Goldberg AT&T Bell Laboratories 600 Mountain Ave. Murray Hill, NJ 07974 **************************************************************** NOTE: This plain text version has been cannibalized from the original LaTeX source. Potentially important font changes have necessarily been removed. Also, a postscript figure has been removed, and some mathematical formatting has been turned into straight text. The pseudo- code in the appendix has particularly suffered from this transformation. The reader is strongly urged to obtain either a paper copy or else use a postscript-viewing program (such as ghostscript) to view the postscript form. If you want postscript, send email to hall@research.att.com. **************************************************************** Abstract Practical performance improvement of a complex program must be guided by empirical measurements of its resource usage. Essentially, the programmer wants to know where in the source code the program is inefficient and why this is so. The process interface of UNIX System V (proc(4)) provides access to the raw data (e.g. time, faults, traps, and system calls) necessary to answering the why question, but gives no guidance in answering the where question. This paper describes a novel approach to the latter, Call Path Profiling, which is both more informative and more closely tied to the process of program optimization than either trace-based or prof/gprof-like approaches. In addition, by viewing consumption of a resource as the ticking of a clock, we generalize the interval-based sampling approach of time profilers to arbitrary monotonic resources. The approach is embodied in several prototypes, including CPPROF which operates under System V. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. INTRODUCTION In order to effectively improve the performance of a complex program, it is essential that one understand both why the program is too slow or big (i.e., what resources it uses) and where it consumes those resources significantly. This allows one to focus time and attention on re-examining critical design decisions, while avoiding insignificant optimizations that needlessly destroy the program's original clear structure. While there has been some progress in automated static analysis of program performance\cite{lemetayer88,flajoletetal91}, prospects for a practical tool are dim. Thus, in practice one still turns to profiling: measuring resource usage empirically\cite{knuth71}. Previous research (see Section 3) has treated the ``where'' question only crudely. In particular, most profilers (e.g. UNIX's prof(1)) report only the costs of executing function bodies. While this is useful in locating inefficiently coded function bodies, it is rare that inefficiency is localized to one routine. Far more common is when a general routine, efficiently coded for a general task, is called in a specialized context, doing more work than necessary. The programmer can optimize by creating a specialized version which is only called in the optimizable context. A profiler should report resource usage in terms of these optimizable program contexts, because they are more likely to be directly optimizable. Call path profiling\cite{hall92} is a technique for describing ``where'' in terms of optimizable program contexts. This paper reports on an implementation of call path profiling of monotonic program resources (e.g. time, faults, traps, system calls, etc.; see Section 2) under the process filesystem (proc(4)) abstraction\cite{faulkneretal91} of UNIX System V. A first pass collects raw resource data about the target process via proc(4), implementing an ``interval timer'' for each resource of interest. Specifically, we view consumption of a resource as the ticking of a clock. Each time the resource clock sweeps out a full interval, we pause the target process and record in a cpprofdata file a stack sample: the program counter and the sequence of return addresses stored on the current program call stack. This does not result in excessive overhead primarily because we are sampling (recording infrequent but randomly distributed data) instead of tracing (recording all data); we trade off perfect accuracy for low overhead. The second pass analyzes the stack samples in the cpprofdata file and puts the user in a simple command loop that allows interactive exploration of call path profiles of various resources viewed from different root functions. While call path profiling has previously been used in Lisp environments for time and space, the proc(4) filesystem provides a unique opportunity for profiling a wide variety of interesting resources in a uniform and portable way. In addition, our research points to the need for a ``generalized interval timer'' service which we believe should be provided by the UNIX kernel for all monotonic system resources in a way similar to the interval timer (setitimer(2)). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2. MONOTONIC RESOURCES AND proc(4) People usually complain about performance when programs are too slow. Thus, the typical profiler reports on the consumption of the single resource, execution time. However, in many cases one wishes to know why the program spends time in a given place, and this requires understanding its usage of other resources. For example, a particular call to a floating point routine may take much longer than expected due to floating point exception processing. A time profiler does not help one to understand why the call is slow, but a profile of where the floating point exceptions occur in a program informs the user directly. Similarly, one might wish to profile any or all of the resources listed below. user time real time floating point exceptions page faults input/output time bytes read or written system or function calls ... The key property these all share is that their effects increase monotonically over time; that is, the total cost over the program's execution is equal to the sum of the costs incurred by individual subroutines. For example, each page fault adds to the total run-time cost of the program. By contrast, some resources, such as stack space, do not have this property. Just because a given subroutine invocation uses k bytes of stack space does not imply that the whole program execution must use an additional k bytes; it may actually reuse the same k bytes allocated by some previous subroutine invocation. A similar example is parallel execution time. Our approach does not apply directly to these non-monotonic resources, so by ``resource'' we will always mean ``monotonic resource.'' Another key property of the resources listed above is that a monitoring process can acquire raw data about them for another process through the proc(4) filesystem\footnote{To monitor time resources, CPPROF uses the kernel's interval timer facilities (setitimer(2))}. For example, the monitoring process can arrange to be notified of every page fault, trap, or system call encountered by the monitored process. Our System V prototype, CPPROF, exploits this functionality. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 3. RELATED WORK To illustrate call path profiling and contrast it with other profilers, it will be useful to refer to the (contrived) example program, process_db, shown below. Section 6 will discuss more realistic programs to which we have applied our profiler. main() { DB db = read_db(); print_salary_stats(uniquify_db(db)); } DB_RECORD **uniquify_db(DB db) { DB_RECORD **dbptrs = build_db_ptrs(db); qsort(dbptrs, name_field_lt); merge_adjacent_records(); return(dbptrs); } print_salary_stats(DB_RECORD **dbptrs) { int *salaries = extract_salaries(); qsort(salaries, integer_lt); stat_summary(salaries); } Essentially, this program reads a database of employee records, sorts it into order by name field eliminating duplicate records, and then extracts and sorts the salaries. Finally it prints various statistics about them, such as range, median, etc. Note that the programmer realizes that sorting the database directly would be inefficient, because qsort would have to swap all the bytes of the records, so uniquify_db creates, manipulates, and returns an array of pointers to database records. Suppose now that we run this program on a database too large to fit in the physical memory of our machine. UNIX's prof(1) reports the amount of time spent in each function body. For process_db, it would report that essentially all the time is spent in qsort. While true, this doesn't help the optimizer because qsort(3), being a library routine, figures to be rather efficient. It is left to the programmer to infer whether something about the program is inefficient and if so, how to optimize it. UNIX's gprof(1)\cite{grahametal82}, on the other hand attempts to compute a ``call graph'' profile which, when accurate, attributes the time spent in a procedure and its descendants to the procedure. It also attempts to report the fraction of time spent in each child of each function. Unfortunately, gprof's implementation makes the incorrect assumption that every call to a given routine takes the same amount of time. Thus, a gprof profile of process_db would report half of the time in the first qsort call and half in the second. This turns out to be wrong; far more time is spent in the first than the second. Even when a correct call graph profile is available, as in the Franz Allegro Common Lisp environment\cite{franz90}, it is only a simple special case (length 1 and 2 call paths) of a call path profile (Section 4). One can easily show (by example) that these profiles do not give as much information as a full call path profile. Franz's profiler is based on stack sampling in a way conceptually similar to our own stack sampling (see Section 5.1). A completely different approach to profiling operates by tracing resource usage. truss(1) provides a general mechanism for tracing a UNIX command via the proc(4) filesystem. The user specifies which classes of resource usage are of interest (page faults, traps, or system calls) and then truss prints out a line on the terminal for each page fault, trap, or system call of the selected type that occurs during execution of the trussed process. While it does give a general idea about what resources are being used, tracing generates a large amount of data which typically must be summarized to be understood. For process_db above, truss simply puts out a line for each of the several thousand page faults encountered, giving little indication of which optimizable context in the program caused the page fault. The System V version of the SunPro Sparcworks Analyzer\cite{sun92} provides a rich interface for tracing and profiling program resources. For example, it gives the user a prof(1)-like profile of any of the proc(4)-accessible resources, gives a graphical display of which text and data pages were touched during execution (including a tool for laying out a program to reduce text page faults), and provides many other features. It even has an option that enables collecting periodic stack samples (the same sampling technique we use in the call path profiler). However, it samples only with respect to the time resource, and instead of a call path profile, abstracts to the less informative gprof-style call graph profile. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 4. CALL PATH PROFILING Here is a call path profile of process_db: Downward Call Path Profile for resource: Real Time 2676 samples, sampled every 50 gticks format: usage_fraction (call_path) [#raw_samples] ----------------------------------------------------- ; 1.00000 (main) [2676] ; 0.88004 (main uniquify_db) [2355] ; 0.68012 (main uniquify_db qsort) [1820] ... ; 0.11323 (main print_salary_list) [303] ; 0.11323 (main print_salary_list extract_salary_fields) [303] ... ; 0.00000 (main print_salary_list qsort) [0] ----------------------------------------------------- The header information shows that this profile was collected by sampling Real Time once every 50 generalized ticks (gtick). The gtick is a polymorphic unit denoting a hundredth of a second for time resources, one page fault for text and data page faults, one byte for space allocation, etc. Each line after the header shows the resource usage of one call path of the program. Intuitively, a call path represents a subset of the program's execution corresponding to a class of program call stacks (Section 4.1 defines this notion precisely). The first field is the fraction of the program's total Real Time attributable to the call path, the second field (in parentheses) names the call path, and the integer in brackets indicates the raw number of samples for the call path. >From this profile, we can conclude that main calling uniquify_db calling qsort consumes 68% of the time, while the other call to qsort (within print_salary_list) takes insignificant time (no samples hit there). This ``where'' description directly relates the resource usage to an optimizable context, because the programmer can easily change the program to call a specialized qsort within a specialized uniquify_db within main. The question remains how to specialize qsort in this case. The measurement seems anomalous, since the arrays sorted by the two qsort calls have approximately the same number of elements (assuming few duplications in the database). By using CPPROF to profile other resources, we find Downward Call Path Profile for resource: Page Faults 9950 samples, sampled every 10 gticks format: usage_fraction (call_path) [#raw_samples] ----------------------------------------------------- 0.99889 (main) [9939] 0.71317 (main uniquify_db) [7096] 0.56995 (main uniquify_db qsort) [5671] ... 0.13467 (main print_salary_list) [1340] 0.13457 (main print_salary_list extract_salary_fields) [1339] ... 0.00000 (main print_salary_list qsort) [0] ----------------------------------------------------- This gives a clear indication of why the first sort takes so much more time: the array to be sorted resides on so many different pages that they can't be held in memory at once, causing a superlinear number of data page faults (due to the superlinear number of key comparisons required by sorting, where each key comparison must actually touch the page to get the name field). Armed with this information, the user can immediately focus on optimizing the critical bottleneck: since the sort key fields are small, the program could localize the keys first (incurring only linearly many page faults) and then sort without causing further page faults. By optimizing process_db in this way, we cut the total run time from 25 minutes to about 8. This matches well with what the profile predicts, since we have effectively made the call path 0 cost. The optimization is unnecessary for the call to qsort within print_salary_list, because the salaries are localized when they are extracted into a separate array. This illustrates how CPPROF provides a unique combination of highly detailed ``where'' descriptions with a rich variety of ``why'' explanations. %%%%%%%%%%%%%%%% 4.1 WHAT IS A CALL PATH A call path is a sequence of function pairs ((f_1, g_1),...) such that f_i executes a subroutine call to g_i and g_i = f_{i+1}. In view of the latter condition, we will always abbreviate the call path ((f_i, g_i)) by (f_1, f_2, ..., f_n, g_n), as we did in the call path profiles above. By this definition, it is easy to see why a call path corresponds to an optimizable program context: any nested sequence of calls can be rendered separately optimizable by duplicating (with a different name) each subroutine (except for the first) in the list so that it is only called by its predecessor. For example, in our program we could make the call path (main uniquify_db qsort) optimizable by creating qsort_1 and uniquify_db_1 which calls it and altering main to call uniquify_db_1. Note that for recursive programs, one must be careful to maintain an isomorphic call graph by renaming recursive calls to the new name of the recursive function. A general algorithm for contextualizing a call path which works for arbitrarily recursive program structures is given in an earlier paper\cite{hall92}; space constraints prohibit reproducing it here. %%%%%%%%%%%%%%%% 4.2 WHAT THE PROFILE REPORTS The resource fraction f_p reported for a call path p=(a b ... n) in a profile has a simple interpretation. Suppose one constructs the optimizable program for p by contextualizing it as above, replacing the call in a to b with a call to b_1, which in turn calls c_1, and so on, which finally calls n_1. We can interpret f_p simply as the fraction of resource usage that would be saved if n_1 were somehow magically transformed to use none of the resource. For example, the time profile for process_db predicts that if one were to make qsort take no time at all in computing its output when called from uniquify_db, then the total run time of the resulting program would be only 1-0.68=0.32 as much as the original. This was achieved when we actually carried out the optimization and measured the result. %%%%%%%%%%%%%%%% 4.3 UPWARD AND DOWNWARD CPPROF provides two complementary types of call path profiles: upward and downward. Both are defined in terms of a distinguished subroutine, called the root function. In our example above, the root function is main, but it can be any function in the program. Given a root function, the downward call path profile is simply a sorted, filtered list of the resource usages of all call paths whose first function name is the root. (The filtering is based on a user-specifiable significance threshold. Any profile entry whose usage fraction falls below the threshold is suppressed from the output.) Both profiles above are downward call path profiles. Downward profiles provide a hierarchical breakdown of resource consumed during the execution of the root function. An upward call path profile is a sorted, filtered list of the resource usages of all call paths whose last function name is the root function. This provides a breakdown of which call ancestors of the root function are incurring the most costs due to it. An important use for this is when one knows that the program is spending most of its time in, say, memory allocation, but doesn't know which calls to it cost the most. gprof(1)\cite{grahametal82}, when its assumption about constant procedure execution times is accurate, provides the information in both upward and downward call path profiles if we restrict call path lengths to at most two. FIGURE 1 HERE CAPTION: Figure 1. An abstracted call graph for the forms program. main calls a specialized processing function for each type of form. These routines call (among others) a utility routine, address_information, that returns full name, address, phone, etc., information on an individual in the database. The intermediate utility db_get_property extracts a property value from a database entry. The low level routine db_read_record performs a disk read to find the record for a given individual. Some subroutines and calls have been left out for brevity. %%%%%%%%%%%%%%%% 4.4 A USEFUL TECHNIQUE One useful technique for using CPPROF is to ask first for a function profile, which is essentially just a sorted, filtered list of profile entries for all length 1 call paths in the program (i.e., the resource fraction consumed in a function and on its behalf). Next, ask for upward and/or downward call path profiles rooted at functions shown to be significant in the function profile. This assures that one will not be prematurely swamped with a large amount of irrelevant data. To illustrate this technique, consider the program shown abstractly in Figure 1. This program fills out forms for users by looking up information in an on-line database. To see why this program is inefficient, we first ask CPPROF for a function profile. Body+Descendants Function Profile for resource: Real Time 614 samples, sampled every 20 gticks format: usage_fraction function_name [#raw_samples] ----------------------------------------------------- 1.0000 main [614] 1.0000 db_read_record [614] 0.8632 db_get_property [530] 0.6873 address_information [422] 0.1743 invoice [107] 0.1726 form_US_1040 [106] 0.1726 envelope [106] 0.1726 loan_application [106] 0.1710 form_NJ_1040 [105] 0.1368 db_update_record [84] ----------------------------------------------------- This shows us that the most significant routines are db_read_record, db_get_property, and address_information. At this point, we know that the most significant efficiency gains (if there are any) are to be had in optimizable contexts within these routines. For more specific information, we set the root to the most significant of these, db_read_record, and request an upward call path profile for it. Upward Call Path Profile for resource: Real Time 614 samples, sampled every 20 gticks format: usage_fraction (call_path) [#raw_samples] ----------------------------------------------------- 1.0000 (db_read_record) [614] 0.8632 (db_get_property db_read_record) [530] 0.6873 (address_information db_get_property db_read_record) [422] 0.1384 (envelope address_information db_get_property db_read_record) [85] 0.1384 (invoice address_information db_get_property db_read_record) [85] ... ----------------------------------------------------- This profile directs us to consider first three particular optimizable contexts each ending in a call to db_read_record. By examining these in turn, we can discover that the routine address_information calls db_get_property four times to retrieve four fields of the same database entry. This results in four successive disk reads of the same record. By specializing db_get_property for use only within address_information (to cache the record after the first read), we can save 75% of the 0.6873 total run time fraction, reducing the overall run time by about 51%. Note that we could not simply optimize db_get_property itself, because other calls to it (from outside of address_information) require its full generality. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 5. IMPLEMENTATION %%%%%%%%%%%%%%%% 5.1 STACK SAMPLING CPPROF is implemented by a monitor process which controls the target process via proc(4) (the time resources are handled slightly differently, using the UNIX interval timers (setitimer(2))). The user specifies which resources are of interest and the monitor process arranges to be notified each time one gtick is consumed. We then implement our own ``generalized interval timer'' for each resource by counting gticks until the sampling interval\footnote{To avoid problems with correlation between periodic sampling and periodic program behavior, one can randomly perturb the sampling interval each time.} passes. For example, we examine each read(2) system call and record the number of bytes requested, each byte being a gtick. Each time a general resource clock passes a sampling interval boundary (e.g. 1024 bytes read), we halt the target process and record a stack sample for the resource in a cpprofdata file. When more gticks pass than a single sample boundary, we credit that many hits to the sample. For example, reading 10240 bytes will result in 10 hits for the corresponding stack sample. A stack sample is simply a list of the number of sample hits, the current program counter, the return address of the current stack frame, the return address of the caller, that of the caller's caller, etc, until we reach the root procedure (typically main for C programs). %%%%%%%%%%%%%%%% 5.2 INTERACTIVE PROFILE ANALYSIS After the target and monitoring processes finish executing, the cpprof command analyzes the cpprofdata file and puts the user in a simple command loop allowing one to view upward and downward call path profiles from user-determined root functions, with user-determined thresholds. Also available is a function profile and a body profile (a list of the resource usage consumed in the body of each function; this generalizes prof(1) to arbitrary resources). Thus, rather than producing a single static profile, CPPROF is an interactive exploratory environment for understanding the performance of a program empirically. %%%%%%%%%%%%%%%% 5.3 COMPUTING CALL PATH PROFILES We compute a downward call path profile in two steps. First, CPPROF reads the cpprofdata file once and builds a sample tree. Second, it computes the call path profile for the chosen root from the sample tree. In this section, we explain how to compute a downward call path profile from a given root function for a single resource. Function profiles, upward profiles, and handling mutliple resources are all straight-forward extensions of the basic technique. To build the sample tree, each sample is first converted from a list of addresses into a list of function names by using symbol table information. Once we have the symbolic form of the sample, we insert it into the sample tree as follows. Each node of the sample tree is labeled with a function name, and contains fields for the number of sample hits corresponding to that call stack and a list of children. The root node has the name of the root function, typically main. For each sample, we locate the unique path in the tree whose node labels are the same as the function names in the sample (if no such path yet exists, new nodes are installed to create it). The sample hits field of the last node on this path is then incremented by the number of sample hits corresponding to the sample. Once all samples are recorded in the tree, we compute for each node in the tree the sum of its sample hits and those of all its descendants; this number is termed the ``weight'' of the node. In the special case of a nonrecursive program for which the user's choice of root function is the same as the label of the root node of the sample tree, the resource fraction for any call path may be read directly from the sample tree. One locates the unique path in the sample tree labeled with the same list of function names and divides the last node's weight by the weight of the root node. To handle the general case, including recursive programs and non-main root functions, we use the more general algorithm described in the Appendix. %%%%%%%%%%%%%%%% 5.4 PORTABILITY ISSUES The great majority of code for CPPROF is completely portable, ANSI C. One small module, the stack trace sampler, is architecture- (and sometimes compiler-)dependent. Stack and register conventions will vary across machines; however, once these conventions are understood, walking the call stack using the proc(4) interface is straight-forward to implement. Obviously, the part of CPPROF that monitors via proc(4) is portable only among UNIX System V systems. A limited version of CPPROF (time only) is implemented under BSD UNIX. It has also been implemented via stack sampling for time and space resources in Franz Allegro Common Lisp v4.1 and also under Macintosh Common Lisp v2.0. Thus, it appears that the technique generalizes easily to other operating systems and machines, limited only by how well the environment supports access to resource data. System V's proc(4) appears to be the richest environment so far. It could, we believe, be further improved by adding a ``generalized interval timer'' feature (analogous to the time version), whereby instead of receiving a signal on every resource usage, the user could set an interval (in terms of gticks) at which to be signaled. This would allow monitoring with even less overhead, avoiding the context switch necessary to handle each resource signal. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 6. EXPERIENCE/CASE STUDIES While the System V implementation of CPPROF is still in testing, we have nevertheless applied it to several large programs, including six of the Spec benchmarks. Though we are not familiar enough with the benchmark programs to do detailed optimization, an example of what can be found quickly by call path profiling is illustrated below. Downward Call Path Profile for resource: User Time 1871 samples, sampled every 20 gticks format: usage_fraction (call_path) [#raw_samples] ----------------------------------------------------- 1.00000 (xleval) [1871] 1.00000 (xleval evform) [1871] 1.00000 (xleval evform xprog) [1871] 1.00000 (xleval evform evfun) [1871] ... 0.27579 (xleval xlgetvalue) [516] ... 0.09139 (xleval evform xlevlist consa) [171] ... ----------------------------------------------------- This is a downward call path profile for user time of the 022.li Spec benchmark, a Lisp interpreter (itself coded in C) solving the 9-Queens problem. Since the code implements an interpreter, one expects to find inefficiency compared to running compiled code. Given an understanding of the structure of the program, the call path profile highlights these inefficiencies. For example, the call path (xleval evform xlevlist consa) uses 9.1% of the total run time. Looking at the code, the xlevlist function is the interpreter's routine for evaluating the arguments to a Lisp function call and returning a list of the evaluated actuals. consa allocates a Lisp cons cell. Thus, this profile entry reports that simply consing up the list of function call actuals (not including the time to evaluate each actual) is using 9.1% of the total time. Compiled code would avoid this inefficiency by passing values at fixed locations on the stack. Note that a prof(1) profile does not show this at all, because most of the work done by consa is done in its call descendants. Also, there is more than one caller of consa in the program, so we could not simply have deduced this information from a function profile alone. In addition to testing on the Spec benchmarks, Lisp-based implementations of the call path profiler have been used on several large programs to find significant optimizations. In one case (involving a system of several thousand lines of Lisp source code), it exposed the fact that one of the program modules had mistakenly been left uncompiled, hence was run interpreted. This was shown simply by the presence of ``extra'' calls in call paths to stack blocks established by the interpreter that are normally compiled out. Simply compiling the module resulted in a 10% savings in total run-time. In another large application, call path profiling detected the fact that redundant calls to a data structure canonicalization procedure slowed the program significantly, pointing out the need to canonicalize only when necessary. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 7. DISCUSSION %%%%%%%%%%%%%%%% 7.1 LIMITATIONS The sampling approach to profiling implemented in virtually all profilers to date has several limitations, centering around statistical significance and unwanted correlations between the sampling and the executing code. While time sampling using setitimer(2) suffers from these problems, sampling other resources fares better. By randomizing the general timer interval, we can avoid the correlation problem, and by reducing the mean of the timer interval, we can increase the statistical significance as much as desired (though this may further perturb the program's unsampled behavior). Call path profiling implemented by sampling has some other limitations. As discussed in a previous paper\cite{hall92}, a sampled call path profile can only give information on dynamic call paths, because we sample the actual call stack at run time. The user is actually interested in lexical call paths, i.e. those in the source code, because those are the ones that are directly optimizable. In most cases, the two sets of call paths are the same, but they can differ in programs where a function is stored or passed as a first class data object and actually called in a context different from where it was created or referred to. For example, a call path profile may report a call path having a call from f to g when in the program text there is no explicit call of g from f; instead a pointer to g is passed somehow into f and is then called. Ideally, one would like the call path to reflect the lexical originator of g, instead of the run-time caller. This problem is fundamental to the sampling-based approach and can only be solved by instrumenting the source code. Another limitation of our implementation is that we have ignored the fact that one function may call another from more than one call site in its body. Ideally, the user would like to see call paths that distinguish the different call sites, as different call sites can have widely different costs. It is straight-forward to incorporate this into CPPROF, because the stack samples are recorded as program addresses. We simply have not yet added this level of detail to the prototype. %%%%%%%%%%%%%%%% 7.2 OTHER RELEVANT WORK Some profilers, such as the Franz Common Lisp graphical profiling tool\cite{franz90} (time and space resources) and the Plan 9 profiler\cite{plan9} (time only), collect and display a tree-based, hierarchical representation of what we called the sample tree: time spent in individual program stack traces. We term this a sample tree profile. For non-recursive programs, this gives the same information as a downward call path profile rooted at the main routine; however, the call path profiler provides better summarization and attention focusing tools for complex programs. For example, a sample tree profile of the forms program of Figure 1 looks like this: Sample Tree Profile for resource: Real Time 614 samples, sampled every 20 gticks format:frame_designator (usage_fraction) [#raw_samples] ----------------------------------------------------- main (1.0000) [614] invoice (0.1743) [107] address_information (0.1384) [85] db_get_property (0.1384) [85] db_read_record (0.1384) [85] db_get_property (0.0358) [22] db_read_record (0.0358) [22] loan_application (0.1726) [106] address_information (0.1368) [84] db_get_property (0.1368) [84] db_read_record (0.1368) [84] db_get_property (0.0358) [22] db_read_record (0.0358) [22] envelope (0.1726) [106] address_information (0.1384) [85] db_get_property (0.1384) [85] db_read_record (0.1384) [85] db_get_property (0.0342) [21] db_read_record (0.0342) [21] form_US_1040 (0.1726) [106] address_information (0.1368) [84] db_get_property (0.1368) [84] db_read_record (0.1368) [84] db_get_property (0.0358) [22] db_read_record (0.0358) [22] form_NJ_1040 (0.1710) [105] address_information (0.1368) [84] db_get_property (0.1368) [84] db_read_record (0.1368) [84] db_get_property (0.0342) [21] db_read_record (0.0342) [21] ... ----------------------------------------------------- We have used indentation to represent depth in the tree. Even though this mass of data is complete, in the sense that any other profile can be derived from it, it does not focus one's attention on worthwhile optimization opportunities. Nothing looks very significant, because every node below main has a usage fraction less than 0.2. In particular, every entry for db_read_record is less than 0.14. In order to find that db_read_record itself accounts for essentially all the time in the program, one must add up fractions that are widely distributed over the tree. Furthermore, to find the usage fraction for any optimizable context not rooted at main, one must perform a similar addition. The call path profiling methodology discussed in Section 4.4, on the other hand, leads the user directly to considering the three most worthwhile optimizable contexts. %%%%%%%%%%%%%%%% 7.3 FUTURE WORK We believe call path profiling of general resources should be incorporated into any profiling environment that is sophisticated enough to record stack samples, such as Sparcworks\cite{sun92} or the Franz profiler\cite{franz90}. This involves two extensions. First, the environment must add the capability to sample monotonic resources on an interval basis, hopefully with kernel support. Second, the environment must provide an interactive capability to compute and display call path profiles based on user-determined root functions and thresholds. While the first extension may be somewhat involved, the second could be implemented quite quickly for whichever resources are already supported, such as elapsed time or allocated space. %%%%%%%%%%%%%%%% 7.4 CONCLUSION Call path profiling provides an informative description of where a program consumes its resources. It is closely tied to optimization, because it reports the costs of optimizable program contexts, rather than simply the context-free costs of function bodies. This technique for describing ``where'' is equally applicable to any monotonic resource (such as time, space, page faults, traps, and system calls), so by profiling via System V's proc(4) filesystem, CPPROF is able to answer the ``why'' profiling question in a richly detailed way as well. Our research indicates also a way in which proc(4) (or any process monitoring facility) could be improved: provide kernel support for ``generalized interval timers.'' That is, allow the user to select a sampling interval for each type of resource and only signal the monitoring process as each interval elapses. The current proc(4) signals on every resource gtick. This can needlessly increase the monitoring overhead of a run, both increasing the overall execution time of the profile run and increasing the uncorrectable perturbation to the program's true behavior (due to extra context switches, cache flushes, etc). On the other hand, adding this general service to the kernel should be inexpensive and simple to do (we have implemented the general interval timers at the user level, but one expects that support at the kernel level would sharply reduce the overhead of the mechanism). We believe that too few programmers profile their programs. Perhaps this is due, in part, to the difficulty in connecting previous forms of profiler output data with optimizable program contexts. We hope that call path profiling can remove much of the detective work previously required, making profiling more attractive to programmers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% A. APPENDIX This appendix presents pseudo-code for a general algorithm for computing downward call path profiles from a previously constructed sample tree. It works for arbitrarily recursive programs and for arbitrary choice of root function by the user. A separate output routine takes the output of this program and formats it for readability; in particular, it divides the number of sample hits in each call path's profile entry by the total number of sample hits incurred during the run in order to report a resource usage fraction. Call paths are represented as lists of strings, each string being a function name. The functions cons, first, and rest operate on lists: cons constructs a new list by adding an element at the beginning of an old list; first returns the first element of a list; and rest returns the list obtained by removing the first element of a list. The basic idea of the algorithm is to recursively descend the sample tree and associate each node with a canonical call path that represents it. Then, we add the weight of the node to the ticks field of that call path's record, unless the record is locked. We have to use a locking protocol in the presence of recursion to avoid attributing the same ticks to a given call path multiple times. PSEUDO-CODE FIGURE: Let PATH-HASH be a hash table whose keys are call paths and whose values are records containing a ticks field and a locked? bit Procedure Downward-CP-Profile (SAMPLE-TREE, ROOT-FN) (Returns a sorted list of call path resource usage records) Initialize PATH-HASH to be empty. Summarize-below-root(SAMPLE-TREE, ROOT-FN) Let RESULT be a list of every (key, value) pair in PATH-HASH. Sort RESULT in decreasing order of sample hits field. Return RESULT. Procedure Summarize-below-root (TREE-NODE, ROOT-FN) If name(TREE-NODE) = ROOT-FN Then CPP-Rec(TREE-NODE, ()) Else For each child CHILD of TREE-NODE Summarize-below-root(CHILD, ROOT-FN) Procedure CPP-Rec(TREE-NODE, PARENT-CALL-PATH) Let NAME = name(TREE-NODE) CP = cons(NAME, PARENT-CALL-PATH) NEXT-PARENT = Canonicalize-name(CP, NAME) RECORD = get-record(CP, PATH-HASH) LOCKED? = record-locked?(RECORD) If not LOCKED? Then lock-record(RECORD) add-ticks(weight(TREE-NODE), RECORD) For each child CHILD of TREE-NODE CPP-Rec(CHILD, NEXT-PARENT) If not LOCKED? Then unlock-record(RECORD) Procedure Canonicalize-name(CALL-PATH, THIS-NAME) For (PTR = rest(CALL-PATH); While PTR is not null) If first(PTR) = THIS-NAME Then Return PTR Else Set PTR = rest(PTR) Return CALL-PATH ACKNOWLEDGEMENTS The authors would like to thank Mark Plotnick for his assistance. Thanks also to Rob Pike and Ken Thompson for stimulating discussions of this work. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% REFERENCES \bibitem{plan9} The Plan 9 Programmer's Manual. Murray Hill, NJ: AT&T Bell Laboratories. (1993). \bibitem{faulkneretal91} R. Faulkner & R. Gomes, The process file system and process model in UNIX system V, In Proc. Winter 1991 USENIX Conference. Berkeley, CA: USENIX Assoc. (1991) 243--252. \bibitem{flajoletetal91} P. Flajolet, B. Salvy, & P. Zimmerman, Automatic average-case analysis of algorithms, Theoretical Computer Science 79, pp 37--109. Amsterdam: Elsevier Science Publishers, 1991. \bibitem{franz90} Franz, Inc., Allegro COMPOSER User Guide, version 1.0, Chapter 6. Berkeley, CA: Franz, Inc, 1990. \bibitem{grahametal82} S.L. Graham, P.B. Kessler, & M.K. McKusick, Gprof: a call graph execution profiler, ACM SIGPLAN Notices 17(6), pp 120--126. June, 1982. \bibitem{hall92} R.J. Hall, Call path profiling, In Proc. 14th International Conf. on Software Engineering, New York, NY: Assoc. for Computing Machinery, (1992), 296--306. \bibitem{knuth71} D.E. Knuth, An empirical study of fortran programs, Software--Practice and Experience 1, pp 105--133. 1971. \bibitem{lemetayer88} D. Le Metayer, ACE: An Automatic Complexity Evaluator, ACM Trans. on Programming Languages and Systems 10(2), pp 248--266. 1988. \bibitem{sun92} SunPro, Performance Tuning an Application. Sun Microsystems, Inc. (1992). \end{thebibliography} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Robert J. Hall (hall@research.att.com) has been a Member of Technical Staff at AT&T Bell Laboratories, Murray Hill, since 1991. Prior to that, he received his S.M., E.E., and Ph.D. degrees from the Massachusetts Institute of Technology, where his thesis research was performed under Charles Rich in the Artificial Intelligence Laboratory's Programmer's Apprentice Group. He received the B.S. degree summa cum laude in E.E.C.S. from the University of California at Berkeley. Aaron J. Goldberg (aaron@research.att.com) has been a Member of Technical Staff at AT&T Bell Laboratories, Murray Hill, since 1992. Prior to that, he received his Ph.D. degree from Stanford University, where his thesis research was performed under John Hennessy in the DASH group. He received the A.B. degree summa cum laude in Applied Mathematics from Harvard University.