################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX Summer 1993 Technical Conference
		  Cincinnati, Ohio  June 21-25, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


   Call Path Profiling of Monotonic Program Resources in UNIX

                          by

                     Robert J. Hall
                    Aaron J. Goldberg
                 AT&T Bell Laboratories
                   600 Mountain Ave.
                 Murray Hill, NJ 07974

****************************************************************
NOTE: This plain text version has been cannibalized from the original
LaTeX source.  Potentially important font changes have necessarily been
removed.  Also, a postscript figure has been removed, and some
mathematical formatting has been turned into straight text.  The pseudo-
code in the appendix has particularly suffered from this transformation.
The reader is strongly urged to obtain either a paper copy or else use
a postscript-viewing program (such as ghostscript) to view the
postscript form.  If you want postscript, send email to
hall@research.att.com.
****************************************************************

                      Abstract

Practical performance improvement of a complex program
must be guided by empirical measurements of its resource usage.
Essentially, the programmer wants to know where in the source code
the program is inefficient and why this is so.
The process interface of UNIX System V (proc(4))
provides access to the raw data (e.g. time, faults, traps, and system
calls) necessary to answering the why question,
but gives no guidance in answering the where question.
This paper describes a novel approach to the latter,
Call Path Profiling, which is both more informative
and more closely tied to the process of program optimization than
either trace-based or prof/gprof-like approaches.
In addition, by viewing consumption of a resource as the ticking of a 
clock, we generalize the interval-based sampling approach of
time profilers to arbitrary monotonic resources.
The approach is embodied in several prototypes, including CPPROF which
operates under System V.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1. INTRODUCTION

In order to effectively improve the performance of a complex program,
it is essential that one understand both why the program is
too slow or big (i.e., what resources it uses) and 
where it consumes those resources significantly.
This allows one to focus time and attention on
re-examining critical design decisions,
while avoiding insignificant optimizations that
needlessly destroy the program's original clear structure.
While there has been some progress in automated static analysis of
program performance\cite{lemetayer88,flajoletetal91}, prospects for
a practical tool are dim.  Thus, in practice one still turns to
profiling: measuring resource usage empirically\cite{knuth71}.

Previous research (see Section 3)
has treated the ``where'' question only crudely.
In particular, most profilers (e.g. UNIX's prof(1))
report only the costs of executing function bodies.  While this is
useful in locating inefficiently coded function bodies, it is
rare that inefficiency is localized to one routine.  Far more common
is when a general routine, efficiently coded for a general task,
is called in a specialized context, doing more
work than necessary.  The programmer can optimize by creating
a specialized version which is only called in the optimizable
context.  A profiler should report resource usage
in terms of these optimizable program contexts, because they are
more likely to be directly optimizable.

Call path profiling\cite{hall92} is a technique for describing
``where'' in terms of optimizable program contexts.
This paper reports on an implementation of call path profiling
of monotonic program resources (e.g. time, faults, traps, system calls,
etc.; see Section 2)
under the process filesystem (proc(4)) abstraction\cite{faulkneretal91}
of UNIX System V.
A first pass collects raw resource data about the target process via
proc(4), implementing an ``interval timer'' for each
resource of interest.
Specifically, we view consumption of a resource
as the ticking of a clock.  Each time the resource clock sweeps
out a full interval,
we pause the target process and record in a cpprofdata
file a stack sample: the program counter and the sequence of
return addresses
stored on the current program call stack.  This does not result in 
excessive overhead primarily because we are sampling (recording
infrequent but randomly distributed data) instead of tracing (recording
all data); we trade off perfect accuracy for low overhead.
The second pass analyzes the stack samples in the
cpprofdata file and puts the user in a simple command loop
that allows interactive exploration of call path profiles
of various resources viewed from different root functions.

While call path profiling has
previously been used in Lisp environments for
time and space, the proc(4) filesystem provides
a unique opportunity for profiling a wide variety of interesting
resources in a uniform and portable way.  In addition, our research points
to the need for a ``generalized interval timer'' service which
we believe should be provided by the UNIX kernel for all monotonic
system resources in a way similar to the interval timer (setitimer(2)).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2. MONOTONIC RESOURCES AND proc(4)

People usually complain about performance when programs are too slow.
Thus, the typical
profiler reports on the consumption of the single resource, execution time.
However, in many cases one wishes to know why the program spends time
in a given place, and this requires understanding its usage of other 
resources.
For example, a particular call to a floating point routine may take
much longer than expected due to floating point exception processing.
A time profiler does not help one to understand why the call is slow,
but a profile of where the floating point exceptions occur in a program
informs the user directly.
Similarly, one might wish to profile any or all of the
resources listed below.

user time
real time
floating point exceptions
page faults
input/output time
bytes read or written
system or function calls
...

The key property these all share is that their effects increase
monotonically
over time; that is, the total cost over the program's execution
is equal to the sum of the costs incurred by individual subroutines.
For example, each page fault adds to the total run-time cost of the program.
By contrast, some resources, such as stack space,
do not have this property.  Just because a given subroutine invocation
uses k bytes of
stack space does not imply that the whole program execution 
must use an additional k
bytes; it may actually reuse the same k bytes allocated by some previous
subroutine invocation.  A similar example is parallel execution time.
Our approach does not apply directly to
these non-monotonic resources, so by ``resource'' we will always mean 
``monotonic resource.''

Another key property of the resources listed above is that a monitoring
process can acquire raw data about them for another process
through the proc(4) filesystem\footnote{To monitor time resources,
CPPROF uses the kernel's interval timer facilities (setitimer(2))}.
For example, the monitoring process
can arrange to be notified of every page fault, trap, or system call
encountered by the monitored process.
Our System V prototype, CPPROF, exploits this functionality.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
3. RELATED WORK

To illustrate  call path profiling and contrast it with other profilers,
it will be useful to refer to
the (contrived) example program, process_db, shown below.  Section 6
will discuss more realistic programs to which we have applied our profiler.

main()
{
  DB db = read_db();
  print_salary_stats(uniquify_db(db));
}

DB_RECORD **uniquify_db(DB db)
{
  DB_RECORD **dbptrs = build_db_ptrs(db);
  qsort(dbptrs, name_field_lt);
  merge_adjacent_records();
  return(dbptrs);
}

print_salary_stats(DB_RECORD **dbptrs)
{
  int *salaries = extract_salaries();
  qsort(salaries, integer_lt);
  stat_summary(salaries);
}

Essentially, this program reads a database of employee records, sorts
it into order by name field eliminating duplicate records,
and then extracts and sorts the salaries.  Finally it prints
various statistics about them, such as range, median, etc.  Note that the
programmer realizes that sorting the database directly would be inefficient,
because qsort would have to swap all the bytes of the records, so 
uniquify_db creates, manipulates, and returns an array of pointers
to database records.

Suppose now that we run this program on a database too large to fit in the
physical memory of our machine.

UNIX's prof(1) reports the amount of time spent in each function body.
For process_db, it would report that essentially all the time is spent
in qsort.  While true, this doesn't help the optimizer because
qsort(3), being a library routine, figures to be rather efficient.
It is left to the programmer to infer whether something about the program
is inefficient and if so, how to optimize it.

UNIX's gprof(1)\cite{grahametal82}, on the other hand attempts to
compute a ``call graph'' profile which, when accurate, attributes
the time spent in a procedure and its descendants to the procedure.
It also attempts to report the fraction of time spent in each child of
each function.  Unfortunately, gprof's implementation makes the
incorrect assumption that every call to a given routine takes the same
amount of time.  Thus, a gprof profile of process_db would report
half of the time in the first qsort call and half in the second.
This turns out to be wrong; far more time is spent in the first than the
second.

Even when a correct call graph profile is available, as in the
Franz Allegro Common Lisp environment\cite{franz90}, it is
only a simple special case (length 1 and 2 call paths) 
of a call path profile (Section 4).  One can easily show (by example)
that these profiles do not give as much information as a full call path
profile.  Franz's profiler is based on stack sampling in a way
conceptually similar to our own stack sampling (see Section 5.1).

A completely different approach to profiling operates by tracing resource
usage.  truss(1) provides a general mechanism for tracing a UNIX
command via the proc(4) filesystem.  The user specifies which classes
of resource usage are of interest (page faults, traps, or system calls)
and then truss
prints out a line on the terminal for each page fault,
trap,
or system call of the selected type that occurs during execution of
the trussed process.
While it does give a general
idea about what resources are being used, tracing generates
a large amount of data which typically must be summarized to be understood.
For process_db above, truss simply puts out a line for each
of the several thousand page faults encountered,
giving little indication of which optimizable context
in the program caused the page fault.

The System V version of the SunPro Sparcworks Analyzer\cite{sun92}
provides a rich interface for tracing and profiling
program resources.
For example, it gives the user a prof(1)-like profile of
any of the proc(4)-accessible resources, gives a graphical display
of which text and data pages were touched during execution (including
a tool for laying out a program to reduce text page faults), and provides many
other features.  It even has an option that enables collecting
periodic stack samples (the same sampling technique we use in the
call path profiler).  However, it samples only with respect to
the time resource, and instead of a 
call path profile, abstracts to the less informative
gprof-style call graph profile.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4. CALL PATH PROFILING

Here is a call path profile of process_db:

Downward Call Path Profile for resource: Real Time
   2676 samples, sampled every 50 gticks
format: usage_fraction (call_path) [#raw_samples]
-----------------------------------------------------
; 1.00000 (main) [2676]
; 0.88004 (main uniquify_db) [2355]
; 0.68012 (main uniquify_db qsort) [1820]
...
; 0.11323 (main print_salary_list) [303]
; 0.11323 (main print_salary_list extract_salary_fields) [303]
...
; 0.00000 (main print_salary_list qsort) [0]
-----------------------------------------------------

The header information shows that this profile was
collected by sampling
Real Time once every 50 generalized ticks (gtick).
The gtick is a polymorphic unit
denoting a hundredth of a second for time resources, one page fault for
text and data page faults, one byte for space allocation, etc.
Each line after the header shows the resource usage of one call path
of the program.  Intuitively, a call path
represents a subset of the program's execution corresponding to
a class of program call stacks
(Section 4.1 defines this notion precisely).
The first field is
the fraction of the program's total Real Time attributable to the call path,
the second field (in parentheses) names the call path, and the integer
in brackets indicates the raw number of samples for the call path.

>From this profile, we can conclude that
main calling uniquify_db calling qsort
consumes 68% of the time, while the other call to
qsort (within print_salary_list) takes insignificant time (no
samples hit there).
This ``where'' description directly relates the resource usage to
an optimizable context, because the programmer can easily change the
program to call a specialized qsort
within a specialized uniquify_db within main.

The question remains how to specialize qsort in this case.
The measurement seems anomalous, since the arrays
sorted by the two qsort calls have approximately the same number
of elements (assuming few duplications in the database).
By using CPPROF to profile other resources, we find

Downward Call Path Profile for resource: Page Faults
  9950 samples, sampled every 10 gticks
format: usage_fraction (call_path) [#raw_samples]
-----------------------------------------------------
0.99889 (main) [9939]
0.71317 (main uniquify_db) [7096]
0.56995 (main uniquify_db qsort) [5671]
...
0.13467 (main print_salary_list) [1340]
0.13457 (main print_salary_list extract_salary_fields) [1339]
...
0.00000 (main print_salary_list qsort) [0]
-----------------------------------------------------

This gives a clear indication of why the first sort takes so much
more time: the array to be sorted resides on so many different pages
that they can't be held in memory at once, causing a superlinear number of
data page faults (due to the superlinear number of key comparisons required
by sorting, where each key comparison must actually touch the page to get
the name field).  Armed with this information, the user can
immediately focus on optimizing the critical bottleneck: since the
sort key fields are small, the program could localize the keys
first (incurring only linearly many page faults) and then sort without
causing further page faults.
By optimizing process_db in this way, we cut the total run time
from 25 minutes to about 8.  This matches well with what the profile
predicts, since we have effectively made the call path 0 cost.
The optimization is unnecessary for the call to qsort within
print_salary_list, because the salaries are localized when they
are extracted into a separate array.

This illustrates how CPPROF provides a unique combination of highly detailed
``where'' descriptions with a rich variety of ``why'' explanations.

%%%%%%%%%%%%%%%%
4.1 WHAT IS A CALL PATH
A call path is a sequence
of function pairs ((f_1, g_1),...)
such that f_i executes a subroutine call to
g_i and g_i = f_{i+1}.  In view of the latter condition,
we will always abbreviate the call path ((f_i, g_i)) by
(f_1, f_2, ..., f_n, g_n), as we did in the call path profiles above.
By this definition, it is easy to see why a call path corresponds
to an optimizable program context: any nested sequence of calls
can be rendered separately
optimizable by duplicating (with a different name) each subroutine
(except for the first) in the list so that it
is only called by its predecessor.
For example, in our program we could make the call path
(main uniquify_db qsort) optimizable by creating
qsort_1 and uniquify_db_1 which calls it and
altering main to call uniquify_db_1.
Note that for recursive programs, one must be careful to maintain
an isomorphic call graph by renaming recursive calls to the new name
of the recursive function.  A general algorithm for
contextualizing a call path which works for arbitrarily recursive
program structures is given in an earlier paper\cite{hall92}; space
constraints prohibit reproducing it here.

%%%%%%%%%%%%%%%%
4.2 WHAT THE PROFILE REPORTS
The resource fraction f_p reported for a call path p=(a b ... n)
in a profile has a
simple interpretation.  Suppose one constructs the optimizable
program for p by contextualizing it as above,
replacing the call in a to b with a call to 
b_1, which in turn calls c_1, and so on, which finally calls n_1.
We can interpret f_p simply as the fraction of resource usage
that would be saved if n_1 were somehow magically transformed to
use none of the resource.  For example, the time profile for process_db
predicts that if one were to make qsort take no time at all in
computing its output when called from uniquify_db, then
the total run time of the resulting program would be only
1-0.68=0.32 as much as the original.  This was achieved when we
actually carried out the optimization and measured the result.

%%%%%%%%%%%%%%%%
4.3 UPWARD AND DOWNWARD
CPPROF provides two complementary types of call path profiles:
upward and downward.  Both are defined
in terms of a distinguished subroutine, called the root function.
In our example above, the root function is main, but it can be
any function in the program.

Given a root function, the downward call path profile is simply a
sorted, filtered list of the resource usages of all call paths whose first
function name is the root.    (The filtering is based on a user-specifiable
significance threshold.  Any profile entry whose usage fraction falls
below the threshold is suppressed from the output.)
Both profiles above are downward call path profiles.
Downward profiles provide a hierarchical breakdown of resource consumed
during the execution of the root function.

An upward call path profile is a sorted, filtered list of the resource
usages of all call paths whose last function name is the root
function.  This provides a breakdown of which call ancestors of
the root function are incurring the most costs due to it.
An important use for this is when one knows that the program is
spending most of its time in, say, memory allocation, but doesn't
know which calls to it cost the most.
gprof(1)\cite{grahametal82}, when its assumption about
constant procedure execution times is accurate, provides
the information in both upward and downward call path profiles
if we restrict call path lengths to at most two.

FIGURE 1 HERE

CAPTION: Figure 1. An abstracted call graph for the forms program.
main calls a specialized processing function for each
type of form.  These routines call (among others) a utility routine,
address_information, that returns full name, address,
phone, etc., information on an individual in the database.  
The intermediate utility db_get_property extracts a
property value from a database entry.
The low level
routine db_read_record performs a disk read to find
the record for a given individual.
Some subroutines and calls have been left out for brevity.

%%%%%%%%%%%%%%%%
4.4 A USEFUL TECHNIQUE
One useful technique for using CPPROF is to ask first for a
function profile, which is essentially just a sorted, filtered
list of profile entries for all length 1 call paths in the program
(i.e., the resource fraction consumed in a function and on its behalf).
Next, ask for upward and/or downward call path profiles rooted at functions
shown to be significant in the function profile.  This assures that
one will not be prematurely swamped with a large amount of
irrelevant data.

To illustrate this technique,
consider the program shown abstractly in Figure 1.
This program fills out forms for users by looking up information
in an on-line database.  To see why this program is inefficient,
we first ask CPPROF for a function profile.

Body+Descendants Function Profile for resource: Real Time
  614 samples, sampled every 20 gticks
format: usage_fraction  function_name  [#raw_samples]
-----------------------------------------------------
1.0000 main [614]
1.0000 db_read_record [614]
0.8632 db_get_property [530]
0.6873 address_information [422]
0.1743 invoice [107]
0.1726 form_US_1040 [106]
0.1726 envelope [106]
0.1726 loan_application [106]
0.1710 form_NJ_1040 [105]
0.1368 db_update_record [84]
-----------------------------------------------------

This shows us that the most significant routines are db_read_record,
db_get_property, and address_information.
At this point, we know that the most significant efficiency gains
(if there are any)
are to be had in optimizable contexts within these routines.
For more specific information, we set the root to the most
significant of these,
db_read_record, and request an upward call path profile for it.

Upward Call Path Profile for resource: Real Time
  614 samples, sampled every 20 gticks
format: usage_fraction (call_path) [#raw_samples]
-----------------------------------------------------
1.0000 (db_read_record) [614]
0.8632 (db_get_property db_read_record) [530]
0.6873 (address_information db_get_property db_read_record) [422]
0.1384 (envelope address_information db_get_property db_read_record) [85]
0.1384 (invoice address_information db_get_property db_read_record) [85]
...
-----------------------------------------------------

This profile directs us to consider first
three particular optimizable contexts each ending in a call to
db_read_record.
By examining these in turn, we can discover that the routine
address_information calls db_get_property four times
to retrieve four fields of the same database entry.
This results in four successive disk reads of the same record.  By
specializing db_get_property for use only within
address_information (to cache the record after the first read),
we can save 75% of the 0.6873 total run time fraction, reducing
the overall run time by about 51%.  Note that we could not simply
optimize db_get_property itself, because other calls to
it (from outside of address_information) require its
full generality.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5. IMPLEMENTATION

%%%%%%%%%%%%%%%%
5.1 STACK SAMPLING
CPPROF is implemented by a monitor process which controls the
target process via proc(4) (the time resources are handled slightly
differently, using the UNIX interval timers (setitimer(2))).
The user specifies which resources are of interest and
the monitor process arranges to be notified each time one gtick
is consumed.  We then implement our own ``generalized interval timer''
for each resource by counting gticks until the
sampling interval\footnote{To avoid problems with correlation between
periodic sampling and periodic program behavior, one can
randomly perturb the sampling interval each time.}
passes.  For example, we examine each read(2) system call
and record the number of bytes requested, each byte being a gtick.
Each time a general resource clock passes a sampling interval boundary
(e.g. 1024 bytes read),
we halt the target process and record a stack sample for the
resource in a cpprofdata file.  When more gticks pass
than a single sample boundary, we credit that many hits to the sample.
For example, reading 10240 bytes will result in 10 hits for the corresponding
stack sample.
A stack sample is simply a list of the number of sample hits, 
the current program counter, the return address of the current stack frame,
the return address of the caller, that of the caller's caller, etc,
until we reach the root
procedure (typically main for C programs).

%%%%%%%%%%%%%%%%
5.2 INTERACTIVE PROFILE ANALYSIS
After the target and monitoring processes finish executing,
the cpprof command
analyzes the cpprofdata file and puts the user in a simple command
loop allowing one to view upward and downward call path profiles from
user-determined root functions, with user-determined thresholds.  Also
available is a function profile
and a body profile (a list of the resource usage consumed in the body
of each function; this generalizes prof(1) to arbitrary resources).
Thus, rather than producing a single static profile,
CPPROF is an interactive exploratory environment for understanding
the performance of a program empirically.

%%%%%%%%%%%%%%%%
5.3 COMPUTING CALL PATH PROFILES
We compute a downward call path profile in two steps.
First, CPPROF reads the cpprofdata file once and builds a
sample tree.  Second, it computes the call path profile for the chosen
root from the sample tree.  In this section, we explain how
to compute a downward call path profile from a given root function
for a single resource.  Function profiles, upward profiles, and
handling mutliple resources are all straight-forward extensions
of the basic technique.

To build the sample tree, each sample is first converted from
a list of addresses into a list of
function names by using symbol table information.
Once we have the symbolic form of the sample, we insert it into
the sample tree as follows.  Each node of the sample tree is labeled with a
function name, and contains fields for the number of sample hits corresponding
to that call stack and a list of children.
The root node has the name of the root function, typically main.
For each sample, we locate the unique path in the tree whose
node labels are the same as the function names in the sample
(if no such path yet exists, new nodes are installed to create it).
The sample hits field of the last node on this path is then incremented by the
number of sample hits corresponding to the sample.  Once all
samples are recorded in the tree, we compute for each node in the
tree the sum of its sample hits and those of all its descendants;
this number is termed the ``weight'' of the node.

In the special case of a nonrecursive program for which the user's choice of
root function is the same as the label of the root node of the sample
tree, the resource fraction for any call path may
be read directly from the sample tree.  One locates the unique path in the
sample tree
labeled with the same list of function names and divides
the last node's weight by the weight of the root node.

To handle the general case, including recursive programs and
non-main root functions, we use the more general algorithm
described in the Appendix.

%%%%%%%%%%%%%%%%
5.4 PORTABILITY ISSUES
The great majority of code for CPPROF is completely portable, ANSI C.
One small module, the stack trace sampler, is architecture-
(and sometimes compiler-)dependent.  Stack and
register conventions will vary across machines; however, once these
conventions are understood, walking the
call stack using the proc(4) interface is straight-forward to implement.

Obviously, the part of CPPROF that monitors via proc(4) is portable only
among UNIX System V systems.  A limited version of CPPROF (time only)
is implemented under BSD UNIX.  It has also been implemented
via stack sampling for time and space resources
in Franz Allegro Common Lisp v4.1
and also under Macintosh Common Lisp v2.0.

Thus, it appears that the technique generalizes easily to other operating
systems and machines, limited only by how well the environment
supports access to resource data.  System V's proc(4) appears to be
the richest environment so far.  It could, we believe, be further improved
by adding a ``generalized interval timer'' feature (analogous to
the time version), whereby instead
of receiving a signal on every resource usage, the user could
set an interval (in terms of gticks) at which to be signaled.  This
would allow monitoring with even less overhead, avoiding the context
switch necessary to handle each resource signal.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
6. EXPERIENCE/CASE STUDIES
While the System V implementation of CPPROF is still in testing,
we have nevertheless applied it to several large programs, including
six of the Spec benchmarks.  Though we are not familiar enough
with the benchmark programs to do detailed optimization, an example of
what can be found quickly by call path profiling is illustrated below.

Downward Call Path Profile for resource: User Time
   1871 samples, sampled every 20 gticks
format: usage_fraction (call_path) [#raw_samples]
-----------------------------------------------------
1.00000 (xleval) [1871]
1.00000 (xleval evform) [1871]
1.00000 (xleval evform xprog) [1871]
1.00000 (xleval evform evfun) [1871]
...
0.27579 (xleval xlgetvalue) [516]
...
0.09139 (xleval evform xlevlist consa) [171]
...
-----------------------------------------------------

This is a downward call path profile for user time of the 022.li
Spec benchmark, a Lisp interpreter (itself coded in C) solving
the 9-Queens problem.  Since the code implements an interpreter, one expects to
find inefficiency compared to running compiled code.
Given an understanding of the structure of the program,
the call path profile highlights these inefficiencies.
For example, the call path
(xleval evform xlevlist consa) uses 9.1% of the total run time.
Looking at the code, the xlevlist function is the interpreter's
routine for evaluating the arguments to a Lisp function call and
returning a list of the evaluated actuals.  consa allocates a
Lisp cons cell.  Thus, this profile
entry reports that simply consing up the list of function call actuals (not
including the time to evaluate each actual) is
using 9.1% of the total time.
Compiled code would avoid this inefficiency by
passing values at fixed locations on the stack.
Note that a prof(1) profile does not show this at all, because most
of the work done by consa is done in its call descendants.
Also, there is more than one caller of consa in the program,
so we could not simply have deduced this information from a function
profile alone.

In addition to testing on the Spec benchmarks, Lisp-based implementations
of the call path profiler have been used on several large programs
to find significant optimizations.  In one case (involving a system
of several thousand lines of Lisp source code), it exposed the fact
that one of the program modules had mistakenly been left uncompiled,
hence was run interpreted.  This was shown simply by the presence
of ``extra'' calls in call paths to stack blocks established
by the interpreter that are normally compiled out.  Simply compiling
the module resulted in a 10% savings in total run-time.

In another large application, call path profiling detected the fact
that redundant calls to a data structure canonicalization procedure
slowed the program significantly, pointing out the need to canonicalize
only when necessary.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7. DISCUSSION

%%%%%%%%%%%%%%%%
7.1 LIMITATIONS
The sampling approach to profiling implemented in virtually all profilers
to date has several limitations,
centering around statistical significance and unwanted correlations
between the sampling and the executing code.  While
time sampling using setitimer(2) suffers from these problems,
sampling other resources fares better.  By randomizing the general timer
interval, we can avoid the correlation problem, and by
reducing the mean of the timer interval, we can increase the statistical
significance as much as desired (though this may further perturb
the program's unsampled behavior).

Call path profiling implemented by sampling has some other limitations.
As discussed in a previous paper\cite{hall92}, a sampled call path
profile can only
give information on dynamic call paths, because we sample
the actual call stack at run time.  The user is actually interested
in lexical call paths, i.e. those in the source code, because
those are the ones that are directly optimizable.  In most cases,
the two sets of call paths are the same, but they can differ in programs
where a function is stored or passed as a first class data object
and actually called in a context different from where it was
created or referred to.  For example, a call path profile may report a call
path having a call from f to g when in the program text there is
no explicit call of g from f; instead a pointer to g is passed
somehow into f and is then called.  Ideally, one would like the
call path to reflect the lexical originator of g, instead of the
run-time caller.  This problem is fundamental to the sampling-based
approach and can only be solved by instrumenting the source code.

Another limitation of our implementation is that we have ignored
the fact that one function may call another from more than one
call site in its body.  Ideally, the user would like to see call paths
that distinguish the different call sites, as different call sites
can have widely different costs.  It is straight-forward to
incorporate this into CPPROF, because the stack samples are recorded
as program addresses.  We simply have not yet added this level of detail
to the prototype.

%%%%%%%%%%%%%%%%
7.2 OTHER RELEVANT WORK
Some profilers, such as the Franz Common Lisp graphical
profiling tool\cite{franz90} (time and space resources)
and the Plan 9 profiler\cite{plan9} (time only),
collect and display a tree-based, hierarchical representation
of what we called the sample tree:
time spent in individual program stack traces.
We term this a sample tree profile.
For non-recursive programs, this gives the same information
as a downward call path profile rooted at the main routine; however, the 
call path profiler provides better summarization and attention focusing
tools for complex programs.  For example, a sample tree profile
of the forms program of Figure 1 looks like this:

Sample Tree Profile for resource: Real Time
  614 samples, sampled every 20 gticks
format:frame_designator (usage_fraction) [#raw_samples]
-----------------------------------------------------
main (1.0000) [614]
  invoice (0.1743) [107]
    address_information (0.1384) [85]
      db_get_property (0.1384) [85]
        db_read_record (0.1384) [85]
    db_get_property (0.0358) [22]
      db_read_record (0.0358) [22]
  loan_application (0.1726) [106]
    address_information (0.1368) [84]
      db_get_property (0.1368) [84]
        db_read_record (0.1368) [84]
    db_get_property (0.0358) [22]
      db_read_record (0.0358) [22]
  envelope (0.1726) [106]
    address_information (0.1384) [85]
      db_get_property (0.1384) [85]
        db_read_record (0.1384) [85]
    db_get_property (0.0342) [21]
      db_read_record (0.0342) [21]
  form_US_1040 (0.1726) [106]
    address_information (0.1368) [84]
      db_get_property (0.1368) [84]
        db_read_record (0.1368) [84]
    db_get_property (0.0358) [22]
      db_read_record (0.0358) [22]
  form_NJ_1040 (0.1710) [105]
    address_information (0.1368) [84]
      db_get_property (0.1368) [84]
        db_read_record (0.1368) [84]
    db_get_property (0.0342) [21]
      db_read_record (0.0342) [21]
  ...
-----------------------------------------------------

We have used indentation to represent depth in the tree.
Even though this mass of data is complete, in the sense that
any other profile can be derived from it, it does not
focus one's attention on worthwhile optimization opportunities.
Nothing looks very significant, because every node below main has a
usage fraction less than 0.2.
In particular, every entry for db_read_record is less than 0.14.
In order to find that db_read_record itself accounts for essentially
all the time in the program, one must add up
fractions that are widely distributed over the tree.
Furthermore, to find the usage fraction for any optimizable context
not rooted at main,
one must perform a similar addition.
The call path profiling methodology discussed in Section 4.4,
on the other hand, leads the user directly
to considering the three most worthwhile optimizable contexts.

%%%%%%%%%%%%%%%%
7.3 FUTURE WORK
We believe call path profiling of general resources should be incorporated
into any profiling environment that is sophisticated enough to
record stack samples, such as Sparcworks\cite{sun92} or the
Franz profiler\cite{franz90}.  This involves two extensions.  First,
the environment must add the capability to sample monotonic resources
on an interval basis, hopefully with kernel support.  Second,
the environment must provide an interactive capability to compute
and display call path profiles based on user-determined root functions
and thresholds.  While the first extension may be somewhat involved,
the second could be implemented quite quickly
for whichever resources are already supported,
such as elapsed time or allocated space.

%%%%%%%%%%%%%%%%
7.4 CONCLUSION
Call path profiling provides an informative description of where
a program consumes its resources.  It is closely tied to optimization,
because it reports the costs of optimizable program contexts, rather than
simply the context-free costs of function bodies.  This technique for
describing ``where'' is equally applicable to any monotonic resource (such
as time, space, page faults, traps, and system calls),
so by profiling via System V's proc(4) filesystem, CPPROF is
able to answer the ``why'' profiling question in a
richly detailed way as well.

Our research indicates also a way in which proc(4) (or any process
monitoring facility) could be improved: provide kernel support
for ``generalized interval timers.''  That is, allow the user to
select a sampling interval for each type of resource and only
signal the monitoring process as each interval elapses.  The current
proc(4) signals on every resource gtick.  This can needlessly increase
the monitoring overhead of a run, both increasing the overall execution time
of the profile run and increasing the uncorrectable perturbation
to the program's true behavior (due to extra context switches, cache
flushes, etc).  On the other hand, adding this general service to the
kernel should be inexpensive and simple to do (we have implemented the
general interval timers at the user level, but one expects that
support at the kernel level would sharply reduce the overhead of
the mechanism).

We believe that too few programmers profile their programs.
Perhaps this is due, in part, to the difficulty in connecting
previous forms of profiler output data with optimizable program contexts.
We hope that call path profiling can remove much of the detective work
previously required, making profiling more attractive to programmers.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
A. APPENDIX

This appendix presents pseudo-code for a general algorithm for computing
downward call path profiles from a previously constructed sample tree.
It works for arbitrarily recursive programs and for arbitrary choice
of root function by the user.  A separate output routine takes
the output of this program and formats it for readability; in particular,
it divides the number of sample hits in each call path's profile entry by
the total number of sample hits incurred during the run in order to report
a resource usage fraction.

Call paths are represented as lists of strings, each string being a 
function name.  The functions cons, first, and rest operate
on lists: cons constructs a new list by adding an element at the
beginning of an old list;
first returns the first element of a list; and
rest returns the list obtained by removing the first element of a list.

The basic idea of the algorithm is to recursively descend the sample tree
and associate each node with a canonical call path that represents it.
Then, we add the weight of the node to the ticks field of that call path's
record, unless the record is locked.  We have to use a locking protocol
in the presence of recursion
to avoid attributing the same ticks to a given call path multiple times.

PSEUDO-CODE FIGURE:   <identifiers in all-caps are variables>

Let PATH-HASH be a hash table whose keys are call paths
    and whose values are records containing a ticks field
    and a locked? bit

Procedure Downward-CP-Profile (SAMPLE-TREE, ROOT-FN)
  (Returns a sorted list of call path resource usage records)
  Initialize PATH-HASH to be empty.
  Summarize-below-root(SAMPLE-TREE, ROOT-FN)
  Let RESULT be a list of every (key, value) pair
      in PATH-HASH.
  Sort RESULT in decreasing order of sample hits field.
  Return RESULT.

Procedure Summarize-below-root (TREE-NODE, ROOT-FN)
  If name(TREE-NODE) = ROOT-FN
     Then CPP-Rec(TREE-NODE, ())
     Else For each child CHILD of TREE-NODE
              Summarize-below-root(CHILD, ROOT-FN)

Procedure CPP-Rec(TREE-NODE, PARENT-CALL-PATH)
  Let NAME = name(TREE-NODE)
      CP = cons(NAME, PARENT-CALL-PATH)
      NEXT-PARENT = Canonicalize-name(CP, NAME)
      RECORD = get-record(CP, PATH-HASH)
      LOCKED? = record-locked?(RECORD)
  If not LOCKED?
     Then lock-record(RECORD)
          add-ticks(weight(TREE-NODE), RECORD)
  For each child CHILD of TREE-NODE
      CPP-Rec(CHILD, NEXT-PARENT)
  If not LOCKED?
     Then unlock-record(RECORD)

Procedure Canonicalize-name(CALL-PATH, THIS-NAME)
  For (PTR = rest(CALL-PATH); While PTR is not null)
      If first(PTR) = THIS-NAME
         Then Return PTR
         Else Set PTR = rest(PTR)
  Return CALL-PATH


ACKNOWLEDGEMENTS
The authors would like to thank Mark Plotnick for his assistance.
Thanks also to Rob Pike and Ken Thompson for stimulating discussions
of this work.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
REFERENCES

\bibitem{plan9}
The Plan 9 Programmer's Manual.
Murray Hill, NJ: AT&T Bell Laboratories.  (1993).

\bibitem{faulkneretal91}
R. Faulkner & R. Gomes,
The process file system and process model in UNIX system V,
In Proc. Winter 1991 USENIX Conference.
Berkeley, CA: USENIX Assoc.  (1991)   243--252.

\bibitem{flajoletetal91}
P. Flajolet, B. Salvy, & P. Zimmerman,
Automatic average-case analysis of algorithms,
Theoretical Computer Science 79, pp 37--109.
Amsterdam: Elsevier Science Publishers, 1991.

\bibitem{franz90}
Franz, Inc., Allegro COMPOSER User Guide, version 1.0,
Chapter 6.  Berkeley, CA: Franz, Inc, 1990.

\bibitem{grahametal82}
S.L. Graham, P.B. Kessler, & M.K. McKusick,
Gprof: a call graph execution profiler,
ACM SIGPLAN Notices 17(6), pp 120--126.
June, 1982.

\bibitem{hall92}
R.J. Hall,
Call path profiling, In Proc. 14th International Conf. on Software 
Engineering,
New York, NY: Assoc. for Computing Machinery, (1992), 296--306.

\bibitem{knuth71}
D.E. Knuth, An empirical study of fortran programs,
Software--Practice and Experience 1, pp 105--133. 1971.

\bibitem{lemetayer88}
D. Le Metayer, ACE: An Automatic Complexity Evaluator,
ACM Trans. on Programming Languages and Systems 10(2),
pp 248--266.  1988.

\bibitem{sun92}
SunPro, Performance Tuning an Application.
Sun Microsystems, Inc.  (1992).
\end{thebibliography}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Robert J. Hall (hall@research.att.com)
has been a Member of Technical Staff at AT&T Bell Laboratories, Murray Hill,
since 1991.  Prior to that, he received his S.M., E.E., and Ph.D. degrees
from the Massachusetts Institute of Technology, where his thesis research
was performed under Charles Rich in the Artificial Intelligence Laboratory's
Programmer's Apprentice Group.  He received the B.S. degree
summa cum laude in E.E.C.S. from
the University of California at Berkeley.

Aaron J. Goldberg (aaron@research.att.com)
has been a Member of Technical Staff at AT&T Bell Laboratories, Murray Hill,
since 1992.  Prior to that, he received his Ph.D. degree from 
Stanford University, where his thesis research was performed under
John Hennessy in the DASH group.  He received the A.B. degree
summa cum laude in Applied Mathematics from Harvard University.