################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


Process-labeled kernel profiling: a new facility to profile system activities

Shingo Nishioka, Atsuo Kawaguchi, and Hiroshi Motoda

Advanced Research Laboratory, Hitachi Ltd.

Abstract

Profiling tools that empirically measure the resource usage of a
program have been widely used in program development. These tools have
traditionally focused on the behavior of the target program. The
target program, however, actually performs its job in collaboration
with other programs, such as servers and an operating system kernel,
in a modern system environment. Process-labeled kernel profiling is a
novel facility that measures and attributes the kernel resource
consumption of programs benefiting from it. This facility, in
conjunction with a conventional profiler, enables a programmer to
grasp the resource consumption of programs from an overall system
point of view. Using this information, the programmer is better able
to reduce overall resource consumption.

1.	Introduction

Profiling is one of a number of important techniques used to improve
the performance of a complex program. Many tools for profiling, such
as prof[1], gprof[2], and cprof[3], have been developed and widely
used in application program development. A programmer uses these tools
to measure the resource usage (e.g. CPU time) of a program
empirically, and to analyze where and why it consumes resources during
execution. The measurement results provide clues to finding critical
portions of the program which affect its performance, and they also
provide evidence that the programmer's code modifications actually
improve program performance.

A typical example of complex programs is an operating system
kernel. Profiling is useful not only in improving user programs, but
also in improving and tuning kernel codes. By empirically measuring
resource usage in the kernel over a sufficiently long period of time,
a kernel developer is able to locate problems affecting the overall
performance of a system. Some problems may need kernel code
modification, while others can be eased by changing kernel
parameters[4].

User program profiling and kernel profiling are used independently
since the kernel and user program are separate. One of the kernel's
important roles is offering a well-defined, abstract, and convenient
program interface that hides the system's internal
operations. Consequently, programmers are able to develop a portable,
clearly organized application program without any special knowledge of
these internal operations. Moreover, expert programmers understands
these operations better than novices, and develop programs that access
the program interface in a manner that can be efficiently processed by
the kernel. Such implicit optimization often brings about better
performance than expected. Providing kernel profiling data as well as
user program profiling results should help expert programmers to write
more efficient codes from an overall system point of view.

Recent operating systems have relied on many user-level programs
(servers) to provide system services. The micro-kernel architecture is
a typical example of such an approach. Providing both user programs
and kernel profiling data simultaneously is extremely useful in the
development operating systems to establish system structures, define
server interfaces, write codes, tune prototypes, and so on.

Process-labeled kernel profiling (pkprof) is a novel profiling
facility that measures which resources are used in an operating system
kernel by which user program (process). It keeps track of which
process a kernel is serving, and labels particular resource
consumption for a particular process that is to benefit by that
consumption. Utility tools are provided to control measurements, to
manage profile data, and to analyze data.

Pkprof, in conjunction with a user level profiler, enables a
programmer to understand a program's behavior in both the user and
kernel space. It reveals hidden system activities that the programmer
is unaware of. Such information is extremely beneficial to both the
application program and operating system development.

2.	Pkprof: Process-labeled kernel profiling

An operating system kernel offers a set of system calls to provide
services to user programs. The services are implemented as kernel
routines that run in the kernel space. A user program triggers
execution of the routines by issuing a system call. Resources consumed
by the execution, therefore, can be attributed to that program.

A kernel handles external asynchronous events. Resources consumed to
handle such events are also attributed to programs benefiting from
consumption. For example, when a network packet arrives, network
interface hardware interrupts the normal flow of program execution and
invokes the kernel's interrupt handling routines. The interrupt
handling routines copy the received packet to a network data buffer
and the system then resumes normal execution. The received packet is
eventually read by a user program. Consequently, resources consumed by
packet reception are attributed to that program when it turns out to
have benefited from the reception.

A process-labeled kernel profiler monitors kernel activities and
reports where and for what process (program) a kernel consumes
resources such as CPU time. In other words, it measures the kernel
resources consumed by:

	o processes that request kernel services, and

	o asynchronous events that are later attributed to processes.

It then reports the extent to which each process benefits from the use
of these resources.

Figure 2.1 shows a report on process-labeled kernel profiling. Figure
2.1(a) shows flat profiling results of a kernel running over a period
of time. Each line represents how much time is spent by the kernel
routine. we_bcopy_in(), for example, was called 6622 times (calls
column) and ran for 4.33 seconds (self seconds); that accounts for
4.6% of the total time that elapsed in the kernel (% time) during
measurement. Each call took 0.65 ms (self ms/call) within
we_bcopy_in() itself and took 0.65 ms (total ms/call) for itself and
its descendant functions on average. Figure 2.1(b) shows the same
information with respect to a process executing a cat program over the
same period. That is, it shows how much CPU time is consumed by each
kernel routine in processing cat's requests. A process identifier and
symbol `k' are prefixed to the name for each function,
e.g. 137k:in_cksum[9]. The `137' denotes the process identifier of the
cat process and the `k' denotes that the function is the kernel's. We
can see that some kernel routines such as ovbcopy() are mainly
executed for the cat process.

A typical pkprof application is seamless profiling, which
simultaneously measures the resource consumption of a program both in
the user and kernel space. A listing of seamless profiling is shown in
Figure 2.2. The listing is shown in the same format that gprof[2]
generates except that the process identifier and symbols `k' or `u'
have been prefixed to the name of each function,
e.g. 137k:read[9]. The `137' denotes the process identifier and `k'
denotes that the function is the kernel's. Symbol `u' is shown if the
function is the user program's. The listing shows, index[8] for
example, that the process 137's read() function ran for 30.2% of the
total CPU time during measurement (% time column). It used 0.00
seconds within 137u:read() itself (self), and the descendants of
137u:read(), i.e. 137k:read() in this case, needed 2.86 seconds to
execute in CPU time. Programmers can grasp the impact on the program
of kernel CPU usage from the listing, as well as the CPU time usage of
the program itself. They may be able to lower the total CPU time usage
by changing the program algorithm, the system calls it issues, or
their arguments.

Pkprof is applicable to a group of processes. Programmers can examine
the kernel usage and impact on system performance of processes as well
as each single process. They may be able to improve the overall
performance of the system by carefully examining the listings, finding
actual performance bottlenecks, and fixing them.

Applying pkprof to a group of processes is particularly useful in
programming based on a client-server scheme. To improve performance in
this type of scheme it is not sufficient to profile only the client
program since a great deal of work for a job is done by the
server. Profiling the server may often be sufficient to improve
performance, but applying pkprof to the server and clients may help
programmers dramatically improve performance through: changing the
scheme itself, changing shared work between the server and clients,
changing communication methods, or changing the protocols used for
communication.

The pkprof facility offers the following utilities to programmers:

	o turns profiler on and off,

	o specifies processes the profiler monitors,

	o specifies resources to profile,

	o reports on measurements.

3.	Basic structure

Pkprof monitors the resource consumption in a kernel, attributes
consumption to the process benefiting from it, and records consumption
with attribution. Pkprof uses the same techniques to monitor resource
consumption as conventional profiling tools; it uses embedding codes
and sampling techniques. Pkprof switches a profile data buffer when
the kernel switches a process or serves asynchronous events
(interrupts). It tries to determine what process will later benefit
from each asynchronous event.

3.1	Attributing resource consumption

Sources of resource consumption in a modern, multi-tasking operating
system kernel can be classified into the following two types of
events:

	o events caused by a process running, and

	o events caused by a device.

The former occur synchronously in respect to the execution of the
process. The process asks for kernel services explicitly (e.g. system
calls) or implicitly (e.g. page faults) through the events. Therefore,
consumption attribution is performed by keeping track of process
switching in the kernel. The process running currently is the one
being served by the kernel. The latter, however, occurs asynchronously
and often has nothing to do with the running process.

The asynchronous event is usually only partially processed. Partial
processing results are stored in a temporal buffer, and the remaining
processing is performed when the process makes a request that requires
response from the processing results. Therefore, resource consumption
attribution caused by the event must be deferred until the process
that benefits from the event is identified.

Assigning a unique identifier to every asynchronous event is one
technique to handle the situation above. The identifier serves as a
classification tag and records the resource consumed by processing of
the asynchronous event. Once a process benefits from the tagged
resource, all the resource consumption recorded with the same tag is
attributed to that process.

The address of a data buffer allocated to handle an asynchronous event
can be used as such an identifier. The buffer is usually passed from
routine to routine during partial processing. The address, therefore,
can be used to tag the resource consumed by processing. Note that the
asynchronous event handling routines may allocate another buffer and
use this during partial processing. The address of a newly allocated
buffer should also be treated as another identifier in such cases.

3.2	Recording profile data

The pkprof logically allocates a profile log buffer for every profiled
process. It also allocates a buffer for every asynchronous
event. Access routines are provided for profiler routines so that they
can efficiently record resource consumption in the buffers. It is
preferable to construct buffer structures dynamically, and on demand
because the number of processes to be profiled is not known in
advance. Static allocation of buffers may restrict profile
measurements and impose a performance penalty while the profiling
facility is turned off.

It should be noted that profile data buffers for asynchronous events
are allocated and de-allocated much more frequently than those for
profiled processes. When a process that benefits from an asynchronous
event is found, the contents of the data buffer for that event can be
merged into the profile data buffer for the found process. The buffer
for the asynchronous event, then, can be de-allocated and reused for
another event. Quick allocation and reuse of the buffers are the keys
to making asynchronous event attribution more practical.

3.3	Implementing pkprof

Implementation of a pkprof facility, unlike conventional kernel
profiles, heavily depends on the structures of a target operating
system. It requires higher-level knowledge of the target such as

	o events caused by a process,

	o handling of synchronous events,

	o location and manner of process switching,

	o the way asynchronous events are caught and processed,

	o data structures used to process events, and so on.

The part in pkprof concerned with synchronous events can be
implemented on top of a conventional kernel profiling
facility[4][5]. Since served process tracking and profile log buffer
management are functionally independent of a conventional profiler, we
can add both functions without modifying the existing profiler. Once
the functions are added, we can modify existing profiler routines so
that they refer to the currently served process and record resource
consumption properly. Of course, we need to adapt existing utilities
to the new profile facility.

Implementation of the other part, i.e. the attribution of asynchronous
events, needs much more modification of the target operating
system. It generally includes:

	o enumerating places where asynchronous events are caught and handled,

	o embedding codes into places that assign identifiers and profile data buffers to events,

	o embedding codes, where appropriate, to track asynchronous event handling, and

	o embedding codes to resolve process-event correspondence and to merge event profile data into process profile data.

Note that asynchronous event processing occupies a relatively small
fraction of kernel CPU usage in some environments. Implementing only
the attribution of synchronous events might prove adequate in such
cases.

4.	Implementation

This section describes sample implementation of pkprof on a
4.4BSD-Lite UNIX operating system. After this, we will refer to sample
implementation as PKPROF. Although PKPROF is experimental and
incomplete, it produces the least functionality that is needed to
evaluate pkprof. Many techniques used for sample implementation, such
as monitor, sampler, and profile data management, are independent of
4.4BSD-Lite kernel architecture. They should be applied to the
implementation of pkprof on other operating system
kernels. Descriptions of the process switch and asynchronous event
tracking, however, depend heavily on the design of the 4.4BSD-Lite
kernel.

4.1	Overview

The 4.4BSD-Lite kernel offers a configurable kernel profiling facility
called KPROF. KPROF measures:

	o how often each kernel routine is called, and

	o how much time is spent executing each kernel routine.

The former is implemented by inserting monitoring codes ("monitor") in
the prologue for each kernel routine so that the monitor is executed
during every invocation of the routine. The latter is actually
measured by sampling the address at which the CPU is executing an
instruction. Execution times for kernel routines are inferred from the
distribution of the samples within the total execution time of the
kernel. This implementation scheme used by KPROF is the same scheme
used by gprof.

PKPROF is an extension of KPROF; the monitor and sampler scheme is
basically the same as that of gprof. We have added and modified KPROF
codes so that the monitor and sampler are able to classify the profile
data, and store them in the per-process and per-event profile data
buffers properly. PKPROF stores profile data into buffers in the
kernel memory space.

Major issues on implementation include:

	o where the kernel switches one running process to another,

	o how asynchronous events are tagged and tracked,

	o how the monitor and sampler keep track of process switching and event processing, 

	o how profile data buffers are allocated and organized, and 

	o how measurements are controlled and how profile data is accessed from outside the kernel.

Limiting the impact of performance was one of the key criteria in
designing the mechanism and data structures of PKPROF (Section 5.1
discusses overhead).

The 4.4BSD-Lite operating system enters into the kernel mode when one
of the following events occur:

1.	a process issues a system call.

2.	a process accesses an address to which a physical memory page is not assigned at the time (i.e. page fault).

3.	a device interrupts the execution of the current process.

When the first event occurs, the kernel starts processing the
process's request immediately. If a requisite resource is unavailable
at that time, the kernel switches the process to another
process. Similar processing also occurs for the second event. The
third event, however, starts processing which may be irrelevant to the
current process. For example, a process may be interrupted to handle
the arrival of a network packet even if this process never accesses
the network.

We have added two global variables; svdproc indicates what process the
kernel is serving and svdevent indicates what event the kernel is
handling. The kernel updates the value of each variable when process
switching and an asynchronous event occur. The monitor and the sampler
refer to the variables and store profile data to the proper profile
data buffer. Profile data for an asynchronous event handling are
merged into the profile data buffer for a process when that process is
found to benefit from event handling.

4.2	Served process tracking

The 4.4BSD-Lite kernel maintains a global variable named curproc. This
points to a proc structure that contains the state of the currently
running process. Although the name stands for "current process", the
value of curproc cannot be used to track the "current served process",
because its value often remains unchanged while the kernel is serving
other processes. For example, the interrupt handler never changes its
value. The value of curproc is also left untouched while the kernel
does system-wide jobs such as process priority calculations.

We have added a new variable named svdproc (stands for "served
process") that shows a process the kernel is actually serving. Codes
that we have embedded into the context switching related kernel
routines update the value of svdproc.

Figure 4.1 shows the call flow diagram for kernel routines that are
responsible for the execution of context switching. Each node shows
the kernel routine and each solid arrow indicates the caller-callee
relationship between the routines. For example, the figure shows that
exit1() calls cpu_exit(). Each gray arrow shows the transition from
the user mode to the kernel mode. The interrupt() is invoked by
hardware interrupts such as interval timer clock, disk, and network
interrupts.

Only kernel routines marked with `*' switch the currently served
process. Switch() switches the current process to another. Exit1()
handles process termination. The value of svdproc is set to NULL when
the process terminates, which means no process is served by the
kernel. Interrupt() also changes svdproc to NULL because it invokes
the kernel's interrupt handling activities.

4.3	Asynchronous event tracking

Figure 4.2 depicts a generic view of asynchronous event processing for
the 4.4BSD-Lite kernel. Device interrupt is caught by the lowest
interrupt handler. It then invokes an appropriate interrupt handling
routine for the device driver. The interrupt handling routine and
other related routines eventually finish processing and call wakeup()
to mark the processes that are waiting for that event as
runnable. Internal data objects are often created during
processing. These objects are passed from routine to routine, and are
eventually stored in a queue or a buffer so that the process can
receive the contents of the objects.

We use the address for each internal data object as the identifier of
an event. Assigning an identifier to each interrupt does not work
properly because one interrupt is often accompanied by multiple
events. The reception of multiple packets from a network is a typical
example of such cases. Internal data objects are always created for
events even in these cases. Thus their addresses can be used to track
asynchronous event processing.

We have added a global variable named svdevent (stands for "served
event") that shows an event identifier. We have embedded codes in all
places where new internal data objects are created so that the
addresses of objects are recorded and that a profile data buffer is
allocated and associated with each object. We have also added codes to
places where internal data objects are discarded. These codes maintain
a value of svdevent. We have also modified the interrupt handlers so
that the value of svdevent can be properly saved and restored upon
interrupts.

4.4	Switching profile data buffer

The monitor and sampler switch the profile data buffer based on the
values of svdproc and svdevent. Table 4.1 summarizes switching of the
profile data buffer.

The value of svdproc precedes that of svdevent. If svdproc shows that
<process> needs to be profiled, the monitor and sampler store profile
data in the profile buffer for that process. Svdproc is OTHER when a
kernel is serving processes not being profiled; the monitor and
sampler aggregate profile data to a special profile buffer for those
processes. When both svdproc and svdevent are NULL, the kernel is
serving system-wide jobs such as scheduling; the monitor and sampler
store profile data to a profile data buffer for the kernel itself.

4.5	Structure of profile data buffer

Because a process usually runs over a long period of time, we designed
the profile data buffer that is allocated for each process to be space
efficient. The data structures for asynchronous events, on the other
hand, are designed to be time efficient. This is because each buffer
is expected to be short-lived and the buffer contents need to be
merged quickly into the profile data buffer for a process.

Figure 4.3 shows how we organized data structures to store the profile
data for each process. PKPROF associates each buffer entry with each
process through the proc structure to provide monitor and sampler
access to the buffer entries. Svdproc points to a proc structure
showing the currently served process. Each proc structure contains a
pointer to the buffer entry for that process. The monitor and the
sampler refer to the entry using that particular pointer. The entries
themselves construct a singly-linked list so that each entry can be
accessed after the termination of the process.

Each buffer entry contains two pointers: a pointer for the address
sampling log buffer and a pointer for the execution count log
buffer. The structure of both log buffers are identical to those of
KPROF and are used to record gprof compatible profile data.

Figure 4.4 shows the data structures storing the profile data for
asynchronous event processing. After this, we will refer to each
profile data buffer for an asynchronous event as PDB.

Internal data objects, such as the mbuf structure, and PDBs are
associated in the hash table. Each PDB is composed of a list of the
statistics array. Statistical entries are stored into the array
sequentially. The entry can hold a caller-current PC (program counter)
pair for the monitor or current PC for the sampler. The sequential
structure of the array enables the profiler routines to access
contents quickly. Note that svdevent actually points out a PDB so that
the monitor and sampler can quickly store the profile data in the PDB
without looking up the hash table.

The internal data objects are collected and assembled into other
objects for some event processing, such as network packet
reception. In these cases, we also collect PDBs associated with all
the collected objects and aggregate the contents of the PDBs into a
PDB associated with the newly created object. The data structures make
aggregation efficient since it only needs to link PDBs by pointers.

4.6	Asynchronous event attribution

A process is found to benefit from each asynchronous event when it
uses the output of asynchronous event processing. An asynchronous
event handler stores internal data objects, e.g. mbuf structures, in
an appropriate place, e.g. a queue for a socket, and calls wakeup()
(See Figure 4.2). When a process removes the internal data objects
from places, PKPROF reads the contents of PDBs attached to objects and
merges them into a profile data buffer for that process. All the
resources consumed for internal data objects are consequently
attributed to that process.

4.7	Profile utilities

PKPROF provides two utility programs named pkgmon and pkgprof. Pkgmon
is used to conduct profiling. Pkgprof is a profile report generator.

We modified the kgmon program to implement pkgmon. The operating
system was originally equipped with kgmon to control KPROF using the
sysctl() library routine, offering miscellaneous services to control
kernel behavior. The controlled functions include resetting KPROF,
starting it, stopping it, and reading the profiled data. We have added
new services to the sysctl() library routine (and the __sysctl system
call) and modified the kgmon command program so that a user can
control PKPROF-specific functionality such as specifying the processes
to be tracked and reading process-labeled profile data. Table 4.2
lists the services that have been added to the sysctl() library
routine. Figure 4.5 summarizes the command syntax for pkgmon.

Pkgprof is a front-end processor of gprof. We have made no
modification to gprof. Pkgprof collects profile data files and passes
them to gprof. It supports seamless profiling by merging the
user-level and kernel-level profile data of a process into one profile
data file (in gmon.out format) and passing it to gprof. Figure 4.6
summarizes pkgprof command syntax and functionality.

4.8	Example Measurements

Figure 4.7 shows a typical procedure to measure seamless profiles. The
first command (line 1) executes a cat program and starts profiling it. 
Pkgmon reports the process identifier of the executed program (line
2). When the measured program terminates, pkgmon also
terminates. Reports are generated by running pkgprof for the cat
process (line 3).

Lines 6 through 8 show files created by measurement. Gmon.out.137k.cat
and gmon.out.137u.cat are process 137's profile data files for
user-level and kernel-level, respectively. Gmon.out is another
user-level profile data file; its contents are the same as those for
gmon.137u.cat. Gmon.out.137.cat contains information on measurement
such as program name and process identifier; pkgprof refers the
information to process profile data files. Gmon.out.-k.kernel is a
profile data file for the kernel's system-wide jobs. Gmon.out.-k.other
is a kernel-level profile data file for not-profiled processes.

Lines 10 through 21 show another example of measurement. A sever
process nfsiod is profiled in this example. A process 63 has been
specified to be profiled for 120 seconds at line 15. Line 19 through
21 show files created by measurement. Note that file gmon.out is not
generated by this measurement since the profiled nfsiod process is
still running after measurement.

5.	Discussion

5.1	Overhead

Overheads for pkprof depend on many factors, such as machine
architecture, kernel implementation, sampling rate, and event
rate. The goal of this section is to discuss the overheads as
generally as possible. We will examine KPROF and PKPROF closely and
evaluate the overheads for PKPROF using those of KPROF as
standards. Since the implementation of KPROF is fairly general and
portable, this evaluation should give the reader clues to estimating
the overheads of pkprof in her environment.

In the following discussion, measurements have been undertaken on
DECstation 5000/200 (MIPS R3000, 25MHz clock) with 16MBytes of main
memory and a RZ56 600MByte SCSI disk drive.

Execution time overheads

PKPROF adds 4 kinds of codes to KPROF as follows:

	o monitor: codes to refer svdproc and svdevent and to switch the profile data buffer

	o sampler: (same as the above)

	o process tracking

	o event tracking

 Table 5.1 show the machine instruction steps of the monitor and the
sampler for both KPROF and PKPROF. It also shows the average execution
time of the monitor for each call in tick. The time is measured while
a test program reads files and writes them to /dev/null. Time
measurement results reflect the dynamic environments of the platform,
such as cache-miss and bus availability as well as the monitor's
typical execution traces. The monitor suffers a 33% text and 16%
execution time increase in implementing PKPROF. Note that the increase
accounts for 3.1% of the total kernel CPU time. The sampler suffers a
31% increase in text.

Table 5.2 shows the results of static analysis and execution time
measurements of process tracking codes. The average execution time for
each routine is measured under the same conditions as Table 5.1. We
modified 8 kernel routines listed in the table to maintain the svdproc
value. These functions suffer a 2.4% text and 0.04% execution time
increase in total. Note that the total execution time for these
routines accounts for only a small fraction of the total kernel CPU
time, that is, about 0.02%. The process tracking codes, therefore,
incur a small overhead on system performance.

Asynchronous event tracking codes need to be embedded to each event
handling module. The modification heavily depends on specific kernel
implementation. Therefore, the overheads for such codes are difficult
to evaluate in a general manner. We selected UDP packet receive
processing as an example of event handling in the following
discussion.

Table 5.3 shows the results of static analysis and execution time
measurements of kernel routines related to Ethernet UDP packet
receiving. The execution time for each routine is measured in
receiving UDP packets of 4 KByte data. The processing of a UDP packet
bigger than 1500 bytes that is received via Ethernet needs to have
fragments of the packet collected. The measured CPU time overheads,
therefore, includes those for concatenating profile data buffers.

Memory requirements

Both KPROF and PKPROF add codes to the kernel. The text size of the
kernel with no profiling for our platform is 775 262 bytes. Sizes with
KPROF and PKPROF are 803 022 bytes and 816 142 bytes respectively.

PKPROF allocates a profile data buffer for each process (Figure 4.3)
when a process is specified to be profiled. It also allocates the same
size for two buffers to store the profile data attributed to the
kernel itself and the processes not being profiled. The size required
for the buffer depends on the address range of the kernel text. PKPROF
allocates 1/2 the size of the address range for the execution count
data buffer (caller-callee counts) and about 3/4 for the address
sampling data buffer.

The address range size of the sample kernel, for example, is 897 048
bytes. Sample implementation first allocates two 1 112 328 byte
buffers for the kernel and not-profiled processes. If three processes
are profiled, PKPROF allocates a 3 336 984 byte memory for 3 buffers
in addition to the first two buffers.

Recording the profile data for each asynchronous event, on the other
hand, needs much less memory. Each asynchronous event is generally
processed quickly. That is, fewer kernel routines are called and less
sampling is made during event processing than during process
life. PKPROF, therefore, allocates a small amount of memory for each
event. Since such memory is recycled for process-event attribution,
the total amount of memory is limited by the throughput of the
system's event processing. The maximum memory theoretically needed for
profiling asynchronous events depends on the number of internal data
objects that a kernel can allocate simultaneously, the number of
functions that participate in each event processing, and the time that
each event processing takes.

We again selected UDP packet processing as an example to be able to
discuss the PKPROF memory requirements more concretely. PKPROF
allocates a statistics array (Figure 4.4) for each mbuf chain. Each
statistics array is 4096 bytes long and can hold 1020 statistical
entries. The actual number of statistical entries used in storing the
profile data depends on both the implementation of UDP packet
processing and the sampling rate.

Measurements on our platform have shown that, for 4 KByte-data UDP
packets received at a maximum processing rate (149 packets/s, See
"Overall performance degradation" section), the maximum (average)
number of caller-callee entries and current PC entries per mbuf list
are 533 (94) and 2 (0.2), respectively. The 1020 entries are
sufficient to hold these entries. While receiving, on average 127
mbufs and 333 PDBs simultaneously exist in the kernel space. These
PDBs construct 106 PDB-lists on average, and they accounts for 1 363
968 bytes in total.

Overall performance degradation

Table 5.4 shows the performance impact of PKPROF on file read
throughput. It also shows KPROF's impact for comparison. We measured
read throughputs of files on a disk for various sizes. We arranged the
measurements carefully so that the buffer cache never hit. The results
show that both KPROF and PKPROF suffer less than 3% decrease on file
read performance.

Table 5.5 shows the performance impact of both PKPROF and KPROF on UDP
packet receipt throughput. We used two communication processes for
measurement; a receiver running on the platform received UDP packets
infinitely and a sender running on another computer sent UDP packets
to the receiver via the Ethernet. We determined the maximum
throughputs by changing packet transmit rate until the receiver began
dropping the packets. The results show that KPROF suffers about a 60%
decrease on 64 byte-data UDP packet receive performance and PKPROF
about 80%.

5.2	Server process

Server processes that work for other user programs instead of the
kernel are common on many of today's operating systems. PKPROF is able
to show to what extent kernel resources are consumed by such servers,
but cannot tell for what process the servers have worked.

Seamless profiling can be extended to deal with such cases. The kernel
places a mark for every data packet passed via interprocess
communication. When a server process is scheduled, the scheduler knows
which data packet has triggered the server. It can identify for which
user program the server is working by referring to the mark. As this
method may cause degradation in server performance, markers
(e.g. address of passed data) and reference methods should be
carefully selected.

5.3	Call-graph support

 PKPROF reports on seamless profiling results in the same format as
gprof. Current implementation can report on call-graph profiles across
the user-kernel boundary passed through via system calls. We have
modified the monitor codes of the system call trap handler so that
they are able to store the addresses of the caller (in user-space) and
the callee (i.e. the trap handler) in the profile data buffer
properly. The gprof report generator constructs a call-graph across
the boundary based on caller-callee information. The call-graph for
other services such as page faults needs modification to the
interrupt().

5.4	Related tools

Many tools, such as sar of System V and various xxstat programs of
BSD, that report kernel statistics have been used to monitor and tune
systems[6]. These tools are also useful to measure the kernel resource
consumption of a program if it were possible to control the system
environment precisely.

If a target operating system kernel clearly organizes process
switching and accesses internal data objects in a consistent manner,
pkprof can be implemented in a mechanical way. For example,
4.4BSD-Lite uses mbuf structures for network handling in a consistent
fashion; similar code patterns appear in many places. In such cases,
modification to the kernel to allow pkprof to be implemented also
results in embedding the same pattern of codes into many
places. ATOM[7] might be useful in such cases.

6.	Conclusion

We have proposed a new profiling facility called process-labeled
kernel profiling (pkprof). Pkprof measures which resources are used in
an operating system kernel for which user program. Seamless profiling,
one of pkprof's applications, measures the resource consumption of a
program both in user space and kernel space. A programmer is able to
understand the impact of a program on kernel CPU usage from the output
of seamless profiling. Applying pkprof to a group of processes is
useful in understanding system behavior and in improving its overall
performance.

Implementation of pkprof heavily depends on the structures of a target
operating system. It might be difficult to add a pkprof facility to an
existing operating system. We believe that it is necessary for
operating system developers to support pkprof in their operating
systems and to consider its implementation during the early stages of
their development.

Appendix	Some details on asynchronous event profiling

PKPROF functions

Table A.1 summarizes PKPROF functions used for profiling asynchronous
event processing. When an interrupt handler allocates a new internal
data object, it calls pkprof_hash() to attach a PDB to the
object. When the handler allocates a new internal data object and
discard the old one, it calls pkprof_pass() to detach the PDB from the
old object and to attach the PDB to the new object. When the handler
concatenates two internal data objects, it calls pkprof_merge() to
merge contents of the PDB of one object to the PDB of the other
object. If a process is found to benefit from an internal data object,
pkprof_attribute() is called to merge contents of the PDB of the
object to the profile data buffer for that process. If the process is
not being profiled, the contents of the PDB are merged into the
profile data buffer for "OTHER" processes.

Profiling disk read

Figure A.1 shows the call flow for the kernel routines of a SCSI disk
read operation. The process issues a read system call. The kernel
routine named read() handles this request and starts setting up a disk
read data transfer operation. Bread() allocates a buf structure and a
data buffer for the buf structure. Biowait() calls tsleep() on the buf
structure to wait for data to become available. A SCSI interface
eventually interrupts normal execution after DMA transfer from a disk
to the DMA buffer. Interrupt() is then called. Interrupt() saves the
last svdevent value and sets it to NULL.

Asc_intr() is then called to copy data in the DMA buffer to the data
buffer passed from bread(). Asc_intr() allocates and associates a PDB
with the data buffer, sets svdevent, and starts copying data from the
DMA buffer to the data buffer. Rzdone() detaches the PDB from the data
buffer and re-attaches it to the buf structure. Biodone() then wakes
up processes sleeping on the buf structure.

When a process sleeping on the buf structure is resumed, the contents
of a PDB associated with the buf structure are merged into a profile
data buffer for that process. The resources consumed in processing the
read operation for that buf structure are attributed to the process in
this way.

Profiling UDP packet receiving

Figure A.2 shows the call flow for the kernel routines of UDP packet
receipt operation. The process issues a recvfrom system call. The
kernel routine named recvfrom() handles the system call. Tsleep() is
finally called from sbwait() to wait for a packet arrival. When
Ethernet packets arrive, a network interface interrupts normal
execution and interrupt() is called. Interrupt() saves the last
svdevent value and sets it to NULL.

Device driver routines (leintr(), lerint(), leread(), leget()) copy
the received Ethernet packets from regions of the network interface
buffer to mbuf chains. Lerint() first attaches a profile data buffer
to a region of the network interface buffer and sets svdevent. That
is, an event identifier for a packet is the address of the
region. Leread() then copies the contents of the region to an mbuf
chain by calling leget(). After copying, leread() detaches the profile
data buffer from the region and re-attaches it to the mbuf
chain. Ethernet_input() places the mbuf chain in the IP packet input
queue and posts a software interrupt. This sequence is repeated until
all Ethernet packets are processed. Then interrupt() restores the last
value of svdevent and resumes normal execution.

When software interrupt is acknowledged, the network and transport
layer routines (ipintr(), udpinput(), etc.) process mbuf chains to
assemble, in this case, a UDP packet. Ipintr() sets svdevent when it
removes an mbuf chain from the IP packet input queue. When ipintr()
concatenates two mbuf chains, it merges the contents of the profile
data buffer for the second mbuf chain into those for the first mbuf
chain. This merging actually chains both the profile data buffers
because of their structures (See Figure 4.4). Assembled packets are
put into the receive queue of a socket and processes sleeping on that
queue are woken up.

When the process waiting for the packet is resumed, it removes the
packet from the receive queue of the socket. Soreceive() merges the
contents of the profile data buffer attached to the packet into the
profile data buffer of the running process pointed out by svdproc. In
this way, resource consumption caused by processing of the packet is
attributed to the process receiving the packet.

References

[1] S. L. Graham, P. B. Kessler and M. K. McKusick, "An Execution
Profiler for Modular Programs", Software - Practice and Experience,
Vol. 13, 671-685, (1983).

[2] S. L. Graham, P. B. Kessler and M. K. McKusick, "gprof: a Call
Graph Execution Profiler", ACM SIGPLAN Notices 17(6), 120-126, (1982).

[3] R. J. Hall and A. J. Goldberg, "Call Path Profiling of Monotonic
Program Resources in UNIX", 1993 Summer USENIX Technical Conference
Proceedings, 1-13, (1993).

[4] M. K. McKusick, "Using gprof to Tune the 4.2BSD Kernel (May 21,
1984)", Distributed with 4.4BSD UNIX (1993).

[5] M. K. McKusick, "Measuring and Improving the Performance of
Berkeley UNIX (DRAFT April 17, 1991)", Distributed with 4.4BSD UNIX,
(1993).

[6] M. Loukides, System Performance Tuning, O'Reilly & Associates,
Inc., (1990).

[7] A. Eustace and A. Srivastava, "ATOM A Flexible Interface for
Building High Performance Program Analysis Tools", 1995 USENIX
Technical Conference Proceedings, 303-314, (1995).

Author Information

Shingo Nishioka is a research scientist at the Advanced Research
Laboratory, Hitachi, Ltd. His interests include programming languages
and operating systems. He received his B.S., M.S., and Ph.D. degrees
from Osaka University. He can be reached at nis@harl.hitachi.co.jp.

Atsuo Kawaguchi is a research scientist at the Advanced Research
Laboratory, Hitachi, Ltd. His interests include file systems, memory
management system, and microprocessor design. He earned his B.S.,
M.S., and Ph.D. degrees at Osaka University. He can be reached at
atsuo@harl.hitachi.co.jp.

Hiroshi Motoda has been with Hitachi since 1967 and is currently a
senior chief research scientist at the Advanced Research Laboratory
and heads the AI group. His current research includes machine
learning, knowledge acquisition, visual reasoning, information
filtering, intelligent user interfaces, and AI-oriented computer
architectures. He holds B.S., M.S., and Ph.D. degrees from the
University of Tokyo. He was on the board of trustees of the Japan
Society of Software Science and Technology and on the board of the
Japanese Society for Artificial Intelligence. He was also on the
editorial board of Knowledge Acquisition and on the editorial board of
IEEE Expert. He can be reached at motoda@harl.hitachi.co.jp.

Alternately, all authors can be contacted via mail:

		Advanced Research Laboratory, Hitachi, Ltd.

		Hatoyama, Saitama, 350-03 Japan.

Availability

PKPROF and related documents are available at
https://www.harl.hitachi.co.jp/~nis.

Figure 2.1:  Example report of pkprof.
Figure 2.2:  Example listing of seamless profiling.
Figure 4.1:  Call flow of kernel routines.
Figure 4.2:  Asynchronous event processing.
Figure 4.3:  Data structures of profile data buffers for processes.
Figure 4.4:  Data structures of profile data buffers for asynchronous events.
Figure 4.5:  Command syntax of pkgmon.
Figure 4.6:  Command syntax of pkgprof.
Figure 4.7:  Sample run of profile measurement.
Figure A.1:  Call flow of the kernel routines for SCSI disk read operation.
Figure A.2:  Call flow of the kernel routines for UDP packet receive opearation.
Table 4.1:  Switching of profile data buffer.
Table 4.2:  Added sysctl() commands.
Table 5.1:  Results of static analysis and execution time measurements of monitor and sampler.
Table 5.2:  Results of static analysis and execution time measurements of context switching routines.
Table 5.3:  Results of static analysis and execution time measurements of kernel routines for UDP packet receiving.
Table 5.4:  Performance impact on file read throughput.
Table 5.5:  Performance impact on UDP packet receive throughput.
Table A.1:  PKPROF functions used for profiling ashynchronous event.