Check out the new USENIX Web site.
Sample Extended AbstractUSENIX

 

NFS Tracing By Passive Network Monitoring (Extended Abstract)

Matt Blaze

Department of Computer Science
Princeton University
mab@cs.princeton.edu

Traces of filesystem activity have proven to be useful for a wide variety of purposes, ranging from quantitative analysis of system behavior to trace-driven simulation of filesystem algorithms. Such traces can be difficult to obtain, however, usually entailing modification of the filesystems to be monitored and runtime overhead for the period of the trace. Largely because of these difficulties, a surprisingly small number of filesystem traces have been conducted, and few sample workloads are available to filesystem researchers.

This paper describes a portable toolkit for deriving approximate traces of NFS [1] activity by non-intrusively monitoring the Ethernet traffic to and from the file server. The toolkit uses a promiscuous Ethernet listener interface (such as the Packetfilter[2]) to read and reconstruct NFS-related RPC packets intended for the server. It produces traces of the NFS activity as well as a plausible set of corresponding client system calls. The tool is currently in use at Princeton and other sites, and is available via anonymous ftp.

Motivation

Traces of real workloads form an important part of virtually all analysis of computer system behavior, whether it is program hot spots, memory access patterns, or filesystem activity that is being studied. In the case of filesystem activity, obtaining useful traces is particularly challenging. Filesystem behavior can span long time periods, often making it necessary to collect huge traces over weeks or even months. Modification of the filesystem to collect trace data is often difficult, and may result in unacceptable runtime overhead. Distributed filesystems exacerbate these difficulties, especially when the network is composed of a large number of heterogeneous machines. As a result of these difficulties, only a relatively small number of traces of Unix filesystem workloads have been conducted, primarily in computing research environments. [3], [4] and [5] are examples of such traces.

Since distributed filesystems work by transmitting their activity over a network, one can collect traces of such systems by placing a "tap" on the network and observing the network traffic. Ethernet[6] based networks lend themselves to this approach particularly well, since traffic is broadcast to all machines connected to a given subnetwork. General-purpose network monitoring tools (such as [7]) that "promiscuously" listen to the Ethernet are useful for observing (and collecting statistics on) specific types of packets, but the information they provide is at too low a level to be useful for building filesystem traces. Filesystem operations may span several packets, and may be meaningful only in the context of other, previous operations.

Previous work on distributed filesystem network traffic has focused on characterizing the network load itself (e.g., [8]). While useful for understanding traffic patterns and developing a queueing model of NFS loads, these previous studies do not use the network traffic to analyze the file access traffic patterns of the system, focusing instead on developing a statistical model of the individual packet sources, destinations, and types. Higher-level studies of file access patterns have traditionally involved installing a trace package directly on the client and/or server machines.

This paper describes a toolkit for collecting traces of NFS file access activity by monitoring Ethernet traffic. A "spy" machine with a promiscuous Ethernet interface is connected to the same network as the file server. Each NFS packet is analyzed and a trace is produced at an appropriate level of detail.

We partition the problem of deriving NFS activity from raw network traffic into two fairly distinct subproblems: that of decoding the low-level NFS operations from the packets on the network, and that of translating these low-level commands back into user-level system calls. Hence, the toolkit consists of two basic parts, an "RPC decoder" ( rpcspy ) and the "NFS analyzer" ( nfstrace ). rpcspy communicates with a low-level network monitoring facility ([2], [9]) to read and reconstruct the RPC transactions (call and reply) that make up each NFS command. nfstrace takes the output of rpcspy and reconstructs an approximation of the system calls that triggered the activity.

NFS Protocol Overview

It is well beyond the scope of this paper to describe the protocols used by NFS; for a detailed description of how NFS works, the reader is referred to [10], [11], and [12]. This section will give a very brief overview of how NFS activity translates into Ethernet packets and the problems a monitor program might have reconstructing the activity. In particular, we discuss the stateless nature of NFS (no open or close calls) and the way files are represented by handles.

The rpcspy Program

rpcspy is the interface to the system-dependent Ethernet monitoring facility; it produces a trace of the RPC calls issued between a given set of clients and servers. This section describes the overall function of rpcspy in detail.

For each RPC transaction monitored, rpcspy produces an ASCII record containing a timestamp, the name of the server, the client, the length of time the command took to execute, the name of the RPC command executed, and the command- specific arguments and return data. Currently, rpcspy understands and can decode the 17 NFS RPC commands, and there are hooks to allow other RPC services (for example, NIS) to be added reasonably easily. The output may be read directly or piped into another program (such as nfstrace ) for further analysis; the format is designed to be reasonably friendly to both the human reader and other programs (such as nfstrace or awk ).

Since each RPC transaction consists of two messages, a call and a reply, rpcspy waits until it receives both these components and emits a single record for the entire transaction. The basic output format is 8 vertical-bar-separated fields:

timestamp | execution-time | server |
    client | command-name | arguments | reply-data

where timestamp is the time the reply message was received, execution-time is the time (in microseconds) that elapsed between the call and reply, server is the name (or IP address) of the server, client is the name (or IP address) of the client followed by the userid that issued the command, command-name is the name of the particular program invoked (read, write, getattr, etc.), and arguments and reply-data are the command dependent arguments and return values passed to and from the RPC program, respectively.

The exact format of the argument and reply data is dependent on the specific command issued and the level of detail the user wants logged. For example, a typical NFS command is recorded as follows:

690529992.167140 | 11717 | paramount | merckx.321 | read |
     {"7b1f00000000083c", 0, 8192} | ok, 1871 

In this example, uid 321 at client "merckx" issued an NFS read command to server "paramount" . The reply was issued at (Unix time) 690529992.167140 seconds; the call command occurred 11717 microseconds earlier. Three arguments are logged for the read call: the file handle from which to read (represented as a hexadecimal string), the offset from the beginning of the file, and the number of bytes to read. In this example, 8192 bytes are requested starting at the beginning (byte 0) of the file whose handle is "7b1f00000000083c" . The command completed successfully (status "ok" ), and 1871 bytes were returned. Of course, the reply message also included the 1871 bytes of data from the file, but that field of the reply is not logged by rpcspy .

Implementation Issues

This section describes the actual implementation of rpcspy, and some of the less obvious problems in actually getting it to work the right way. In particular, we discuss the representation of file handles across various NFS implementations, caching of IP address/name translations, memory management, and packet fragmentation.

nfstrace : The Filesystem Tracing Package

nfstrace is a filter for rpcspy that produces a log of a plausible set of user level filesystem commands that could have triggered the monitored activity. A record is produced each time a file is opened, giving a summary of what occurred. This summary is detailed enough for analysis or for use as input to a filesystem simulator.

The output format of nfstrace consists of 7 fields:

timestamp | command-time | direction |
     file-id | client | transferred | size

where timestamp is the time the open occurred, command-time is the length of time between open and close, direction is either read or write file-id identifies the server and the file handle, client is the client and user that performed the open, transferred is the number of bytes of the file actually read or written size is the size of the file (in bytes).

An example record might be as follows:

690691919.593442 | 17734 | read | basso:7b1f00000000400f |
     frejus.321 | 0 | 24576 
Here, userid 321 at client frejus read file 7b1f00000000400f on server basso . The file is 24576 bytes long and was able to be read from the client cache. The command started at Unix time 690691919.593442 and took 17734 microseconds at the server to execute.

Nfstrace produces an approximation of the underlying user activity. This section will describe the heuristics used by nfstrace to approximate the original system calls. We discuss the discovery of cache hits vs. cache misses, file name translation and other such issues.

Using rpcspy and nfstrace for Filesystem Tracing

This section describes the applications of rpcspy and nfstrace. Clearly, nfstrace is not suitable for producing highly accurate traces; cache hits are only estimated, the timing information is imprecise, and data from lost (and duplicated) network packets are not accounted for. When such a highly accurate trace is required we must resort to more traditional tracing approaches.

We compare nfstrace with other approaches to file system tracing, and describe the test suite that was used to validate the accuracy of the trace results. We also will briefly discuss some of the social and ethical issues arising out of research based on trace data collected from real users.

A Trace of Filesystem Activity in the Princeton C.S. Department

In a previous paper[14] presented at USENIX, we analyzed a five-day long trace of filesystem activity conducted on 112 research workstations at DEC-SRC. The paper identified a number of file access properties that affect filesystem caching performance; it is difficult, however, to know whether these properties were unique artifacts of that particular environment or are more generally applicable. This section describes how we used rpcspy and nfstrace to conduct a week long trace of filesystem activity in the Princeton University Computer Science Department. Approximately 500,000 file opens were recorded.

We will compare the results of the Princeton nfstrace trace with the DEC-SRC trace of the previous paper. We describe the environment in which the traces were collected. Measurements of the Princeton data were remarkably similar to those taken on the SRC data in the previous paper.

In particular, we will examine observed hit rate, file write sharing and file "entropy". Data will be described graphically and analytically.

Conclusions

Although not as accurate as direct, kernel-based tracing, a passive network monitor such as the one described in this paper can permit tracing of distributed systems relatively easily. The ability to limit the data collected to a high-level log of only the data required can make it practical to conduct traces over several months. Such a long-term trace is presently being conducted at Princeton as part of the author's research on filesystem caching. The non-intrusive nature of the data collection makes traces possible at facilities where kernel modification is impractical or unacceptable.

Availability

The toolkit is available for anonymous ftp over the Internet from samadams.princeton.edu, in the compressed tar file nfstrace/nfstrace.tar.Z.

References

Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. "Design and Implementation of the Sun Network File System." Proc. USENIX, Summer, 1985.

Mogul, J., Rashid, R., & Accetta, M. "The Packet Filter: An Efficient Mechanism for User-Level Network Code." Proc. 11th ACM Symp. on Operating Systems Principles, 1987.

Ousterhout J., et al. "A Trace-Driven Analysis of the Unix 4.2 BSD File System." Proc. 10th ACM Symp. on Operating Systems Principles, 1985.

Floyd, R. "Short-Term File Reference Patterns in a UNIX Environment," TR-177 Dept. Comp. Sci, U. of Rochester, 1986.

Baker, M. et al. "Measurements of a Distributed File System," Proc. 13th ACM Symp. on Operating Systems Principles, 1991.

Metcalfe, R. & Boggs, D. "Ethernet: Distributed Packet Switching for Local Computer Networks," CACM July, 1976.

"Etherfind(8) Manual Page," SunOS Reference Manual, Sun Microsystems, 1988.

Gusella, R. "Analysis of Diskless Workstation Traffic on an Ethernet," TR-UCB/CSD-87/379, University Of California, Berkeley, 1987.

"NIT(4) Manual Page," SunOS Reference Manual, Sun Microsystems, 1988.

"XDR Protocol Specification," Networking on the Sun Workstation, Sun Microsystems, 1986.

"RPC Protocol Specification," Networking on the Sun Workstation, Sun Microsystems, 1986.

"NFS Protocol Specification," Networking on the Sun Workstation, Sun Microsystems, 1986.

Postel, J. "User Datagram Protocol," RFC 768, Network Information Center, 1980.

Blaze, M., and Alonso, R., "Long-Term Caching Strategies for Very Large Distributed File Systems," Proc. Summer 1991 USENIX, 1991.

 

?Need help? Use our Contacts page.
Last changed: April 21, 1998 jd
USENIX home