OSDI '04 Paper [OSDI '04 Technical Program]

Configuration Debugging as Search: Finding the Needle in the Haystack

Andrew Whitaker, Richard S. Cox, and Steven D. Gribble
University of Washington
`{andrew,rick,gribble}@cs.washington.edu`

Abstract

This work addresses the problem of diagnosing configuration errors that cause a system to function incorrectly. For example, a change to the local firewall policy could cause a network-based application to malfunction. Our approach is based on searching across time for the instant the system transitioned into a failed state. Based on this information, a troubleshooter or administrator can deduce the cause of failure by comparing system state before and after the failure. We present the Chronus tool, which automates the task of searching for a failure-inducing state change. Chronus takes as input a user-provided software probe, which differentiates between working and non-working states. Chronus performs ``time travel'' by booting a virtual machine off the system's disk state as it existed at some point in the past. By using binary search, Chronus can find the fault point with effort that grows logarithmically with log size. We demonstrate that Chronus can diagnose a range of common configuration errors for both client-side and server-side applications, and that the performance overhead of the tool is not prohibitive.

1 Introduction

Continual change is a fact of life for software systems. For desktop machines, users can install new applications, apply software upgrades, change security policies, and alter system configuration options. Servers and other infrastructure services are also subject to frequent changes in functionality and administrative settings.

The ability to change is what gives software its vibrancy and relevance. At the same time, change has the potential to disrupt existing functionality. For example, software patches can break existing applications [5]. Seemingly unrelated applications can conflict - for example, by corrupting Windows registry keys or shared configuration options. Changes to security policies, while often necessary to respond to emerging threats, can disrupt functionality. For server-side applications, administrator actions and other "operator errors" [17] are a substantial contributor to overall downtime.

In most cases, these change-induced failures are diagnosed by human experts such as system administrators. This approach suffers on a variety of fronts: trained experts are expensive, they are in short supply, and they are faced with escalating system complexity and change. In consequence, system administrative costs are approaching 60-80% of the total cost of ownership of information technology [12].

Figure 1: Searching through time for a configuration error: Chronus reveals configuration errors by pinpointing the instant in time the system transitioned to a failed state.

The goal of this work is to reduce the burden on human experts by partially automating problem diagnosis. In particular, we analyze the applicability of search techniques for diagnosing configuration errors. Our insight is that although computers cannot compete with human intuition, they are very effective at exploring a large configuration space. Our diagnosis tool, which we call Chronus, uses search to identify the specific time in the past when a system transitioned from a working to a non-working state, as shown in Figure 1. Using this information, an administrator can more easily diagnose why the system stopped working, for example, by comparing the file system state immediately before and after the fault point to determine the configuration change that "broke" the system.

1.1 Existing Approaches

In this work, we focus on automated problem diagnosis. For the sake of completeness, we briefly survey other approaches, arguing that the approach embodied by Chronus represents an advance for a significant class of configuration errors.

The best approach to dealing with configuration errors is prevention. Unfortunately, the complexity of today's systems makes it difficult to reason a priori about all possible side effects of a configuration change. One problem is that modern systems are built from components from many vendors, and there are few global mechanisms that are capable of understanding the effects of configuration changes in the large. The situation is further exacerbated by the inadequacy of analysis tools. For example, determining whether a software patch results in "equivalent" system behavior is intractable.

Recovery tools such as Windows XP Restore [24] create occasional state checkpoints, allowing users to "undo" [8] the effects of bad configuration changes. While effective in some situations, this approach faces several limitations. First, it requires the user to choose an appropriate state snapshot, which assumes that some form of problem diagnosis has already occurred. Second, recovery itself can corrupt system state, either by undoing "good" changes or restoring "bad" changes. Problem diagnosis in Chronus does not modify system state, and can therefore be safely employed in more situations.

Expert system diagnosis tools have a similar goal as Chronus, in that they attempt to map from symptoms to a root cause. A widely used (though rudimentary) example is the Windows "Help and Support Center." Expert systems typically rely on a static rule database, and are therefore only effective for known configuration errors. Arguably, known configuration errors would be better handled by improvements in software design or user interface. In addition, as systems grow more complex, static rule databases grow increasingly incomplete.

When all else fails, the last recourse is manual diagnosis by an expert. People have intuition and experience, letting them reason about unexpected situations. Unfortunately, human resources are scarce and costly, and mastering the complexity of today's software systems represents a significant hurdle to effective diagnosis.

1.2 The Chronus Approach

Chronus is a troubleshooting tool whose goals are to simplify the task of diagnosing a configuration error and to reduce the need for costly human expertise. Rather than requiring troubleshooters to answer the difficult question "why is the system not working," our tool instead requires them to supply a software probe (i.e., a script or program) that answers the simpler question "is the system currently working?" Given a probe, Chronus searches through time for the instant that the system transitioned from a working to a non-working state. As we will demonstrate, many common configuration errors can be diagnosed with simple shell scripts.

Chronus relies on several components. A time-travel disk [25] captures the progression of the system's durable state over time by logging disk block writes. Chronus uses the mDenali virtual machine monitor [35] to instantiate, boot, and test historical snapshots of the system, including the complete operating system and application state. Chronus executes the user-supplied software probe to test whether a given historical state works correctly. Finally, Chronus relies on a search strategy to efficiently educe the failure-inducing state change from a large sequence of historical states. In many cases, Chronus can use binary search, allowing for diagnosis time that scales logarithmically with log length.

Figure 2: A Chronus debugging session: Given a user-supplied software probe, Chronus reveals when the system began failing. Based on this information, it is possible to understand the cause of failure using higher-level analysis tools.

The output from Chronus is the time of the fault point. Based on this timing information (the "when"), the troubleshooter can then use OS- or application-specific tools to diagnose the cause of the failure (the "why"). One simple but useful technique is to compare the complete file system state immediately before and after the failure using an invocation of the UNIX diff command. Figure 2 depicts the stages of a typical Chronus session.

1.3 Outline

In the remainder of this paper, we describe the design and implementation of Chronus, and we demonstrate its ability to help a troubleshooter diagnose significant configuration errors. The remainder of this paper is organized as follows. In Section 2, we describe some of the challenges we faced and design decisions that we made. Section 3 discusses the Chronus implementation. We evaluate Chronus in Section 4. After discussing related work in Section 6, we describe open problems and future work in Section 7, and we conclude in Section 8.

2 Challenges and Design Tradeoffs

In this section, we drill down into the major components of Chronus. In each case, we identify the major challenges and describe the design tradeoffs we faced.

2.1 Time travel

Chronus relies on a time travel mechanism to instantiate previous system states. Traditional checkpointing systems capture the complete state of a system, including both persistent (e.g., disk contents) and transient state (e.g., memory and CPU state). This approach recreates previous states with high fidelity, but imposes a heavy overhead to continually flush memory state to disk. Approaches based on incremental logging (e.g., Revirt [15]), reduce overhead during normal operation, but require more time to recreate a previous system state.

Instead of taking full checkpoints, Chronus only records updates to persistent storage. This allows for reasonable performance during both normal operation and problem diagnosis. As we demonstrate in Section 5.1, the overhead of our versioning storage system is primarily limited to disk space (which is plentiful) rather than degraded performance.

A drawback of disk-only state capture is that we sacrifice completeness: only errors that persist across system restarts are recorded by the time travel layer. Note, however, that some configuration changes require system restarts to take effect - for example, changes to shared libraries or the OS kernel typically require system reboots. For this type of "delayed release" configuration change, the on-disk state is more meaningful than the instantaneous characteristics of the running system.

2.1.1 Time-travel disks

Time-travel or versioning storage systems have been extensively studied. Proposed systems include versioning file systems [30,31], source code repositories [14], time-travel databases [32], and the Peabody time-travel disk [25]. Taken as a whole, these systems demonstrate a tradeoff between completeness and high-level semantics (Figure 3). At one extreme, the time-travel disk offers the most completeness, in that it captures all state changes without requiring support from operating systems or applications. At the other extreme, relational databases offers strong data consistency semantics, but require applications to utilize a particular API.

Figure 3: Time travel storage layer tradeoff: Chronus uses a time travel disk, which achieves completeness while forfeiting high-level semantics.

For Chronus, we chose a storage system based on a time-travel disk. One of our goals ws to avoid making assumptions about how and where configuration errors arise. Because of its low-level interface, a time-travel disk captures all local configuration changes, without regard to application or OS functionality. Chronus is to some degree "future-proof," in that it can diagnose configuration errors for systems that have yet to be written.

A drawback of a time-travel disk is that it offers poor data consistency semantics. In some cases, the on-disk state may be corrupt, causing Chronus to discover a spurious error unrelated to the true cause of failure. More commonly, Chronus may discover the correct error, but the granularity of a block change is too fine to make a useful diagnosis. For example, configuration files can temporarily disappear while the text editor's "save" operation is in progress. Because such inconsistencies are short-lived, it often suffices to "zoom out" by computing state changes over a slightly longer interval.

2.2 Instantiating a historical state

Another key design decision is the technique used to instantiate previous system configurations. A simple strategy would be to use application-layer restarts, in which the user-mode processes of interest are restarted after each configuration change. Unfortunately, many relevant configuration changes require whole-system reboots, including changes to system software (the kernel, shared libraries) or configuration options (TCP/IP parameters, firewall policy).

In this work, we use a virtual machine monitor (VMM) [13,33,35] to perform "virtual reboots" in software. Because VMMs emulate the hardware layer, they provide a more complete representation of whole-system behavior. As well, VM restarts offer a series of advantage compared to physical machine restarts. VMs can be rebooted faster, because they avoid re-initializing physical I/O devices. For Chronus, this translates into faster problem diagnosis. VMMs provide robust mechanisms for terminating failed tests and reclaiming state changes, and they enable debugger-like functionality, allowing the user to inspect or modify VM state.

There are disadvantages to using VMMs. Virtualization imposes performance overhead; this can be minimized [4], but may still be significant in some settings. VMMs tend to reduce virtual device interfaces to the lowest common denominator, and thus may mask or perturb some configuration errors. A VMM might not expose a bleeding-edge graphics card, for example. Finally, a VMM-based implementation of Chronus cannot diagnose configuration errors within the virtualization layer itself, such as updates to physical device drivers.

2.3 Testing a historical state

Chronus's automated diagnosis capability relies on a user-supplied software probe to test whether the system is functioning correctly. Testing a system is often easier than performing a full failure diagnosis. Nevertheless, testing itself can be a non-trivial task, and probe authorship represents a hurdle to utilizing Chronus.

In our current prototype, probes are written on the fly in response to specific failure conditions. We assume that troubleshooters have knowledge of shell scripts and basic command line tools. With this, many configuration errors are testable, including application crashes, a Web browser that fails to load pages, or a remote execution service that refuses access to valid clients.

For errors that are beyond the scope of shell scripts, Chronus supports a manual testing mode, in which the human troubleshooter performs some or all of the testing process by hand. We have found manual testing particularly useful to evaluate errors that involve sequences of GUI actions or that require the user to interpret a visual image. Manual testing can be used with more configuration errors than probes, but it imposes a heavier burden.

In the future, we plan to explore techniques to simplify probe creation. One option is to create static libraries of probes, which could be used to test generic forms of application behavior. For example, a generic web server probe might attempt to download the system home page. For graphical applications, Chronus could leverage point-and-click tools for capturing and replaying sequences of GUI actions [20].

Regardless of testing strategy, there are some configuration errors that Chronus cannot diagnose. Non-deterministic errors (or Heisenbugs [17]) that cannot be reliably reproduced are beyond the scope of our tool.

2.4 Searching over time

Given a probe, a naïve approach to finding a fault point is to sequentially examine every historical state of the system. Of course, this is impractical, as it would require instantiating, booting, and testing a virtual machine for each disk block write that occurred in the past.

A more intelligent approach is to use a binary search through time. If the troubleshooter can identify a past instance in time at which the system worked, and assuming there is a single transition from that working state to the current non-working state (as in Figure 1), then binary search will find the fault point in logarithmic time.

However, in some cases, a system may make multiple transitions from a working to a non-working state, as shown in Figure 4. Most of these additional state transitions are spurious, in that they are not related to the true source of the current configuration error. For example, because software is typically unavailable during a software upgrade, Chronus may mistakenly implicate a past upgrade that is unrelated to the current configuration error. Other sources of spurious errors include configuration changes that have already been fixed, and short-term inconsistencies due to corrupt file system state.

Figure 4: A spurious search result: Chronus may detect an error that is unrelated to the current cause of failure.

A simple strategy for dealing with multiple failures is simply to run Chronus multiple times. By choosing different time ranges for each search, Chronus can be made to explore different regions of the system time-line. This is philosophically similar to simulated annealing search, which uses random choices to escape local minimums [29]. The troubleshooter can then analyze all returned state transitions to determine which one is the likely source of failure.

An alternate strategy is to construct probes that are less likely to exhibit spurious errors. One useful strategy is to construct error-directed probes, which search for changes in the system's observable symptoms, regardless of whether the behavior is "correct." The key insight is that different failure causes often produce different failure modes. For example, one error might cause an application to hang, whereas another produces an identifiable error messages. Therefore, probes that search for a particular symptom are less likely to reveal spurious errors unrelated to the true cause of failure. We explore such a complex error scenario in Section 4.3.

2.5 Going from "when" to "why"

The output from Chronus is the instant in time the transition to a failing state occurred. Using this, the troubleshooter can determine the state change that induced the failure. In many cases, this information alone is sufficient to diagnose the configuration error.

In other cases, however, the individual state change revealed by Chronus may be insufficient to diagnose the error. For binary configuration data, there is no universal differencing mechanism that reveals the "meaning" of a state change. Another limitation is that Chronus cannot uncover the broader context in which a state change was carried out. For example, Chronus cannot associate a modification to a dynamic library with the act of installing a particular application. In these cases, reversing the single state change revealed by Chronus may be insufficient to remedy the problem.

The solution to this "semantic gap" [11] between hardware-level events and higher-level semantics lies in combining Chronus with other debugging tools. The UNIX diff, which reveals changes to ASCII files, is one such tool, but others may be more appropriate in certain contexts. For example, the Windows regdiff tool reveals changes between two snapshots of the Windows Registry. The Backtracker tool [22] performs root-cause analysis by mapping from a low-level state event to high-level user action. Another approach is to leverage existing system logs. Currently, the sheer volume of this logging makes it difficult to use, but the timing information provided by Chronus can be used to quickly zoom-in on a small cross-section of system log entries.

2.6 Summary

The Chronus tool maps from a user-provided software probe to the instant the system transitioned to a failing state. This information, in conjunction with higher-level analysis tools like diff, allows a troubleshooter to diagnose the cause of failure.

The design of Chronus was guided by a few basic goals. Unlike programming language debuggers, Chronus strives for low overhead during normal operation. To achieve this, our snapshot mechanism only captures storage updates rather than complete memory checkpoints. Chronus also strives to capture the most possible configuration errors. We achieve this by using a time-travel disk (which captures all persistent state changes) and virtual machine monitors (which reproduce the entire system boot sequence). Finally, Chronus strives for fast problem diagnosis. Binary search provides for diagnosis time that scales logarithmically with log size. Also, our use of virtual machines enables individual tests to execute significantly faster than would be possible on physical hardware.

3 Implementation

In this section, we describe our prototype implementation of Chronus. Our prototype consists of roughly 2600 commented lines of C code, approximately half of which is dedicated to the time-travel disk. The other half comprises the search, testing, and diagnosis functionality. Figure 5 shows a high-level view of Chronus.

Figure 5: Chronus software architecture: During normal operation, the parent VM records the child's disk writes to a time-travel disk (TTDisk). During debugging, a software probe is used to determine the correctness of a given state. Chronus uses the probe to implement a search strategy (such as binary search) across the system time-line.

Chronus makes heavy use of the mDenali VMM [35]. Presently, mDenali (and hence Chronus) only supports the NetBSD guest OS. mDenali VMM allows a "parent" virtual machine to exert control over its "child" virtual machines. In addition to being able to create, destroy, and boot child VMs, the parent can interpose on and respond to its children's virtual hardware device events. For example, if a child issues a virtual disk write, that event is passed to the parent via the "lib_interpose" interposition library. In Chronus, the child executes normal user programs, while the parent implements the Chronus debugging functionality. Chronus itself runs as a normal user process with permission to access the interposition and control APIs described by Whitaker et al. [35]

Chronus exposes a command-line interface to the troubleshooter. The search command initiates a diagnosis session. The command's arguments include the name of a time-travel disk, the beginning and end of a search range (expressed as log indexes), and a probe configuration file, which defines the executable probe routine and other probe meta-data. If the search range limits are omitted, Chronus defaults to the beginning and end of the log. After Chronus has identified the instant of failure, the attach command is used to mount the child disk into the parent's local file system before and after the failure. The troubleshooter can then use commands such as diff to extract meaningful state changes. In addition to the search and attach commands, Chronus provides a set of command line utilities for interacting with time-travel disks. See Table 1 for details.

Table 1: Chronus command-line utilities: Automatic commands perform time-travel searches given a search probe. Manual commands allow the troubleshooter to instantiate a time-travel disk at some point in the past. Administrative commands perform TTDisk creation and maintenance.

Beyond the mDenali VMM, the major components of Chronus's implementation are a time-travel disk for recreating previous states, testing infrastructure for evaluating individual states, and binary search for efficiently localizing the failure across many previous states.

3.1 Time-travel Disk

The Chronus time-travel disk (or TTDisk) maintains a log of the child VM's disk writes. The TTDisk implements the mDenali disk interface [35], a C API that allows the programmer to implement custom functionality for disk reads and writes. The TTDisk functionality is hidden behind the hardware disk interface, so the child's guest OS requires no modifications.

The TTDisk uses two helper disks to maintain state. A checkpoint disk contains the initial disk contents. All disk writes are recorded to a log disk. The implementation of both disks is abstracted away behind the mDenali disk interface. In our current implementation, checkpoint/log disks can be backed by either physical disk partitions or by files in the parent's local file system. We disable write caching to ensure that disk writes are synchronously flushed to disk. Periodically, the log can be trimmed by flushing old entries back to the checkpoint.

In addition to the data disks, the TTDisk requires a meta-data region to map a given disk block to a location in either the checkpoint or the log. The TTDisk meta-data is similar to the checkpoint region of the log-structured file system [28], except that it preserves all previous disk writes, not merely those that are still active. For each TTDisk block, the meta-data region maintains a sorted list of the log writes that modified the given block. The meta-data region is backed by a file in the parent's local file system. As with the log-structured file system, we alternate between two meta-data regions (files) to ensure consistency in the face of failure [28].

3.1.1 Design details

The TTDisk uses a block size larger than the disk sector size to reduce the amount of meta-data. For NetBSD, the correct choice for this parameter is not the file system block size, but rather the file system fragment size. BSD systems typically use a large block size and rely on smaller fragments to efficiently store small files [23]. By choosing the TTDisk block size to match the file system fragment size, we avoid degrading performance for small writes.

A general problem for log-structured storage systems is maintaining consistency without synchronously writing log meta-data. The design of the TTDisk avoids synchronous meta-data writes by appending a recovery sector to each block written to the log. The recovery sector contains two fields: the virtual block index that the log write corresponds to, and a 64-bit counter, which is used to indicate the last log entry. During recovery, we roll forward the log starting from the last meta-data checkpoint until we reach a recovery sector that does not contain a valid counter.

The implementation of TTDisk crash recovery is not complete in our current prototype. We have implemented a version of TTDisk that writes recovery sectors, but this version exhibits poor performance because the mDenali disk interface currently supports only 4 KB block operations (as opposed to 512 byte sector operations).

3.2 Testing infrastructure

Chronus relies on user-supplied software probes to indicate whether a given time step corresponds to a "correct" system state. Given such a probe, the testing infrastructure automates the task of instantiating and evaluating a previous system state. After the test has completed, any state changes made during the test are discarded.

Chronus supports two styles of software probes. Internal probes run inside the child virtual machine being tested. External probes run on the parent virtual machine conducting the test. Generally, external probes are used for diagnosing server failures. Running a probe internally on the server could yield incorrect results, since the local loopback network device is configured separately from the external interface. Internal probes are used for all other types of applications, including network clients and non-networked applications.

The steps for executing a probe differ for internal versus external probes. In both cases, the first step is to wrap the time-travel disk with a copy-on-write (COW) disk. This provides a convenient mechanism for discarding state changes made during probe execution. For internal probes, the parent virtual machine then executes a pre-processing routine, which mounts the COW disk into the parent's file system, and configures the child's file system to execute the probe routine on boot. By convention, the probe output is stored in a particular file for later extraction. Once the probe has executed, the child VM performs a halt operation, causing the parent VM to terminate it. Alternately, a timeout mechanism is used for tests that hang or stall. After termination, the parent VM once again mounts the COW disk, and executes a post-processing routine to extract the probe result.

The steps for executing external probes are similar, but simpler. The pre-processing and post-processing phases are omitted. The probe runs in the parent virtual machine while the child virtual machine is running. Once the probe terminates or times out, the child VM is garbage collected.

3.3 Binary search

Chronus uses binary search to quickly find the fault point along the system time-line. We assume the system exhibits a transition from a working to a non-working state, as shown in Figure 1. Chronus begins by running the probe at the limits of the user-provided search range. Assuming the limits exhibit different probe results, Chronus then tests the midpoint; if the midpoint's output is the same as the endpoint's, Chronus recursively tests the earlier half of time line. If the probe's output differs from the endpoint's, Chronus recursively tests the later half of the time line. In some cases, a probe may fail to execute or may produce non-binary results. To handle this, Chronus considers all results that differ from the endpoint to be the same. This tends to work because probe failures often coincide with a non-working system, and we are generally interested in the last transition from a working to a non-working state.

Chronus requires the troubleshooter to specify a search range whose limits exhibit different probe results. Because the troubleshooter might not know an appropriate range a priori, Chronus provides a test command, which allows the troubleshooter to guess-and-check individual time steps. In our experience, this mechanism has proven sufficient to quickly discover a valid search range for most failure cases.

4 Debugging Experience

In this section, we describe our experience using the Chronus tool. For each experiment, we used binary search to locate the failure in time and the UNIX diff utility to extract the state change. In some cases, it was necessary to compute the state difference over a time range larger than a single block. As a result, diff sometimes detects spurious changes such as changes to emacs backup files or modifications to the system lost+found directory. In some cases, we have sanitized the results for brevity, but we never removed more than eight lines of output. All probes are written as UNIX shell scripts.

#!/bin/sh

TEMPFILE=./QXB50.tmp
rm -f ${TEMPFILE}

ssh root@10.19.13.17 'date' > ${TEMPFILE}

if (test -s ${TEMPFILE})
    then echo "SSHD UP"
else echo "SSHD DOWN"
fi

exit 0

Figure 6: sshd probe: This is the complete version of a shell script that diagnosed a configuration fault in the ssh daemon.

>>> search netbsd andrew.time
0000: SSHD UP  5267: SSHD DOWN  2633: SSHD UP
3950: SSHD UP  4608: SSHD UP    4937: SSHD DOWN
4772: SSHD UP  4854: SSHD UP    4895: SSHD UP
4916: SSHD UP  4926: SSHD DOWN  4921: SSHD DOWN
4918: SSHD UP  4919: SSHD UP    4920: SSHD DOWN

# attach ttdisk before and after fault
>>> attach andrew.time 4919 4920

# use recursive diff to find what changed
>>> diff -r /child1 /child2
Binary file /etc/ssh/ssh_host_key differs

Figure 7: Diagnosing the sshd failure: This terminal log shows Chronus's output for a binary search using the sshd probe. We have added comments to the raw output, preceded by '#'. After pinpointing the failure instant, we attach the time-travel disk before and after the fault, and use recursive diff to elicit the failure cause.

# Probe
#!/bin/sh

rm -f /TTOUTPUT
echo 'SUCCESS' > /TTOUTPUT

# Console output

% search netbsd andrew2.time

0000: SUCCESS  1607: FAILURE  0803: SUCCESS
1205: SUCCESS  1406: SUCCESS  1506: FAILURE
1456: FAILURE  1431: FAILURE  1418: FAILURE
1412: FAILURE  1409: FAILURE  1407: SUCCESS
1408: FAILURE

% attach andrew2.time 1407 1408
% diff -r --exclude '*dev*' /child1 /child2

file: /child1/etc/rc.d/bootconf.sh differs
<       conf=${_DUMMY}
>       conf=${$DUMMY}

Figure 8: Boot failure probe and console output: The probe writes a string to a file, but only if the boot process completes successfully. Using this probe, Chronus diagnosed the failure as resulting from a change to the file bootconf.sh.

4.1 Randomly injected failures

We wrote a fault-injection tool called etc-smasher that creates typos in key system configuration files. Such errors can be difficult to diagnose because they often do not take effect until after the machine is rebooted. Once per second, etc-smasher chooses a random file from the /etc directory (which contains system and application configuration files). 90% of the time, etc-smasher writes back the file without modifying it; this creates "background noise" in the system. For the remaining 10%, the program changes the file in a small way, by either removing, adding, or modifying a character. To generate a sample run, we ran the program for several minutes, and observed the most obvious failure symptom.

The first two runs of this program induced the following configuration errors:

Configuration Fault #1: sshd failure. The child VM's sshd daemon does not respond to remote login requests.

Configuration Fault #2: boot failure. The child VM does not boot correctly. Instead of a login prompt, the user is asked to enter a shell name.

To diagnose the sshd failure, we wrote a probe that attempts to login via ssh and execute the UNIX date command. This probe (shown in Figure 6) is an external probe: it runs on the parent VM. Notice that the probe only deals with the observable symptoms of ssh, and not with any of its potential failure causes (TCP/IP mis-configurations, authentication failure, failure of the ssh daemon itself, etc.) Figure 7 shows the output of running a Chronus binary search for this error. The ssh fault was introduced between disk block writes 4919 and 4920 within the log. The output from diff indicates the error resulted from a change in the ssh_host_key file.

To diagnose the boot failure, we crafted a probe that writes a string into a file (see Figure 8). The probe runs internally (within the child VM), but only executes after the boot sequence has completed. As a result, the existence of the file /TTOUTPUT indicates a successful trial. If the boot process hangs, Chronus eventually terminates the virtual machine, and the trial constitutes a failure. As shown in Figure 8, Chronus correctly identified the source of the error as a small typo in the file /etc/rc.d/bootconf.sh.

4.2 Debugging Mozilla errors

To understand Chronus's behavior for graphical applications, we analyzed a list of frequently asked questions for the Mozilla Web browser [26]. The questions fall into two categories: 1) customization questions such as "how can I make Mozilla my default browser?" and 2) errors/problems. The latter category comprises 24 out of a total of 53 questions.

In Table 2, we indicate which Mozilla errors could be diagnosed with Chronus. To qualify for Chronus support, an error must be both easily reproducible and result from a state change from Mozilla's default configuration. Overall, 15 of the 24 errors (63%) in the Mozilla FAQ satisfy these criteria.

We further break down the errors captured by Chronus according to the best available testing strategy. For 7 error cases, it would be possible to construct a shell-script probe to elicit the failure condition. From a script, it is possible to direct Mozilla to a specific page and extract the returned result. Also, Mozilla supports a "ping" command, which is useful for determining if the application has crashed or hung. The 8 remaining error cases require manual control over some or all of the testing process; typically, these errors involve GUI interactions that are difficult to script. In the future, it may be possible to automate more diagnoses using graphical capture/replay tools [20].

The "connection refused" error requires further explanation. The error arises when a local firewall prevents the Mozilla executable from establishing out-bound connections. This error has a subtle dependence on the order that the firewall and Mozilla are installed. If Mozilla is installed first, then the installation of the firewall will trigger a failure, which Chronus can detect. If the firewall is installed first, then Mozilla will never work correctly. Nevertheless, it is still possible to diagnose this error with Chronus by using a probe that first installs Mozilla, and then tests the application.

Table 2: The applicability of Chronus for Mozilla errors: 15 of these 24 errors could be captured by Chronus. This means that they are both repeatable and result from a state change. In 7 of these cases, the testing could be conducted automatically given a shell-script probe. For the other 8 cases, the testing process requires assistance from a human operator, either to manipulate Mozilla or interpret its visual output.

Beyond studying applicability, we also used Chronus to diagnose several of the Mozilla errors. For each trial, we synthetically injected the error condition based on the description in the Mozilla FAQ. We then wrote a probe to diagnose the behavior, and ran Chronus to pinpoint the offending state transition. We now describe two such trials in more depth.

4.2.1 JavaScript error

JavaScript is used by some web sites to provide enhanced functionality beyond static content. JavaScript is also a security concern, and Mozilla allows users to limit the functionality of scripts, or to disable JavaScript completely. In some cases, JavaScript-enabled sites may demonstrate strange behavior if JavaScript is not enabled. For example, the user may be unable to follow hyperlinks for a particular page [26].

To model this error, we installed Mozilla in a virtual machine and disabled JavaScript through the preferences menu. To test for the error, we wrote a probe that directs Mozilla to fetch a web page that requires JavaScript support. The probe asks the user whether the resulting display output is correct. The probe and console output are shown in Figure 9.

# Probe
#!/bin/sh

ssh -X root@10.19.13.79 'mozilla  $WEBSITE' &

echo -n 'RESULT: '

read result
echo $result

# Console output

169904: RESULT:  GOOD 222044: RESULT: BAD
195974: RESULT: BAD 182939: RESULT: BAD
176421: RESULT: BAD 173162: RESULT:  GOOD
174791: RESULT: BAD 173976: RESULT: BAD
173569: RESULT: BAD 173365: RESULT:  GOOD
173467: RESULT: BAD 173416: RESULT:  GOOD
173441: RESULT:  GOOD 173454: RESULT: BAD
173447: RESULT:  GOOD 173450: RESULT:  GOOD
173452: RESULT: BAD 173451: RESULT: BAD

>>> diff -r  /child1 /child2
file /root/.mozilla/default/zc1u3kp2.slt/prefs.js 
differs:
> user_pref("browser.download.dir", "/root");
> user_pref("browser.startup.homepage", 
"https://www.mozilla.org/start/");
> user_pref("javascript.enabled", false);

Figure 9: Mozilla JavaScript probe and console output: This probe, combined with user input, diagnosed a Mozilla rendering problem related to JavaScript. The probe runs externally on the parent VM, so that X-windows ssh forwarding is set up properly. Spurious information exists because Mozilla atomically saves all preference changes made during a user session.

4.2.2 A misbehaved extension

Mozilla allows developers to provide new functionality via an extensibility API. These extensions are not well-isolated, and a misbehaved extension can cause the overall browser to malfunction. To model this error, we installed a set of extensions from the Web. After quitting and restarting the program, we discovered that one of these extensions had introduced a malfunction, such that Mozilla would hang before displaying a page.

To diagnose this error, we wrote a probe that uses the Mozilla "ping" command to indicate whether a previously-launched browser is functioning correctly. Figure 10 shows the output of the diff utility. Although more verbose than previous examples, the state change reveals that the "StockTicker" extension caused Mozilla to malfunction.

>>> diff -r /child1 /child2
file /root/.mozilla/default/zc1irw5u.slt/chrome
/chrome.rdf differs:

> <RDF:Description about="urn:mozilla:package
:stockticker"
> c:baseURL="jar:file:///root/.mozilla/default
/zc1irw5u.slt
> /chrome/stockticker.jar!/content/"
> c:locType="profile"
> c:author="Jeremy Gillick"
> c:authorURL="https://jgillick.nettripper.com/"
> c:description="Shows your favorite stocks in a 
> customized ticker."
> c:displayName="StockTicker 0.4.2"
> c:extension="true"
> c:name="stockticker"
> c:settingsURL="chrome://stockticker/content
/options.xul" />

Figure 10: Console output for a buggy Mozilla extension: Chronus traced the failure to the "StockTicker" extension.

4.3 A complex Apache error

As discussed in Section 2.4, binary search can fail in the presence of multiple faults in a single time-line. To explore this phenomenon, we introduced a sequence of configuration events inside an Apache web server, as shown in Figure 11a. The "true failure" is a mis-configuration of the Apache suexec command, which allows an administrator to run CGI scripts as a different user than the overall Web server. suexec is a common source of configuration errors, especially when scripts require special privileges [7]. In our example, the CGI script must connect to a back end database, which only permits access from the user www. As a result, Web requests for this script return an HTTP error message.

In addition to the suexec error, we performed two actions that affect the Web server's functionality. Near the start of the trace, we changed the server's IP address. Because DNS mappings are not captured in our time-travel layer, any attempt to connect to the server before the IP address change will not succeed. Subsequently, we upgraded the version of the Apache running on the server. This new build was necessary to support the suexec command. During the installation of the upgrade, the Web server is unavailable to Chronus probes.

There are two strategies one could take in analyzing this failure. First, one could write a success-directed probe, which tests whether the system successfully handles requests. We wrote such a probe by testing for a successful HTTP response. The drawback of such a probe is that it may detect any transition from a working to a failing state, as shown in Figure 11a. In the worst case, the user must decipher two spurious results before revealing the true source of the error.

An alternate approach is to construct a failure-directed probe. Instead of looking for successful completion of a request, a failure-directed probe searches for the precise error behavior exhibited by the application. In this example, the suexec failure returned a distinctive error message. Because different errors often exhibit different symptoms, a failure directed probe can result in fewer state transitions over an equivalent system time-line (see Figure 11b). Using a failure-directed probe, we discovered the source of the suexec failure using a single Chronus search invocation.

Figure 11: Apache suexec error, as seen by two different probes: A success-directed probe searches for transitions from a working to a failing state. This may return spurious results when the system contains multiple such transitions. A failure-directed probe searches for changes in the specific error symptom exhibited by the application. For this example, a failure-directed probe revealed the configuration error with a single Chronus invocation.

4.4 Reverse debugging

Although we intended Chronus as a tool for finding configuration bugs, an alternate use is to search for configuration fixes. This is especially useful in cases when the "fix" was applied serendipitously. For example, application X might install a dynamic library that fortuitously allows application Y to work correctly. In practice, the issue is even more subtle, because the order in which packages are installed can affect the system's final configuration [19]. Given a failing machine and a correct machine, an administrator can use Chronus to find the fix from the correct machine, and then apply the fix to the failing machine.

We used reverse debugging to elicit the correct configuration for the NetBSD Network Time Protocol (NTP) daemon. Initially, the system's NTP configuration was incorrect, causing the system's time to be set to an incorrect value. Although we fixed the problem in one particular VM, the change was not propagated back to the base disk image. To locate the fix, we wrote a probe that searches for unusual behavior from the make utility; make relies on a correct clock, and may force unnecessary recompilation when the clock is mis-configured.

5 Quantitative Evaluation

In this section, we provide quantitative measurements of Chronus. We analyze time-travel disk performance, log growth, and debugging execution time. All tests were run on a uniprocessor 3.2GHz Pentium 4 with hyperthreading disabled. The test machine had 2 GB of RAM, but the virtual machines (both the parent and the child) were configured to use at most 512 MB. The machine contained a single 80 GB, 7200 RPM Maxtor DiamondMax Plus IDE drive, and an Intel PRO/1000 PCI gigabit Ethernet card.

All of the following experiments were run without appending recovery sectors to log writes. Therefore, the results model a system that uses some other mechanism for insuring meta-data consistency (e.g., non-volatile RAM). An implementation with recovery sectors would require 12.5% more disk space (one 512 byte recovery sector is appended to each 4 KB block). The performance overhead would likely be similar.

5.1 Runtime Overhead

To evaluate time-travel disk performance, we ran the set of workloads shown in Table 3. We generated the sequential read and write workloads using the UNIX dd command using 32 KB block increments. We also ran an "adversarial" sequential read, in which we read over a disk region that was previously written in reverse order in 32 KB increments. Finally, we ran untar and grep over the Mozilla 1.6 source tree. Mozilla 1.6 contains 35,186 files in 2,454 directories, and has a total size of 300 MB. The "native disk" data series shows the performance of a child VM using a physical disk partition. The time-travel disk log was backed by a physical partition. No disk operations were processed by the checkpoint disk, and swapping was disabled for these tests.

Table 3: Time-travel disk performance: The time-travel disk is competitive with the native disk for all workloads, except for the "adversarial" workload designed to exhibit poor locality in the time-travel log.

For most workloads, the performance of the time-travel disk is competitive with the native disk. The one exception is the adversarial sequential read workload. Because blocks are written out in reverse log order, this style of workload generates poor performance from a log-structured storage layer. Most files are processed sequentially [2], suggesting this style of workload occurs rarely in practice.

5.2 Measuring log inflation

Chronus relies on excess storage capacity to maintain the time-travel log. This is reasonable, given that storage capacity is growing at an annual rate of 60% [18] and shows no signs of abating. Other researchers have noted that users can already go years without reclaiming storage [16].

One remaining concern is log inflation, which arises from file system meta-data operations. Applications that heavily modify the directory structure can generate excessive log growth. Table 4 shows the amount of log growth required for various operations on the Mozilla 1.6 archive. As expected, simply copying the tar file does not generate undue log inflation. However, untaring Mozilla causes log growth that is more than six times larger than the growth in the underlying file system. Even worse, deleting the Mozilla directory tree (with rm -Rf) generates 1432 MB of log data! The source of this log growth is repeated, synchronous updates to file system structures such as free block lists, inodes, and directory contents.

Table 4: Log inflation: Operations that greatly modify the file system directory structure generate a large number of log writes. Fortunately, the writes are highly redundant and amenable to compression.

We have considered two possibilities for combating log inflation. One possibility is compression. The contents of meta-data operations are highly redundant, and therefore would exhibit significant size reductions (as shown in Figure 4). A second possibility is to temporarily deactivate versioning - for example, using heuristics similar to those employed by the Elephant file system [30]. We have not yet experimented with or implemented either of these strategies.

5.3 Debug execution time

Because Chronus uses binary search, it can discover configuration errors in a logarithmic number of steps. Figure 12 shows Chronus's convergence time for logs of various sizes. The test uses an internal probe that tests for the existence of a particular file. Chronus currently requires roughly 20 seconds to conduct a single probe. More than half this time is devoted to file system consistency check (fsck) operations, which we must do twice for each probe - once before installing the probe, and once to extract the result. Moving to a journaling file system would substantially reduce this overhead.

Figure 12: Debug execution time: The runtime grows logarithmically with log size.

6 Related Work

We now discuss related work in problem diagnosis and resolution. We first discuss history-based resolution techniques, and then we discuss other techniques.

6.1 History-based problem resolution

Researchers have proposed versioning storage systems at various levels of abstraction [25,30,31,32,14]. In some cases, recovery from configuration errors has been cited as a driving application. The VMWare virtual machine monitor [33] also supports checkpointing to enable safe recovery. Unlike Chronus, these systems do not perform failure diagnosis. As a result, the user is forced to undo all state changes that occurred after the error. Chronus helps to reveal the specific failure cause, enabling recovery with minimal lost state.

The Operator Undo work [8] attempts to recover lost state by invoking an application-specific replay procedure. In a similar vein, Windows XP restore [24] allows developers to exert some control over which state is included in state snapshots. Both of these approaches, being application specific, are less general than Chronus. Also, these techniques have side effects, which can further corrupt system state. For example, Windows Restore may inadvertently re-introduce a virus into the system.

The Backtracker tool [22] maintains an operating system causal history log. Such a tool could address one of Chronus's current shortcomings, which is its inability to extract semantically relevant debugging information from the child virtual machine. For example, Chronus might discover that an application failure was caused by an update to a particular dynamic library. Given this starting point, a Backtracker-like tool could determine that the library change was caused by the installation of an unrelated application.

Several research efforts have extended programming language debuggers with the ability to perform time-travel or backwards execution [6]. These systems tend to have high overhead or long replay times, depending on the extent to which they rely on checkpointing or logging. In addition, these systems are tied to a particular language or runtime environment. Chronus detects configuration errors that span applications and the OS, and it does so with tolerable overhead by recording only those changes that reach stable storage.

Delta-debugging [36] applies search techniques to the problem of localizing source code edits that induced a failure. Delta-debugging does not assume changes are ordered, and much of the system's complexity derives from having to prune an exponentially large search space. The challenges for Chronus relate to capturing and replaying complete system states using time-travel disks and virtual machines.

The STRIDER [34] project uses periodic snapshots of the Windows registry to reveal configuration errors. Unlike Chronus, STRIDER monitors a single execution of a failing program, during which it records the registry keys that are accessed by the faulty process. STRIDER requires registry-specific heuristics to prune the search space: for example, registry keys that differ across machines are less likely to be at fault. STRIDER does not detect indirect dependencies that result from interactions with helper processes or the operating system. For example, STRIDER cannot reveal errors related to TCP/IP parameters or firewall policy.

6.2 Other problem resolution techniques

A direct strategy for automated debugging is to construct a software agent that embodies the knowledge of a human expert [3]. The limitation of such systems is that they are only as good as their initial diagnosis heuristics. Complex systems generate unexpected errors. Chronus can capture these errors by operating beneath the layer of operating system and application semantics.

The No-Futz [21] computing initiative advocates a principled approach to maintaining configuration state. For example, the authors advocate making individual configuration parameters orthogonal to limit the effect of unintended side effects. While this is a worthwhile goal, the tight integration of today's application and system functionality suggests that debugging techniques will still be necessary when inevitable failures occur.

Redstone et al. proposed a model of automated debugging that extracts relevant system state and symptoms to serve as a query against a database of known problems [27]. A challenge for such a system is constructing a database and query format that yield meaningful results. Chronus avoids using databases by directly "querying" the system state at a previous instant in time. The results returned by our system may be more relevant because they pertain exclusively to the system under consideration.

Several recent projects have investigated path-based debugging of distributed systems [10,1]. These systems log the interactions between components or nodes of a distributed system. By applying statistical techniques to these traces, it is possible to extract some information of interest, e.g., localizing performance problems or detecting an incipient system failure. These systems depend on the ability to extract large volumes of trace data showing the integration between distributed components. Chronus is useful in situations where these assumptions are not satisfied, e.g., desktop personal computers.

7 Future Work

Although functional, our Chronus prototype could be extended in numerous ways. One area of interest is extending mDenali and Chronus beyond a UNIX environment. In particular, systems based on Microsoft Windows are likely to exhibit qualitatively different configuration errors. We are also interested in extending Chronus with different time-travel storage mechanisms. For example, some administrators use CVS to maintain a log of configuration changes. Chronus could use CVS check-ins to reconstruct previous system states.

Chronus is not a fully automatic tool: the troubleshooter must supply a software probe and interpret the state change that induced the failure. It may be possible to reduce this manual effort by combining Chronus with related research efforts. For example, capture/replay tools could automate probe creation [20], and Backtracker [22] could simplify end-to-end diagnosis by mapping from a low-level state change to a high-level action.

A final area for future work is to perform a more complete evaluation of Chronus. Our work to date has focused on a small number of case studies representing "common" configuration errors. Although our initial results are promising, we do not have enough data about configuration errors in the wild to make strong claims about the applicability of Chronus. An even harder challenge is to measure the "usefulness" of our tool. In the end, a complete evaluation of Chronus will likely require a user study, since simulating a human operator is intractable. Work by Brown et al. provides a starting point for such an effort [9].

8 Conclusions

Software systems often break. When they do, diagnosing the cause of failure can be difficult, especially when the application depends on a wide range of system-level and user-level functionality. Existing automated approaches based on expert systems can only handle error cases that are known in advance. Human experts can leverage intuition to solve unforeseen problems, but manual diagnosis requires significant expertise, which ultimately translates into substantial cost.

This paper has described Chronus, a tool for automating the diagnosis of configuration errors caused by a state change. Chronus represents a novel synthesis of existing techniques: versioning storage systems, virtual machine monitors, testing, and search. Chronus reduces the burden on human experts from complete diagnosis ("why is the system not working") to testing for correctness ("is the system working?"). Our experience to date suggests that Chronus is a valuable tool for a significant class of configuration errors.

9 Acknowledgments

We thank our shepherd, Timothy Roscoe, for his guidance, and Ed Lazowska, Neil Spring, Marianne Shaw, and Andrew Schwerin for their insightful comments. We also thank Phil Levis for giving us pointers to valuable related work. This research was supported in part by NSF Career award ANI-0132817, funding from Intel Corporation, and a gift from Nortel Networks.

References

[1]: M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating System Principles, 2003.
[2]: M.G. Baker, J.H. Hartman, H.D. Kupfer, K.W. Shirriff, and J.K. Ousterhout. Measurements of a distributed file system. In Proceedings of the thirteenth ACM symposium on Operating systems principles, 1991.
[3]: G. Banga. Auto-diagnosis of field problems in an appliance operating system. In Proceedings of the USENIX Annual Technical Conference, June 2000.
[4]: P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proceedings of the 19th Symposium on Operating System Principles (SOSP 2003), Bolton Landing, NY, October 2003.
[5]: S. Beattie, S. Arnold, C. Cowan, P. Wagle, C. Wright, and A. Shostack. Timing the application of security patches for optimal uptime. In Proceedings of the Sixteenth USENIX LISA Conference, November 2002.
[6]: B. Boothe. Efficient algorithms for bidirectional debugging. In Proceedings of the ACM Conference on Programming Language Design and Implementation, 2000.
[7]: R. Bowen and K. Coar. Apache Cookbook. O'Reilly and Associates, November 2003.
[8]: A.A. Brown and D.A. Patterson. Undo for operators: Building an undoable e-mail store. In Proceedings of the 2003 USENIX Annual Technical Conference, June 2003.
[9]: A.B. Brown, L. Chung, W. Kakes, C. Ling, and D.A. Patterson. Experiences with evaluation human-assisted recovery processes. In Proceedings of the International Conference on Dependable Systems and Networks, June 2004.
[10]: M.Y. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In Proceedings of the First Symposium on Network Systems Design and Implementation, March 2004.
[11]: Peter M. Chen and Brian D. Noble. When virtual is better than real. In Proceedings of the Workshop on Hot Topics in Operating Systems, May 2001.
[12]: Final Report of the CRA Conference on Grand Research Challenges in Information Systems. https://www.cra.org/reports/gc.systems.pdf, 2003.
[13]: R.J. Creasy. The origin of the VM/370 time-sharing system. IBM Journal of Research and Development, 25(5), 1981.
[14]: Version management with CVS. https://www.cvshome.org/docs/manual/.
[15]: G.W. Dunlap, S.T. King, S. Cinar, M. Basrai, and P.M. Chen. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 2002 Symposium on Operating Systems Design and Implementation (OSDI 2002), Boston, MA, December 2002.
[16]: J. Gemmell, G. Bell, R. Lueder, S. Drucker, and C. Wong. MyLifeBits: Fulfilling the Memex vision. In ACM Multimedia, December 2002.
[17]: Jim Gray. Why do computers stop and what can be done about it ? In Proceedings of the 5th Symposium on Reliablity in Distributed Software and Database systems, January 1986.
[18]: E. Growchowski. Emerging trends in data storage on magnetic hard disk drives. Datatech, 1998.
[19]: J. Hart and J. D'Amelia. An analysis of RPM validation drift. In Proceedings of the USENIX LISA Conference, 2002.
[20]: J.H. Hicinbothom and W.W. Zachary. A tool for automatically generating transcripts of human-computer interaction. In Proceedings of the Human Factors and Ergonomics Society 37th Annual Meeting, 1993.
[21]: D.A. Holland, W. Josephson, K. Magoutis, M. Seltzer, C.A. Stein, and A. Lim. Research issues in no-futz computing. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems, May 2001.
[22]: Samuel T. King and Peter M. Chen. Backtracking intrusions. In Proceedings of the 19th Symposium on Operating System Principles (SOSP 2003), Bolton Landing, NY, October 2003.
[23]: Marshall Kirk McKusick, Bill Joy, Leffler, and Fabry. A Fast File System for UNIX. ACM Transactions on Computer Systems, 2(3), 1984.
[24]: Microsoft, Inc. Windows XP system restore. https://msdn.microsoft.com/library/default.asp?URL=/library/techart/windowsxpsystemrestore.htm, April 2001.
[25]: C.B. Morrey and D. Grunwald. Peabody: The time travelling disk. In 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2003.
[26]: Mozilla FAQ: Using mozilla. https://mozilla.gunnars.net/mozfaq_use.html.
[27]: Joshua A. Redstone, Michael M. Swift, and Brian N. Bershad. Using computers to diagnose computer problems. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems, 2003.
[28]: Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, 1991.
[29]: S.J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2nd edition, December 2002.
[30]: D.S. Santry, M.J. Feeley, N.C. Hutchinson, A.C. Veitch, R.W. Carton, and J. Ofir. Deciding when to forget in the Elephant file system. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP'99), December 1999.
[31]: C.A.N. Soules, G.R. Goodson, J.D. Strunk, and G.R. Ganger. Metadata efficiency in versioning file systems. In Proceedings of the 2nd USENIX conference on file and storage technologies, March 2003.
[32]: M. Stonebreaker. The design of the POSTGRES storage system. In Proceedings of the 13th International Conference on Very Large Databases, September 1987.
[33]: VMware, Inc. VMware virtual machine technology. https://www.vmware.com/.
[34]: Y. Wang, C. Verbowski, J. Dunagan, Y. Chen, H.J. Wang, C. Yuan, and Z. Zhang. STRIDER: A black-box, state-based approach to change and configuration management and support. In Proceedings of the USENIX LISA Conference, October 2003.
[35]: Andrew Whitaker, Richard S. Cox, Marianne Shaw, and Steven D. Gribble. Constructing services with interposable virtual hardware. In Proceedings of the First Symposium on Network Systems Design and Implementation, March 2004.
[36]: A. Zeller. Yesterday, my program worked. Today, it does not. Why? In Proceedings of the 7th European Software Engineering Conference, September 1999.

File translated from T_EX by T_TH, version 3.63.
On 4 Oct 2004, 15:19.

This paper was originally published in the Proceedings of the 6th Symposium on Operating Systems Design and Implementation,
December 6–8, 2004, San Francisco, CA
Last changed: 18 Nov. 2004 aw

OSDI '04 Technical Program

OSDI '04 Home

USENIX home