Idleness is not sloth Richard Golding, Peter Bosch, Carl Staelin, Tim Sullivan, and John Wilkes Hewlett-Packard Laboratories, Palo Alto, CA Abstract Many people have observed that computer systems spend much of their time idle, and various schemes have been proposed to use this idle time productively. The commonest approach is to off-load activity from busy periods to less-busy ones in order to improve system responsiveness. In addition, speculative work can be performed in idle periods in the hopes that it will be needed later at times of higher utilization, or non-renewable resource like battery power can be conserved by disabling unused resources. We found opportunities to exploit idle time in our work on storage systems, and after a few attempts to tackle specific instances of it in ad hoc ways, began to investigate general mechanisms that could be applied to this problem. Our results include a taxonomy of idle-time detection algorithms, metrics for evaluating them, and an evaluation of a number of idleness predictors that we generated from our taxonomy. 1. Introduction Resource usage is often bursty: periods of high utilization alternate with periods when there is little external load. If work can be delayed from the busy periods to the less-busy ones, resource contention during the busy periods can be reduced, and perceived system performance can be improved. The low-utilization periods can also be exploited for other purposes-for example, power conservation in a portable system by shutting down parts of it, or eagerly performing work that might be needed in the future. We call the periods of sufficiently-low system utilization idle periods. The definition of "sufficiently low" utilization is application specific; we use the term "idle" even if this utilization is non-zero. During these times the system can execute an idle task without affecting time-critical work too much. The overall structure of the idle-detection framework is shown in Figure 1. External work requests arrive and are executed, requiring resources. Potentially useful idle tasks also consume the same resources. An idleness detector monitors the external-work arrivals and the state of the server. When the detector believes the system will be idle enough for long enough, it starts up the idle task. This executes until it completes, or until the detector signals it to stop-typically when new foreground work arrives. The goal of the detector is to make sufficiently good predictions that the net effect to the system of running the idle task is positive. The best predictions exploit all the idle time, while making no mistaken predictions of idle periods when the system is not, in fact, idle. There are two basic ways to measure how good an idle-time processing system is. External measures quantify the interference between the idle task and an outside application, and the benefits from running the idle tasks. These measures use units such as additional operation latency or power consumption. Internal measures are based solely on how accurate the detector's predictions are. The external measures are what really matter, but internal measures are useful for guiding the choice of detection mechanism. The rest of this paper is organized as follows. We begin with a look at several aspects of the problem of making good predictions, including a discussion of external measures and a survey of existing work in this field. We then present an architecture and taxonomy for idleness detectors, and discuss internal measures for evaluating their efficiency. We then use this taxonomy as a tool to generate a suite of detectors, and evaluate their effectiveness under realistic, measured workloads. Some thoughts about opportunities for future work and our conclusions wrap up the paper. 2. Idle-time processing The purpose of idle-time processing is to improve the system's overall performance during non-idle times. Measuring the improvement requires metrics for the costs and benefits of executing each idle task. These measures vary from one system or application to another. Even within one system, more than one measurement may be appropriate. Often these measures are not directly comparable, or may be subjective in nature. For example: * The value of powering down a disk drive is the power saved; the cost is that ordinary work may be delayed, extra power consumed during the "recovery" task of spinning up the drive, and the disk lifetime reduced by repeated start-stop cycles. * The value of disk shuffling (reorganizing data layout on disk) is faster access to frequently-used data. The costs include delaying disk accesses behind a data-move, and buffer-cache pollution. In addition, the detector itself may intrude on normal system operation: some detectors require significant computation resources to identify idle periods. The work required to determine what to do in the idle time is also a potential resource consumer (for example, disk shuffling is typically based on collected access pattern data; delayed operations are put into a queue that itself consumes resources). To understand the benefits and costs of idle-time processing, we must first understand how idle tasks can benefit the system, and the ways in which they can create costs: both when executing in an idle period, and if they end up still executing when the system stops being idle. The rest of this section discusses these issues. 2.1. Characterizing idle tasks Useful idle-time operations fall into a few different categories: * Required work that can be delayed. Examples include delayed cache writes, migrating objects in a storage hierarchy, rebuilding a RAID array, or cleaning a log-structured file system. * Work that will probably be requested later. Examples include disk readahead, eager function evaluation, collapsing chains of forwarding addresses for mobile objects, and eager make. * Work that is not necessary, but will improve system behavior. Examples include rearranging disk layout, shutting down parts of a system to conserve power, checking system integrity, or compressing unused data. * Shifting work from a busy to an idle resource. Examples include choosing the least-loaded network path, or compressing data when the processor is idle to reduce disk traffic. Idle tasks can also be characterized by how they react to being stopped and started (we call these granularity properties): * Interruptability: some idle tasks can be interrupted at any time, and will stop immediately. Others must complete a fixed granule of work before they can relinquish the system resources they are consuming. For example, a disk write operation must run to completion, while a powered-down device can be restarted at any time. * Work loss: when some idle tasks are interrupted, they will lose or must discard some of the work that they have performed. (For example a log-structured file system cleaner may have to abandon work on the current segment.) This cleanup process itself may need resources, and some idle tasks have to be followed by a recovery task to put the system back to a consistent state. * Resource use: most idle tasks block foreground work from making progress to some degree. In the extreme, they may completely deny foreground-work access to a resource (e.g., a disk that has been spun down); in other cases, foreground activity simply slows down while the idle task is executing. Each of these properties affects applications in different ways. For example, high degrees of multiprogramming will probably make a workload more resilient to an idle task that blocks access to a single resource, since there is probably something else useful that can be done while the idle task has the resource. 2.2. Executing idle tasks Once it has been determined that an idle task should be executed, a number of operations occur, as illustrated by the time line in Figure 2. Processing begins when the detector signals the idle task to start executing. (Note that there may be a delay between the end of the last piece of ordinary work and the detector emitting its prediction.) The initial activity of the idle task may be to run an initialization step that prepares the system for the main idle task that follows. (For example, it may determine which log-structured file system segments are to be cleaned.) This is followed by one or more executions of the idle task proper. Each of these might take a different amount of time or require a different resource. (Breaking the idle task up in this fashion reduces its granularity, which reduces the cost of a bad prediction.) Eventually, regular work will again enter the system. If the detector's prediction was accurate, the idle task will have completed execution. If not, the detector signals the idle task to stop, and the task interrupts its activity if possible. It may be necessary to execute a recovery task to bring the system back to normal operation-for example, a powered-down disk must be spun up, or a partially-completed update may need to be undone. For eager or delayed activity, one must also consider the steps required to delay work, or to detect and use eagerly-performed work later. 2.3. Detecting idle periods Recall that the idleness detector monitors external work requests in order to find idle periods. Figure 3 illustrates the process for one kind of idleness detector. At the beginning, the system is doing useful work, but then the offered external load decreases below a predetermined threshold. Some time after this happens, the detector makes a prediction about the idle period and signals the idle task to start. Later, as the load increases past the threshold, it signals the task to stop. More precisely, the detector's problem is to make a series of predictions, each of which identifies the start time and duration of an idle period. The detector cannot be late with a prediction-otherwise it isn't a prediction. A good prediction will neither start earlier than the actual start time (so there is no collision between idle processing and ordinary work) nor start much later (which would waste time the idle task could spend doing work); nor will the duration last beyond the end of the actual idle period. It is usually good to make conservative predictions since the actual cost of incorrectly starting an idle task is usually higher than the opportunity cost of missing a chance to start it successfully. 2.4. Idle task examples The model of idle-time processing that we have presented so far is rather abstract. To make matters more concrete, we now introduce three examples of idle-time processing in storage systems. Delayed writeback The read-write bandwidth of a disk is a scarce resource during a burst of requests. As write buffers increase in size, synchronous read accesses will come to dominate performance in realistic systems because the amount of memory needed to absorb peak write rates is (much) smaller than the quantity needed to cache all reads [Ruemmler93, Bosch94]. The delays seen by reads can be reduced by delaying writes until idle periods, possibly with the help of non-volatile memory [Baker91b, Carson92a]. This is an example of delayed work. When a write operation arrives, it is saved in the cache rather than written to disk. This consumes buffer-cache space, which may reduce the read hit rate in the cache, or require more memory. When the system is idle, these accumulated writes are flushed to disk in groups of N data sectors at a time. (The larger the value of N, the better the requests can be scheduled at the disk [Seltzer90b, Jacobson91].) This flush can potentially delay foreground reads that arrive during the flush. In practice, reads should be given priority over writes [Carson92a]; however, we'll explore the effect of scheduling the reads and writes with identical priority here. This idle task requires no special initialization or recovery actions. A good delayed writeback system minimizes read latency and cache utilization. Table 1 summarizes these characteristics. Initial idle task (none) Idle task flush dirty disk blocks Recovery idle task (none) Granularity unit: fixed (N sectors) loss: none resources: ties up disk Measures max cache space needed change in read latency Table 1: characteristics of delayed writeback Eager LFS segment cleaning In a segmented log-structured file system, blocks are appended to the end of a log as they are written [Rosenblum92, Carson92a, Seltzer93]. The disk is divided up into a number of segments, which are linked together to form the log. As blocks are re-written and their new values appended to the log, earlier copies become garbage that must be reclaimed. A cleaner task selects a segment that contains some garbage blocks and copies the remaining valid blocks out of the segment. The segment is then marked as being empty so it can be written over later. The cleaning operation causes a significant amount of disk traffic, and consumes operating system buffer space [Seltzer93]. The disruptiveness can be minimized if cleaning is performed when there is little ordinary disk traffic. However, segments must be cleaned promptly enough that the system does not run out of clean segments, which would force a segment-clean in the foreground. The cleaning task is characterized by the delay it imposes on ordinary traffic and how often the system runs out of clean segments. Minimizing the first is best done with an interruptible cleaner that can discard partially-completed operations. Table 2 summarizes these characteristics. Initial idle task (none) Idle task clean one segment Recovery idle task discard partially-cleaned segment Granularity unit: fixed (1 segment) loss: up to 1 segment, resource: ties up disk Measures foreground cleaning time change in read latency Table 2: characteristics of LFS segment cleaning Disk power-down Several people have investigated powering down disk drives on portable computers to conserve power (e.g., [Cáceres93, Douglis94, Greenawalt94, Marsh93, Wilkes92b]). Using the taxonomy we have developed, the "idle task" is keeping the disk powered off; this is an "optional" task, whose goal is to decrease power consumption; the initial task is to spin the disk down; the recovery task is to spin it back up (Table 3). Initial idle task spin down disk Idle task (none): saves 1.5-1.7W Recovery idle task spin up disk: takes 1.5s Granularity unit: can be aborted at any time loss: power cost of spinning up disk (3.3J) resource: excludes any other disk accesses Measures power saved delay caused to I/O operations Table 3: characteristics of disk power-down For example, during normal operation, an HP Kittyhawk disk [HPKittyhawk92] consumes 1.5-1.7W. When it is spun down, it enters a "sleep" mode that consumes very little power. When a disk I/O request arrives, the disk must be powered up, which uses 2.2 W for 1.5 s (i.e., 3.3J). Power consumption will decrease if the savings from the powered-down mode outweigh the power cost of spinning it up again. For this disk, it can be achieved if the disk is spun down for as little as 2.2s; larger disks take somewhat longer to recoup the spin-up cost. However, spinning the disk down too often will increase the latency of disk requests and increase the chance of disk failure. A good idle-time mechanism will balance these conflicts. Other examples There are many possible uses for idle time beyond the three that we have mentioned. Storage, compilation, user interfaces, and distributed systems all exhibit highly variable workloads-a clue that a system could benefit from idle-time processing. Here are a few examples. Large data structures such as database indices can often be updated quickly at the cost of gradually-worsening access characteristics. For example, nodes can be inserted into or deleted from a binary tree, unbalancing its branches. However, the initial performance can be recovered by a reorganization step (in this case, rebalancing the tree). Such reorganization tasks are candidates for idle-time processing, thereby moving the cost of rebalancing out of the critical path for updates. Likewise, some indexing systems in databases segregate the index into two parts: one for "stable" information, for which an efficient hash or tree structure has been built; and one for "unstable" information that has been added or modified recently [Herrin91]. The unstable portion is periodically merged into the stable portion, which involves significant calculation. Doing this in an idle task can make it appear as if it happened "for free". Compilation is a particularly rich source of idle-time work. One approach that has been explored is to use an eager make facility [Bubenik89], which detects file changes and rebuilds binaries as soon as possible, in the hope that this work will reduce the latency of performing a make when it is asked for later. Costs here include both the processing and disk-resource contention generated by the background compilations, and also the complexity of hiding intermediate results (which may fail) from the user. Languages such as Cecil, Self, and Smalltalk involve dynamic code generation [Chambers90a]. In many cases this code can be optimized using information generated at run-time, such as the probability of taking a particular branch of a conditional or the inferred type of an object. Modules can be re-optimized when the processor is not busy with normal work during idle time. There are many instances of distributed system load-balancing schemes that look for "idle workstations" and attempt to use the large number of spare cycles that can be claimed in this way. To keep the workstation owners' happy (and thus the workstations eligible for being used in this fashion), good predictions of when the users are idle are needed [Litzkow88]. Designers of such schemes have used process migration as a mechanism for avoiding loss of work when an idle task is preempted [Litzkow88]. More closely-coupled systems can perform eager data transmission in periods when the network is lightly loaded. For example, release consistency [Carter91] mechanisms can pre-transmit updates to reduce the time to complete a release, which is often on the critical performance path. The choice of whether to compress data for network transmission can depend on the utilization of both the network and the processor: if the processor is busy, but the network is not, it may make sense to transmit data uncompressed. Likewise, the rate or quality at which multimedia data are played can be varied to take advantage of extra network or processor resource beyond a guaranteed minimum. These are examples of policy changes as a result of idleness detection. Forwarding-address chains in a distributed system allow clients to find the current location of a moving object [Shapiro92]. When the object moves, it leaves a pointer to its new location at its old one. Locating the object is done by starting at a place that the object has visited, and following the chain of pointers. Forwarding address chains can be compressed when the system is idle, thereby reducing the time to do future locations. Disk and file readahead is a common example of eager activity in a storage system [McVoy91, Ruemmler94]. Transparent file compression can improve effective storage capacity, and compressing during idle periods ensures that it does not interfere with ordinary operations. Disk shuffling [McDonald89, Vongsathorn90, Ruemmler91, Akyürek93] involves rearranging the blocks or cylinders on a disk so that frequently-accessed information is located near the middle of the disk with related data close together. Shuffling has non-trivial initial and recovery tasks. An anonymous reviewer of this paper suggested using idle cycles to perform graphical window-system operations eagerly. For example, hidden or obscured portions of windows could be rendered in idle periods so that exposing them would be quick. In summary, the existence of burstiness and under-utilized resources in a system is an indication that some form of idle-time processing could be exploited. The opportunities are limited mostly by the complexity of deciding what to do when, and by the potential downside of resource contention with foreground activities. Dealing with both of these requires efficient, accurate idleness detectors, whose construction is the subject of the next section. 3. An idle-detector architecture Having presented an overall framework for idle-time processing, we now turn our attention to how to build idleness detectors. We have found it useful to decompose the problem into a number of separate components, each implementing just one part of the detection algorithm. By combining the parts in different ways we can build detectors on a mix-and-match basis to explore a much wider range of design alternatives than would otherwise be the case. Figure 4 shows the overall scheme. An idleness detector is composed of a number of predictors and skeptics, along with an actuator: * A predictor monitors its environment-the arrival process, the server, and possibly other variables such as the time of day-and issues a stream of predictions. Each prediction is a tuple (start time, duration). * A skeptic [Rodeheffer91] filters and smooths these predictions, possibly combining the results from several predictors. * The actuator obeys the sequence of predictions it gets as input to start and stop the idle task; its purpose is to isolate the interactions with the idle task from the predictors. The predictors and skeptics form a directed acyclic graph or DAG with predictors as the leaves and skeptics as the internal nodes. Streams of predictions flow toward the root node, which emits predictions to the actuator. 3.1. Predictors As with so many problems of this type, optimal idleness detection requires off-line analysis. This is not as useless as it might seem: if usage patterns are stable enough, then a one-time off-line analysis may provide an excellent prediction. For example, "weekends from 1-6 a.m." is a common time to perform system maintenance. In practice, on-line detectors are of more interest. They can be classified according to the approach used to predict the idle period start time and its duration. 3.1.1. Idle period start time Simple predictors use little information to make their predictions; more sophisticated ones take advantage of knowledge about the arrival process to make better decisions. We present them here in roughly increasing-complexity order. * Timer-based: whenever the system runs out of work, the predictor begins a timer. If no work comes in before the timer expires, the predictor declares that an idle period has begun. The timer period can be fixed, variable, or adaptive. A fixed period does not change. A variable period is computed as a function of some values in the environment, such as time of day. An adaptive timeout period is increased if predictions are too eager, and decreased if they are too conservative. * Rate-based: the predictor maintains an estimate of the rate at which work is arriving, and declares an idle period when its rate estimate falls below a threshold. Different threshold rates can be used for "start of idle period" and "end of idle period", thereby providing some hysteresis. Methods for maintaining the estimate include: - moving average. The rate is periodically sampled, and the predictor computes a moving average of the samples. - event window. The predictor maintains the times of the last n arrivals, and estimates the rate as n divided by the age of the oldest arrival. This is similar to leaky bucket rate-control schemes for high-speed networks [Cruz92]. - time window. The predictor maintains a list of arrival times more recent than t seconds, and estimates the rate as the length of the list divided by t. This is a variation on the event window method. * Adaptive rate-based: like rate-based, but with an arrival-rate load threshold that is adapted according to the quality of the predictions. * Rate-change-based: these predictors maintain an estimate of the first derivative of the arrival rate to predict in advance when the arrival rate will fall below a threshold. * Periodic: if the workload contains periodic work, a digital phase locked loop or DPLL [Massalin89a, Lindsey81] can be used to synchronize predictions to periodic events in the workload, such as the unix syncer daemon. By knowing when work will arrive, it is also known when the system is idle. 3.1.2. Idle period duration Duration predictors can use a wide range of techniques to adapt to a changing workload. We list them here according to the amount of information they use about the arrival process. * No duration: no prediction is made (alternately, the prediction is "forever"). Variants on this include schemes that try to predict the end of an idle period as this happens, rather than at its beginning. These are most useful when the definition of "idle" allows some residual foreground work. For example: rate-based; adaptive rate-based; or rate-change-based. * Fixed duration: a fixed duration is predicted. The simplest form of this is "enough time to run the idle task once." * Moving average: the predictor keeps a moving (possibly weighted) average of the actual durations. The usual average is the mean, but a geometric average or median can also be used. * Filtered moving average: as for moving average, but only idle periods greater than some lower-bound are considered during the averaging process. * Backoff: after each prediction is used, the predictor uses the feedback to determine whether the actual duration was longer or shorter than predicted. If it was longer, the next prediction is increased; if it was shorter, the next prediction is decreased. The increases can be arithmetic, increasing by a constant each time, or geometric, increasing by a constant factor. The skeptic in Autonet [Rodeheffer91] and round-trip timers in TCP [Postel80a, Comer91, Karn91] used geometric increase and arithmetic decrease to maintain a prediction slightly longer than the actual, while an idleness predictor works to keep its predictions slightly shorter. The backoff algorithm can be applied either at the end of the prediction period, or at the end of the idle period. The first gives a chance for the algorithm to be much more aggressive in extending its estimates, the latter provides more information, but potentially causes the period to be adjusted much less often. * Filtered backoff: backoff schemes that only consider actual idle periods longer than a given lower-bound during their backoff calculations. * Autocorrelation: the autocorrelation on the work arrival process gives the probability of an event arriving or the rate of arrival as a function of time into the future. The predicted duration is the period during which the probability of arrival is below some threshold. The autocorrelation is somewhat expensive to compute, so it might be recomputed periodically rather than continuously. It might also be used to predict the beginning of multiple idle periods. * Conditional autocorrelation: like a simple autocorrelation, except that multiple autocorrelation functions are computed based on some property of arriving events. For example, the expected future might be different following read requests or write requests. * Ad-hoc rules: finally, as with predicting the beginning of an idle period, many systems can take advantage of other specific features of the arrival process, such as periodicity. 3.2. Skeptics A skeptic takes in one or more prediction streams, and emits a new one. They are used to filter out bad predictions and to combine the results from several predictors into a single prediction stream. Single-stream (filtering) skeptics include: * low-pass: discards predicted periods that are shorter than some threshold (e.g., the duration of the idle task). * quality: discards predictions from a predictor that is consistently wrong. The skeptic can compute a measure of the predictor's accuracy, perhaps using a moving average of the measures we propose in Section 4, and only pass along predictions when the accuracy is above some threshold. * environmental: discards or modifies predictions according to some external event (e.g., time of day). This can allow idleness predictions to be restricted to times when nobody is around, for example. The time-of-day input can be derived from moving averages of workloads over long periods of time, so this skeptic can be made adaptive. Perhaps the most important use for skeptics is to combine several prediction streams. For example, a periodic-work detector will not handle non-periodic bursts, while another predictor might. A skeptic could combine the two, only reporting a prediction when both agree. More generally, a skeptic can combine a number of input streams by weighted voting. Each stream is given a weight, and the skeptic produces a prediction only when the combined weights are greater than some threshold. When the weights are equal and fixed, this becomes simple voting. Alternately, the weights can be varied according to the accuracy of each predictor [CesaBianchi94]. This approach has been shown to yield near-optimal prediction in many cases. 4. Evaluating idleness detectors A good idleness detector produces a stream of predictions that waste little time at either the beginning or end of an idle period, while rarely making a prediction that starts too early, ends too late, or overlaps much real work. We now describe several measures used to quantify how well a particular detector meets these goals. Remember that their interpretation depends on an application-specified level of acceptable idleness. We start with several primitive measures: * The predicted time is the total amount of time a detector declared idle. * The actual time is the total time the best possible off-line detector could produce. * The overflow time measures the amount of time that the detector declared idle when the system was not idle. * The violations count the number of operations that overlapped the declared idle periods. * The delay is the sum of the delay imposed on operations, assuming that any operation that starts during a period declared idle must be delayed until the end of the period. * The intrusiveness of an idleness detector measures how much extra load the detector itself imposes on a non-idle system. From these basic measures we compute two derived measures: * The efficiency of a detector: this is a measure of how good the detector is at finding idle periods in the workload. It is defined as efficiency = predicted / actual * A detector's incompetence evaluates how much of the predicted idle time is not idle, penalizing over-eager detectors. It is defined as incompetence = overflow / actual A good idleness detector will have a high efficiency and a low incompetence. Ideally, we'd like to have a single value that give the overall goodness of a detector. However, this is impossible, because the weighting of the value of idle task processing and the costs of incorrect idleness predictions is highly application-specific, and sometimes even subjective at a personal level. Nonetheless, we found it useful to consider three main candidate figures of merit: 1. merit1 = efficiency * (1 - incompetence) This merit figure considers only efficiency and incompetence, penalizing efficient detectors that are also incompetent. 2. merit2 = efficiency / violations This merit figure penalizes a detector heavily for each operation that occurs during a predicted idle period. 3. merit3 = efficiency - alpha * delay This merit figure penalizes a detector for the total time that operations are delayed. The relative importance of delay and efficiency can be scaled by the factor a. The choice is a function of the application. When spinning down a disk, for example, the power saved is related to detector efficiency, while the cost is the delay to operations. The high subjective impact of the cost suggests a metric that penalizes delays more, such as merit3. Delayed writes, on the other hand, have a less dramatic cost of violations, so merit1 is probably more appropriate. 5. Performance results To get quantitative measures of the effectiveness of idle-time processing, we used the taxonomy presented in section 3 to design a large number of possible idleness predictor networks. We then implemented the component parts and evaluated how well the networks did in practice. This section reports on what we learned. 5.1. Implementation We implemented the detectors and idle tasks in the TickerTAIP simulation system [Ruemmler94, Golding94]. In particular, we simulated a host system issuing read and write requests to a set of disks. We used calibrated disk models [Ruemmler94], and exercised our detectors using I/O access traces taken from real systems [Ruemmler93] to avoid making simplifying assumptions about access rates. We used 1week subsets of these traces. We added a few new component classes to the TickerTAIP system. These included an overall framework; multiplexers to distribute events to multiple predictors; the predictors themselves; skeptics; and actuators. Our disk and device driver models were changed to accept detectors. The idleness detectors were connected to the disk devices for taking internal measurements, and to the disk device drivers for external measures. To perform the evaluations, we introduced the notion of a busyness detector to determine when the system should be considered busy (i.e., not idle). This was used to evaluate the idle detectors. Busyness detectors can be application-specific. In practice, the simplest busyness detector was the one used for the results we report here: it declared the system busy if there was at least one request anywhere in it. We looked at several hundred idle detector networks derived from our taxonomy, evaluating them against our internal measures. We also implemented three skeptics: one that filters out predictions during the busiest six hours of the day (the TODSkeptic); one that filters out predictions from a predictor that has a high rate of violations (MeritSkeptic); and one that combines predictions from several predictors, outputting only predictions on which a quorum agree (QuorumSkeptic). 5.2. Start-time predictors To make the presentation of the results less intractable, we picked a fixed duration prediction of 25s and combined this with different start-time prediction algorithms. The most efficient detectors under this test are: * Event window, when busy 10% or less over last 5 operations (efficiency = 1.01) * A 1s fixed timer (1.01) * Event window with rate <10 IO/s over last 5-25 operations (1.00-0.998) * Moving average rate of 10 IOs/s or less (0.993) * Adaptive timer with Geom+/Arith- or vice versa with 100ms increments (0.990) * Event window, busy 1% or less over last 5 operations (0.985) The lowest violation rates (violations per second of predicted idle period) come from: * Event window, rate <0.1 IO/s over 25 operations * Same, over last 5 operations * Moving average, 0.1 IO/s or less * Adaptive timer Arith+/Arith-, 10s increment * Adaptive timer Arith+/Geom-, 10s increment * 10 second fixed timer 5.3. Duration prediction The next step is to fix the start-time predictor (we used a moving-average rate-based start-time predictor with a threshold rate of 10 IO/s because that gave pretty high efficiencies). The most efficient predictors under this test are: * Fixed 25s period (efficiency = 0.993) * Backoff using Arith+/Arith- in 10 ms, 100 ms and 1 s increments (0.992, 0.990 and 0.988 respectively) * Backoff using Arith+/Geom- and a 1 s increment (0.975) * A moving average over the most recent durations (0.971) * Backoff Arith+/Geom- with 100 ms increments (0.959), or 10ms increments (0.945) * Moving average + TOD skeptic (0.759) From the point of view of violation rates, the following were the best ones: * Moving average + violation rate skeptic (violation rate = 0.358s) * Moving average + TOD skeptic (0.722) * Moving average (0.821) * Backoff Arith+/Geom- with increments of 10ms (1.02); 100ms (3.61); or 1s (14.6) * Fixed 25s timer (35.1) * Backoff Arith+/Arith- with increments of 10ms (36.9), 100ms (44.6), or 1s (48.0). In some circumstances (such as log-structured file system cleaning), it's useful to have idleness detectors that can reliably predict idle periods of long duration. Under this metric (the mean duration of the idle period predicted), we find the following are best: * Backoff Arith+/Arith- with increments of 1s (mean duration = 44.8s), 100ms (40.4), or 10ms (34.4) * Fixed 25s timer * Backoff Arith+/Geom- with increments of 1s (17.9); 100ms (5.82); or 10ms (2.19) * Average duration + the merit skeptic (0.982) * Average duration alone (0.952) * Average duration + the TOD skeptic (0.944) 5.4. Other combinations Needless to say, picking the best stand-alone start-time and duration prediction algorithms and combining them doesn't produce the best overall result. Instead, we found (by exhaustive combinatorial analysis!) that the following are among the best pair-wise combinations for efficiency: * EventWindow, busy 10% or less over 25 events + fixed 25s duration (efficiency = 1.0098) * The same algorithm over the last 5 events (1.0095) * Busy 10% or less over 25 events + fixed 25s timer (1.0087) * Busy 10% or less over 5 events + Backoff Arith+/Arith- with 10ms increment (1.0087) * Busy 10% or less over 25 events + Backoff Arith+/Arith- with 10ms increment (1.0087) * Fixed 1s timer, + fixed 25s duration * Busy 10% or less over 25 events + Backoff Arith+/Arith- with 100 ms increments Generally, the results were dominated by Event Window predictors with a threshold of busy 10% or less, or 5 IO/s or less, combined with arithmetic backoff. For durations, the best result came from combining these with adaptive timers with 100 ms arithmetic increase and geometric decrease (or vice versa), or by backoff predictors with 10--100 ms increment arithmetic increases and geometric decrease. The best violation rates results came from: * Adaptive timer Geom+/Geom- + Average duration (violation rate = 0.214/s) * Moving average 10 IO/s + Average duration + a violation rate skeptic (0.358) * Event window over 25 events with rate < 0.1 IO/s + Average duration (0.362) * Event window with rate < 10 IO/s and a moving average, fed into a Quorum skeptic (0.373) * Event window over last 5 events with a rate < 10 IO/s, + Average duration, fed into a violation rate skeptic (0.431) Generally, rate-based or adaptive Arith+/Arith- timers with 10s increments combined with Average duration predictors or Backoff Arith+/Geom- and 10 ms increments dominate. Looking at mean duration predictions, we find the longest values come from: * Adaptive timers, Arith+/Geom- or vice versa, with Backoff Arith+/Arith- (1s increment) duration (mean duration = 340s). * Adaptive timer Arith+/Arith- and increments of 1s (306s), 10s (288s) or 100ms (256s) + Backoff Arith+/Arith- with 1s increments. We learned that there can sometimes be unfortunate interactions between adaptive start-time predictors and their duration counterparts. These feedback loops are best minimized by running the start-time prediction algorithm once per actual idle period, and the duration prediction once per predicted idle period, if this is shorter than the actual idle period, but longer than some minimum threshold. (The first prevents inter-policy conflicts; the second allows duration-predictors to respond gracefully to alternating bursts of high activity and long quiet periods.) 5.5. Skeptics Skeptics proved more important for busy disks, where the workload is more variable than on quiet disks. They are also best at reducing the mean violation rate. On a busy disk, the TOD skeptic reduces violation rate; in the samples, usually to about half what the same predictor yielded without the skeptic, except when the rate was already low. This generally causes a loss of about 20-30% efficiency, indicating that much of the time that could successfully be declared idle was not in the busy hours of the day anyway. The Merit skeptic likewise reduces the violation rate. For predictors with a high violation rate, the Merit skeptic reduces violations to less than half of the rate of using the TOD skeptic, but generally at the cost of about half the efficiency. When the violation rate is already low, the Merit skeptic does not give appreciable advantage over the TOD skeptic, and in one case (MovingAverage 10/s, Average duration) increased the violation rate by about 20% For the Quorum Skeptic, we combined EventWindow and MovingAverage predictors (both at rates < 10/s); the EventWindow and a PLL; and the Moving Average and a PLL. For these combinations, both predictors had to agree on idleness. Using quorums like this leads to short predictions because they all have to agree. We also combined four predictors using a quorum size of two for agreement: two EventWindows with different queue sizes, the EventWindow, and a PLL. The combination of an event window and a moving average predictor them agree gives a low violation rate (0.23 on a busy disk). The combination of all four gives 99% efficiency and one of the lowest violation rates. Overall, our top three picks for workloads such as delayed writeback and segment cleaning are: * EventWindow 10 IO/s + MovingAverage 10 IO/s + Quorum skeptic (both agreeing); gives 0.98 efficiency, 0.23 violations, 0.52s mean duration. * Four predictors + Quorum skeptic; gives 0.99 efficiency, 0.48 violations, 0.52s mean duration. * Moving average start-rate 10/s + average duration; gives 0.98 efficiency, 0.49 violations, 1.05s mean duration. 6. Conclusions The high degree of burstiness observed in many real computer systems gives many opportunities for doing useful work at low apparent cost during idle periods. Many people have observed this, and applied this idea to several specialized domains. The major contribution of this work is to put the previous approaches into a common framework. In turn, the framework suggested opportunities for combining analysis techniques to improve the quality of idle-time prediction. It also allowed us to propose some figures-of-merit for evaluating idle-time detection algorithms, which we then used to evaluate many different idle-time detectors. Our framework should be helpful to those looking to exploit low-utilization periods in computer systems, regardless of the precise details of the problem, since the framework itself is independent of the particular domain. Developing the taxonomy was helpful to us in two ways: it improved our understanding of the problem, and it helped us systematize the generation of a large number of potentially interesting detection and prediction algorithms. Without it, we would have had a much harder time exploring the design space. We were gratified to learn that simple predictors work remarkably well. This is good news: it means that these techniques can be applied successfully in the real world with only moderate effort. Nonetheless, we found that you can't have all three of high efficiency, low violation and long durations, but you can generally get any two. Acknowledgments George Neville-Neil suggested predictors based on the first derivative of arrival rate. Fred Douglis's interest in disk power conservation spurred much of our early thinking. Availability An extended version of this paper, with more detailed results than we had space for here, is available by anonymous ftp from the directory pub/wilkes on the system ftp.hpl.hp.com. References [Akyurek93] Sedat Akyurek and Kenneth Salem. Adaptive block rearrangement. Technical report CS-TR-2854.1. University of Maryland, November 1993. [Baker91b] Mary Baker, Satoshi Asami, Etienne Deprit, John Ousterhout, and Margo Seltzer. Non-volatile memory for fast, reliable file systems. Proceedings of 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA, 12-15 October 1992). Published as Computer Architecture News, 20(special issue):10-22, October 1992. [Bosch94] Peter Bosch. A cache odyssey. M.Sc. thesis, published as Technical Report SPA-94-10. Faculty of Computer Science/SPA, Universiteit Twente, Netherlands, 23 June 1994. [Bubenik89] Rick Bubenik and Willy Zwaenepoel. Performance of optimistic make. Proceedings of 1989 ACM SIGMETRICS and Performance '89 International Conference on Measurement and Modeling of Computer Systems (Berkeley, CA). Published as Performance Evaluation Review, 17(1):39-48, May 1989. [Caceres93] Ramon Caceres, Fred Douglis, Kai Li, and Brian Marsh. Operating system implications of solid-state mobile computers. Technical report MITL-TR-56-93. Matsushita Information Technology Laboratory, Princeton, NJ, May 1993. [Carson92a] Scott Carson and Sanjeev Setia. Optimal write batch size in log-structured file systems. USENIX Workshop on File Systems (Ann Arbor, MI), pages 79-91, May 1992. [Carter91] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. Proceedings of 13th ACM Symposium on Operating Systems Principles (Asilomar, CA). Published as Operating Systems Review, 25(5):152-64, 13-16 October 1991. [CesaBianchi94] N. Cesa-Bianchi, Y. Freund, D. P. Helmbold, and M. Warmuth. On-line prediction and conversion strategies. Technical report UCSC-CRL-94-28. Computer and Information Sciences Board, University of California at Santa Cruz, August 1994. [Chambers90a] Craig Chambers, David Ungar, and Elgin Lee. An efficient implementation of Self, a dynamically-typed object-oriented language based on prototypes. In Urs Hoelzle, editor, The Self Papers. The Self Group, CIS 209, Stanford University, Stanford CA 94305, 1990. [Comer91] Douglas E. Comer and David L. Stevens. Internetworking with TCP/IP: design, implementation, and internals, volume II. Prentice-Hall, 1991. [Cruz92] Rene L. Cruz. Service burstiness and dynamic burstiness measures: a framework. Journal of High Speed Networks, 2:105-27. IOS press, Amsterdam, 1992. [Douglis87] Fred Douglis and John Ousterhout. Process migration in the Sprite operating system. Proceedings of 7th International Conference on Distributed Computing Systems (Berlin, 21-25 September, 1987), pages 18-25, R. Popescu-Zeletin, G. Le Lann, and K. H. Kim, editors. IEEE Computer Society Press, 1987. [Douglis94] Fred Douglis, P. Krishnan, and Brian Marsh. Thwarting the power-hungry disk. Proceedings of USENIX Winter 1994 Technical Conference (San Francisco, CA), pages 292-306. USENIX Association, Berkeley, CA, 17-21 January 1994. [Golding94] Richard Golding, Carl Staelin, Tim Sullivan, and John Wilkes. "Tcl cures 98.3%of all known simulation configuration problems" claims astonished researcher! Proceedings of Tcl/Tk Workshop, New Orleans, LA, June 1994. Available as Technical report HPL-CCD-94-11, Concurrent Computing Department, Hewlett-Packard Laboratories, Palo Alto, CA. [Greenawalt94] Paul M. Greenawalt. Modeling power management for hard disks. 2nd International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunications Systems (MASCOTS '94) (Durham, NC), pages 62-66. IEEE Computer Society Press, 31 January-2 February 1994. [Herrin91] Eric H. Herrin II and Raphael A. Finkel. An implementation of service rebalancing. Technical report 191-91. University of Kentucky, Department of Math Sciences, July 1991. Proc. of the XI Intl. Conf. of the Chilean Computer Science Society, 1991. [HPKittyhawk92] Hewlett-Packard Company, Boise, Idaho. HP Kittyhawk Personal Storage Module: product brief, Part number 5091-4760E, 1992. [Jacobson91] David M. Jacobson and John Wilkes. Disk scheduling algorithms based on rotational position. Technical report HPL-CSP-91-7. Hewlett-Packard Laboratories, 24 February 1991. [Karn91] Phil Karn and Craig Partridge. Improving round-trip time estimates in reliable transport protocols. ACM Transactions on Computer Systems, 9(4):364-73, November 1991. [Lindsey81] William C. Lindsey and Chak Ming Chie. A survey of digital phase-locked loops. In William C. Lindsey, editor, Phase Locked Loops, pages 296-317. Institute of Electrical and Electronics Engineers, April 1981. [Litzkow88] Michael J. Litzkow, Miron Livny, and Matt W. Mutka. Condor-a hunter of idle workstations. Proceedings of 8th International Conference on Distributed Computing Systems (San Jose, CA), pages 104-11. IEEE Computer Society Press, 13-17 June 1988. [Marsh93] Brian Marsh, Fred Douglis, and P. Krishnan. Flash memory file caching for mobile computers. Technical report MITL-TR-59-93. Matsushita Information Technology Laboratory, Princeton, NJ, 18 June 1993. [Massalin89a] Henry Massalin and Calton Pu. Fine-grain scheduling. Proceedings of Workshop on Experience in Building Distributed and Multiprocessor Systems (Ft. Lauderdale, FL), pages 91-104. USENIX Association, October 1989. [McDonald89] M. Shane McDonald and Richard B. Bunt. Improving file system performance by dynamically restructuring disk space. Proceedings of Phoenix Conference on Computer and Cmm. (Scottsdale, AZ), pages 264-9. IEEE, 22-24 March 1989. [McVoy91] L. W. McVoy and S. R. Kleiman. Extent-like performance from a UNIX file system. Proceedings of Winter 1991 USENIX (Dallas, TX), pages 33-43, 21-25 January 1991. [Postel80a] J. Postel. Transmission Control Protocol, Technical report RFC-761. USC Information Sciences Institute, January 1980. [Rodeheffer91] Thomas L. Rodeheffer and Michael D. Schroeder. Automatic reconfiguration in Autonet. Proceedings of 13th ACM Symposium on Operating Systems Principles (Asilomar, CA). Published as Operating Systems Review, 25(5):183-97, 13-16 October 1991. [Rosenblum92] Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems, 10(1):26-52, February 1992. [Ruemmler91] Chris Ruemmler and John Wilkes. Disk shuffling. Technical report HPL-91-156. Hewlett-Packard Laboratories, October 1991. [Ruemmler93] Chris Ruemmler and John Wilkes. UNIX disk access patterns. Proceedings of Winter 1993 USENIX (San Diego, CA), pages 405-20, 25-29 January 1993. [Ruemmler94] Chris Ruemmler and John Wilkes. An introduction to disk drive modeling. IEEE Computer, 27(3):17-28, March 1994. [Seltzer90b] Margo Seltzer, Peter Chen, and John Ousterhout. Disk scheduling revisited. Proceedings of Winter 1990 USENIX Conference (Washington, D.C.), pages 313-23, 22-26 January 1990. [Seltzer93] Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin. An implementation of a log-structured file system for UNIX. Proceedings of Winter 1993 USENIX (San Diego, CA), pages 307-26, January 1993. [Shapiro92] Marc Shapiro, Peter Dickman, and David Plainfossé. SSP chains: robust, distributed references supporting acyclic garbage collection. Technical report 1799. INRIA, France, November 1992. [Vongsathorn90] Paul Vongsathorn and Scott D. Carson. A system for adaptive disk rearrangement. Software---Practice and Experience, 20(3):225-42, March 1990. [Wilkes92b] John Wilkes. Predictive power conservation. Technical report HPL-CSP-92-5. Concurrent Systems Project, Hewlett-Packard Laboratories, 14 February 1992. Author information Richard Golding received his M.S. degree in 1991 and his Ph.D. degree in 1992, both in Computer and Information Sciences from the University of California, Santa Cruz. Following a stint at the Vrije Universiteit Amsterdam, he joined Hewlett-Packard Laboratories as a researcher. His research interests include distributed operating systems, wide-area networking, and storage systems. Peter Bosch graduated with degrees in Electrical Engineering (1988), and Computer Science (M.Sc., 1994) from the University of Twente, Netherlands. He has worked since 1991 for the University of Twente as an research assistant. His current work involves research into file systems and implementation of a high performance file system for the ESPRIT Pegasus project. His hobbies are traveling and spending evenings in nice restaurants. Carl Staelin received his Ph.D. in Computer Science from Princeton University in 1992. He has worked since 1992 as a researcher for Hewlett-Packard Laboratories. As part HP's Berkeley Science Center he is currently working on the NOW and Mariposa projects at U.C. Berkeley. His current research interests include {high-performance, tertiary, distributed, reliable} storage systems, distributed systems, and electronic libraries. For fun he is also working to provide a library of binary packages of public domain software for HP workstations. Tim Sullivan has been a member of the technical staff at Hewlett-Packard Laboratories since 1984, when he was an undergraduate at Stanford University. He worked on software, workload characterization, and performance modeling tools for experimental I/O devices while obtaining his BS (1985) and MS (1988) degrees in Electrical Engineering. He spent two years at the Hewlett-Packard Laboratories Pisa Science Center in Pisa, Italy designing parallel languages with researchers at the University of Pisa. He has also done research on operating systems and databases for distributed and parallel systems and on high-performance, high-availability storage systems. John Wilkes graduated with degrees in Physics (BA 1978, MA 1980), and a Diploma (1979) and Ph.D. (1984) in Computer Science from the University of Cambridge. He has worked since 1982 as a project manager and researcher at Hewlett-Packard Laboratories in Palo Alto, CA. His current research area is high-performance, high-availability storage systems, and he has interests in multicomputer interconnects, operating system design, and performance modelling. He also enjoys learning about Renaissance art and architecture and interacting with the academic research community. The authors can be contacted through Richard Golding: golding@hpl.hp.com, Hewlett-Packard Laboratories, mailstop 1U13, PO Box 10490, Palo Alto, CA 94303-0969.