USENIX, The Advanced Computing Systems Association

OSDI '06 Paper

Pp. 117–130 of the Proceedings

Flight Data Recorder: Monitoring Persistent-State Interactions to Improve Systems Management


Chad Verbowski, Emre Kıcıman, Arunvijay Kumar, Brad Daniels,

Shan Lu, Juhan Lee*, Yi-Min Wang, Roussi Roussev
Microsoft Research,Florida Institute of Technology, U. of Illinois at Urbana-Champaign, *Microsoft MSN

Abstract

Mismanagement of the persistent state of a system—all the executable files, configuration settings and other data that govern how a system functions—causes reliability problems, security vulnerabilities, and drives up operation costs.  Recent research traces persistent state interactions—how state is read, modified, etc.—to help troubleshooting, change management and malware mitigation, but has been limited by the difficulty of collecting, storing, and analyzing the 10s to 100s of millions of daily events that occur on a single machine, much less the 1000s or more machines in many computing environments.

We present the Flight Data Recorder (FDR) that enables always-on tracing, storage and analysis of persistent state interactions.  FDR uses a domain-specific log format, tailored to observed file system workloads and common systems management queries.  Our lossless log format compresses logs to only 0.5-0.9 bytes per interaction.  In this log format, 1000 machine-days of logs—over 25 billion events—can be analyzed in less than 30 minutes.  We report on our deployment of FDR to 207 production machines at MSN, and show that a single centralized collection machine can potentially scale to collecting and analyzing the complete records of persistent state interactions from 4000+ machines. Furthermore, our tracing technology is shipping as part of the Windows Vista OS.

 


1.     Introduction

Misconfigurations and other persistent state (PS) problems are among the primary causes of failures and security vulnerabilities across a wide variety of systems, from individual desktop machines to large-scale Internet services.  MSN, a large Internet service, finds that, in one of their services running a 7000 machine system, 70% of problems not solved by rebooting were related to PS corruptions, while only 30% were hardware failures.  In [24], Oppenheimer et al. find that configuration errors are the largest category of operator mistakes that lead to downtime in Internet services.  Studies of wide-area networks show that misconfigurations cause 3 out of 4 BGP routing announcements, and are also a significant cause of extra load on DNS root servers [4,22]. Our own analysis of call logs from a large software company’s internal help desk, responsible for managing corporate desktops, found that a plurality of their calls (28%) were PS related.[1]  Furthermore, most reported security compromises are against known vulnerabilities—administrators are wary of patching their systems because they do not know the state of their systems and cannot predict the impact of a change [1,26,34].

PS management is the process of maintaining the “correctness” of critical program files and settings to avoid the misconfigurations and inconsistencies that cause these reliability and security problems.  Recent work has shown that selectively logging how processes running on a system interact with PS (e.g., read, write, create, delete) can be an important tool for quickly troubleshooting configuration problems, managing the impact of software patches, analyzing hacker break-ins,  and detecting malicious websites exploiting web browsers [17,35-37].  Unfortunately, each of these techniques is limited by the current infeasibility of collecting and analyzing the complete logs of 10s to 100s of millions of events generated by a single machine, much less the 1000s of machines in even a medium-sized computing and IT environments.

There are three desired attributes in a tracing and analysis infrastructure.  First is low performance overhead on the monitored client, such that it is feasible to always be collecting complete information for use by systems management tools.  The second desired attribute is an efficient method to store data, so that we can collect logs from many machines over an extended period to provide a breadth and historical depth of data when managing systems.  Finally, the analysis of these large volumes of data has to be scalable, so that we can monitor, analyze and manage today’s large computing environments.  Unfortunately, while many tracers have provided low-overhead, none of the state-of-the-art technologies for “always-on” tracing of PS interactions provide for efficient storage and analysis.

We present the Flight-Data Recorder (FDR), a high-performance, always-on tracer that provides complete records of PS interactions.  Our primary contribution is a domain-specific, queryable and compressed log file format, designed to exploit workload characteristics of PS interactions and key aspects of common-case queries—primarily that most systems management tasks are looking for “the needle in the haystack,” searching for a small subset of PS interactions that meet well-defined criteria.  The result is a highly efficient log format, requiring only 0.47-0.91 bytes per interaction, that supports the analysis of 1000 machine-days of logs, over 25 billion events, in less than 30 minutes.

We evaluate FDR’s performance overhead, compression rates, query performance, and scalability.  We also report our experiences with a deployment of FDR to monitor 207 production servers at various MSN sites.  We describe how always-on tracing and analysis improve our ability to do after-the-fact queries on hard-to-reproduce incidents, provide insight into on-going system behaviors, and help administrators scalably manage large-scale systems such as IT environments and Internet service clusters.

In the next section, we discuss related work and the strengths and weaknesses of current approaches to tracing systems.  We present FDR’s architecture and log format design in sections 3 and 4, and evaluate the system in Section 5.  Section 6 presents several analysis techniques that show how PS interactions can help systems management tasks like troubleshooting and change management.  In Section 7, we discuss the implications of this work, and then conclude.

Throughout the paper, we use the term PS entries to refer to files and folders within the file system, as well as their equivalents within structured files such as the Windows Registry.  A PS interaction is any kind of access, such as an open, read, write, close or delete operation.

2.     Related Work

In this section, we discuss related research and common tools for tracing system behaviors.  We discuss related work on analyzing and applying these traces to solve systems problems in Section 6.  Table 1 compares the log-sizes and performance overhead of FDR and other systems described in this section for which we had data available [33,11,21,20,40].

 

Table 1: Performance overhead and log sizes for related tracers.

 

Performance Overhead

Log size

(B/event)

Log Size

(MB/machine-day)

FDR

<1%

0.7

20MB

VTrace

5-13%

3-20

N/A

Vogel

0.5%

105

N/A

RFS

<6%

N/A

709MB

ReVirt

0-70%

N/A

40MB-1.4GB

Forensix

6-37%

N/A

450MB

The tools closest in mechanics to FDR are file system workload tracers.  While, to our knowledge, FDR is the first attempt to analyze PS interactions to improve systems management, many past efforts have analyzed file system workload traces with the goal of optimizing disk layout, replication, etc. to improve I/O system performance [3,9,12,15,25,28,29,33]. Tracers based on some form of kernel instrumentation, like FDR and DTrace [30], can record complete information.  While some tracers have had reasonable performance overheads, their main limitation has been a lack of support for efficient queries and the large log sizes.  Tracers based on s

niffing network file system traffic, such as NFS workload tracers [12,29] avoid any client-perceived performance penalties, but sacrifice visibility into requests satisfied by local caches as well as visibility of the process making a request.

Developer tools for tracing program behavior, such as strace, Filemon Regmon and similar tools, for recording system call traces and file system interactions on Linux, Unix and Windows systems have high performance overheads, as well as log formats more suited to manual inspection than automated querying.  These tools are generally more suited to selective debugging, rather than always-on monitoring.

Complete versioning file systems, such as CVFS [31] and WayBack [8] record separate versions of files for every write to the file system.  While such file systems have been used as a tool in configuration debugging [39], they do not capture file reads, or details of the processes and users that are changing files.  The Repairable File Service (RFS) logs file versioning information and also tracks information-flow through files and processes to analyze system intrusions [40]. 

In [33], Vogels declares analysis of his 190M trace records to be a “significant problem,” and uses data warehousing techniques to analyze his data.  The Forensix project, tracing system calls, also records logs in a standard database to achieve queryability [13].  However, Forensix’s client-side performance overhead and their query performance (analyzing 7 machine-days of logs in 8-11 minutes) make it an unattractive option for large-scale production environments.

A very different approach to tracing a system’s behavior is to record the nondeterministic events that affect the system, and combine this trace with virtual machine-based replay support.  While this provides finer-grained and more detailed information about all the behaviors of a system than does FDR, this extra information can come at a high cost: ReVirt reports workload-dependent slowdowns up to 70% [11].  More significantly, arbitrary queries are not supported without replaying the execution of the virtual machine, taking time proportional to its original execution.

While, to our knowledge, we are the first to investigate domain-specific compression techniques for PS interaction or file system workload traces, there has been related work in the area on optimizing or compressing program CPU instruction traces [5,19], as well as work to support data compression within general-purpose databases [6].

3.     Flight Data Recorder Architecture

In this section, we present our architecture and implementation for black-box monitoring, collecting, and analysis of PS interactions.  Our architecture consists of (1) a low-level driver that intercepts all PS interactions with the file system and the Windows Registry, calls to the APIs for process creation and binary load activity, and exposes an extensibility API for receiving PS interaction events from other specialized stores; and (2) a user mode daemon that collects and compresses the trace events into log files and uploads them to a central server, (3) a central server that aggregates the log files and, (4) an extensible set of query tools for analyzing the data stream.  Our implementation does not require any changes to the core operating system or applications running atop it. We provide detailed discussion of our domain-specific queryable log format in Section 4.

3.1     FDR Agent Kernel-Mode Driver

Our low-level instrumentation is handled by a kernel mode boot driver[2], which operates in real-time and, for each PS interaction, records the current timestamp, process ID, thread ID, user ID, interaction type (read, write, etc.), and hashes of data values where applicable.  For accesses to the file system, the driver records the path and filename, whether the access is to a file or a directory and, if applicable, the number of bytes read or written.  For accesses to the registry, the driver records the name and location of the registry entry as well as the data it contains.  The driver sits above the file system cache, but below the memory mapping manager.  This driver also records process tree information, noting when a binary module is loaded, or when a process spawns another.

The largest performance impact from the driver stems from I/O related to log writing, memory copies related to logging events, and latency introduced by doing this work on the calling application’s thread. We mitigate this by only using the application’s thread to write the relevant records directly into the user-mode daemon’s memory space, and doing the processing on the user-mode daemon’s thread.  Caches for user names and file names that need to be resolved for each interaction also help to minimize lookup costs.

Our kernel driver is stable and suitable for use in production environments, and will be available for public use as part of Windows Vista.

3.2     FDR Agent User-Mode Daemon

The user-mode daemon is responsible for receiving records of PS interactions from the kernel driver, compressing them into our log format in-memory, and periodically uploading these logs to a central server. 

To avoid impacting the performance of the system, we configure our daemon to run at lowest-priority, meaning it will be scheduled only if the CPU is otherwise idle.  If the daemon does fall behind, the driver can be configured to either block until space is available or drop the event.  However, in practice, we have found that a 4MB buffer is sufficient to avoid any loss on even our busiest server machines.

The daemon throttles its overall memory usage by monitoring the in-memory compressed log size, and flushing this to disk when it reaches a configurable threshold (typically 20MB to 50MB). The daemon will also periodically flush logs to disk to ensure reliable log collection in the event of agent or system failure.   These logs are uploaded to a central server using a standard SMB network file system protocol.  If a failure occurs during upload the daemon will save the log locally and periodically retry the upload.

The daemon also manages its own operation, for example, by automatically update its binaries and configuration settings when indicated on the central server, and monitoring its disk space and memory usage.  Setting up FDR tracing on a new machine is simple:  a user only needs to run a single binary on the machine and configure the log upload location.

3.3     FDR Collection Server

The collection server is responsible for organizing FDR log files as they are uploaded, triggering relevant query tools to analyze the files as they arrive, and pruning old log files from the archive.  It also sets the appropriate access privileges and security on the collected files and processed data.

3.4     FDR Query Tools

The final pieces of our framework are the query tools that analyze log files as they arrive.  Each query tool is specialized to answer a specific type of query for a systems management task.  Simple example queries include “what files were modified today?”, or “which programs depend on this configuration setting?”  As all our log files are read-only, we do not require complicated transactional semantics or other coordination between our query tools.  Each query tool reads the log files it is interested in scanning and implements its own query plan against the data within.  While future work might investigate benefits of caching, sharing intermediate results across multiple concurrent queries, or other optimization techniques from the database literature, we found that allowing uncoordinated reads simplified the process of building new query tools as required.

4.     Designing the Log Format

The key requirements we have for FDR’s log format are that 1) logs are compact, so that their size does not overly burden client resources, network bandwidth or server-side scalability; and 2) the log format efficiently supports common-case queries.  To meet these requirements, we built a preliminary version of FDR with a straightforward, flat format, and collected 5000 machine-days of traces from a wide variety of machines.  We can personally attest to the difficulty of collecting, storing and analyzing this scale of data without support for compression and queryability.  Based on our analysis of these traces, and a survey of how previous work applies such traces to systems management tasks, we designed an optimized log file format that takes advantage of three aspects of PS interaction workloads that we saw across our collected traces.

First, most PS interactions repeat many times during a day—93-99% of daily activity is a duplicate of an earlier event.  For queries that care only about what happened, rather than when or how often, we can improve query performance by separating the definitions of this small number of distinct interactions from the details of when they occur.

Secondly, we observe that PS interactions are highly bursty, with many interactions occurring almost simultaneously and long idle periods between bursts.  This allows us to save significant storage space by amortizing timestamp information across a burst.

Finally, we find that sequences of PS interactions are also highly repetitious; if we see a sequence of PS reads and writes, we are very likely to see the same sequence again in the future.  This leads us to apply standard compression schemes to the time-ordered traces of PS interactions, achieving a high compression rate.

In the rest of this section, we describe relevant attributes of common-case queries, present the results and implications of our survey of PS interaction traces, and then describe the details of our log format.

4.1     Common Queries

Today, systems administrators deal with large-scale, complicated systems.  According to surveys [9,28,33,36], an average Windows machine has 70k files and 200k registry settings.  Faced with the task of managing these systems, a systems administrator’s job is often a problem of “finding the needle in the haystack.”  For example, troubleshooting is the task of finding the few configuration settings or program files that are causing a problem; and to test a software upgrade or patch, the administrator needs to know what subset of the system might be affected by the change.  To be useful, FDR must help systems administrators quickly identify the small set of relevant state and events out of all the state existing and events occurring across the many machines of a computing or IT environment.  We describe the details of how systems management tasks use PS interaction traces in Section 6.  Here, we briefly describe the aspects of common-case queries that informed our log format design.

We find that most common systems management queries of PS interaction traces search for a subset of events, identified by the responsible process or user, the file or registry entry being accessed, or another aspect of the interaction (“Who changed this configuration?” or “What did I change yesterday?”).  This means that, by organizing or indexing our log format around such attributes, we can quickly identify the subset of interactions of interest.  Common queries are also often restricted by time range, looking only at events that occurred during a specific period, implying that our logs should support random access over time, not just sequential access.

Many systems management tasks only involve the existence (or absence) of a particular PS interaction, and not when or how often the interaction occurred.  For example, finding all loads of a shared library, regardless of when they occurred, can identify the processes that depend on that library and help assess the impact of a software upgrade.  Other times, queries do care about when a PS interaction occurred, but only need to know an interaction’s relative-ordering vis-à-vis other PS interactions on a given thread, e.g., to determine potential causalities like loading a binary after reading its name from the Registry.  In both cases, the implication is that some queries need not read timestamps at all.

4.2     PS Workloads and Log Optimizations

For our survey, we monitored the PS interactions of over 324 machines during one year across a variety of computing environments and collected over 5000 machine-days of PS interactions in total.  We worked with MSN to instrument 207 of their machines, across 4 different services with different workloads, including CPU-bound systems with heavy disk workloads, a large storage service for external users, and web notifications publish/subscribe service.  In our own research lab, we monitored 72 laboratory machines used for various data collection, analysis and simulation experiments.  We also monitored 35 corporate desktops and laptops, used by researchers and engineers, primarily for activities such as software development and word processing.  Finally, we monitored 7 home machines, used for entertainment and work-related activities by researchers, engineers, and their families.  As a control, we also collected traces from 3 idle systems, running within virtual machines with no user workload.

4.2.1     Scale and repeated interactions

The primary challenge to efficiently tracing the PS interactions of a machine is the volume of events that occur.  In our survey, we found that the average number of daily PS interactions was O(107) ranging from 9M on desktop machines to 70M on the busiest workloads, as shown in Table 2.  Not surprisingly, servers tended to have a stable workload from day-to-day, while our lab, corporate desktop and home machines had varied workloads. The highest number of daily interactions we saw was 264M events, on an MSN server that collected application logs from 1000s of other machines.

Table 2: The average per machine daily total  and distinct interactions, entries, and processes

Environment

Total PS Interactions

Distinct PS Interactions

Distinct PS Entries  Accessed

Distinct

Processes

Svc. 1

70M

0.2%

>0.1%

40-60

Svc. 4

29M

3.3%

0.4%

30-70

Svc. 2

22M

0.6%

0.1%

30

Svc. 3

19M

1.1%

0.1%

30-70

Home

17M

4.2%

0.6%

30-40

Desktop

9M

5.4%

1.0%

20-100

Lab

9M

1.5%

0.3%

17-40

Average

25M

1.6%

0.2%

43

Idle

965K

2.1%

0.5%

14

However, we found several factors that mitigate the large volume of PS interactions in all these workloads.  First, the number of distinct files and registry entries read or written every day is much smaller than the total number of interactions.  Secondly, the number of distinct processes that run on each machine is very small, O(102) processes on the busiest desktops, and fewer on production servers.  Overall, we found that most PS entries are only accessed by a small number of processes, and that the total number of distinct interactions (i.e., distinct <user, process, operation-type, PS entry> tuples) was O(105), only 0.2% to 5.4% of the total interactions per day.

This implies that we can improve the performance of queries not interested in timestamps or counts of PS interaction occurrences by separating the unique definitions of observed interactions from the time-ordered traces of when they occur.  Effectively, this allows many common queries to ignore 94.6-99.8% of the log. This also provides the possibility of compressing our logs, by replacing repeated descriptions of an interaction with a single unique ID.

4.2.2     Bursts of Activity

Several studies of I/O traffic and file system activities have shown that server and desktop I/O workloads demonstrate bursty or self-similar behavior [14,16].  We observe this in our traces as well, where it manifests as many interactions arriving together with long idle periods in between.

The primary implication of these bursts for our log format is that, when many events occur together, there is a clear opportunity to merge their associated time information, storing a single timestamp for all the events that occur during the same timestamp bucket.  This is a significant observation because per-event timestamps are a major limiting factor to achieving high compression rates.  To help us choose an appropriate bucket duration, we look to the requirements of common-case systems management queries.  We find that fine-grained timestamps are rarely necessary, instead what is most important is the relative ordering of events and the ability to map event occurrences to human activities (i.e., wall-clock time).  This leads us to choose a relatively coarse-grained 48-bit or 6ms granularity timestamp.  Note that this still provides a granularity finer than Windows’ time-scheduling quantum of 10-15ms.  While one might worry that a coarse-grained timestamp would mean every bucket would have at least one event in it, in practice, even our busiest observed machine-day, with 264M daily events, showed no PS interactions during 60% of its timestamp buckets.  Of course, this does not mean the machine as a whole was idle—it might have been busy with CPU calculations during the times it was not doing PS interactions.

4.2.3     Repeated Sequences of Interactions

Our final key observation is that many sequences of PS interactions repeat over time.  This is not a surprise, as we would expect that most of the file system and registry activities performed by a system are standard, repetitive tasks, including process start-up and shutdown, background activities, document auto-saves, and logging.  We perform a detailed analysis of repeating “activity bursts,” in [32] and, for space considerations, provide only a summary here.

In our analysis in [32], we define an “activity burst” as the set of PS interactions occurring in one thread, where each interaction occurs no more than some small time separation apart.  Formally, we define an activity burst as a group of events {e| i  t  j} occurring within a single thread, where gap(et,et+1) < k, for all i ≤ t < j; gap(ei-1,ei) ≥ k; gap(ej,ej+1) ≥ k; gap(x,y) is the time between two interactions x and y; and k is the threshold gap between bursts.  We call an activity burst a “repeat” if it is identical to an earlier activity burst in every attribute of its interactions except for timestamps.  Otherwise, we call it a “distinct” burst.  In our survey, we find that most activity bursts in a day are repeated bursts.  On desktops, we see 2K-5K distinct bursts out of 20K-40K total and, on servers, we see 3K-4K distinct bursts out of 40K-70K.

This repetition of PS interaction sequences indicates that simple byte compression schemes, applied to the time-ordered event sequences, should detect and compress these repeating patterns. Since our analysis of activity burst repetition focuses on bursts within a single-thread, storing PS interactions in a system-wide timestamp sequence runs the risk of allowing concurrent I/O from multiple threads to interfere with the compressibility of each other’s patterns.  However, because of the relatively large CPU time slice of 10-15ms on the Windows OS, and the knowledge that most PS interactions are handled quickly by file caches, we still expect to gain significant compression of repeating patterns in a cross-thread trace of PS interactions.

4.3     Log Format Details

Based on the machines observed in our survey, our log format has two key facets to its logical design, shown in Figure 1.  First, given the repetition of accessed files, observed processes, etc., we normalize the attributes of our persistent state interactions into separate sections, essentially following a standard procedure of database normalization.[3] We create one section for distinct PS interactions, which point to other sections containing distinct names, user context, process information, file data hashes and values.  The primary benefit of this normalization is a reduction of repetitive information.  In addition, we find that grouping attributes into their own sections improves the performance of byte compression algorithms as we compress these log sections later.  This separation also improves query efficiency by transforming queries that involve multiple attributes and expensive string comparisons into inexpensive comparisons of integer IDs as we scan through traces of PS interactions.

Text Box:  
 Figure 4: Time taken for each query compared with the number of attributes scanned

Figure 1: Logical design of our log format

 

Our second design choice in our logical log format is to represent the trace of events itself as two parallel, but connected, streams of data.  The first stream is an ordered list of events as they are captured by our kernel driver and reported to the user daemon.  The second stream contains timestamp information for groups of events.  This amortizes the size of timestamp information across multiple events, reducing the overall size of the log, as well as improving byte compression of the event stream by better exposing patterns in the sequences of events.  Both logical streams are stored in a single physical file to ease management of log archives.

We created a file structure that contains a large logical address space split into 32 sections.  Each of the normalized attribute sections, the section of distinct PS interactions, as well as the ordered event stream and timestamp stream, are mapped to a section in our log file.  Each section is composed of individually compressed 64k pages.  Compressing in blocks, rather than using a streaming compression format allows random access within a data section.

To simultaneously optimize our log for random access and compression, our physical log file layout consists of a three-layer addressing scheme of block table, block number, and block offset, shown in Figure 2.  This three-layer addressing scheme is important because we want to compress individual blocks and store the start and end offsets of each block in a table for fast lookup, and as a further optimization, compress these tables as well.  With 448 block tables, 8192 blocks per table and a 64k uncompressed block size, this provides a maximum addressable storage size of 234 GB of uncompressed data within each log file.  While we find this is many times more than a single machine-day of logs, this format gives us the flexibility of joining many days of logs into a single file for improved compression, and gives us flexibility if PS interaction workloads grow in the future.

Text Box:  
Figure 2: The physical layout of our log format

Figure 2: The physical layout of our log format.

Each log file starts with a 4k uncompressed header.  The first 256 bytes consist of versioning and other miscellaneous information.  Next are 32 section descriptions, each 8 bytes long.  Each of the logical sections, described earlier, is laid out contiguously over our three-layer addressing scheme, aligned at block boundaries.  These section descriptors provide the block table and number of the start and end of each section.  The rest of the 4k header is filled by 448 block table descriptors, that point to the start and end offsets of a compressed block table.  The block table, in turn, contains 8192 block entries, each pointing to the start and end offset of a compressed 64k block.

The timestamp section is maintained as a 16 byte entry containing 6 bytes (48 bits) to represent a 6ms time resolution, 2 bytes to count missing events within that region, and two 4 byte offsets pointing to the first and last consecutive event with that time resolution.  While almost all events are received by the daemon in time sorted order, we correctly handle timestamp information for any events that appear out of order. This can happen when a context switch occurs just after an I/O activity completes, but before the kernel driver reports it, but this delays the reporting of an event by a few scheduling quantums, and never affect the intra-thread ordering of PS interactions.

The user daemon first creates log files in-memory.  As it receives raw events from the kernel driver, the daemon normalizes the events, replacing attribute values with indexes into the appropriate sections.  The normalized event is then compared to the table of distinct normalized events using an O(1) hash lookup, and added to the table if necessary.  Finally, the normalized event is added to the ordered event stream section, along with new timestamp information, if necessary.  Each of the log file’s data sections is append-only in memory.  When a log file is closed and flushed to disk the daemon writes each data section contiguously to disk while applying a standard compression algorithm to each 64K-byte block.

4.4     Querying Logs

When analyzing these log files, our tools tend to restrict their queries based on one or more attributes in the PS interaction record, based on a time-range of interest, or based on both.  To restrict a query by attribute, a query tool scans the appropriate section, looking for all values matching the given criteria.  From these values, the tool then generates a filter to apply against the section of distinct PS interactions, resulting in a set of unique IDs, one for each PS interaction matching the original attribute restriction.  For example, to return only PS interactions that access a particular file, a tool would first scan the string section to find the ID of the filename, and then scan the section of distinct PS interactions to find the IDs of all distinct PS interactions that accessed this filename ID.  If a tool is not interested in when or how many times an interaction occurred then it can stop here, without scanning the much larger event stream sections.  Otherwise, the tool can scan through the ordered event list and timestamp stream to find the details of the occurrences of these PS interactions.

To restrict a query by a time range, a query tool applies a binary search to the timestamp stream, searching for the start of the desired time range.  Once this timestamp is found, it can skip to the appropriate 64K block of the ordered list of events, and begin scanning the ordered list from that point on, until the end of the time range.

Common-case queries tend to extract sparse information from the extremely large data set of PS interactions.  Our log format enables efficient queries by allowing query tools to focus on the relevant subsets of data, and expanding their scope to larger and larger portions of the data as necessary.  For example, a query to find a list of all files modified during a day of 25M PS interactions requires only one pass over a distinct event table with 318Kentries to identify the string attribute id of modified files, and then scanning over the string attribute section with 100K entries to discover the full filenames of each modified file, avoiding ever scanning the full list of 25M events that occurred during the day.

5.     Experimental Evaluation

In this section, we evaluate FDR and find that on average, our logs use only 0.7 bytes/PS interaction.  We find that query performance scales linearly with the number of events in a log. All of our queries can be processed in a single pass against an entire machine-day of logs in just 3.2 seconds.  FDR’s client-side performance overhead is less than 1%, and calculating the load on bottleneck resources in our central server indicates that a single machine could scale to collecting and analyzing all the PS interactions from 4300 machines, keeping all logs for over 3 months.

5.1     Log File Compression

The efficiency of FDR’s log file format affects both the runtime load on clients’ memory and network, and the long-term storage costs of PS interaction logs.

In our survey of 5000 machine-days of logs, described in Section 4.2, the average raw event size is 140 bytes and daily per machine raw logs are 7GB, compressing to 700MB with GZIP. After converting our collected logs to our new format, we found, depending on workload, each PS interaction takes between 0.5 to 0.9 bytes of storage, and 0.7 bytes on average.  One machine-day of PS interactions can be stored in 6 to 71MB, 20MB on average, with a maximum observed machine-day size of 179MB. Table 3 shows the results across our environments.

Table 3: Daily storage requirements for machines across environments

Role

Avg. Bytes/Event

Avg. MB/day

Max MB/day

Svc. 1

0.91

71

179

Svc. 4

0.78

57

103

Svc. 2

0.71

19

22

Svc. 3

0.66

10

53

Home

0.58

17

51

Desktop

0.80

13

21

Lab

0.47

6

43

Average

0.70

20

49

Idle

0.85

0.76

0.88


In addition to PS interaction workload, the efficiency of our log format’s compression is sensitive to the frequency at which the logs are uploaded to the central server.  More frequent uploads of logs reduces the latency between when an event occurs and when it can be analyzed.  However, uploading logs less frequently allows the logs to collect more repeated events and achieve better overall compression.

 

Figure 3: Compression Ratio variation with log file interval for each category of machines

Figure 3: Compression Ratio variation with log file interval for each category of machines

 

Figure 3 shows how storage size varies with the event collection period. It shows that 80% of the compression efficiency is typically achieved with log files an hour long, and that improvement ceases after 1 week of data.  Based on this result, our current deployments upload log files every 2 hours, and our central server reprocesses and merges them into 1-week long log files to save long-term storage space.  We envision that future versions of FDR will support latency-critical queries by moving them to the client-side agent.

Table 4: Average section size as a percentage of total log file size for each machine role

Role

Event

Time

Reg.

File

String

Data

Other

Svc 1

77%

19%

1%

0.3%

0.2%

1.9%

0.6%

Svc 4

57%

17%

7%

13.9%

1.8%