Misconfigurations and other persistent state (PS) problems
are among the primary causes of failures and security vulnerabilities across a wide
variety of systems, from individual desktop machines to large-scale Internet
services. MSN, a large Internet service, finds that, in one of their services
running a 7000 machine system, 70% of problems not solved by rebooting were
related to PS corruptions, while only 30% were hardware failures. In [24], Oppenheimer
et al. find that configuration errors are the largest category of
operator mistakes that lead to downtime in Internet services. Studies of
wide-area networks show that misconfigurations cause 3 out of 4 BGP routing
announcements, and are also a significant cause of extra load on DNS root
servers [4,22]. Our own analysis of call logs from a large software company’s
internal help desk, responsible for managing corporate desktops, found that a
plurality of their calls (28%) were PS related.
Furthermore, most reported security compromises are against known
vulnerabilities—administrators are wary of patching their systems because they
do not know the state of their systems and cannot predict the impact of a
change [1,26,34].
PS management is the process of maintaining the
“correctness” of critical program files and settings to avoid the
misconfigurations and inconsistencies that cause these reliability and security
problems. Recent work has shown that selectively
logging how processes running on a system interact with PS (e.g., read, write,
create, delete) can be an important tool for quickly troubleshooting
configuration problems, managing the impact of software patches, analyzing
hacker break-ins, and detecting malicious websites exploiting web browsers
[17,35-37]. Unfortunately, each of these techniques is limited by the current
infeasibility of collecting and analyzing the complete logs of 10s to 100s of
millions of events generated by a single machine, much less the 1000s of
machines in even a medium-sized computing and IT environments.
There are three desired attributes in a tracing and analysis
infrastructure. First is low performance overhead on the monitored client,
such that it is feasible to always be collecting complete information for use
by systems management tools. The second desired attribute is an efficient
method to store data, so that we can collect logs from many machines over an
extended period to provide a breadth and historical depth of data when managing
systems. Finally, the analysis of these large volumes of data has to be
scalable, so that we can monitor, analyze and manage today’s large computing
environments. Unfortunately, while many tracers have
provided low-overhead, none of the state-of-the-art technologies for
“always-on” tracing of PS interactions provide for efficient storage and
analysis.
We present the Flight-Data Recorder (FDR), a high-performance,
always-on tracer that provides complete records of PS interactions. Our
primary contribution is a domain-specific, queryable and compressed log file
format, designed to exploit workload characteristics of PS interactions and key
aspects of common-case queries—primarily that most systems management tasks are
looking for “the needle in the haystack,” searching for a small subset of PS
interactions that meet well-defined criteria. The result is a highly efficient
log format, requiring only 0.47-0.91 bytes per interaction, that supports the
analysis of 1000 machine-days of logs, over 25 billion events, in less than 30
minutes.
We evaluate FDR’s performance
overhead, compression rates, query performance, and scalability. We
also report our experiences with a deployment of FDR to monitor 207 production
servers at various MSN sites. We describe how always-on tracing and
analysis improve our ability to do after-the-fact queries on hard-to-reproduce
incidents, provide insight into on-going system behaviors, and help
administrators scalably manage large-scale systems such as IT environments and
Internet service clusters.
In the next section, we discuss related work and the
strengths and weaknesses of current approaches to tracing systems. We present
FDR’s architecture and log format design in sections 3 and 4, and evaluate the system in Section 5. Section 6 presents several analysis techniques that show
how PS interactions can help systems management tasks like troubleshooting and
change management. In Section 7, we discuss the implications
of this work, and then conclude.
Throughout the paper, we use the term PS entries to
refer to files and folders within the file system, as well as their equivalents
within structured files such as the Windows Registry. A PS interaction
is any kind of access, such as an open, read, write, close or delete operation.
2.
Related Work
In this section, we discuss related research and common tools for tracing
system behaviors. We discuss related work on analyzing and applying these
traces to solve systems problems in Section 6. Table 1 compares the log-sizes and performance overhead of FDR and other systems
described in this section for which we had data available [33,11,21,20,40].
The tools closest in mechanics to FDR are file system
workload tracers. While, to our knowledge,
FDR is the first attempt to analyze PS interactions to improve systems
management, many past efforts have analyzed file system workload traces with
the goal of optimizing disk layout, replication, etc. to improve I/O system
performance [3,9,12,15,25,28,29,33]. Tracers based on
some form of kernel instrumentation, like FDR and DTrace [30], can record
complete information. While some tracers have had reasonable performance
overheads, their main limitation has been a lack of support for efficient
queries and the large log sizes. Tracers based on
s
niffing network file system traffic, such as NFS workload tracers [12,29]
avoid any client-perceived performance penalties, but sacrifice visibility into
requests satisfied by local caches as well as visibility of the process making
a request.
Developer tools for tracing program
behavior, such as strace, Filemon Regmon and similar tools, for recording
system call traces and file system interactions on Linux, Unix and Windows
systems have high performance overheads, as well as log formats more suited to
manual inspection than automated querying. These tools are generally more
suited to selective debugging, rather than always-on monitoring.
Complete versioning file systems,
such as CVFS [31] and WayBack [8] record
separate versions of files for every write to the file system. While such file systems have been used as a tool in
configuration debugging [39], they do not
capture file reads, or details of the processes and users that are
changing files. The Repairable File Service (RFS) logs file versioning
information and also tracks information-flow through files and processes to
analyze system intrusions [40].
In [33], Vogels declares analysis of his 190M trace records
to be a “significant problem,” and uses data warehousing techniques to analyze
his data. The Forensix project, tracing system calls, also records logs in a
standard database to achieve queryability [13]. However, Forensix’s client-side
performance overhead and their query performance (analyzing 7 machine-days of
logs in 8-11 minutes) make it an unattractive option for large-scale production
environments.
A very different approach to tracing
a system’s behavior is to record the nondeterministic events that affect the system, and combine this trace with virtual machine-based replay support. While this provides finer-grained
and more detailed information about all the behaviors of a system than does
FDR, this extra information can come at a high cost: ReVirt reports workload-dependent
slowdowns up to 70% [11]. More significantly, arbitrary queries are not
supported without replaying the execution of the virtual machine, taking time
proportional to its original execution.
While, to our knowledge, we are the first to investigate
domain-specific compression techniques for PS interaction or file system
workload traces, there has been related work in the area on optimizing or
compressing program CPU instruction traces [5,19], as well as work to support data
compression within general-purpose databases [6].
In this section, we present our architecture and
implementation for black-box monitoring, collecting, and analysis of PS
interactions. Our architecture consists of (1) a low-level
driver that intercepts all PS interactions with the file system and the
Windows Registry, calls to the APIs for process creation and binary load
activity, and exposes an extensibility API for receiving PS interaction events
from other specialized stores; and (2) a user mode daemon that collects and
compresses the trace events into log files and uploads them to a central
server, (3) a central server that aggregates the log files and, (4) an
extensible set of query tools for analyzing the data stream. Our
implementation does not require any changes to the core operating system or
applications running atop it. We provide detailed discussion of our
domain-specific queryable log format in Section 4.
3.1
FDR Agent Kernel-Mode Driver
Our low-level instrumentation is handled by a kernel mode boot
driver,
which operates in real-time and, for each PS interaction, records the current timestamp,
process ID, thread ID, user ID, interaction type (read, write, etc.), and hashes
of data values where applicable. For accesses to the file system, the driver
records the path and filename, whether the access is to a file or a directory
and, if applicable, the number of bytes read or written. For accesses to the
registry, the driver records the name and location of the registry entry as
well as the data it contains. The driver sits above the file system cache, but
below the memory mapping manager. This driver also records process tree
information, noting when a binary module is loaded, or when a process spawns
another.
The largest performance impact from the driver stems from I/O
related to log writing, memory copies related to logging events, and latency
introduced by doing this work on the calling application’s thread. We mitigate this
by only using the application’s thread to write the relevant records directly
into the user-mode daemon’s memory space, and doing the processing on the
user-mode daemon’s thread. Caches for user names and file names that need to
be resolved for each interaction also help to minimize lookup costs.
Our kernel driver is stable and suitable for use in
production environments, and will be available for public use as part of
Windows Vista.
3.2
FDR Agent User-Mode Daemon
The user-mode daemon is responsible for receiving records of
PS interactions from the kernel driver, compressing them into our log format
in-memory, and periodically uploading these logs to a central server.
To avoid impacting the performance of the system, we
configure our daemon to run at lowest-priority, meaning it will be scheduled
only if the CPU is otherwise idle. If the daemon does fall behind, the driver can
be configured to either block until space is available or drop the event.
However, in practice, we have found that a 4MB buffer is sufficient to avoid any
loss on even our busiest server machines.
The daemon throttles its overall memory usage by monitoring the
in-memory compressed log size, and flushing this to disk when it reaches a
configurable threshold (typically 20MB to 50MB). The daemon will also
periodically flush logs to disk to ensure reliable log collection in the event
of agent or system failure. These logs are uploaded to a central server using
a standard SMB network file system protocol. If a failure occurs during upload
the daemon will save the log locally and periodically retry the upload.
The daemon also manages its own operation, for example, by automatically
update its binaries and configuration settings when indicated on the central
server, and monitoring its disk space and memory usage. Setting up FDR tracing
on a new machine is simple: a user only needs to run a single binary on the machine
and configure the log upload location.
3.3
FDR Collection Server
The collection server is responsible for organizing FDR log
files as they are uploaded, triggering relevant query tools to analyze the
files as they arrive, and pruning old log files from the archive. It also sets
the appropriate access privileges and security on the collected files and
processed data.
3.4 FDR Query Tools
The final pieces of our framework
are the query tools that analyze log files as they arrive. Each query tool is specialized to answer a specific type
of query for a systems management task. Simple example queries include “what
files were modified today?”, or “which programs depend on this configuration
setting?” As all our log files are read-only, we do not require complicated
transactional semantics or other coordination between our query tools. Each
query tool reads the log files it is interested in scanning and implements its
own query plan against the data within. While future work might investigate
benefits of caching, sharing intermediate results across multiple concurrent
queries, or other optimization techniques from the database literature, we
found that allowing uncoordinated reads simplified the process of building new
query tools as required.
The key requirements we have for FDR’s log format are that
1) logs are compact, so that their size does not overly burden client
resources, network bandwidth or server-side scalability; and 2) the log format
efficiently supports common-case queries. To meet these requirements, we built
a preliminary version of FDR with a straightforward, flat format, and collected
5000 machine-days of traces from a wide variety of machines. We can personally
attest to the difficulty of collecting, storing and analyzing this scale of
data without support for compression and queryability. Based on our analysis
of these traces, and a survey of how previous work applies such traces to
systems management tasks, we designed an optimized log file format that takes
advantage of three aspects of PS interaction workloads that we saw across our
collected traces.
First, most PS interactions repeat many times during a
day—93-99% of daily activity is a duplicate of an earlier event. For queries
that care only about what happened, rather than when or how often, we can
improve query performance by separating the definitions of this small number of
distinct interactions from the details of when they occur.
Secondly, we observe that PS interactions are highly bursty,
with many interactions occurring almost simultaneously and long idle periods
between bursts. This allows us to save significant storage space by amortizing
timestamp information across a burst.
Finally, we find that sequences of PS interactions are also
highly repetitious; if we see a sequence of PS reads and writes, we are very
likely to see the same sequence again in the future. This leads us to apply
standard compression schemes to the time-ordered traces of PS interactions,
achieving a high compression rate.
In the rest of this section, we describe relevant attributes
of common-case queries, present the results and implications of our survey of
PS interaction traces, and then describe the details of our log format.
4.1
Common Queries
Today, systems administrators deal with large-scale,
complicated systems. According to surveys [9,28,33,36], an average Windows
machine has 70k files and 200k registry settings. Faced with the task of
managing these systems, a systems administrator’s job is often a problem of
“finding the needle in the haystack.” For example, troubleshooting is the task
of finding the few configuration settings or program files that are causing a
problem; and to test a software upgrade or patch, the administrator needs to
know what subset of the system might be affected by the change. To be useful,
FDR must help systems administrators quickly identify the small set of relevant
state and events out of all the state existing and events occurring across the many
machines of a computing or IT environment. We describe the details of how
systems management tasks use PS interaction traces in Section 6. Here, we
briefly describe the aspects of common-case queries that informed our log
format design.
We find that most common systems management queries of PS
interaction traces search for a subset of events, identified by the responsible
process or user, the file or registry entry being accessed, or another aspect
of the interaction (“Who changed this configuration?” or “What did I change yesterday?”).
This means that, by organizing or indexing our log format around such
attributes, we can quickly identify the subset of interactions of interest. Common
queries are also often restricted by time range, looking only at events that
occurred during a specific period, implying that our logs should support random
access over time, not just sequential access.
Many systems management tasks only involve the existence (or
absence) of a particular PS interaction, and not when or how often
the interaction occurred. For example, finding all loads of a shared library,
regardless of when they occurred, can identify the processes that depend on that
library and help assess the impact of a software upgrade. Other times, queries
do care about when a PS interaction occurred, but only need to know an
interaction’s relative-ordering vis-à-vis other PS interactions on a given thread,
e.g., to determine potential causalities like loading a binary after
reading its name from the Registry. In both cases, the implication is that some
queries need not read timestamps at all.
For our survey, we monitored the PS interactions of over 324
machines during one year across a variety of computing environments and
collected over 5000 machine-days of PS interactions in total. We worked with
MSN to instrument 207 of their machines, across 4 different services with
different workloads, including CPU-bound systems with heavy disk workloads, a
large storage service for external users, and web notifications
publish/subscribe service. In our own research lab, we monitored 72 laboratory
machines used for various data collection, analysis and simulation
experiments. We also monitored 35 corporate desktops and laptops, used by
researchers and engineers, primarily for activities such as software
development and word processing. Finally, we monitored 7 home machines, used
for entertainment and work-related activities by researchers, engineers, and
their families. As a control, we also collected traces from 3 idle systems,
running within virtual machines with no user workload.
4.2.1
Scale and repeated interactions
The primary challenge to efficiently tracing the PS
interactions of a machine is the volume of events that occur. In our survey,
we found that the average number of daily PS interactions was O(107)
ranging from 9M on desktop machines to 70M on the busiest workloads, as shown
in Table 2. Not surprisingly, servers tended to have a stable workload from
day-to-day, while our lab, corporate desktop and home machines had varied
workloads. The highest number of daily interactions we saw was 264M events, on
an MSN server that collected application logs from 1000s of other machines.
Table
2:
The average per machine daily total and distinct interactions, entries,
and processes
|
Environment |
Total PS Interactions |
Distinct PS Interactions |
Distinct PS Entries
Accessed |
Distinct
Processes |
|
Svc. 1 |
70M |
0.2% |
>0.1% |
40-60 |
|
Svc. 4 |
29M |
3.3% |
0.4% |
30-70 |
|
Svc. 2 |
22M |
0.6% |
0.1% |
30 |
|
Svc. 3 |
19M |
1.1% |
0.1% |
30-70 |
|
Home |
17M |
4.2% |
0.6% |
30-40 |
|
Desktop |
9M |
5.4% |
1.0% |
20-100 |
|
Lab |
9M |
1.5% |
0.3% |
17-40 |
|
Average |
25M |
1.6% |
0.2% |
43 |
|
Idle |
965K |
2.1% |
0.5% |
14 |
However, we found several factors that mitigate the large volume of PS interactions
in all these workloads. First, the number of distinct files and registry
entries read or written every day is much smaller than the total number of
interactions. Secondly, the number of distinct processes that run on each
machine is very small, O(102) processes on the busiest desktops, and
fewer on production servers. Overall, we found that most PS entries are only
accessed by a small number of processes, and that the total number of distinct
interactions (i.e., distinct <user, process, operation-type, PS
entry> tuples) was O(105), only 0.2% to 5.4% of the total
interactions per day.
This implies that we can improve the performance of queries
not interested in timestamps or counts of PS interaction occurrences by separating
the unique definitions of observed interactions from the time-ordered traces of
when they occur. Effectively, this allows many common queries to ignore 94.6-99.8%
of the log. This also provides the possibility of compressing our logs, by
replacing repeated descriptions of an interaction with a single unique ID.
4.2.2
Bursts of Activity
Several studies of I/O traffic and file system activities
have shown that server and desktop I/O workloads demonstrate bursty or
self-similar behavior [14,16]. We observe this in our traces as well, where it
manifests as many interactions arriving together with long idle periods in between.
The primary implication of these bursts for our log format
is that, when many events occur together, there is a clear opportunity to merge
their associated time information, storing a single timestamp for all the
events that occur during the same timestamp bucket. This is a significant observation
because per-event timestamps are a major limiting factor to achieving high
compression rates. To help us choose an appropriate bucket duration, we look
to the requirements of common-case systems management queries. We find that
fine-grained timestamps are rarely necessary, instead what is most important is
the relative ordering of events and the ability to map event occurrences to
human activities (i.e., wall-clock time). This leads us to choose a
relatively coarse-grained 48-bit or 6ms granularity timestamp. Note that this
still provides a granularity finer than Windows’ time-scheduling quantum of
10-15ms. While one might worry that a coarse-grained timestamp would mean
every bucket would have at least one event in it, in practice, even our busiest
observed machine-day, with 264M daily events, showed no PS interactions during
60% of its timestamp buckets. Of course, this does not mean the machine as a
whole was idle—it might have been busy with CPU calculations during the times
it was not doing PS interactions.
4.2.3
Repeated Sequences of Interactions
Our final key observation is that many sequences of PS
interactions repeat over time. This is not a surprise, as we would expect that
most of the file system and registry activities performed by a system are
standard, repetitive tasks, including process start-up and shutdown, background
activities, document auto-saves, and logging. We perform a detailed analysis
of repeating “activity bursts,” in [32] and, for space considerations, provide
only a summary here.
In our analysis in [32], we define an “activity burst” as
the set of PS interactions occurring in one thread, where each interaction
occurs no more than some small time separation apart. Formally, we define an
activity burst as a group of events {et | i ≤ t ≤ j}
occurring within a single thread, where gap(et,et+1) < k,
for all i ≤ t < j; gap(ei-1,ei) ≥ k;
gap(ej,ej+1) ≥ k;
gap(x,y) is the time between two interactions x and y;
and k is the threshold gap between bursts. We call an activity burst a
“repeat” if it is identical to an earlier activity burst in every attribute of
its interactions except for timestamps. Otherwise, we call it a “distinct”
burst. In our survey, we find that most activity bursts in a day are repeated
bursts. On desktops, we see 2K-5K distinct bursts out of 20K-40K total and, on
servers, we see 3K-4K distinct bursts out of 40K-70K.
This repetition of PS interaction sequences indicates that
simple byte compression schemes, applied to the time-ordered event sequences, should
detect and compress these repeating patterns. Since our analysis of activity
burst repetition focuses on bursts within a single-thread, storing PS
interactions in a system-wide timestamp sequence runs the risk of allowing
concurrent I/O from multiple threads to interfere with the compressibility of each
other’s patterns. However, because of the relatively large CPU time slice of 10-15ms
on the Windows OS, and the knowledge that most PS interactions are handled
quickly by file caches, we still expect to gain significant compression of
repeating patterns in a cross-thread trace of PS interactions.
4.3
Log Format Details
Based on the machines observed in our survey, our log format
has two key facets to its logical design, shown in Figure 1. First, given the
repetition of accessed files, observed processes, etc., we normalize the attributes
of our persistent state interactions into separate sections, essentially
following a standard procedure of database normalization. We create one section
for distinct PS interactions, which point to other sections containing distinct
names, user context, process information, file data hashes and values. The
primary benefit of this normalization is a reduction of repetitive information.
In addition, we find that grouping attributes into their own sections improves
the performance of byte compression algorithms as we compress these log
sections later. This separation also improves query efficiency by transforming
queries that involve multiple attributes and expensive string comparisons into inexpensive
comparisons of integer IDs as we scan through traces of PS interactions.

Figure 1:
Logical design of our log format
Our second design choice in our logical log format is to represent
the trace of events itself as two parallel, but connected, streams of data.
The first stream is an ordered list of events as they are captured by our
kernel driver and reported to the user daemon. The second stream contains
timestamp information for groups of events. This amortizes the size of
timestamp information across multiple events, reducing the overall size of the
log, as well as improving byte compression of the event stream by better exposing
patterns in the sequences of events. Both logical streams are stored in a
single physical file to ease management of log archives.
We created a file structure that
contains a large logical address space split into 32 sections. Each of the
normalized attribute sections, the section of distinct PS interactions,
as well as the ordered event stream and timestamp stream, are mapped to a section in our log file. Each section is composed of individually compressed 64k
pages. Compressing in blocks, rather than using a streaming compression
format allows random access within a data section.
To simultaneously optimize our log for random access and
compression, our physical log file layout consists of a three-layer addressing
scheme of block table, block number, and block offset, shown in Figure 2.
This three-layer addressing scheme is important because we want to compress
individual blocks and store the start and end offsets of each block in a table
for fast lookup, and as a further optimization, compress these tables as well.
With 448 block tables, 8192 blocks per table and a 64k uncompressed block size,
this provides a maximum addressable storage size of 234 GB of uncompressed data
within each log file. While we find this is many times more than a single
machine-day of logs, this format gives us the flexibility of joining many days
of logs into a single file for improved compression, and gives us flexibility
if PS interaction workloads grow in the future.
|

|
|
Figure 2: The physical layout of our log format.
|
Each log file starts with a 4k uncompressed header. The
first 256 bytes consist of versioning and other miscellaneous information.
Next are 32 section descriptions, each 8 bytes long. Each of the logical
sections, described earlier, is laid out contiguously over our three-layer
addressing scheme, aligned at block boundaries. These section descriptors
provide the block table and number of the start and end of each section. The
rest of the 4k header is filled by 448 block table descriptors, that point to
the start and end offsets of a compressed block table. The block table, in
turn, contains 8192 block entries, each pointing to the start and end offset of
a compressed 64k block.
The timestamp section is
maintained as a 16 byte entry containing 6 bytes (48 bits) to represent a 6ms
time resolution, 2 bytes to count missing events within that region, and two 4
byte offsets pointing to the first and last consecutive event with that time
resolution. While almost all events are received by the daemon in time sorted
order, we correctly handle timestamp information for any events that
appear out of order. This can happen when a context
switch occurs just after an I/O activity completes, but before the
kernel driver reports it, but this delays the
reporting of an event by a few scheduling quantums, and never affect the intra-thread ordering of PS interactions.
The user daemon first creates log files in-memory. As it receives raw events from the kernel driver,
the daemon normalizes the events, replacing attribute values with indexes into
the appropriate sections. The normalized event is then compared to the table
of distinct normalized events using an O(1) hash lookup, and added to the table
if necessary. Finally, the normalized event is added to the ordered event
stream section, along with new timestamp information, if necessary. Each of
the log file’s data sections is append-only in memory. When a log file is
closed and flushed to disk the daemon writes each data section contiguously to
disk while applying a standard compression algorithm to each 64K-byte block.
4.4 Querying Logs
When analyzing these log files,
our tools tend to restrict their queries based on one or more attributes
in the PS interaction record, based on a time-range of interest, or based on
both. To restrict a query by attribute, a query tool scans the appropriate
section, looking for all values matching the given criteria. From these
values, the tool then generates a filter to apply against the section of
distinct PS interactions, resulting in a set of unique IDs, one for each PS
interaction matching the original attribute restriction. For example, to
return only PS interactions that access a particular file, a tool would first
scan the string section to find the ID of the filename, and then scan the
section of distinct PS interactions to find the IDs of all distinct PS
interactions that accessed this filename ID. If a tool is not interested in when
or how many times an interaction occurred then it can stop here, without
scanning the much larger event stream sections. Otherwise, the tool can scan
through the ordered event list and timestamp stream to find the details of the
occurrences of these PS interactions.
To restrict a query by a time range, a query tool applies a
binary search to the timestamp stream, searching for the start of the desired
time range. Once this timestamp is found, it can skip to the appropriate 64K
block of the ordered list of events, and begin scanning the ordered list from
that point on, until the end of the time range.
Common-case queries tend to extract sparse information from
the extremely large data set of PS interactions. Our log format enables
efficient queries by allowing query tools to focus on the relevant subsets of
data, and expanding their scope to larger and larger portions of the data as
necessary. For example, a query to find a list of all files modified during a
day of 25M PS interactions requires only one pass over a distinct event table
with 318Kentries to identify the string attribute id of modified files, and
then scanning over the string attribute section with 100K entries to discover the full filenames of each
modified file, avoiding ever scanning the full list of 25M events that occurred
during the day.
In this section, we evaluate FDR and find that on average,
our logs use only 0.7 bytes/PS interaction. We find that query performance
scales linearly with the number of events in a log. All of our queries can be
processed in a single pass against an entire machine-day of logs in just 3.2
seconds. FDR’s client-side performance overhead is less than 1%, and
calculating the load on bottleneck resources in our central server indicates
that a single machine could scale to collecting and analyzing all the PS
interactions from 4300 machines, keeping all logs for over 3 months.
5.1
Log File Compression
The efficiency of FDR’s log file format affects both the
runtime load on clients’ memory and network, and the long-term storage costs of
PS interaction logs.
In our survey of 5000 machine-days of logs, described in
Section 4.2, the average raw event size is 140 bytes and daily per machine raw
logs are 7GB, compressing to 700MB with GZIP. After converting our collected
logs to our new format, we found, depending on workload, each PS interaction
takes between 0.5 to 0.9 bytes of storage, and 0.7 bytes on average. One
machine-day of PS interactions can be stored in 6 to 71MB, 20MB on average,
with a maximum observed machine-day size of 179MB. Table 3 shows the results
across our environments.
Table 3: Daily storage
requirements for machines across environments
|
Role |
Avg. Bytes/Event |
Avg. MB/day |
Max MB/day |
|
Svc. 1 |
0.91 |
71 |
179 |
|
Svc. 4 |
0.78 |
57 |
103 |
|
Svc. 2 |
0.71 |
19 |
22 |
|
Svc. 3 |
0.66 |
10 |
53 |
|
Home |
0.58 |
17 |
51 |
|
Desktop |
0.80 |
13 |
21 |
|
Lab |
0.47 |
6 |
43 |
|
Average |
0.70 |
20 |
49 |
|
Idle |
0.85 |
0.76 |
0.88 |
In addition to PS interaction workload, the efficiency of our log format’s
compression is sensitive to the frequency at which the logs are uploaded to the
central server. More frequent uploads of logs reduces the latency between
when an event occurs and when it can be analyzed. However, uploading logs
less frequently allows the logs to collect more repeated events and achieve
better overall compression.

|
|
Figure 3:
Compression Ratio variation with log file interval for each category of
machines |
Figure 3 shows how storage
size varies with the event collection period. It shows that 80% of the compression
efficiency is typically achieved with log files an hour long, and that
improvement ceases after 1 week of data. Based on this result, our current
deployments upload log files every 2 hours, and our central server
reprocesses and merges them into 1-week long
log files to save long-term storage space. We
envision that future versions of FDR will support latency-critical queries by
moving them to the client-side agent.
Table 4: Average
section size as a percentage of total log file size for each machine
role
|
Role |
Event |
Time |
Reg. |
File |
String |
Data |
Other
|
|
Svc
1 |
77% |
19% |
1% |
0.3% |
0.2% |
1.9% |
0.6% |
|
Svc 4 |
57% |
17% |
7% |
13.9% |
1.8% |
| |