Discoverer : Automatic
Protocol Reverse Engineering
from Network Traces
|
Weidong Cui
Microsoft Research
wdcui@microsoft.com
|
Jayanthkumar Kannan
UC Berkeley
kjk@cs.berkeley.edu
|
Helen J. Wang
Microsoft Research
helenw@microsoft.com
|
Abstract:
Application-level protocol specifications are useful for many security
applications, including intrusion prevention and detection that performs deep
packet inspection and traffic normalization, and penetration testing that
generates network inputs to an application to uncover potential
vulnerabilities. However, current practice in deriving protocol specifications
is mostly manual. In this paper, we present Discoverer, a tool for
automatically reverse engineering the protocol message formats of an application
from its network trace. A key property of Discoverer is that it operates in a
protocol-independent fashion by inferring protocol idioms commonly seen in
message formats of many application-level protocols. We evaluated the efficacy
of Discoverer over one text protocol (HTTP) and two binary protocols (RPC and
CIFS/SMB) by comparing our inferred formats with true formats obtained from
Ethereal [5]. For all three protocols,
more than 90% of our inferred formats correspond to exactly one true format; one
true format is reflected in five inferred formats on average; our inferred
formats cover over 95% of messages, which belong to 30-40% of true formats
observed in the trace.
Application-level protocol specifications are useful for many security
applications. Penetration testing can leverage protocol specifications to
generate network inputs to an application to uncover potential vulnerabilities.
For network management, protocol specifications can also be used to identify
protocols and tunnelings in monitored network traffic. Generic protocol
analyzers (GAPA [1] and binpac [16]) are important mechanisms for intrusion
detection or firewall systems to perform deep packet inspection. These
analyzers take protocol specifications as input for their analyses.
To date, protocol specifications for the above applications are specified
from documentation or reverse engineered manually. Such efforts are
painstakingly time-consuming and error-prone. It took the open-source SAMBA
project 12 years to manually reverse engineer the Microsoft SMB protocol [18]. In another example, the Yahoo
messenger protocol has also been persistently reverse engineered, despite
which, the open source clients [6] regularly
require patching to support proprietary changes in the Yahoo protocol.
Sometimes, the period between the availability of an official client and an
open-source client has been a month, with some open-source projects simply
abandoning the effort due to the frequent changes initiated by Yahoo.
To address this pain, we tackle the problem of automatic protocol reverse
engineering. There can be two sources of given input for the
reverse-engineering task: network traces and application code. In this paper,
we present our tool, Discoverer , which performs automatic reverse
engineering from network traces. We leave application-code-based reverse
engineering as future work.
In Discoverer, we focus on reverse engineering the message format
specification and leave the protocol state machine inference to our future
work. To automatically reverse engineer message formats for a wide range of
protocols, we face three main challenges: (1) We have very few hints from the
network trace. The only evident information from the trace is the
directionality of byte streams. (2) Protocols are significantly different from
each other. (3) Protocol message formats are often context-sensitive where
earlier fields dictate the parsing of the subsequent part of the message.
To make our tool general, we base our design on inferring protocol idioms
commonly seen in message formats of many protocols. To cope with the few hints,
we dissect the formless byte streams into text and binary segments or tokens as
a starting point for clustering messages with similar patterns, where each
cluster approximates a message format. By comparing messages in a cluster and
observing the characteristics of known cross-field dependencies (such as a
length field followed by a string of the length), we infer additional
properties for the tokens, which in turn can be leveraged to refine and divide
the clusters of messages, where each subcluster approximates a more precise
format. This process continues recursively until we can no longer divide up any
message clusters based on the newly finished inference. After this recursive
clustering phase, we look at all message clusters globally through a type-based
sequence alignment algorithm, and merge similar clusters into one. This way, we
can produce more concise message formats.
We have evaluated Discoverer over traces of a representative set of
protocols consisting of one text protocol (HTTP) and two binary protocols (RPC
and CIFS/SMB). We calibrated our design over some of these traces, and used the
remaining for validation. The three main metrics for our tool are correctness
(``does one inferred format correspond to exactly one true format?''), conciseness
(``how many inferred formats is a single true format reflected in?''), and coverage
(``how many messages are covered by the inferred formats?''). Across all
protocols we tested, more than 90% inferred formats correspond to exactly one
true format; one true format is reflected in five inferred formats on average;
our inferred formats cover over 95% messages, which belong to 30-40% of true
formats observed in the trace. Such significant difference between message and
format coverage is due to the heavy-tail distribution of message format
popularity commonly seen in practice.
Although our reverse-engineered message formats are imperfect, we anticipate
them to be still practical for the aforementioned applications. For instance,
penetration testing guided by our reverse-engineered formats is likely to be
much more effective than that with random inputs. Protocol fingerprinting and
tunneling detection probably do not require perfect protocol specifications.
For applications like firewalls which would err with imperfect specifications,
our tool could still serve as a help to ease the manual protocol specification
process.
We organize the rest of the paper as follows. We discuss common protocol
idioms and the scope of Discoverer in Section 2.
We describe the design of Discoverer in detail in Section 3. We present our evaluation methodology and results in
Section 4. We discuss related work in
Section 5, and limitations and future work in
Section 6. Finally, we summarize the paper
in Section 7.
2 Problem Statement
Many application-level protocols share common protocol idioms which
correspond to the essential components in a protocol specification. To make our
reverse-engineering algorithm applicable to many protocols, we base our design
on inferring the common protocol idioms. In this section, we first describe
these idioms and then explain the scope of Discoverer.
2.1 Common Protocol Idioms
Most application-level protocols involve the concept of an application
session, which consists of a series of messages (also known as
Application-level Data Units or ADUs) between two hosts that accomplishes a
specific task. The structure of an application session is determined by the
application's protocol state machine, an essential component in a
protocol specification that characterizes all possible legitimate sequences of
messages. The structure of an application message is determined by the
application's message format specification, another essential
component in a protocol specification. A message format specifies a sequence of
fields and their semantics. Common field semantics include length
(reflecting the size of a subsequent field with a variable length), offset
(determining the byte offset of another field from a certain point like the
start of the message), pointer (a special offset that specifies the
index of a field in an array of arbitrary items), cookie
(session-specific opaque data that appears in messages from both sides of the
application session; session IDs are an example of cookie fields), endpoint-address
(encoding IP addresses or port numbers in some form), and set (a group
of fields that can be put in an arbitrary order).
One particular type of fields is the Format Distinguisher (FD)
field. The value of this field serves to differentiate the format of the subsequent
part of the message, which reflects the context-sensitive nature in the grammar
of many application-level protocols. A message may have a sequence of FD
fields, particularly when multiple protocols are encapsulated. For instance, a
CIFS/SMB message consists of a NetBIOS header encapsulating an SMB header,
which in turn may encapsulate a RPC message. This implies that the applications
need to scan a message from left-to-right, decoding a FD field before
parsing the subsequent part of the message.
In this paper, we focus on deriving the message format specification and
leave protocol state machine inference to our future work. We assume
synchronous protocols to identify message boundaries. A message is a
consecutive chunk of application-level data sent in one direction. It spans one
or more packets in a TCP or UDP connection, where a UDP connection is a pair of
unidirectional UDP flows that are matched on source/destination IP address/port
number. We only aim to deal with applications that do not obfuscate their
payloads. We do not aim to capture timing semantics (e.g., ``message 1 usually
follows message 2 within 10 seconds'').
3 Design
In this section, we first present an overview of Discoverer, then describe
the three main phases of Discoverer in detail, and finally give a concrete
example of a message format inferred by Discoverer.
3.1 Overview
|
![\includegraphics[width=6in]{figures/discoverer-architecture.eps}](img1.png)
|
|
Figure 1: Overview of Discoverer's
architecture. In the example, we assume there is a single true message format
which has two fields: the first binary field of a single byte represents the
length of the second text field. There are two token patterns because, when
the text field is shorter than a threshold, it is treated as binary. In the
merging phase, this kind of tokenization errors is corrected.
|
The basic idea of Discoverer is to cluster messages with the same format
together and infer the message format by comparing messages in a single
cluster. We achieve this in three main phases (illustrated in Figure 1).
- Tokenization and Initial Clustering: This phase operates on
the raw packets, and helps in identifying field boundaries in a message
and giving the first order structure to the unlabeled messages. We first
reassemble the packets into messages, and then break up a message into a
sequence of tokens which is an approximation to a sequence of fields.
Tokens belong to one of two token classes: binary or text. We
then classify messages into various clusters based on each message's token
pattern, which is simply represented by the message direction and classes
of its tokens.
- Recursive Clustering: Since messages with the same token pattern do not
necessarily have the same format, this phase further divides clusters of
messages so that messages in each cluster have the same format, and infers
the message format by comparing messages in each single cluster. To do so,
we mimic the left-to-right recursive parsing of applications processing
messages by recursively repeating the following steps. We first infer the
message format that captures the content of all messages in a cluster.
Then we identify the first FD field (which decides the format of the
subsequent part of the message) in a left-to-right scan and use the values
of this FD field to divide the cluster into subclusters.
- Merging: This phase mitigates the over-classification problem,
namely, messages of the same format may be scattered into multiple
clusters. To do so, we merge similar message formats by using a type-based
sequence alignment algorithm that compares the field structure of two
inferred message formats.
A key design rationale for Discoverer is to be conservative: it may
scatter messages of the same format into more than one cluster, but it should
not collate messages of different formats into the same cluster. This rationale
is to ensure the correctness of inferred formats because, if there are messages
of more than one format in a cluster, the inferred format might be too general
by trying to capture multiple message formats at once.
3.2.1 Tokenization
A token is a sequence of consecutive bytes likely to belong to the same
application-level field. We require that the tokenization process works without
any particular distinction between text and binary protocols, since our tool is
intended to be fully automatic and we wish to spare the user from the manual
effort required to distinguish between text and binary protocols. Further, it
is hard to declare a protocol as purely text or purely binary, since text
protocols can contain binary bytes (e.g., an image file transferred over HTTP)
and most binary protocols contain a few text fields (e.g., the name of a file).
Our tokenization procedure generates two classes of tokens: text
and binary. A text token is intended to span the several bytes of a single
message field representing some text (such as ``GET'' in an HTTP request). Our
procedure for finding text tokens is as follows: we first identify text bytes
by comparing them with the ASCII values of printable characters, and then
consider a sequence of text bytes sandwiched between two binary bytes as a text
segment. To avoid mistaking binary bytes for text bytes, we require this
sequence to have a minimum length. Then we use a set of delimiters (e.g., space
and tab) to divide a text segment into tokens. We also look for Unicode
encodings in messages. For binary fields, identifying field boundaries is very
hard; so we instead simply declare a single binary byte to be a binary token in
its own right. Note that this procedure can admit errors: consecutive binary
bytes with ASCII values of printable characters are wrongly marked as a text
token; a text string shorter than the minimum length is wrongly marked as
binary tokens; a text field consisting of some white space characters is
wrongly divided into multiple text tokens. We correct this kind of errors in
the merging phase (see Section 3.4).
Byte-wise sequence alignment based on the Needleman-Wunsch algorithm [15] has been used in previous
studies [3,11,17]
for aligning and comparing messages. We find that byte-wise sequence alignment,
while ideally suited to align messages with similar byte patterns, is
not suitable for aligning messages with the same format. For instance,
fields with variable lengths may lead to mis-alignment of two messages of the
same format. Further, parameter selection for sequence alignment is also hard
as shown in [3].
To avoid aligning messages, we cluster messages based on their token
patterns. The token pattern assigned to a message is a tuple, (dir, class_of_token_1,class_of_token_2,…),
where
is the direction of the
message (client to server or vice versa), followed by the classes of all tokens
in the message. We consider the message direction because messages in opposite
directions tend to have different formats. An example of a token pattern is (client_to_server,text,binary,text).
Note that this initial clustering is coarse-grained since messages with
different formats may have the same token pattern. For instance, SMTP commands
typically have two text tokens (``MAIL receiver'', ``RCPT sender'', ``HELO
server-name''). In the recursive clustering phase, we improve the granularity
of this clustering by recursively identifying FD tokens and dividing clusters.
Our recursive clustering relies on identifying format distinguisher (FD)
tokens. To find FD tokens, we need to invoke both format inference and format
comparison. In this section, we first explain these procedures before
describing how we recursively identify FD tokens and divide clusters.
The format inference phase takes as input a set of messages and infers a
format that succinctly captures the content of the set of messages. Our
inferred message format is defined to be a sequence of token
specifications which include not only token semantics but also token
properties. We introduce token properties because we cannot infer the
semantic meaning for every token and certain token properties are useful for
describing the message format. Token properties currently cover two
perspectives: binary vs. text and constant vs. variable.
The first property reflects the token class, and the second decides if a token
takes the same value across all messages of the same format (i.e., constant
token) or different values in different application sessions (i.e., variable
token). We also define the type of a token to be the sum of its
semantic and property.
We now describe how token properties and semantics are derived.
Property Inference: Token class is already identified during the
tokenization phase. Constant or variable tokens can also be easily identified.
Since the set of messages come from a single token-pattern cluster, tokens in
one message can be directly compared against their counterparts in another
message by simply using the token offset. Thus, constant tokens are those that
take the same value across the entire set of messages, and variable tokens are
those that take more than one value.
Semantic Inference: We currently support three semantics: length,
offset, and cookie (see Section 2.1 for
definitions). We will discuss how it may be possible to support other semantics
in Section 6. We identify cookie fields at
the end of the merging phase since it requires correlating multiple messages in
the same session. We employ the heuristics in RolePlayer [3] for doing this. Our heuristics for
identifying length and offset fields are an extension of those in
RolePlayer [3]. The intuition for
identifying length fields is that, for a specific pair of messages, the difference
in the values of potential length fields (at most four consecutive binary
tokens or a text token in the decimal or hex format) reflects the difference
of the sizes of the messages or some subsequent tokens. We thus simply check
for a match between the value difference and the size difference. If a match
holds for all pairs of messages in the cluster, the potential length field is
declared to be a length field. For offset fields, we compare the value
difference with the difference of the offsets of some subsequent tokens.
3.3.2 Format Comparison
The goal of this procedure is to decide if two inferred message formats are
the same. Given two formats, it scans these two formats token-by-token from
left-to-right and matches the inferred type (i.e., semantic and property) of a
token from one format against its counterpart from the other. If all tokens
match, the two formats are considered to be the same.
Ideally, two tokens can be considered to match if their semantics match.
However, since there are always tokens that we do not have semantics for, we
need to compare their values (they have the same token class since the two
formats have the same token pattern). We allow a constant token to match with a
variable token if the latter takes the value of the former at least once. We
also allow a variable token to match with another if there is an overlap in the
two sets of values taken by them. Note that these policies are conservative,
which is in line with our design rationale.
3.3.3 Recursive Clustering by
Format Distinguishers
We identify FD tokens with the following algorithm. First, we invoke format
inference on the set of messages in a cluster. Then, we scan the format token
by token from left to right to identify FD tokens. We use three criteria in
determining if a token is a FD:
- We first check if the number of unique values taken by
this token across the set of messages is less than a threshold, referred
to as the maximum distinct values for a FD token. This is because a FD
token typically takes a few values corresponding to the number of
different formats.
- For tokens satisfying the first criterion, we perform a
second test as follows. The cluster is divided into subclusters, one for
each unique value taken by this token. Each subcluster consists of
messages where the candidate FD token takes a specific value. We then
require that the size of the largest subcluster exceeds a threshold,
referred to as the minimum cluster size. This is to guarantee that we can
make a meaningful format inference in at least one subcluster. Otherwise,
we gain nothing by continuing this splitting.
- If the potential FD token passes the second criterion,
we invoke format comparison across subclusters to see if their formats are
different from each other. We then merge those that manifest the same
formats and leave others intact.
This process is recursively performed on each of the subclusters because a
message may have more than one FD token. We find the next FD token by scanning
further down the message towards the right (end) of the message. It is
necessary to scan all the way to the end because we need to recognize all FDs
to obtain a good clustering and format inference.
When looking for the next FD token, the format inference is invoked again on
the set of messages in each subcluster. This is because the inferred token
properties and semantics might change because the set of messages has become
smaller, and it is possible for stronger properties to hold. For instance, a
previously variable token might now be a constant token; a previously variable
token might now be identified as a length field.
3.4 Merging with Type-Based Sequence Alignment
In the tokenization and recursive clustering phases, we are conservative to
ensure that the format inference procedure operates correctly on a set of
messages of the same format. However, this leads to a new problem of
over-classification, namely, messages of the same format may be scattered into
more than one cluster. This problem can be quite severe; for instance, over a
CIFS/SMB trace of almost four million messages, there are about 7,000
clusters/formats as input to this phase, while the total number of true formats
is 130. The goal of the merging process is to coalesce similar formats from different
clusters into a single one.
The key observation behind our merging phase is that, while sequence
alignment [15] cannot be used for
clustering messages of the same format, it can be used to align formats
for identifying similar ones across different clusters. This is because we can
leverage the rich token types (i.e., semantics and properties) inferred in the
recursive clustering phase. For instance, knowing that a particular token is a
length field in a format necessitates that its counterpart in another format is
also a length field for these two formats to be considered a match. We refer to
our algorithm for aligning formats as type-based sequence alignment.
In our type-based sequence alignment, we only allow two tokens of the same
class (binary or text) to align with each other. We claim two aligned tokens
are matched if they either have the same semantic or share at least one value
(see Section 3.3.2 for details).
To compensate for tokenization errors, we allow gaps in our type-based
sequence alignment. In addition to using gap penalties to control gaps, we
introduce extra constraints to avoid excessive gaps. First, consecutive binary
tokens in one message format are allowed to align with gaps if they precede or
follow a text token in the other message format in the alignment, and the
number of binary tokens is at most the size of the text token if the text token
is aligned with a gap, or the size difference if it is aligned with another
text token. This constraint is for handling the case of mistaking a sequence of
binary tokens to be a text token or vice versa. Second, a text token is allowed
to align with a gap, but we allow at most two gaps of this kind. This
constraint is for handling the case that a text field consisting of some white
space characters is mistakenly divided into multiple tokens.
When we align and compare two message formats to decide whether to merge
them, we first check if the gap constraints can be satisfied. If no, we stop
and claim the two formats are mismatched; otherwise, we continue to check the
number of mismatches. If there is at most one pair of aligned tokens
mismatched, we claim the two formats are matched and merge them. Note that this
is conservative because the mismatched token can be treated as a variable token
that takes values from a new set covering both formats.
Since we use the gap constraints and the number of mismatches to decide
whether to merge two message formats, our merging performance is insensitive to
sequence alignment parameters--scores for match, mismatch and gap.
|

|
|
Figure 2: Ethereal's XML output of
an example SMB ``Tree Connect AndX Request'' message (edited for better
presentation).
|
|

|
|
Figure 3: The ``name'' of the true
format for the example message in Figure 2
concatenates the human readable names of all the fields.
|
|
Table 1: Discoverer 's inferred
format for the true format in Figure 3.
For C(x,y), C means constant, x means binary (``b'') or text (``t''; in text
tokens, ``u'' means Unicode and ``n'' means it is null terminated), y is the
hex value or string of the token; for V(x,z), z is the number of different
values for the token.
|
|
Token
|
True Field
|
Token
|
True Field
|
Token
|
True Field
|
Token
|
True Field
|
|
C(b,00)
|
nbss.type
|
C(b,00)
|
smb.pid.high
|
C(b,00)
|
smb.tid
|
C(b,00)
|
|
|
C(b,00)
|
Length
|
C(b,00)
|
|
C(b,00)
|
|
V(b,2)
|
smb.connect.flags
|
|
C(b,00)
|
|
V(b,256)
|
smb.signature
|
C(b,ff)
|
smb.pid
|
C(b,00)
|
|
|
L(b)
|
|
V(b,256)
|
|
C(b,fe)
|
|
C((b,01)
|
smb.pwlen
|
|
C(b,ff)
|
Server Component
|
V(b,256)
|
|
V(b,33)
|
smb.uid
|
C(b,00)
|
|
|
C(tn,SMBu)
|
smb.cmd
|
V(b,256)
|
|
V(b,32)
|
|
L(b)
|
smb.bcc
|
|
C(b,00)
|
smb.nt_status
|
V(b,256)
|
|
V(b,13)
|
smb.mid
|
C(00)
|
|
|
C(b,00)
|
|
V(b,256)
|
|
V(b,211)
|
|
C(00)
|
smb.password
|
|
C(b,00)
|
|
V(b,256)
|
|
C(b,04)
|
smb.wct
|
V(tun,664)
|
smb.path
|
|
C(b,18)
|
smb.flags
|
V(b,256)
|
|
C(b,ff)
|
AndXCommand
|
C(tn,?????)
|
smb.service
|
|
C(b,07)
|
smb.flags2
|
C(b,00)
|
smb.reserved
|
C(b,00)
|
smb.reserved
|
|
|
|
C(b,c8)
|
|
C(b,00)
|
|
L(b)
|
smb.andxoffset
|
|
|
|
For better understanding, here we present a concrete example based on the
SMB ``Tree Connect AndX Request'' message format to explain the design and
output of Discoverer. We obtain the true message format from Ethereal (see
Figure 2 and Figure 3). The final inferred format by Discoverer is
shown in Table 1.
We can see that the inferred format is a sequence of tokens with token
properties (binary vs. text, constant vs. variable) and semantics (e.g., length
fields). For tokens with unknown semantics, their possible values are also
taken into account in the format. Before the merging step, messages of this
true format were scattered into 24 clusters in 18 different token patterns.
Different token patterns are due to the ``smb.signature'' field. Since this
field may take any random values, we will have a different token pattern when
more than three consecutive bytes at a different offset take values from the
printable ASCII range and are wrongly treated as a text token. Messages in some
token patterns were further split into fine-grained clusters in the recursive
clustering phase due to our conservative approach. Our merging technique mitigates
this over-classification problem effectively. At the end, all of the 24
clusters were merged into a single one.
This example also shows the possibility of imprecise field boundaries. For
example, the first null byte of the field ``smb.nt_status'' was treated as the
null terminator for the text token before it. However, we believe this kind of
imprecision will not affect the effectiveness of the inferred format but instead
create some extra inferred formats with different values for ``smb.nt_status''.
4 Evaluation
|
Table 2: Summary of network traces
used in the evaluation.
|
|
Protocol
|
Source
|
Size (B)
|
#
Messages
|
# True
Formats
|
|
HTTP
|
Enterprise
|
4.6G
|
5,950,453
|
2,696
|
|
RPC
|
Enterprise
|
179M
|
351,818
|
50
|
|
CIFS/SMB
|
Enterprise
|
1.0G
|
3,818,267
|
301
|
|
CIFS/SMB
|
Honeyfarm
|
1.1G
|
1,439,744
|
1220
|
|
We implemented Discoverer in 5,700 lines of C++ code on Windows. The tool
takes a network capture file either in the libpcap [12] or Netmon [14]
format as input and outputs inferred message formats: a sequence of tokens with
the inferred properties and semantics. Our current un-optimized implementation
takes about 6-12 hours for a trace of several million messages (the merging
procedure is the slowest due to the need of pairwise comparisons of all
inferred formats). Before discussing the experimental results, we first
describe our data sets and evaluation methodology.
We tested Discoverer on traces from two different sites: a honeyfarm
site [2] (which responds to
un-solicited, mostly malicious traffic) and a busy enterprise (which has
diverse and high-volume traffic). The honeyfarm trace consists of CIFS/SMB
only. The enterprise trace includes HTTP, CIFS/SMB, and RPC. The honeyfarm
trace and the HTTP trace were used as the calibration data to help
guide the design process and set tunable parameters. Our results are presented
based on the output of our tool on the traces from the enterprise site, which
served as the evaluation data. Thus, CIFS/SMB can be seen as the
evaluation case where the tool was trained on the trace from a different site,
whereas RPC is the case where the tool is evaluated over a new protocol. Though
CIFS/SMB messages may encapsulate the RPC layer, the RPC trace consists of RPC
traffic exclusively. The HTTP trace was used for both calibration and
evaluation, but we hardly tailored our tool to HTTP.
In our experiment, the CIFS/SMB and RPC trace from the enterprise site
contains traffic in one direction only. This will not affect our evaluation
because the protocol formats in both directions are equally complicated based
on Ethereal's parsing results of the honeyfarm CIFS/SMB trace. This is not to
say that if we can infer the format in one direction, we are guaranteed to
infer the format in the other direction; but the performance in one direction
does give an indication of the performance in the other direction. In addition,
since we do not put messages in different directions into the same cluster,
uni-directional traffic does not make the problem any easier.
For the HTTP trace, our tool reassembled consecutive data sent in one
direction into a message. For the CIFS/SMB and RPC traces, we leveraged
Ethereal to parse them and identify message boundaries. A summary of these
traces is shown in Table 2.
4.2 Evaluation Methodology
Our evaluation methodology is to compare the quality of our output with the
set of true message formats. To obtain the true format, instead of
trying to manually extract it from documentation and RFCs, we used the protocol
analyzers in Ethereal [5]. Ethereal can
parse a network trace and produce, for each message in the trace, an XML output
that includes the list of fields in the message, the values of those fields,
some human readable names and their sizes. Based on this output, we assign
every message a true format ``name'', which is simply the concatenation of the
human readable names of all the fields. An example of Ethereal's XML output and
the true format name is shown in Figure 2
and Figure 3.
We characterize the performance of our tool and highlight the results in the
following metrics:
- Correctness: If a cluster contains messages from more than one
true format, then Discoverer will make incorrect inference. Thus we
measure the correctness by checking the number of different true formats
followed by the messages in a cluster. For all three protocols, over 90%
clusters contain messages from a single true format.
- Conciseness: Our conservative clustering may cause multiple
inferred formats to cover subsets of a single true format. A large number
of redundant formats will affect the conciseness of the protocol
specifications generated by our tool. Thus we measure conciseness by the
ratio from the number of inferred formats to the number of true formats
followed by their messages. For all three protocols, we achieved a low 5
to 1 ratio.
- Coverage: We measure the trace coverage from two perspectives:
the fraction of messages covered by our inferred formats and the fraction
of true formats followed by covered messages. For all the three protocols,
our message coverage is above 95% while our format coverage is around
30-40%.
As for the semantic inference, all the length fields inferred by Discoverer
are correct; certain length fields are missed due to the trace limitation. For
instance, some true formats in CIFS/SMB have a fixed message size. In this
case, Discoverer will treat the length fields that reflect the message size as
constant tokens, and it will not affect parsing messages of these formats in
practice.
Discoverer has just a few tunable parameters (see Table 3). For a message larger than 2048 bytes, we only
consider the first 2048 bytes, referred to as the maximum message prefix. The
minimum length of text segments controls the tokenization procedure
(Section 3.2.1). The minimum cluster
size and the maximum distinct values for FD are used in the recursive
clustering phase (see Section 3.3.3). The
match/mismatch/gap scores are parameters for sequence alignment [15]. We observed that the performance of
Discoverer is not sensitive to the settings of these parameters. For instance,
we saw similar performance when we changed the maximum prefix size from 2048
bytes to 1024 bytes or changed the minimum cluster size from 20 messages to 10
messages. In addition, our type-based sequence alignment is not sensitive to
the match/mismatch/gap scores as we discussed in Section 3.4. Thus we take the same parameters for our
evaluations on all three protocols.
|
Table 3: Summary of parameters.
|
|
Parameter
|
Value
|
|
Maximum message prefix
|
2048
bytes
|
|
Minimum length of text segments
|
3
letters
|
|
Minimum cluster size
|
20
messages
|
|
Maximum distinct values for FD
|
10
|
|
Alignment match score
|
1
|
|
Alignment mismatch score
|
0
|
|
Alignment gap score
|
-2
|
|
In the rest of this section, we present the experimental results on the
enterprise traces for HTTP, RPC, and CIFS/SMB. Note that we use the inferred format
and cluster interchangeably because we infer one format from each cluster.
|
![\includegraphics[width=2.8in]{results/gnuplot-http-format-mesg.eps}](img7.png)
|
|
Figure 4: Heavy-tail distribution of
message format popularity in HTTP.
|
|
|
|
![\includegraphics[width=2.8in]{results/gnuplot-http-cluster-format-cdf.eps}](img8.png)
|
|
Figure 5: Correctness for HTTP: CDF
of the number of true formats followed by messages of a cluster for all
clusters before the merging phase.
|
The HTTP protocol allows an arbitrary number of ``parameter: value'' pairs
in an arbitrary order. We refer this to be the ``set'' semantic. Currently we are
unable to identify this set semantic. So we treat each ordering of the set
elements as a distinct format. By doing so, we observed 2,696 formats from the
parsing results of Ethereal. We leave the identification of set semantic to be
future work (see Section 6).
In Figure 4, we show the number of
messages of each true format in the HTTP trace. Note that the y-axis is in
logarithmic scale. This clearly reveals the heavy-tail distribution; most
messages (more than 99%) fall in the first top 1000 true formats. We observed a
similar trend in the RPC and CIFS/SMB trace as well. The implication for our
tool is that the format coverage and message coverage are likely to be very
different; the latter will be much higher compared to the former. In HTTP, we
inferred 3,926 formats, which covered 5,938,511 out of 5,950,453 messages
(99.8%). The covered messages belong to 865 out of 2,696 true formats (32%).
Since we have a hard requirement on the minimum size of a cluster, we
conjecture that the coverage ratio in terms of true formats will improve when
the trace grows and each format has more messages.
Figure 5 plots the CDF of the number
of true formats followed by messages of a cluster for all clusters before the
merging phase. This reflects the correctness of our tool. This figure shows
that about 90% of our inferred clusters are correct. They correspond to only
one true format. The number rises to over 95% if we include inferred formats
that match two true formats.
By manually inspecting the results, we found that the clustering errors are
mainly due to the inaccuracy in Ethereal parsing. For example, in some message
formats, Discoverer infers that there is a token that can be either
``Connection'' or ``Proxy-Connection''. Discoverer does not treat it as a FD
because both may be followed by the same set of values such as ``Close'' or
``Keep-Alive''. However, Ethereal does not recognize ``Proxy-Connection'' as a
parameter for HTTP, and returns a null string for this field in its parsing
result, while it returns ``http.connection'' for ``Connection''. So we will
have two true formats for a cluster that contains both ``Connection'' and
``Proxy-Connection''. Thus, our conciseness number may improve if Ethereal has
more accurate parsing.
The merging phase reduced 4,465 clusters to 3,926 clusters. Since the
covered messages belong to 865 true formats, this gives us a 5 to 1 ratio. In
fact, almost 80% true formats are scattered into at most five clusters. On one
hand, our conservative strategy eliminated false positives (i.e., wrongly
merging two clusters that correspond to two different true formats). On the
other hand, it did not help much in merging clusters for HTTP. The reason is as
follows. HTTP allows many parameters in the form of ``parameter: value''. We
treat the ``parameter:'' and ``value'' as separate tokens because of the space
in between. Since the ``value'' token for certain parameters such as ``PROXY'' may
be arbitrary strings, it is likely for such ``value'' tokens in two clusters to
not have overlapped values. In this case, we will treat them as a mismatch. If
two clusters happen to have more than one such mismatch, we will not merge them
based on our conservative policy.