Check out the new USENIX Web site.


USENIX, The Advanced Computing Systems Association

LISA '06 Paper

NAF: The NetSA Aggregated Flow Tool Suite

Brian Trammell - CERT/NetSA, Carnegie Mellon University
Carrie Gates[Note 1] - CA Labs

Pp. 221-231 of the Proceedings of LISA '06: 20th Large Installation System Administration Conference
(Washington, DC: USENIX Association, December 3-8, 2006).

Abstract

In this paper we present a new suite of tools - NAF (for NetSA Aggregated Flow) - that accepts network flow data in multiple different formats and flexibly processes it into time-series aggregates represented in an IPFIX-based data format. NAF also supports both unidirectional and bidirectional flow data by matching uniflows into biflows where sufficient information is available. These tools are designed for generic aggregation of flow data with a focus on security applications. They can be used to reduce flow data for long-term storage, summarize it as the first step in numerical analysis, or as a back-end to flow data visualization processes.

Introduction

Many organizations, including universities, private industry, Internet Service Providers (ISPs) and government organizations, are monitoring and storing network traffic information. Due to the overwhelming volume of network traffic, many of these organizations have chosen flow formats over full packet capture or packet header information. This information is then used for a variety of purposes, such as billing or capacity planning, with security monitoring being particularly common.

A number of tools have been produced in the past few years for processing flow data for both network usage and security analysis. For example, OSU FlowTools [4], argus [10] and SiLK [5] each have a significant user base. However, each of these programs produces flow data in a different format, the tools from each are not interoperable, and there are no programs that can perform consistent analysis across each of the different formats.

In this paper we present the NAF (NetSA Aggregated Flow) suite of tools. These tools have been designed for interoperability with a variety of flow collection systems and analysis tools, with a focus on common security analysis tasks using time-series aggregated flow data. To that end, the nafalize flow aggregator can read raw flow data from multiple input formats (e.g., SiLK, argus) as well as from data sources supporting the IPFIX [1] standard for flow export. To bridge the gap between unidirectional and bidirectional flow data formats, nafalize can match uniflows into biflows if both directions are available. The nafscii flow printer converts aggregate flow data into whitespace-delimited text for simple import into a variety of numerical analysis tools.

nafalize was designed to be a general-purpose flow aggregator. It provides a flexible aggregation facility for generating time-series aggregated flow data, supporting aggregation by any set of flow key fields, and counting of octets, packets, and flows in the aggregate, as well as by distinct source and destination addresses. It includes a facility for sorting aggregated flow data and limiting output to produce time-series Top-N and watch lists. The nafilter tool allows further filtering and sorting of this aggregated flow data.

We begin with an overview of related work, focusing on different suites of flow tools and the functionality they provide. We then describe the NAF suite of tools, including the functionality of each of the tools along with a description of the options available. We present details of the design and implementation of the nafalize and nafilter tools and subsequently provide some usage examples that have been deployed in an operational context. The last section includes concluding comments and plans for future work.

Related Work

The conversion and aggregation of flow data for analysis purposes is certainly not a new area of work, and there exist several other tool sets which address some of the problems the NAF suite was created to address. Most of NAF's functionality can indeed be duplicated by stringing together combinations of these tools; we believe that what NAF adds to the state of the art is the combination of flexible, time-series aggregation and sorting of flow data with multiple input format support in a single, relatively easy to use package. We examine some of these other tools below.

In its default mode of operation, the ipaggcreate tool from the ipsumdump [3] suite, maintained by Eddie Kohler at UCLA, works much like nafalize and nafscii in series, with two important differences. First, ipaggcreate is primarily designed to read various packet trace formats; ``flow-like'' input support is limited to Cisco NetFlow summary files and Bro connection summaries. Second, ipaggcreate has no notion of time series; it is analogous to running NAF with a bin size larger than the entire scope of its input; using ipaggcreate to generate time-series aggregates would require multiple invocations.

The OSU FlowTools [4] flow-report tool can produce the same type of aggregate reports, with the same caveat that it does not natively support time series data.

The SiLK [5] flow analysis suite can be used to do many of the aggregation tasks performed by nafalize and nafilter, though somewhat less easily.[Note 2] A set of rwfilter invocations can be used to filter flows, simulating flow key masking for limited key spaces. The rwcut tool provides only simple time binning and aggregation. nafalize also supports biflow data natively, and unique host and port counting, which are not presently provided by SiLK.

The FlowScan [2] tool, maintained by Dave Plonka from the University of Wisconsin, is another time-series flow analysis environment. Unlike NAF, it is focused specifically on visualization of summary time series flow data, and as such uses the round-robin database provided by RRDTool [13] for data storage and aggregation. However, it does support import of flow data in multiple formats; including CAIDA's cflowd, the OSU flow-tools package, QoSient's argus, and the Lightweight Flow Accounting Protocol.

NCSA's CANINE [8] tool performs largely the same function as nafalize's first stage, transcoding among a variety of flow formats as a conversion front-end for NCSA's flow data visualization tools. CANINE is also capable of IP address and time sequence anonymization. However, it does not itself provide any support for aggregation of flow data.

Description

The NAF suite presently consists of four tools: nafalize, which aggregates flow data into a common aggregated flow data format (the NAF data format, based upon IPFIX); nafilter, which sorts and filters NAF data; nafscii, which prints NAF data as whitespace-delimited ASCII text; and nafload, which loads NAF data into a relational database. The NAF data format itself is handled by libnaf, a library installed with the NAF tools; this arrangement allows the easy creation of tools that interoperate with NAF.

All the tools support common options for input and output routing. They can each run as command-line tools or as daemons. In the latter case, the tools can watch directories for input files, processing them as they appear and forwarding the output to other directories; in this manner, chains of daemons can be built to support specific workflows.

nafalize reads raw flow data in a variety of formats, filtering, aggregating, and sorting it into the NAF data format. The aggregation function performed by nafalize is specified on the nafalize command line by an aggregation expression. This expression maps relatively tightly to the design of nafalize; that is, each ``clause'' in the expression corresponds to a particular process within nafalize's data flow. The aggregation expression is described by the following pattern:

bin [uniform | start | end] <n>
     (sec | min | hr)
[uniflow] [perimeter <perimeter-ranges>]
[<filter-expression>]
aggregate [sip[/<mask>]] [dip[/<mask>]]
            [sp] [dp] [proto] [sid]
   count [hosts] [ports] [total]
         [flows] [packets] [octets]
   [<filter-expression>]
   [<sort-expression>]
   [label <output-label>]
[aggregate ...]

Note that multiple aggregate clauses are supported; this is used to specify fanout as in the ``Fanout'' section in ``Stage 3,'' below. The label phrase serves to differentiate output files in this case.

The first filter expression is applied to the entire data set so that all later processing is performed only on this filtered data (and is described more fully in the ``Stage 2'' section, below). The second filter expression applies solely to the individual aggregation (see the ``Stage 3'' section). Thus each individual aggregation can be on data that has been filtered differently. The filter expression itself is described by the following pattern:

filter
[time <time-rangelist>]
[sip [not] <ipaddr-rangelist>]
[dip [not] <ipaddr-rangelist>]
[sp [not] <unsigned-rangelist>]
[dp [not] <unsigned-rangelist>]
[proto [not] <unsigned-rangelist>]
[flows <unsigned-rangelist>]
[packets <unsigned-rangelist>]
[octets <unsigned-rangelist>]
[shosts <unsigned-rangelist>]
[dhosts <unsigned-rangelist>]
[sports <unsigned-rangelist>]
[dports <unsigned-rangelist>]

The sort expression defines the ordering of records in the output. The output is always sorted in time order first. Within each bin, output records are given in arbitrary order by default. If a sort expression is present, output records within each bin are sorted by the fields given in ascending or descending order. The limit phrase limits the output to the specified number of records per time bin; it can be used to generate top-N lists. The sort expression is described by the following pattern:

[sort (flows | packets | octets |
        shosts | dhosts | sports |
        dports | sip | sp | dip |
        dp | proto)
    [asc | desc]]
[sort ...]
[limit <n>]

nafilter reads NAF aggregated flow data, filters and sorts it, writing NAF aggregated flow data. As with nafalize, nafilter's operation is specified by a filter/sort expression, which has the same mapping to nafilter's internals as the aggregation expression above. The filter/sort expression is merely a filter expression followed by a sort expression, as defined above.

Each of the clauses in the expressions corresponds to a process in the data flow of each application; see below for more.

nafscii is the NAF flow printer. It reads NAF aggregated flow data and writes it out as whitespace-delimited text. It is used for simple aggregated flow display, and for exporting NAF aggregated flow data to other analysis tools (e.g., the R [12] statistical computing and visualization environment) which can handle whitespace-delimited data.

nafload is an online loader for inserting NAF aggregated flow data into a relational database. It was built largely to support the internal workflow of a specific project at the CERT Network Situational Awareness Group.

Both nafscii and nafload are extremely simple in design, reading in NAF aggregate flow records and writing them out after transforming them; they will therefore not be considered in further detail in this paper.

Reference the manual pages in the NAF distribution for detailed usage instructions (available at https://www.cert.org/netsa/tools/naf).

Design and Implementation

In this section, we describe the common data model and storage format these tools use, then explore the design of nafalize and nafilter in detail.

Data Model

NAF's internal data model is based on time-binned, aggregated, bidirectional flows. Each NAF record represents a collection of flows sharing a given flow key or subkey within a given finite time period (bin). Flow keys (type NAFlowKey in the NAF source) are composed of some combination of source and destination IP address and prefix length, IP protocol, source and destination port, and source (or sensor) identifier. Flow values (type NAFlowVal) are composed of six aggregate counters (for octets, packets, and flows, each in the forward and reverse directions) and four unique value counters (for distinct source and destination IP addresses, and distinct source and destination ports).

NAF uses an IPFIX-based external data format, as described in the IPFIX Protocol Specification [1] and the IPFIX Information Model [11]. Bidirectional flow information is represented as in Biflow Implementation Support in IPFIX [14]. IPFIX was chosen for its self-describing nature; the ability to dynamically define record templates, useful because nafalize can produce output with any subset of key and value fields; and for ease of present and future interoperability with IPFIX-based flow metering and collection tools. Each NAF data file is a serialized IPFIX message stream containing the IPFIX templates required to interpret the records contained within the file.

nafalize

nafalize is the NAF flow aggregator. It is an input-driven application, capable of converting a variety of flow formats. It can read flow data from files, function as an IPFIX Collecting Process (reading IPFIX messages over TCP or UDP), or capture packets from an interface via libpcap.

nafalize is roughly organized into three stages, with well-defined interfaces between them. These three stages are raw flow input, binning and matching, and aggregation and output; they are described below. A schematic diagram of nafalize appears in Figure 1.



Figure 1: Schematic diagram of nafalize.

It should be noted that presently, all of these stages run in series - for example, when a packet or flow is received from the first stage that causes a new bintable to be enqueued, and a bintable is dequeued, that bintable is aggregated and flushed (multiple times, in case of fanout) before the next packet or flow is processed. This may lead to dropped data when NAF is collecting data from a live interface or via IPFIX over UDP. The strict layering of these stages was chosen for future performance enhancement; by splitting each stage into its own thread for running on its own processor in a multiprocessor system, for example.

Stage 1: Raw Flow Input

nafalize supports three raw flow input types: multi-format flow input from files; IPFIX over TCP, UDP, or serialized streams via libfixbuf; and packet capture via libpcap [7]. A given nafalize invocation can only read input from a single type.

The file input facility includes flow format drivers for reading from QoSient Argus [10] 2.0.6, CERT/ NetSA SiLK [5], and NAF files themselves for re-aggregation. New flow format drivers are relatively simple to add, but at this time require integration into the NAF source code; there is no support for dynamically-loaded flow format drivers.

The IPFIX input facility can read from serialized IPFIX message stream files, and IPFIX messages via UDP or TCP. It is capable of reading both unidirectional and bidirectional flows.

The packet capture facility can read from pcap dumpfiles (as produced by tcpdump -w), or from a live ethernet or loopback interface via libpcap. It does partial fragment reassembly - enough to ensure subsequent fragments of a fragmented datagram are accounted to the correct source and destination UDP or TCP port. When capturing packets, nafalize simulates counting TCP ``flows'' by looking for SYN or SYN+ACK packets. Each packet, SYN or not, is then treated as a separate raw flow for purposes of binning and matching, as below.

Each of these facilities successively reads flow or packet records from their respective sources, and normalizes them into NAF raw flow records (type NAFlowRaw), which are then passed to the second stage.

Stage 2: Bin and Match

The second stage of nafalize consists of four processes: binning, perimeter reversal, prefiltering, and matching. Perimeter reversal and prefiltering are optional, depending on configuration.

Each raw flow is first split into time bins. If a raw flow's start and end times fit entirely within a single bin, that raw flow is assigned to the bin in which it fits. Otherwise, the flow is assigned to bins according to a user-selectable bin selection strategy. Three of these strategies are presently supported: start, end, and uniform. The ``start'' strategy bins the raw flow completely into the bin containing the flow's start time; conversely, the ``end'' strategy bins the raw flow completely into the bin containing the flow's end time. The ``uniform'' bin selection strategy divides the flow's values equally into each bin covered by the time span between the flow's start and end time, preserving remainders so that values are robust across re-aggregation.

Note that, as a packet has only a single timestamp, every packet-derived ``raw flow'' will always fit into a single bin. Therefore, packet capture input has the effect of ``start'' binning no matter what bin selection strategy is employed.

NAF supports optional perimeter reversal of flows. If the user specifies a network perimeter based on a set of IP address ranges and/or CIDR blocks, all raw flows are conditionally reversed such that addresses external to the perimeter are ``source'' addresses, and addresses internal to the perimeter are ``destination'' addresses. Flows not crossing the perimeter are dropped. This is compatible with the semantics of some security tools, such as snort and QRadar, which typically assign source addresses to the ``attacker.'' This facility is provided for users of such tools, who are more comfortable thinking of networks in terms of ``interior'' and ``exterior.''

After perimeter reversal, each binned flow is then subjected to an optional prefilter. See the ``Filtering'' section below for a detailed description of filtering in nafalize. Prefiltering is provided for performance improvement. Though each flow can also be filtered after matching, matching requires a flow to be held in memory until it is ready for aggregation. Therefore, early elimination of irrelevant flows before matching will increase nafalize's performance.

The binned flows are then inserted into the multibin (type NAFMultiBin). This structure is a bin-indexed, time-ordered queue of flow tables (type NAFBinTable). The appropriate bin table is accessed by time bin; if no table exists yet for a given bin (because a binned flow is the first one assigned to the given bin), new bin tables are created and inserted at the head of the queue. delim $$

The multibin's queue length is determined by the horizon. This horizon $h$ is chosen such that the processing of a flow of start time $t$ enables the assumption that no flows with a start time before $t-h$ will be processed subsequently. When processing raw flow or IPFIX data, this is generally set to the active timeout interval of the flow metering process; for packet capture input, the horizon can be the same as the bin size, keeping only one bin table active at once. This design implies that NAF's input must be at least roughly sorted by time.

When new bin tables are enqueued at the head, old bin tables expire from the tail. These bin tables are dequeued, and passed on to the third stage.

Stage 3: Aggregate and Output

The third stage of nafalize consists of five processes: filtering, masking, aggregation, sorting, and output. As with prefiltering in the second stage, filtering is optional, and is described in the ``Filtering'' section below.

Each binned flow that passes the optional filter is masked. Masking consists of projecting each record's flow key into a subset of the flow key. Subkeys are derived either by dropping fields from the key or by masking IP addresses by a more restrictive prefix length. The masking process is illustrated in Figure 2.



Figure 2: Illustration of mask operation.

This subkey is then used to create and update aggregate flow records into an aggregation table. The values of each binned flow are added to the corresponding aggregate flow record, and distinct values for flow keys dropped from the subkey (such as hosts or ports) are counted. Once all the binned flows in a bin table have been aggregated, the aggregation table is optionally sorted as described in the ``Sorting'' section below, then written to the output file.

Fanout

NAF is capable of fanout; that is, it can run multiple third stages off a single second stage. Each of these third stages has a different filter and mask, and a separate aggregation table, and writes to a different file. This can lead to significant performance improvement over processing the same data twice, because in many cases the second stage is much more memory- and CPU-intensive than the third.

Filtering

As noted above, nafalize may optionally filter raw flows before binning and matching, and binned flows before aggregation. The filtering facility is identical in either case. Each filter is built from a user-supplied filter expression, and contains one or more field specifiers (e.g., source IP, protocol) and ranges of acceptable values for the given field. If a field specifier is present in a filter, then that field must have one of the values in the associated range in order for the flow to pass the filter. Flows that do not pass the filter are simply dropped.

This filtering algorithm does not support arbitrary boolean predicates; instead, it can be viewed as the intersection of a set of unions (or ``AND-of-OR'').

When filtering on time ranges, a flow matches a time range if the flow's start time falls within the time range. For purposes of filtering, the start time of a binned flow is the bin's start time.

Sorting

As noted above, nafalize may optionally sort aggregated flows on output. If a sort expression is supplied by the user for a given aggregation, all aggregated flows in each time bin are placed into an array on output, then sorted using a sort comparator derived from the sort expression. Note that this design constrains the output to be sorted in ascending time order.

The sorting function also supports a limit, which will output only the first N flows per bin in sort order. In this way nafalize can be used to build time-series Top-N lists.

nafilter

nafilter is the NAF aggregated flow filter/sorter. Like nafalize, it is an input-driven application, though it is limited to only reading files written by nafalize. It is roughly organized into two stages: filtering and optional sorting. A schematic is shown in Figure 3.



Figure 3: Schematic diagram of nafilter.

Filtering operates as described in the ``Filtering'' section; indeed, the filter implementation is shared between nafalize and nafilter. It is important to note that the filter operates on raw flows in nafalize but on aggregated flows in nafilter, so filtering on value fields will have different results in each application. Consider the example of a filter that passes only records with a packet count of one. In nafalize, this filter will build aggregates of single-packet raw flows, while in nafilter, it will only pass aggregate flows that themselves only have a single packet.

Sorting operates as described above; the sorting implementation is also shared with nafalize.

Usage Examples

Here we describe two examples of actual NAF deployments in operational contexts. The first is as part of an internal data collection project at the CERT Network Situational Awareness Group. The second deployment occurred on the network of a large computing conference as part of a security support effort using SiLK and NAF.

NetSA Preanalysis

NAF is used in the preprocessing of Argus 2.0.6 flow data from a distributed collection infrastructure operated by the Network Situational Awareness group at CERT. The generated aggregate flows support ``at-a-glance'' visualization and statistical anomaly detection.

Raw Argus flow files (generated by argus and preprocessed themselves through ragator) are aggregated into four separate labeled files (using the fanout feature described below). These files are then loaded via nafload into a relational database, from which further analyses are done.[Note 3] The nafalize command line for this output is:

nafalize --daemon --lock --intype argus2
 --nextdir argus-cache --faildir naf-fail
 --in "argus/*.rag" --out naf-out
 bin 5 min
 aggregate count flows packets octets
 label volume
 aggregate count hosts
 label talkers
 aggregate sip count flows octets
 label sources
 aggregate dp proto count hosts octets
         packets flows
 filter proto 6,17
 label pdb
The four labeled files contain, in 5-minute time series: total flow, packet and octet volume (example nafscii output in Figure 4); total distinct source and destination hosts (Figure 5); total flows and octets per source IP address (Figure 6); and total distinct source and destination hosts, flow, packet, and octet counts from per destination port and protocol (Figure 7).[Note 4]
    date     time    flo rflo  pkt  rpkt    oct     roct
2006-03-03 13:20:00  363  12  4873  5552  739388  5956007
2006-03-03 13:25:00  279  16  7026  7612 1337665  8042156
2006-03-03 13:30:00  343  11  2599  2208  639824  1616504
2006-03-03 13:35:00  190   9  1355  1077  311763   729521
2006-03-03 13:40:00  223   6  1631  1422  320408  1040939
2006-03-03 13:45:00  223   7  5031  6147  653908  7736319

Figure 4: Argus preprocessing example volume output.

    date     time   shosts dhosts
2006-03-03 13:20:00     35     62
2006-03-03 13:25:00     37     60
2006-03-03 13:30:00     37     70
2006-03-03 13:35:00     31     38
2006-03-03 13:40:00     34     50
2006-03-03 13:45:00     28     48

Figure 5: Argus preprocessing example talkers output.

    date     time          sip    flo  rflo   oct    roct
2006-03-03 13:20:00  10.146.0.13    1     0     64       0
2006-03-03 13:20:00  10.146.0.73   27     2  91604  433619
2006-03-03 13:20:00  10.146.0.74   14     0   1436     837
2006-03-03 13:20:00  10.146.0.77   23     0   9286   15266
2006-03-03 13:20:00  10.146.0.82   27     3   7766    5544
2006-03-03 13:20:00  10.146.0.83   14     0   4647   22963
2006-03-03 13:20:00  10.146.0.91   11     0  23202   31724
2006-03-03 13:20:00  10.146.0.95    8     0   3618   35124
2006-03-03 13:20:00  10.146.0.99    2     0     56       0

Figure 6: Argus preprocessing example sources output.

    date     time     dp proto shosts dhosts flo rflo pkt rpkt   oct    roct
2006-03-03 13:20:00   22   6     1       1    4   0    42   47   3910    7894
2006-03-03 13:20:00   80   6     5      15   81   0  1033 1141 107152 1120657
2006-03-03 13:20:00  143   6     1       1    2   0    48   49   2534   34490
2006-03-03 13:20:00  443   6     4       7   40   0   423  431  64404  282673
2006-03-03 13:20:00  445   6     3       1    3   0     3    0    144       0
2006-03-03 13:20:00  993   6     1       1    2   0     4    2    320     266
2006-03-03 13:20:00   53  17     8       6   53   0    91   55   6411   11104
2006-03-03 13:20:00   67  17     5       2    5   0     9    4   3024    1312
2006-03-03 13:20:00  137  17     7       3    9   0    54    0   4203       0
2006-03-03 13:20:00  138  17     5       1    6   0     7    0   1615       0
2006-03-03 13:20:00 5353  17     3       1    6   0    11    0   1365       0

Figure 7: Argus preprocessing example pdb output.

The relational database into which the aggregate flow data is loaded is presently used for two purposes.

First, periodic queries are run against the relational database, and the results are fed into Tobias Oetiker's rrdtool [13] to generate time-series graphs for each of the variables produced (e.g., total data volume per sensor in octets, talkers per sensor, etc.). This provides a simple ``dashboard'' visualization for the distributed collection system.

Periodic queries are also used to select all the variables available for a given time bin; these are used as independent-variable input into a statistical anomaly detection process based upon Mahalanobis distance [6], which compares each bin to a baseline derived from a larger window of recent aggregate data, and detects time bins which deviate significantly and therefore may benefit from further analysis of the full-flow data. The result of this process is a single time-series ``deviance'' metric, which is itself fed into rrdtool as above.

Note that we store time-series summary flow data in the relational database, and keep raw flow data in its native (Argus) format for a period of time to permit detailed flow analysis. NAF was deployed in this environment to replace a system which inserted raw flow data into the database for intermediate-term storage, and where all aggregation was done via SQL queries. This system did not scale adequately for our needs; a more detailed look at the use of relational database technology in raw flow storage is given in [9].

Conference Security

SiLK [5] and NAF[Note 5] were deployed as part of an effort to provide security support for a large computing conference in late 2005. The example usage and output results presented here were all gathered from this conference. We have represented IP addresses internal to the security conference facility as 192.168. 128.0/17, while external addresses have been randomly chosen from 241.0.0.0/8.

NAF is used here as a post-processor for SiLK raw flow data; the per-key binning provided by nafalize is more convenient to use than the equivalent operations using the SiLK rw tools alone, and SiLK did not at the time of this deployment support unique host or port counting as in the third example. delim ..........

In our first example, we first use the SiLK tools (rwfilter) to filter the data to include only those flows that originated from within the internal network, but that were not destined for an internal address. One hour of such TCP data is piped into the following command:

nafalize -t silk bin 1 hr
 aggregate sip
 count hosts | \
nafscii | \
gawk '{if ($1 !~ /date/) { split($3, a, ".");
 printf "%d | %d0, a[1]*256*256*256 +
 a[2]*256*256 + a[3]*256 + a[4], $4}}'

The nafalize command aggregates the results from the SiLK commands into one hour bins by source IP address and counts the total number of hosts contacted. These results are converted into ASCII (via nafscii). The gawk command takes the IP address, which is provided in dotted quad notation, and converts it back to its integer form, printing this value along with the number of hosts contacted by that IP. The results from this command have the following form:

2363326977 | 1
2363326978 | 1
2363326980 | 2
2363327150 | 32
2363327161 | 7

These results are then presented graphically (hence the requirement for an integer representation of the addresses) so that a user can gain a sense over time of what was considered normal activity for the network. We present an example graph in Figure 8. This figure indicates five outliers that are potentially worth further investigation.

Figure 8: Hosts Contacted Per Hour.

Figure 9 demonstrates a second command run periodically on the conference network. In order to detect unusual TCP activity, we select TCP traffic that was inbound to the network and that did not originate from within it over a one day period, restricted to only those flows representing failed connection attempts (i.e., where the SYN flag was set but not the ACK flag). This selection is done via the SiLK rwfilter command. We again nafalize this into one hour bins by source IP address, counting packets and bytes. nafscii then converts the output for processing by gawk to print the average number of bytes per packet observed, the number of packets, the start date and time for the record and the source IP address in dotted quad. As the selected flows represent failed connection attempts, we would expect the majority of traffic to be 40, 44, 48 or 52-byte single-packet flows. However, here we observe unusual activity. We expect that the cases where there are a large number of packets that average 40 or 44 bytes per packet are indications of scanning activity. For example, the case where IP address 241.194.61.230 has 14,374 packets with an average of 59 bytes per packet might indicate someone who was scanning and trying an exploit against those hosts that responded. This would indicate an IP address whose traffic warrants further investigation.


% rwfilter --syn=1 --ack=0 --daddr=192.168.128.0/17 \
    --not-saddr=192.168.128.0/17
    --pass=stdout --proto=6 --start=2005/11/15 | \
    nafalize -t silk bin 1 hr aggregate sip count packets octets | \
    nafscii | \
    gawk '{if ($1 !~ /date/) printf "%d | %d | %s %s %s \n", \
        $5/$4, $4, $1, $2, $3}'
 40 |     1 | 2005-11-15 00:00:00 241.192.13.14
 52 |     2 | 2005-11-15 00:00:00 241.10.21.189
 48 |     1 | 2005-11-15 00:00:00 241.11.246.197
 44 |  1929 | 2005-11-15 00:00:00 241.34.98.164
 59 | 14374 | 2005-11-15 00:00:00 241.194.61.230
 40 | 23347 | 2005-11-15 16:00:00 241.192.100.123
740 |     1 | 2005-11-15 23:00:00 241.71.1.154

Figure 9: Inbound bytes-per-packet by source.

In the third example (Figure 10), we examine all TCP traffic for a single day that is incoming to the conference network and that did not originate from within it. Again we aggregate in one-hour bins by source IP address. We convert to ASCII and process the results, printing out the integer value for the IP address, the number of destination hosts contacted, the start date and time for the record, and the dotted-quad version of the IP address. As there is little reason for an external IP address to connect to a large number of internal IP addresses, this analysis indicates likely scanning activity. Note that IP 241.194.61.230 appears again in this data, lending credence to our hypothesis above regarding their activity.

% rwfilter --daddr=192.168.128.0/17 --not-saddr=192.168.128.0/17 \
    --pass=stdout --proto=6 --start=2005/11/15 | \
    nafalize -t silk bin 1 hr aggregate sip count hosts | \
    nafscii | \
    gawk '{if ($1 !~ /date/) { \
        split($3, a, "."); printf "%d | %d | %s %s %s \n", \
       a[1]*256*256*256 + a[2]*256*256 + a[3]*256 + a[4], $4, $1, $2, $3}}'
1000079635 | 1 | 2005-11-15 21:00:00 241.156.1.19
1006778434 | 2 | 2005-11-15 23:00:00 241.2.56.66
1022911850 | 3 | 2005-11-15 03:00:00 241.248.101.106
3514611848 | 283 | 2005-11-15 20:00:00 241.124.184.136
3703717350 | 14509 | 2005-11-15 00:00:00 241.194.61.230
1022903300 | 15881 | 2005-11-15 16:00:00 241.248.68.4

Figure 10: Inbound destination host count by source.

Our last example demonstrates commands run periodically to detect potentially compromised internal machines. In this case we select TCP flows from an external host to an internal host on port 445, lasting more than thirty seconds, representing completed connections where more than 1400 bytes were sent. This should extract those hosts that might have compromised the SMB port on a Windows machine. We then use NAF in order to bin on an hourly basis, extracting the start date and hour, the source and destination IP address, the number of flows and the number of bytes. The results from running this command are provided in Figure 11. In this case we find two IP addresses that have


% rwfilter --bytes=1400-99999999 --dur=30-3600 --dport=445 --ack=1 \
    --flags-initial=S/SA --start=2005/11/15 --proto=6 --pass=stdout \
    --not-saddr=192.168.128.0/17 --daddr=192.168.128.0/17 | \
    nafalize -t silk bin 1 hr aggregate sip dip count total flows octets | \
    nafscii | \
    gawk '{if ( $1 !~ /date/) { if ($6 > 10000) \
        printf "445 | %s %s | %s %s | %d | %d\n", $1, $2, $3, $4, $5, $6}}'
445 | 2005-11-15 10:00:00 | 241.146.88.212 192.168.190.233  | 18 | 1209855
445 | 2005-11-15 11:00:00 | 241.146.88.212 192.168.190.233  |  1 | 44824
445 | 2005-11-15 16:00:00 | 241.214.211.244 192.168.190.248 | 3  | 231372

Figure 11: Inbound potential SMB compromise detection.

potentially compromised a single internal host each.

During the conference, we performed a similar analysis on ports 135 and 22. While much of the traffic destined for port 22 was legitimate, we examined the number of internal IP addresses to which different external IP addresses had connected. We also examined where the external IP addresses were registered, to ensure that they matched known participants rather than likely compromisers.

Conclusions and Future Work

We have introduced a new flow aggregation tool suite, the focus of which is interoperability with multiple flow sensor technologies and the reduction of flow data for network security analysis purposes. These tools are designed to be reasonably generic, and apply to a wide variety of analysis tasks. We have explored the design of two of these tools in detail, and provided examples of their usage in two real-world applications.

NAF is under continuing active development at the CERT Network Situational Awareness Group. We have planned several enhancements for the tool suite that will appear in future releases:

NAF's internal data model and aggregation operations presently only support flows with IPv4 addresses; we plan to add support for IPv6 addresses, as well.

To support NAF's use in data sharing applications, future releases of the tool suite will include support for data anonymization, when the aggregation operations do not discard sufficient information to meet an organization's dissemination policy needs. Indeed, the extent to which aggregation operations obfuscate data needs to be better quantified; this is an area for future research.

The use of a single data format at the core of the NAF tools' design also makes it reasonably easy to build new consumers for that data; currently planned is a NAF-to-SVG ``printer'' analogous to nafscii or nafload. This would allow the generation of time-series graphs from NAF data without the present need to convert the data into rrdtool round-robin databases.

Likewise, the layered architecture of nafalize eases the addition of new types of flow input to the tool. Additional flow input drivers will continue to be added to the distribution on an ``on-demand'' basis.

Furthermore, as NAF's output format is based upon IPFIX, it will be reasonably simple to add support for using nafalize as an IPFIX Exporting Process; that is, to allow it to send output over the IPFIX Protocol. When combined with existing Collecting Process support, nafalize will be deployable as a ``drop-in'' aggregating IPFIX proxy.

Present experience with nafalize suggests that its performance is bound by I/O (how fast records can be read from disk or the network) and the size of the active flow table. While the performance is obviously dependent on both the data and the aggregation expressions used, nafalize run on a stock Dell 1850 tends to process between 100 k and 250 k records per second with between 5 k and 10 k concurrent flows in the second-stage NAFMultiBin. We plan on performing detailed profiling and optimization in order to increase this throughput.

One performance enhancement suggested by nafalize's three-stage design would be to split the stages into separate threads. This may increase throughput on multicore hardware, but more importantly, it will isolate output delay from input processing; especially important in the aggregating IPFIX proxy case above.

The NAFBinTable is presently limited to the size of available memory. nafalize may be extensible to work with extremely large datasets by replacing the underlying bintable implementation with one that can utilize on-disk storage when necessary, sacrificing performance for flexibility as needed.

Author Biographies

Brian Trammell is the Engineering Technical Lead at CERT Network Situational Awareness Group. He designs, builds and maintains software systems for the collection and analysis of security-relevant measurement data for large-scale networks. He is also an active contributor to internet-measurement related working groups in the Internet Engineering Task Force. He received his bachelor's degree in computer science in 2000 from the Georgia Institute of Technology, where he also worked as the UNIX systems administrator for the School of Civil Engineering for three years. He can be reached at .

Carrie Gates is a Research Staff Member with CA Labs where she performs research in enterprise security. She received her Ph.D. in May, 2006, from Dalhousie University. While completing her dissertation, she spent three years working with CERT Network Situational Awareness at Carnegie Mellon University doing security research for large-scale networks. Previous to her doctoral studies, Carrie was a systems administrator for six years. She can be reached at .

Bibliography

[1] Claise, B., IPFIX Protocol Specification, (work in progress), June, 2006, Internet-Draft draft-ietf-ipfix-proto-22 .
[2] Plonka, Dave, ``FlowScan: A network traffic flow reporting and visualization tool,'' Proceedings of the 14th Large Installation Systems Administration Conference (LISA 2000), New Orleans, Louisiana, USENIX Organization, pp. 305-317, December, 2000.
[3] Kohler, Eddie, ipsumdump and ipaggcreate, 2006, https://www.cs.ucla.edu/kohler/ipsumdump/, (Last Visited: 11 May 2006).
[4] Fullmer, M., and S. Romig, ``The OSU flow-tools package and Cisco Netflow logs,'' Proceedings of the 14th Systems Administration Conference (LISA 2000), New Orleans, Louisiana, USENIX, pp. 291-303, December, 2000.
[5] Gates, C., M. Collins, M. Duggan, A. Kompanek, and M. Thomas, ``More NetFlow tools: For performance and security,'' Proceedings of the 18th Large Installation Systems Administration Conference (LISA 2004), Atlanta, Georgia, USENIX, pp. 121-132, November, 2004.
[6] Lazarevic, A., L. Ertoz, V. Kumar, A. Ozgur, and J. Srivastava, ``A comparative study of anomaly detection schemes in network intrusion detection,'' Proceedings of SIAM International Conference on Data Mining, San Francisco, California, Society for Industrial and Applied Mathematics, May, 2003.
[7] LBNL Network Research Group, TCPDUMP public repository, 2005, https://www.tcpdump.org, (Last Visited: 9 May 2006).
[8] Luo, K., Y. Li, A. Slagell, and W. Yurcik, ``CANINE: A NetFlows converter/anonymizer tool for format interoperability and secure sharing,'' FloCon 2005: Proceedings, Pittsburgh, Pennsylvania, CERT Network Situational Aware- ness Group, September, 2005, https://www.cert.org/flocon/2005/proceedings.html.
[9] Navarro, J.-P., B. Nickless, and L. Winkler, ``Combining Cisco NetFlow exports with relational database technology for usage statistics, intrusion detection and network forensics,'' Proceedings of the 14th Large Installation Systems Administration Conference (LISA 2000), New Orleans, Louisiana, USENIX, pp. 285-290, December, 2000.
[10] QoSient, LLC, argus: network audit record generation and utilization system, 2004, https://www.qosient.com/argus/, (Last Visited: 9 May 2006).
[11] Quittek, J., S. Bryant, B. Claise, and J. Meyer, Information Model for IP Flow Information Export, June, 2006, Internet-Draft draft-ietf-ipfix-info-12 (work in progress).
[12] R Project, The R Project for Statistical Computing, 2006, https://www.r-project.org, (Last Visited: 12 May 2006).
[13] Oetiker, Tobias, RRDtool, 2006, https://oss.oetiker.ch/rrdtool/, (Last Visited: 11 May 2006).
[14] Trammell, B., and E. Boschi, Bidirectional Flow Export using IPFIX, August, 2006, Internet- Draft draft-ietf-ipfix-biflow-00 (work in progress).
Footnotes:
Note 1: This work was performed while with the CERT Network Situational Awareness Group at Carnegie Mellon University.
Note 2: Indeed, one of the initial motivations behind NAF's creation was to provide an easier method of producing time-series aggregates and unique counts from SiLK data.
Note 3: This is not precisely how this works in production. First, the nafalize command line is slightly different due to deployment concerns. Second, nafalize is run twice for site-specific reasons, with the second nafalize instance processing the output of the first instance, using naf's ability to reaggregate its own output. Third, the aggregation expressions in production use the srcid field to aggregate data from multiple sensors.
Note 4: The example data was not generated from production data from the distributed collection system; it is presented as an example of output from the command-line shown.
Note 5: This deployment used an earlier version of the NAF tools which did not support fanout and consequently used a slightly different aggregation expression; the examples have been corrected to use aggregation expressions that will operate with NAF as it is presently available.
Last changed: 1 Dec. 2006 jel