Both the monitoring and the response components of pH use ideas introduced in . What follows is a description of our original testing methodology, with which we gathered on-line data for off-line analysis. Subsequent sections explain how these techniques were modified to create pH.
To review, we monitored all the system calls (without arguments) made by an executing program on a per-process basis. That is, each time a process was invoked, we began a new trace, logging all the system calls for that process. Thus, for every process the trace consists of an ordered list (a time-series) of the system calls it made during its execution. For commonly executed programs, especially those that run with privilege, we collected such traces over many invocations of the program, when it was behaving normally. We then used the collection of all such traces (for one program) to develop an empirical model of its normal behavior.
Once the system had been trained on a sufficient number of normal program executions, the model was tested on subsequent invocations of the program. The hope was that the model would recognize most normal behavior as ``normal'' and most attacks as ``abnormal.'' Our method thus falls into the category of anomaly intrusion detection.
Given a collection of system call traces, how do we use them to construct a model? This is an active area of research in the field of machine learning, and there are literally hundreds of good methods available to choose from, including hidden Markov models, decision trees, neural networks, and a variety of methods based on deterministic finite automata (DFAs). We chose the simplest method we could think of within the following constraints. First, the method must be suitable for on-line training and testing. That is, we must be able to construct the model ``on the fly'' in one pass over the data, and both training and testing must be efficient enough to be performed in real-time. Next, the method must be suitable for large alphabet sizes. Our alphabet consists of all the different system calls--typically about 200 for UNIX systems. Finally, the method must create models that are sensitive to common forms of intrusion. Traces of intrusions are often 99% the same as normal traces, with very small, temporally clumped deviations from normal behavior. In the following, we describe a simple method, which we call ``time-delay embedding'' . Warrender  compared time-delay embedding with several other common machine learning algorithms and discovered that it is remarkably accurate and efficient in this domain.
We define normal behavior in terms of short -grams of system calls. Conceptually, we define a small fixed size window and ``slide'' it over each trace, recording which calls precede the current call within the sliding window. The current call and a call at a fixed preceding window position form a ``pair,'' with the contents of a window of length being represented by pairs. The collection of unique pairs over all the traces for a single program constitutes our model of normal behavior for the program.1
More formally, let
For example, suppose we had as normal the following sequence of calls:
execve, brk, open, fstat, mmap, close, open, mmap, munmapand a window size of 4. We slide the window across the sequence, and for each call we encounter, we record what call precedes it at different positions within the window, numbering them from 0 to , with 0 being the current system call. So, for this trace, we get the following windows:
|position 3||position 2||position 1||current|
|current||position 1||position 2||position 3|
|open||brk, close||execve, mmap||fstat|
|mmap||fstat, open||open, close||brk, mmap|
This table can be stored using a fixed-size bit array. If is the size of the alphabet (number of different possible system calls) and is the window size, then we can store the complete model in a bit array of size: . Because is small (6 is our standard default), our current implementation uses a byte array, with masks to access the individual bits.
At testing time, system call pairs from test traces are compared against those in the normal profile. Any system call pair (the current call and a preceding call within the current window) not present in the normal profile is called a mismatch. Any individual mismatch could indicate anomalous behavior (a true positive), or it could be a sequence that was not included in the normal training data (a false positive). The current system call is defined as anomalous if there are any mismatches within its window.
To date, all of the intrusions we have studied produce anomalous sequences in temporally local clusters. To facilitate the detection of these clusters, we record recent anomalous system calls in a fixed-size circular array, which we refer to as a locality frame. More precisely, let be the size of our locality frame, and let be the -th entry of the locality frame array, with and . Then, for system call ( ) with mismatches , iff , and is 0 otherwise. Thus, the locality frame implicitly stores the number of the past system calls which were anomalous. We call this total of recent anomalies, , the locality frame count (LFC).2 For the experiments described below, we used a locality frame of size 128.