Next: Related Work Up: Using Text Categorization Techniques Previous: Abstract

Introduction

Intrusion detection has played an important role in computer security research. Two general approaches to intrusion detection are currently popular: misuse detection and anomaly detection. In misuse detection, basically a pattern matching method, a user's activities are compared with the known signature patterns of intrusive attacks. Those matched are then labeled as intrusive activities. That is, misuse detection is essentially a model-reference procedure. While misuse detection can be effective in recognizing known intrusion types, it tends to give less than satisfactory results in detecting novel attacks.

Anomaly detection, on the other hand, looks for patterns that deviate from the normal (for example, [1,2]). In spite of their capability of detecting unknown attacks, anomaly detection systems suffer from a basic difficulty in defining what is ``normal''. Methods based on anomaly detection tend to produce many false alarms because they are not capable of discriminating between abnormal patterns triggered by an otherwise authorized user and those triggered by an intruder [3].

Regardless of the approach used, almost all intrusion detection methods rely on some sort of usage tracks left behind by users. People trying to outsmart an intrusion detection system can deliberately cover their malicious activities by slowly changing their behavior patterns. Some examples of obvious features that a user can manipulate are the time of log-in and the command set used [4]. This, coupled with factors emanating from privacy issues, makes the modeling of user activities a less attractive option.

Learning program behavior and building program profiles is another possibility. Indeed building program profiles, especially those of privileged programs, has become a popular alternative to building user profiles in intrusion detection [5,6,7,8]. Capturing the system call history associated with the execution of a program is one way of creating the execution profile of a program. Program profiles appear to have the potential to provide concise and stable descriptions of intrusion activity. To date, almost all the research in this area has been focused on using short sequences of system calls generated by individual programs. The local ordering of these system call sequences is examined and classified as normal or intrusive. There is one theoretical and one practical problem with this approach. Theoretically, no justification has been provided for this definition of ``normal'' behavior. Notwithstanding this theoretical gap, this procedure is tedious and costly. Although some automated tools may help to capture system call sequences, it is difficult and time consuming to learn individually the behavior profiles of all the programs (i.e., system programs and application programs). While the system programs are not generally updated as often as the application programs, the execution traces of system programs are likely to be dynamic also, thus making it difficult to characterize ``normality''.

This paper treats the system calls differently. Instead of looking at the local ordering of the system calls, our method uses the frequencies of system calls to characterize program behavior for intrusion detection. This stratagem allows the treatment of long stretches of system calls as one unit, thus allowing one to bypass the need to build separate databases and learn individual program profiles. Using the text processing metaphor, each system call is then treated as a ``word'' in a long document and the set of system calls generated by a process is treated as the ``document''.This analogy makes it possible to bring the full spectrum of well-developed text processing methods [9] to bear on the intrusion detection problem. One such method is the k-Nearest Neighbor classification method [10,11].

The rest of this paper is organized as follows. In Section 2 we review some related work. Section 3 is a brief introduction to the kNN text categorization method. Section 4 describes details of our experiments with the 1998 DARPA data. We summarize our results in Section 5, and Section 6 contains further discussions.

Next: Related Work Up: Using Text Categorization Techniques Previous: Abstract

Yihua Liao 2002-05-13