Home About USENIX Events Membership Publications Students
Security 2001 Paper    [Security '01 Tech Program Index]

Pp. 177–190 of the Proceedings

 

Statically Detecting Likely Buffer Overflow Vulnerabilities

 

David Larochelle

larochelle@cs.virginia.edu

University of Virginia, Department of Computer Science

David Evans

evans@cs.virginia.edu

University of Virginia, Department of Computer Science

 

Abstract

 

Buffer overflow attacks may be today’s single most important security threat.  This paper presents a new approach to mitigating buffer overflow vulnerabilities by detecting likely vulnerabilities through an analysis of the program source code.  Our approach exploits information provided in semantic comments and uses lightweight and efficient static analyses.   This paper describes an implementation of our approach that extends the LCLint annotation-assisted static checking tool.  Our tool is as fast as a compiler and nearly as easy to use.  We present experience using our approach to detect buffer overflow vulnerabilities in two security-sensitive programs.

 

 


1.   Introduction

 

Buffer overflow attacks are an important and persistent security problem.  Buffer overflows account for approximately half of all security vulnerabilities [CWPBW00, WFBA00].  Richard Pethia of CERT identified buffer overflow attacks as the single most im­por­tant security problem at a recent software engineering conference [Pethia00]; Brian Snow of the NSA predicted that buffer overflow attacks would still be a problem in twenty years [Snow99].   

 

Programs written in C are particularly susceptible to buffer overflow attacks.  Space and performance were more important design considerations for C than safety.  Hence, C allows direct pointer manipulations without any bounds checking.  The standard C library includes many functions that are unsafe if they are not used carefully.  Nevertheless, many security-critical pro­grams are written in C.

 

Several run-time approaches to mitigating the risks associated with buffer overflows have been proposed.  Despite their availability, these techniques are not used widely enough to substantially mitigate the effectiveness of buffer overflow attacks.  The next section describes representative run-time approaches and speculates on why they are not more widely used.  We propose, instead, to tackle the problem by detecting likely buffer overflow vulnerabilities through a static analysis of program source code.  We have im­ple­ment­ed a prototype tool that does this by extending LCLint [Evans96].  Our work differs from other work on static detection of buffer overflows in three key ways: (1) we exploit semantic comments added to source code to enable local checking of interprocedural properties; (2) we focus on lightweight static checking techniques that have good performance and scalability characteristics, but sacrifice soundness and completeness; and (3) we introduce loop heuristics, a simple approach for efficiently analyzing many loops found in typical programs. 

 

The next section of this paper provides some background on buffer overflow attacks and previous attempts to mitigate the problem.  Section 3 gives an overview of our approach.  In Section 4, we report on our experience using our tool on wu-ftpd and BIND, two security-sensitive programs.  The following two sec­tions provide some details on how our analysis is done.  Section 7 compares our work to related work on buffer overflow detection and static analysis. 

2.      Buffer Overflow Attacks and Defenses

The simplest buffer overflow attack, stack smashing [AlephOne96], overwrites a buffer on the stack to replace the return address.  When the function returns, instead of jumping to the return address, control will jump to the address that was placed on the stack by the attacker.  This gives the attacker the ability to execute arbitrary code.  Programs written in C are particularly susceptible to this type of attack.  C provides direct low-level memory access and pointer arithmetic without bounds checking.   Worse, the standard C library provides unsafe functions (such as gets) that write an unbounded amount of user input into a fixed size buffer without any bounds checking [ISO99].  Buffers stored on the stack are often passed to these functions.  To exploit such vulnerabilities, an attacker merely has to enter an input larger than the size of the buffer and encode an attack program binary in that input.  The Internet Worm of 1988 [Spafford88, RE89] exploited this type of buffer overflow vulnerability in fingerd.  More so­phis­ti­ca­ted buffer overflow attacks may exploit unsafe buffer usage on the heap.  This is harder, since most programs do not jump to addresses loaded from the heap or to code that is stored in the heap.

 

Several run-time solutions to buffer overflow attacks have been proposed.  StackGuard [CPMH+98] is a com­pi­ler that generates binaries that incorporate code designed to prevent stack smashing attacks.  It places a special value on the stack next to the return address, and checks that it has not been tampered with before jumping.  Baratloo, Singh and Tsai describe two run-time approaches: one replaces unsafe library func­tions with safe implementations; the other modifies executables to perform sanity checking of return ad­dress­es on the stack before they are used [BST00].

 

Software fault isolation (SFI) is a technique that inserts bit mask instructions before memory operations to prevent access of out-of-range memory [WLAG93].  This alone does not offer much protection against typical buffer overflow attacks since it would not prevent a program from writing to the stack address where the return value is stored.  Generalizations of SFI can insert more expressive checking around potentially dangerous operations to restrict the behavior of programs more generally.  Examples include Janus, which observes and mediates behavior by monitoring system calls [GWTB96]; Naccio [ET99, Evans00a] and PSLang/PoET [ES99­, ES00] which transform object programs accord­ing to a safety policy; and Generic Software Wrappers [FBF99] which wraps system calls with security checking code. 

 

Buffer overflow attacks can be made more difficult by modifications to the operating system that put code and data in separate memory segments, where the code segment is read-only and instructions cannot be executed from the data segment.  This does not eliminate the buffer overflow problem, however, since an attacker can still overwrite an address stored on the stack to make the program jump to any point in the code segment.  For programs that use shared libraries, it is often possible for an attacker to jump to an address in the code segment that can be used maliciously (e.g., a call to system).  Developers decided against using this approach in the Linux kernel since it did not solve the real problem and it would prevent legitimate uses of self-modifying code [Torvalds98, Coolbaugh99].

 

Despite the availability of these and other run-time approaches, buffer overflow attacks remain a persistent problem.  Much of this may be due to lack of awareness of the severity of the problem and the availability of practical solutions.  Nevertheless, there are legitimate reasons why the run-time solutions are unacceptable in some environments.  Run-time solutions always incur some performance penalty (StackGuard reports performance overhead of up to 40% [CBDP+99]).  The other problem with run-time solutions is that while they may be able to detect or prevent a buffer overflow attack, they effectively turn it into a denial-of-service attack.  Upon detecting a buffer overflow, there is often no way to recover other than terminating execution. 

 

Static checking overcomes these problems by detecting likely vulnerabilities before deployment.  Detecting buffer overflow vulnerabilities by analyzing code in general is an undecidable problem.[1]  Nevertheless, it is possible to produce useful results using static analysis.  Rather than attempting to verify that a program has no buffer overflow vulnerabilities, we wish to have reasonable confidence of detecting a high fraction of likely buffer overflow vulnerabilities.  We are willing to accept a solution that is both unsound and incomplete.  This means that our checker will sometimes generate false warnings and sometimes miss real problems.  Our goal is to produce a tool that produces useful results for real programs with a reasonable effort.  The next section describes our approach.  We compare our work with other static approaches to detecting buffer overflow vulnerabilities in Section 7.

3.      Approach

Our static analysis tool is built upon LCLint [EGHT94, Evans96, Evans00b], an annotation-assisted lightweight static checking tool.  Examples of problems detected by  LCLint include violations of information hiding, inconsistent modifications of caller-visible state or uses of global variables, misuses of possibly NULL references, uses of dead storage, memory leaks and problems with parameters aliasing.  LCLint is actually used by working programmers, especially in the open source development community [Orcero00, PG00].

 

Our approach is to exploit semantic comments (henceforth called annotations) that are added to source code and standard libraries.  Annotations describe programmer assumptions and intents. They are treated as regular C comments by the compiler, but recognized as syntactic entities by LCLint using the @ following the /* to identify a semantic comment.  For example, the annotation /*@notnull@*/ can be used syntactically like a type qualifier.  In a parameter declaration, it indicates that the value passed for this parameter may not be NULL.  Although annotations can be used on any declaration, for this discussion we will focus exclusively on function and parameter declarations.  We can also use annotations similarly in declarations of global and local variables, types and type fields.

 

Annotations constrain the possible values a reference can contain either before or after a function call.  For example, the /*@notnull@*/ annotation places a constraint on the parameter value before the function body is entered.  When LCLint checks the function body, it assumes the initial value of the parameter is not NULL.  When LCLint checks a call site, it reports a warning unless it can determine that the value passed as the corresponding parameter is never NULL.

 

Prior to this work, all annotations supported by LCLint classified references as being in one of a small number of possible states.  For example, the annotation /*@null@*/ indicated that a reference may be NULL, and the annotation /*@notnull@*/ indicated that a reference is not NULL.  In order to do useful checking of buffer overflow vulnerabilities, we need annotations that are more expressive.  We are concerned with how much memory has been allocated for a buffer, something that cannot be adequately modeled using a finite number of states.  Hence, we need to extend LCLint to support a more general annotation language.  The annotations are more expressive, but still within the spirit of simple semantic comments added to programs.

 

The new annotations allow programmers to explicitly state function preconditions and postconditions using requires and ensures clauses.[2]  We can use these clauses to describe assumptions about buffers that are passed to functions and constrain the state of buffers when functions return.  For the analyses described in this paper, four kinds of assumptions and constraints are used: minSet, maxSet, minRead and maxRead.[3]

 

When used in a requires clause, the minSet and maxSet annotations describe assumptions about the lowest and highest indices of a buffer that may be safely used as an lvalue (e.g., on the left-hand side of an assignment).  For example, consider a function with an array parameter a and an integer parameter i that has a pre­condition requires maxSet(a) >= i.  The analysis assumes that at the beginning of the function body, a[i] may be used as an lvalue.  If a[i+1] were used before any modifications to the value of a or i, LCLint would generate a warning since the function preconditions are not sufficient to guarantee that a[i+1] can be used safely as an lvalue.  Arrays in C start with index 0, so the declaration

         char buf[MAXSIZE]

generates the constraints

         maxSet(buf) = MAXSIZE – 1 and minSet(buf) = 0.

Similarly, the minRead and maxRead constraints indicate the minimum and maximum indices of a buffer that may be read safely.  The value of maxRead for a given buffer is always less than or equal to the value of maxSet.  In cases where there are elements of the buffer have not yet been initialized, the value of maxRead may be lower than the value of maxSet.

 

At a call site, LCLint checks that the preconditions implied by the requires clause of the called function are satisfied before the call.  Hence, for the requires maxSet(a) >= i example, it would issue a warning if it cannot determine that the array passed as a is allocated to hold at least as many elements as the value passed as i.  If minSet or maxSet is used in an ensures clause, it indicates the state of a buffer after the function returns.  Checking at the call site proceeds by assuming the postconditions are true after the call returns. 

 

For checking, we use an annotated version of the standard library headers.  For example, the function strcpy is annotated as[4]:

char *strcpy (char *s1, const char *s2)    

 /*@requires maxSet(s1) >= maxRead(s2)@*/

 /*@ensures maxRead(s1) == maxRead(s2) /\ result == s1@*/;

The requires clause specifies the precondition that the buffer s1 is allocated to hold at least as many char­acters as are readable in the buffer s2 (that is, the number of characters up to and including its null terminator).  The postcondition reflects the behavior of strcpy – it copies the string pointed to by s2 into the buffer s1, and returns that buffer.  In ensures clauses, we use the result keyword to denote the value returned by the function.

 

Many buffer overflows result from using library functions such as strcpy in unsafe ways.  By annotating the standard library, many buffer overflow vulnerabilities can be detected even before adding any annotations to the target program.  Selected annotated standard library functions are shown in Appendix A.  

4.      Experience

In order to test our approach, we used our tool on wu-ftpd, a popular open source ftp server, and BIND (Berkeley Internet Name Domain), a set of domain name tools and libraries that is considered the reference implementation of DNS.  This section describes the process of running LCLint on these applications, and illustrates how our checking detected both known and unknown buffer overflow vulnerabilities in each  appli­cation.

4.1 wu-ftpd

We analyzed wu-ftp-2.5.0[5], a version with known se­cur­ity vulnerabilities. 

Running LCLint is similar to running a compiler.  It is typically run from the command line by listing the source code files to check, along with flags that set checking parameters and control which classes of warnings are reported.  It takes just over a minute for LCLint to analyze all 17 000 lines of wu-ftpd.  Running LCLint on the entire unmodified source code for wu-ftpd without adding any annotations resulted in 243 warnings related to buffer overflow checking.

Consider a representative message[6]:

ftpd.c:1112:2: Possible out-of-bounds store.  Unable to

   resolve constraint:

      maxRead ((entry->arg[0] @ ftpd.c:1112:23)) <= (1023)

   needed to satisfy precondition:

      requires maxSet ((ls_short @ ftpd.c:1112:14))

                  >= maxRead ((entry->arg[0] @ ftpd.c:1112:23))

   derived from strcpy precondition:

      requires maxSet (<param 1>) >= maxRead (<param 2>)

Relevant code fragments are shown below with line 1112 in bold: 

char  ls_short[1024];

extern struct aclmember * getaclentry(char *keyword, struct aclmember **next);

int main(int argc, char **argv, char **envp)

{

  

   entry = (struct aclmember *) NULL;

   if (getaclentry("ls_short", &entry)

       && entry->arg[0]

       && (int)strlen(entry->arg[0]) > 0)

      {

        strcpy(ls_short,entry->arg[0]);

       

This code is part of the initialization code that reads configuration files.  Several buffer overflow vul­ner­a­bil­i­­ties were found in the wu-ftpd initialization code.  Although this vulnerability is not likely to be exploited, it can cause security holes if an untrustworthy user is able to alter configuration files.

 

The warning message indicates that a possible out-of-bounds store was detected on line 1112 and contains information about the constraint LCLint was unable to resolve.  The warning results from the function call to strcpy.  LCLint generates a pre­con­dit­ion constraint corresponding to the strcpy requires clause maxSet(s1) >= maxRead(s2) by substituting the actual parameters:

   maxSet (ls_short @ ftpd.c:1112:14) >= maxRead (entry->arg[0] @ ftpd.c:1112:23). 

Note that the locations of the expressions passed as actual parameters are recorded in the constraint.  Since values of expressions may change through the code, it is important that constraints identify values at particular program points. 

The global variable ls_short was declared as an array of 1024 characters.  Hence, LCLint determines maxSet (ls_short) is 1023.  After the call to getaclentry, the local entry->arg[0] points to a string of arbitrary length read from the configuration file.  Because there are no annotations on the getaclentry function, LCLint does not assume anything about its behavior.  In particular, the value of maxRead (entry->arg[0]) is unknown. LCLint reports a possible buffer misuse, since the constraint derived from the strcpy requires clause may not be satisfied if the value of maxRead (entry->arg[0]) is greater than 1023.

 

To fix this problem, we modified the code to handle these values safely by using strncpy.  Since ls_short is a fixed size buffer, a simple change to use strncpy and store a null character at the end of the buffer is sufficient to ensure that the code is safe.[7] 

 

In other cases, eliminating a vulnerability involved both changing the code and adding annotations.  For example, LCLint generated a warning for a call to strcpy in the function acl_getlimit:

int acl_getlimit(char *class, char *msgpathbuf) {

 int limit;

 struct aclmember *entry = NULL;

 

 if (msgpathbuf) *msgpathbuf = '\0';

 while (getaclentry("limit", &entry)) {

  

   if (!strcasecmp(class, entry->arg[0]))

   {

    

     if (entry->arg[3]

         && msgpathbuf != NULL)

       strcpy(msgpathbuf, entry->arg[3]);

    

If the size of msgputhbuf is less than the length of the string in entry->arg[3], there is a buffer overflow.  To fix this we replaced the strcpy call with a safe call to strncpy:

 strncpy(msgpathbuf, entry->arg[3], 199);

 msgpathbuf[199] = '\0';        

and added a requires clause to the function declaration:

 /*@requires maxSet(msgpathbuf) >= 199@*/

The requires clause documents an assumption (that may be incorrect) about the size of the buffer passed to acl_getlimit.  Because of the constraints denoted by the requires clauses, LCLint does not report a warning for the call to strncpy.

 

When call sites are checked, LCLint produces a warn­ing if it is unable to determine that this requires clause is satisfied.  Originally, we had modified the function acl_getlimit by adding the precondition maxSet (msgpathbuf) >= 1023.  After adding this precondition, LCLint produced a warning for a call site that passed a 200-byte buffer to acl_getlimit.  Hence, we re­placed the requires clause with the stronger constraint and used 199 as the parameter to strncpy.

 

This vulnerability was still present in the current ver­sion of wu-ftpd.  We contacted the wu-ftpd developers who acknowledged the bug but did not consider it security critical since the string in question is read from a local file not user input [Luckin01, Lundberg01].

 

In addition to the previously unreported buffer overflows in the initialization code, LCLint detected a known buffer overflow in wu-ftpd.  The buffer overflow occurs in the function do_elem shown below, which passes a global buffer and its parameters to the library function strcat.  The function mapping_chdir calls do_elem with a value entered by the remote user as its parameter.  Because wu-ftpd fails to perform sufficient bounds checking, a remote user is able to exploit this vulnerability to overflow the buffer by carefully creating a series of directories and executing the cd command.[8]

 

char mapped_path [200];

void do_elem(char *dir) {

  

   if (!(mapped_path[0] == '/'

       && mapped_path[1] == '\0'))

     strcat (mapped_path, "/");

   strcat (mapped_path, dir);

}

 

LCLint generates warnings for the unsafe calls to strcat.  This was fixed in latter versions of wu-ftpd by calling strncat instead of strcat.

Because of the limitations of static checking, LCLint some­­times generates spurious error messages.  If the user believes the code is correct, annotations can be added to precisely suppress spurious messages. 

Often the code was too complex for LCLint to analyze correctly.  For example, LCLint reports a spurious warning for this code fragment since it cannot determine that ((1.0*j*rand()) / (RAND_MAX + 1.0)) always produces a value between 1 and j:

  i = passive_port_max

      – passive_port_min + 1;

  port_array = calloc (i, sizeof (int));

  for (i = 3; … && (i > 0); i--) {

    for (j = passive_port_max

             – passive_port_min + 1;

         … && (j > 0); j--) {

       k = (int) ((1.0 * j * rand())

                  / (RAND_MAX + 1.0));

       pasv_port_array [j-1]

          = port_array [k];

Determining that the port_array[k] reference is safe would require far deeper analysis and more precise specifications than is feasible within a lightweight static checking tool.

 

Detecting buffer overflows with LCLint is an iterative process.   Many of the constraints we found involved functions that are potentially unsafe.  We added function preconditions to satisfy these constraints where possible.  In certain cases, the code was too convoluted for LCLint to determine that our preconditions satisfied the constraints.  After convincing ourselves the code was correct, we added annotations to suppress the spurious warnings.

 

Before any annotations were added, running LCLint on wu-ftpd re­sulted in 243 warn­ings each corresponding to an unresolved constraint.   We added 22 annotations to the source code through an iterative process similar to the examples described above.  Nearly all of the annotations were used to indicate preconditions constraining the value of maxSet for function parameters. 

 

After adding these annotations and modifying the code, running LCLint produced 143 warnings.  Of these, 88 reported unresolved constraints involving maxSet.  While we believe the remaining warnings did not indicate bugs in wu-ftpd, LCLint’s analyses were not sufficiently powerful to determine the code was safe.  Although this is a higher number of spurious warnings than we would like, most of the spurious warnings can be quickly understood and suppressed by the user.  The source code contains 225 calls to the potentially buffer overflowing functions strcat, strcpy, strncat, strncpy, fgets and gets.  Only 18 of the unresolved warnings resulted from calls to these functions.  Hence, LCLint is able to determine that 92% of these calls are safe automatically.  The other warnings all dealt with classes of problems that could not be detected through simple lexical techniques.

4.2 BIND

BIND is a key component of the Internet infrastructure.  Recently, the Wall Street Journal iden­ti­fied buffer overflow vulnerabilities in BIND as a critical threat to the Internet [WSJ01].  We focus on named, the DNS sever portion of BIND, in this case study.  We analyzed BIND version 8.2.2p7[9], a version with known bugs.  BIND is larger and more complex than wu-ftpd. The name server portion of BIND, named, contains approximately 47 000 lines of C including shared li­bra­ries.  LCLint took less than three and a half minutes to check all of the named code. 

 

We limited our analysis to a subset of named because of the time required for human analysis. We focused on three files: ns_req.c and two library files that contain functions which are called extensively by ns_req.c:  ns_name.c and ns_sign.c.  These files contain slightly more than 3 000 lines of code.

 

BIND makes extensive use of functions in its internal library rather than C library functions. In order to accurately analyze individual files, we needed to annotate the library header files.  The most accurate way to annotate the library would be to iteratively run LCLint on the library and add annotations.  However, the library was extremely large and contains deeply nested call chains.  To avoid the human analysis this would require, we added annotations to some of the library functions without annotating all the dependent functions.  In many cases, we were able to guess preconditions by using comments or the names of function parameters.  For example, several functions took a pointer parameter (p) and another parameter encoding it size (psize), from which we inferred a precondition MaxSet(p) >= (psize – 1).  After annotating selected BIND library functions, we were able to check the chosen files without needing to fully annotate all of BIND.

 

LCLint produces warnings for a series of unguarded buffer writes in the function req_query.  The code in question is called in response to a specific type of query which requests information concerning the domain name server version. BIND appends a response to the buffer containing the query that includes a global string read from a configuration file.  If the default configuration is used, the code is safe because this function is only called with buffers that are large enough to store the response.  However, the restrictions on the safe use of this function are not obvious and could easily be overlooked by someone modifying the code.  Additionally, it is possible that an administrator could reconfigure BIND to use a value for the server version string large enough to make the code unsafe.  The BIND developers agreed that a bounds check should be inserted to eliminate this risk [Andrews01].

 

BIND uses extensive run time bounds checking. This type of defensive programming is important for writing secure programs, but does not guarantee that a program is secure.  LCLint detected a known buffer overflow in a function that used run time checking but specified buffer sizes incorrectly.[10]

 

The function ns_req examines a DNS query and gen­er­ates a response.  As part of its message processing, it looks for a signature and signs its response with the function ns_sign.  LCLint reported that it was unable to satisfy a precondition for ns_sign that requires the size of the message buffer be accurately described by a size parameter.  This precondition was added when we initially annotated the shared library.  A careful hand analysis of this function reveals that to due to careless modification of variables denoting buffer length, it is possible for the buffer length to be specified incorrectly if the message contains a signature but a valid key is not found.  This buffer overflow vulnerability was introduced when a digital signature feature was added to BIND (ironically to increase security).  Static analysis tools can be used to quickly alert programmers to assumptions that are broken by incremental code changes.

                               

Based on our case studies, we believe that LCLint is a useful tool for improving the security of programs.  It does not detect all possible buffer overflow vulnerabilities, and it can generate spurious warnings.  In practice, however, it provides programmers concerned about security vulnerabilities with useful assistance, even for large, complex programs.  In addition to aiding in the detection of exploitable buffer overflows, the process of adding annotations to code encourages a disciplined style of programming and produces programs that include reliable and precise documentation.

5.      Implementation

Our analysis is implemented by combining traditional compiler data flow analyses with constraint generation and resolution.  Programs are analyzed at the function level; all interprocedural analyses are done using the information contained in annotations. 

 

We support four types of constraints corresponding to the annotations introduced in Section 2: maxSet, minSet, maxRead, and minRead.  Constraints can also contain constants and variables and allow the arithmetic operations: + and -.  Terms in constraints can refer to any C expression, although our analysis will not be able to evaluate some C expressions statically.

 

The full constraint grammar is:

constraint Ž (requires | ensures)

constraintExpression relOp constraintExpression

relationalOp Ž == | > | >= | < | <=

constraintExpression Ž

    constraintExpression binaryOp constraintExpresion

  | unaryOp ( constraintExpression )

  | term

binaryOp Ž + | -

unaryOp Ž maxSet | maxRead | minSet | minRead

term Ž variable | C expression | literal | result

 

Source-code annotations allow arbitrary constraints (as defined by our constraint grammar) to be specified as the preconditions and postconditions of functions.  Constraints can be conjoined (using /\), but there is no support for disjunction.  All variables used in constraints have an associated location.  Since the value stored by a variable may change in the function body, it is important that the constraint resolver can distinguish the value at different points in the program execution.

 

Constraints are generated at the expression level and stored in the corresponding node in the parse tree.  Constraint resolution is integrated with the checking by resolving constraints at the statement level as checking traverses up the parse tree.  Although this limits the power of our analysis, it ensures that it will be fast and simple. The remainder of this section describes briefly how constraints are represented, generated and resolved.

 

Constraints are generated for C statements by traversing the parse tree and generating constraints for each subexpression.  We determine constraints for a statement by conjoining the constraints of its subexpressions.  This assumes subexpressions cannot change state that is used by other subexpressions of the same expression.  The semantics of C make this a valid assumption for nearly all expressions – it is undefined behavior in C for two subexpressions not separated by a sequence point to read and write the same data.  Since LCLint detects and warns about this type of undefined behavior, it is reasonable for the buffer overflow checking to rely on this assumption.  A few C expressions do have intermediate sequence points (such as the comma operator which specifies that the left operand is always evaluated first) and cannot be analyzed correctly by our simplified assumptions.  In practice, this has not been a serious limitation for our analysis.

 

Constraints are resolved at the statement level in the parse tree and above using axiomatic semantics techniques.  Our analysis attempts to resolve constraints using postconditions of earlier statements and function preconditions.  To aid in constraint resolution, we simplify constraints using standard algebraic techniques such as combining constants and substituting terms.  We also use constraint-specific simplification rules such as maxSet(ptr + i) = maxSet(ptr) - i.  We have similar rules for maxRead, minSet, and minRead.

 

Constraints for statement lists are produced using normal axiomatic semantics rules and simple logic to combine the constraints of individual statements.  For example, the code fragment

1      t++;

2      *t = ‘x’;

3      t++;

leads to the constraints:

requires maxSet(t @ 1:1) >= 1,

ensures maxRead(t @ 3:4) >= -1 and

ensures (t @ 3:4) = (t @ 1:1) + 2.

The assignment to *t on line 2 produces the constraint requires maxSet(t @ 2:2) >= 0.  The increment on line 1 produces the constraint ensures (t@1:4) = (t@1:1) + 1.  The increment constraint is substituted into the maxSet constraint to produce requires maxSet (t@1:1 + 1) >= 0.  Using the constraint-specific simplification rule, this simplifies to requires maxSet (t@1:1) - 1 >= 0 which further simplifies to requires maxSet(t @ 1:1) >= 1.

6.      Control Flow

Statements involving control flow such as while and for loops and if statements, require more complex analysis than simple statement lists.  For if statements and loops, the predicate often provides a guard that makes a possibly unsafe operation safe.  In order to analyze such constructs well, LCLint must take into account the value of the predicate on different code paths.  For each predicate, LCLint generates three lists of postcondition constraints: those that hold regardless of the truth value of the predicate, those that hold when the predicate evaluates to true, and those that hold when the predicate evaluates to false. 

 

To analyze an if statement, we develop branch specific guards based on our analysis of the predicate and use these guards to resolve constraints within the body.  For example, in the statement

   if (sizeof (s1) > strlen (s2))

    strcpy(s1, s2);

if s1 is a fixed-size array, sizeof (s1) will be equal to maxSet(s1) + 1.  Thus the if predicate allows LCLint to determine that the constraint maxSet(s1) >= maxRead(s2) holds on the true branch.  Based on this constraint LCLint determines that the call to strcpy is safe.

 

Looping constructs present additional problems.  Previous versions of LCLint made a gross simplification of loop behavior: all for and while loops in the program were analyzed as though the body executed either zero or one times.  Although this is clearly a ridiculous assumption, it worked surprisingly well for the types of analyses done by LCLint.  For the buffer overflow analyses, this simplified view of loop semantics does not provide satisfactory results – to determine whether buf[i] is a potential buffer overflow, we need to know the range of values i may represent.  Analyzing the loop as though its body executed only once would not provide enough information about the possible values of i.

 

In a typical program verifier, loops are handled by requiring programmers to provide loop invariants.  Despite considerable effort [Wegbreit75, Cousot77, Collins88, IS97, DLNS98, SI98], no one has yet been able to produce tools that generate suitable loop invariants automatically.  Some promising work has been done towards discovering likely invariants by executing programs [ECGN99], but these techniques require well-constructed test suites and many problems remain before this could be used to produce the kinds of loop invariants we need.  Typical programmers are not able or willing to annotate their code with loop invariants, so for LCLint to be effective we needed a method for handling loops that produces better results than our previous gross simplification method, but did not require expensive analyses or programmer-supplied loop invariants.

 

Our solution is to take advantage of the idioms used by typical C programmers.  Rather than attempt to handle all possible loops in a general way, we observe that a large fraction of the loops in most C programs are written in a stylized and structured way.  Hence, we can develop heuristics for identifying and analyzing loops that match certain common idioms.  When a loop matches a known idiom, corresponding heuristics can be used to guess how many times the loop body will execute.  This information is used to add additional preconditions to the loop body that constrain the values of variables inside the loop. 

 

To further simplify the analysis, we assume that any buffer overflow that occurs in the loop will be apparent in either the first or last iterations.  This is a reasonable assumption in almost all cases, since it would be quite rare for a program to contain a loop where the extreme values of loop variables were not on the first and last iterations.  This allows simpler and more efficient loop checking. To analyze the first iteration of the loop, we treat the loop as an if statement and use the techniques described above.  To analyze the last iteration we use a series of heuristics to determine the number of loop iterations and generate additional constraints based on this analysis.

 

An example loop heuristic analyzes loops of the form

  for (index = 0; expr; index++) body

where the body