Check out the new USENIX Web site. next up previous
Next: Related Work Up: CMC: A Pragmatic Approach Previous: The AODV model


Table 4 summarizes the set of bugs found using CMC in the three AODV implementations. The bugs range from simple memory errors to protocol invariant violations. We found a total of $40$ bugs of which $34$ were unique. The Kernel AODV implementation has $5$ bugs (shown in parenthesis in the table) that are instances of the same bug in mad-hoc. Also, the AODV specification bug causes a routing loop in all three implementations.

Currently, CMC stops after finding the first bug in the model. It prints the failed assertion and a trace of events starting from the initial state to the error state. After a bug is fixed, CMC is run again to find bugs iteratively. Most bugs were found within minutes of model checking time; the longest took roughly 40 minutes.

We describe the bugs below at a high level to give a feel for the breadth of coverage and focus on four of the more interesting bugs to give a feel for its depth.

Memory errors. The first three error classes illustrate the mishandling of dynamically allocated memory: not checking for allocation failure (12 errors), not freeing allocated memory (8 errors), and using memory after freeing it (2 errors).

Figure 4: Mishandled malloc failure: if malloc fails, the loop will exit after allocating less than rerrhdr_msg.dst_cnt buffers. The two errors are in code that assumes rerrhdr_msg.dst_cnt buffers were allocated. Both lead to segmentation faults.

All implementations checked that the pointer returned by malloc was not null. However, functions that call malloc can also indirectly return null pointers when allocations fail. The code only erratically checked such cases. Since CMC directly executes the implementation, such errors were manifested as segmentation faults.

Most of the memory-related bugs were straightforward. However, there were several interesting errors where the code would correctly check for allocation failure, but its recovery code was broken. Figure 4 gives a representative error. Here, the code attempts to allocate rerrhdr_msg.dst_cnt temporary message buffers. It correctly checks for malloc failure and breaks out of the loop. However, the code after the loop assumes that rerrhdr_msg.dst_cnt list entries were indeed allocated. This assumption leads to two bugs. The first (intraprocedural) error attempts to dequeue rerrhdr_msg.dst_cnt buffers off of the rerrhdr_msg.unr_dst list in order to free them. Since the list has fewer entries than expected, the code will attempt to use a null pointer and get a segmentation fault. The second (interprocedural) error, in rec_rerr, similarly tries to walk over the rerrhdr_msg.dst_cnt list entries and seg faults because the list is too short.

Most of the memory leaks were similarly caused by mishandled allocation failures. Commonly, code would attempt to do two memory allocations and, if the first allocation succeeded but the second failed, would return with an error, leaking the first pointer.

Unexpected messages. CMC detected two places where unexpected messages would cause mad-hoc to crash with a segmentation violation. Figure 5(a) shows one of the errors. The error happens because AODV encodes state in its messages. In this error:

  1. The current node n receives a Route Request (RREQ) message from node req requesting a route to node dst.

  2. Node n inserts a reverse route to req in its routing table.

  3. Then, n looks up the route to dst in its routing table.

  4. If the route is not there, n re-broadcasts the RREQ message. The RREQ message contains the IP address of both the destination node dst and the requesting node req.

  5. The response to this request, a Route Response (RREP) message, includes both the route to dst and the IP address of req.

  6. Node n inserts the new route to dst in its routing table. It then attempts to relay this route to req by looking up the route to req. In the normal case, this lookup will return the reverse route inserted in Step 2.

This last step causes the error. The code assumes the normal case and uses the result of the routing table lookup for req without checking for null. However, the lookup could fail for two reasons. First, if the machine has rebooted, the implementation will start with an empty routing table. If an old RREP message arrives after the reboot, the lookup of req will return a null pointer. Second, an attacker could send a bogus RREP with a node address that does not exist, crashing the router.

Figure 5: Two bugs: an unexpected message and an invalid route response. (a) An unexpected route-response (RREP) message causes getentry to return null, crashing the machine. (b) If a route returned by getentry has been invalidated the hopcount will be 255. However, the code does not check for this and sends the message.
\begin{figure}\begin{verbatim}/* madhoc:rrep.c:rec_rrep */
/* If I'm not t...
...gram(my_info, &my_rrep,
sizeof(my_rrep)) == -1)

Invalid messages. There were $4$ cases of invalid packets being created, $2$ cases of using uninitialized variables (these could not be detected by gcc -Wall), and $2$ cases where invalid routes were used to send routing updates, violating the AODV specification ( Figure 5(b) gives a representative example). CMC also detected $2$ instances of integer overflow which resulted in program assertion failures. The implementations use an 8 bit integer to store the hop counts and use $255$ to represent a hopcount of infinity. In these error cases, an infinite hopcount was erroneously incremented to $0$.

Routing loops. CMC found three routing loops. Two of these bugs are caused by implementation errors. The third routing loop is due to an error in the AODV protocol specification.

The first routing loop is caused when the implementation fails to increment a sequence number while processing specific RERR messages.

Another loop is caused when the implementation performs a sequence number comparison before a subsequent increment, while the AODV specification requires the comparison to be done after the increment.

The specification bug. This bug involved the handling of RERR (``route error'') messages. When a node receives an RERR from its next hop, it sets the sequence number of its route to the sequence number in an RERR message. Under normal conditions this is the right thing to do. However, when the underlying link layer can reorder messages, the RERR message might have an outdated sequence number resulting in the node setting its sequence number to an older version. This can ultimately result in a routing loop. This bug was mentioned to the authors of the protocol with a suggested fix. Both the bug and the fix were accepted by the protocol authors[26]. Figure 6 gives both the error and the fix.

Figure 6: The specification bug: the sequence number from an incoming message is used without validation, causing ``time'' to go backwards when messages are reordered. Fortunately, while the error was not obvious (surviving 6 rounds of specification revisions) the fix is trivial.
\begin{figure}\begin{verbatim}/* madhoc:rerr.c:rec_rerr */
// Get pointer ...
... -1;
tmp_rtentry->dst_seq = tmp_unr_dst->unr_dst_seq;\end{verbatim}\end{figure}

The specification bug was found by running $4$ AODV nodes using a depth-first search of the state space. CMC came up with an error trace of length $93$. Using best-first search, it was possible to find traces as short as $27$. Performing a breadth-first search of the state space would give the shortest trace. However, breadth-first search on AODV ran out of resources without finding the bug. A carefully hand-crafted simulation of the bug required at least $20$ transitions. Such a complex error would be very difficult to catch using conventional means of testing.

next up previous
Next: Related Work Up: CMC: A Pragmatic Approach Previous: The AODV model
Madanlal Musuvathi 2002-10-08