Check out the new USENIX Web site. next up previous
Next: Use of the programming Up: A scalable and explicit Previous: Event-based vs. state-based notification

Details of the programming interface

An application might not be always interested in events arriving on all of its open file descriptors. For example, as mentioned in Section 8.1, the Squid proxy server temporarily ignores data arriving in dribbles; it would rather process large buffers, if possible.

Therefore, our API includes a system call allowing a thread to declare its interest (or lack of interest) in a file descriptor:

  #define    EVENT_READ        0x1
  #define    EVENT_WRITE       0x2
  #define    EVENT_EXCEPT      0x4

  int declare_interest(int fd,
                       int interestmask,
                       int *statemask);

The thread calls this procedure with the file descriptor in question. The interestmask indicate whether or not the thread is interested in reading from or writing to the descriptor, or in exception events. If interestmask is zero, then the thread is no longer interested in any events for the descriptor. Closing a descriptor implicitly removes any declared interest.

Once the thread has declared its interest, the kernel tracks event arrivals for the descriptor. Each arrival is added to a per-thread queue. If multiple threads are interested in a descriptor, a per-socket option selects between two ways to choose the proper queue (or queues). The default is to enqueue an event-arrival record for each interested thread, but by setting the SO_WAKEUP_ONE flag, the application indicates that it wants an event arrival delivered only to the first eligible thread.

If the statemask argument is non-NULL, then declare_interest() also reports the current state of the file descriptor. For example, if the EVENT_READ bit is set in this value, then the descriptor is ready for reading. This feature avoids a race in which a state change occurs after the file has been opened (perhaps via an accept() system call) but before declare_interest() has been called. The implementation guarantees that the statemask value reflects the descriptor's state before any events are added to the thread's queue. Otherwise, to avoid missing any events, the application would have to perform a non-blocking read or write after calling declare_interest().

To wait for additional events, a thread invokes another new system call:

    typedef struct {
      int fd;
      unsigned mask;
    } event_descr_t;

    int get_next_event(int array_max,
                       event_descr_t *ev_array,
                       struct timeval *timeout);

The ev_array argument is a pointer to an array, of length array_max, of values of type event_descr_t. If any events are pending for the thread, the kernel dequeues, in FIFO order, up to array_max events1. It reports these dequeued events in the ev_array result array. The mask bits in each event_descr_t record, with the same definitions as used in declare_interest(), indicate the current state of the corresponding descriptor fd. The function return value gives the number of events actually reported.

By allowing an application to request an arbitrary number of event reports in one call, it can amortize the cost of this call over multiple events. However, if at least one event is queued when the call is made, it returns immediately; we do not block the thread simply to fill up its ev_array.

If no events are queued for the thread, then the call blocks until at least one event arrives, or until the timeout expires.

Note that in a multi-threaded application (or in an application where the same socket or file is simultaneously open via several descriptors), a race could make the descriptor unready before the application reads the mask bits. The application should use non-blocking operations to read or write these descriptors, even if they appear to be ready. The implementation of get_next_event() does attempt to try to report the current state of a descriptor, rather than simply reporting the most recent state transition, and internally suppresses any reports that are no longer meaningful; this should reduce the frequency of such races.

The implementation also attempts to coalesce multiple reports for the same descriptor. This may be of value when, for example, a bulk data transfer arrives as a series of small packets. The application might consume all of the buffered data in one system call; it would be inefficient if the application had to consume dozens of queued event notifications corresponding to one large buffered read. However, it is not possible to entirely eliminate duplicate notifications, because of races between new event arrivals and the read, write, or similar system calls.

next up previous
Next: Use of the programming Up: A scalable and explicit Previous: Event-based vs. state-based notification
Gaurav Banga