################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


		LINUX DEVICE DRIVER EMULATION IN MACH

		    Shantanu Goel and Dan Duchamp

		     Computer Science Department
		         Columbia University


			      ABSTRACT

We describe the design and performance of code added to the Mach
microkernel (Mach 4.0, version UK02p21) that permits one to build a
Mach kernel that includes unmodified Linux device drivers.  We have
written emulation code to support all Linux 1.3.35 network and SCSI
drivers for the ISA and PCI I/O buses.  Emulation increases latency,
but very little.  The degree depends on both device and operation, and
varies from 2 microseconds for receiving small (60 byte) network
packets up to 197 microseconds for writing 16KB to an ISA SCSI device.


1. INTRODUCTION

We describe the design and performance of code that permits one to
build a Mach microkernel (Utah release 4.0, version UK02p21) that
includes the completely unmodified source for device drivers that have
been written for Linux (version 1.3.35).  Our code, which consists of
some changes and additions to Mach as well as run-time emulation of
Linux calls, handles all of Linux's block drivers, network drivers,
and SCSI host adapters for the ISA or PCI buses---53 drivers in all
(block drivers for floppy, IDE hard disk, and SCSI; 30 ISA network
devices; 4 PCI network devices; 10 ISA SCSI host adapters; and 5 PCI
SCSI host adapters).

The motivation for this work is to improve the usefulness of the Mach
microkernel on Intel x86 platforms.  We are wedded to Mach because
some of the research in our laboratory is dependent on its unique
features.  Our research also needs to incorporate new peripheral
devices on a regular basis.  Unhappily, because of its small user
base, Mach has always had relatively few device drivers compared to
more popular operating systems.  Furthermore, many of these drivers
are old and do not accommodate recent generations of I/O chips, either
by not running them at all or else by failing to take advantage of
advanced new features.  In part because of developments in multimedia
and wireless networking, new I/O peripherals are being invented at a
remarkable rate, and Mach's set of device drivers has been steadily
falling further behind the hardware base available in the PC world.

This problem eventually became acute for us.  We knew we could not
obtain either the time or the information to write all the Mach device
drivers we wanted; writing new device drivers is often difficult
because of the need to obtain access legally to proprietary hardware
specifications and/or software, and because one must have a sound
understanding of the hardware in order to write a high quality driver.
Accordingly, we wondered whether it would be practical to implement
Linux device driver emulation within Mach.  Because of its relatively
large following, Linux has many more device drivers and the rate at
which new drivers are added to it outstrips the rate at which new
drivers are added to Mach.  Linux has driver call-in and call-out
interfaces that are well defined and that change slowly.  We thought
that if we could emulate these interfaces, then we could tap into the
current base of Linux drivers, and---possibly with limited further
effort to update the emulation---future Linux drivers as well.

Our goal was to be able to compile completely unmodified Linux device
drivers into the Mach kernel.  We achieved this goal for network,
block, and SCSI devices that attach to either the ISA or PCI bus.
This paper reports our design, how certain aspects of both Mach and
Linux constrained our design, and how our code performs.


2. BACKGROUND

This section provides necessary background about how device drivers
operate in both Mach 4.0 and Linux 1.3.35.

In Linux, drivers may be either statically or dynamically linked into
the kernel.  Mach supports only statically linked drivers.
Consequently, we emulate only statically linked Linux drivers.

Linux has five classes of devices: network, SCSI, block, sound, and
character.  At this writing, we have tackled only network, block, and
SCSI drivers that attach to the ISA and PCI I/O buses.

Linux and Mach view device access differently.  Mach is a microkernel
designed with portability in mind.  It has a single kernel interface
for all device types, and this interface is accessed by messages.
Within the kernel, a device access request first passes through
machine independent code and then to machine dependent code.

Linux is a monolithic operating system that was originally targeted to
only the Intel x86 architecture.  Its device interfaces are accessed
by other parts of the operating system via procedure call, and there
is no attempt at machine independence.  Each type of device has its
own interface.

Figures 1, 2, and 3 give the Mach device interface and Linux
interfaces to network and SCSI devices, respectively.

------------------------------------------------------------------------
device_open
device_close

device_read         /* synchronous */
device_write
device_read_inband  /* small variant */
device_write_inband

/* these two replace ioctl */
device_get_status   
device_set_status

/* set network packet filter */
device_set_filter

device_map
/*
 * Also, asynchronous two-message
 * (request & reply) versions of
 * each of these calls: device_open,
 * device_read, device_write,
 * device_read_inband, and
 * device_write_inband.
 */


------------------------------------------------------------------------
	    Figure 1: Mach Kernel Interface to All Devices
------------------------------------------------------------------------

Linux regards SCSI devices as block devices, so the interface to SCSI
is the vnode interface; this explains the large number of
unimplemented calls in Figure 3.  Of course in both systems a device
driver must also have an interface that is accessed from below: one
procedure per type of interrupt.  In the case of network drivers,
there are two types of interrupt: transmit done and packet received.
The SCSI interrupts are: command complete and bus reset by device.

------------------------------------------------------------------------
probe
open
close
send_packet
ioctl                 /* unimplemented */
get_stats
set_multicast_list    /* effectively, this is ioctl; link address
                         and promiscuous mode can be set too */

------------------------------------------------------------------------
	    Figure 2: Linux Interface to Network Devices
------------------------------------------------------------------------

------------------------------------------------------------------------
open
release	     /* same as close */
read
write
ioctl

/* obscure */
change       /* has media changed?  e.g., CD-ROM removal */
validate     /* flush and re-read disk partition tables */
fsync

/* not implemented by any driver */
lseek
readdir
select
mmap
fasync       /* asynchronous sync */
------------------------------------------------------------------------
	    Figure 3: Linux Interface to SCSI Devices
------------------------------------------------------------------------

Network drivers have essentially no internal structure.  There is one
procedure to handle each entry point and a few utility routines.
However, in both Mach and Linux SCSI drivers have an internal
structure.  The top layer is ``target specific;'' example targets are
tape and disk/CD-ROM.  The target specific layer knows about the
physical structure and constraints of one type of device.  This layer
translates device-specific operations (e.g., read a disk block) into
one or more SCSI commands.  The middle layer performs bookkeeping
chores like queueing and timeouts.  The bottom layer is the ``host
adapter,'' which knows how to send one or more SCSI commands to a
specific controller chip and return the result(s).  The interface to
the host adapter is given in Figure 4.  This interface is the same in
both Mach and Linux.

------------------------------------------------------------------------
detect          /* probe for device */
command         /* synchronous */
queue_command   /* asynchronous */
reset
------------------------------------------------------------------------
	    Figure 4: Interface to SCSI Driver Host Adapter Layer
------------------------------------------------------------------------


3. DESIGN

This section presents the details of our design.  Section 3.1
discusses the relative power of the two device interfaces; i.e., is it
possible to emulate the Mach device interface using Linux drivers?
The remaining subsections, 3.2 through 3.6, discuss the specific
modules of emulation code that we wrote.  The implementation consists
of about 2000 lines of C.


3.1  Device Interfaces

At the most abstract level of consideration, emulation raises two
issues:
	1. Emulating any procedure call or variable reference that a
	   Linux driver might make.

	2. Ensuring that the combination of emulation code plus a
	   Linux driver are together able to implement all Mach device
	   entry points. 

The first concern---how Mach can provide the facilities needed by
Linux drivers---is addressed in the sections below.  This section
shows that Linux drivers can implement the Mach interface as well as
Mach drivers implement it.  In the next few paragraphs we argue that
the Mach kernel interface to devices can be implemented by Linux
drivers plus a small amount of simple emulation code.

Figures 2 and 3 show that Linux has obvious analogues for most Mach
device entry points: open, close, read, write, and ioctl.  Of course,
it is possible that certain Mach ioctl arguments are not implemented
by Linux drivers.  In the case of network devices, Mach's ioctl calls
(device_get_status and device_set_status) map not to ioctl but to the
non-obvious set_multicast_list.  With the right arguments,
set_multicast_list can be used to read or write the link address,
multicast addresses, and promiscuous read mode.  These are the major
actions of Mach's ioctl.

There is no Linux analogue for the Mach device_set_filter
entry point.  This entry point is used to associate a packet filter
[PacketFilter] with a driver.  However, in Mach neither this
entry point nor actual filtering of packets is implemented by device
drivers.  Instead, Mach has generic routines for installing and
executing packet filters, and all device drivers call the generic
installation routine.  Once installed, a packet filter is called after
the network driver has retrieved the incoming packet and copied it
into a Mach message.  The filter operates on the message contents.  At
that point, all device-specific functions have been performed, so
packet filtering really takes place above the driver level.

The device_map entry point permits devices like frame buffers and
disks to map their contents into virtual memory.  The Linux analogue
is mmap.  As a practical matter, the only drivers that use these calls
in either system are frame buffers, so emulation is presently a moot
point.  If someday it becomes important for a Linux driver to service
Mach device_map calls, emulation would be easy because of the
presence of a generic Mach routine (blockio_map) that converts paging
activity of the mapped area into device read/write requests.


3.2  In-Kernel Device Interface

Below Mach's kernel device interface and above the device drivers is a
layer of ``machine independent'' code.  This code unpacks request
messages into a generic I/O request structure, maps or moves pages
between the calling process and the kernel, and sends a reply message
when the device driver is finished with the operation.  Also, during a
device_open operation, this code looks up the device's name in a list
called dev_name_list.  Device drivers add entries to dev_name_list.
Such entries include the device's name and pointers to routines in the
device driver that implement the calls in the kernel's device
interface.  The idea is that the machine independent code performs all
kernel actions that are not truly device-specific.

After acquiring some experience with the machine independent code, we
decided to eliminate it because it interferes with the goal of using
device drivers from other operating systems.

For example, Linux names IDE drives ide0, ide1, and so on, but Mach's
UNIX server refers to IDE drives as hd0, hd1, etc.  To use Linux
drivers in a Mach microkernel that is running the UNIX server, it
would be necessary to update the machine independent layer to
translate a device_open kernel call with an argument of hd0 into a
call to the Linux driver.  More generally, extra effort is needed to
convert between several conventions defined by Mach's machine
independent layer and the corresponding conventions of Linux drivers.

Since the machine independent layer would have to change anyway, and
since we thought the type of translation mentioned above is more
properly done by the emulation code associated with the Linux driver,
we decided to eliminate the machine independent layer.  The result is
a new implementation of all Mach device interface calls that
recognizes emulation as a possibility.  For example, when a
device_open request is received, the kernel calls the open routine of
each emulation module.  An emulation module that recognizes the device
name returns a structure that contains pointers to routines that
implement each call in the kernel interface.  For calls other than
device_open, the proper routine pointer in the structure is
dereferenced immediately upon entry into the kernel.  The emulation
module is responsible for all actions needed to service a device
interface call, including mapping pages and sending a reply.  Three
emulation modules currently exist: Linux block, Linux network, and old
Mach.  The old Mach module is the original Mach device code, hacked a
bit to conform to the new way of doing things.


3.3  Initialization

Linux assumes that the clock is running at the time drivers configure,
but in Mach this is not true.  This creates a problem since Linux
drivers use clock interrupts to timeout device probe commands.  In
particular, the initializing driver (which is the only activity
running on the machine at the time) initiates a probe command and then
polls the clock variable in a loop until the probed device responds
with an interrupt or until the clock variable reaches some timeout
value.  The clock variable is incremented by the handler for clock
interrupts.  (Mach drivers use a similar method except that, because
no time facility is available, the delay loop runs for an
``estimated'' amount of time.)

We solved this problem by changing Mach to start the clock earlier,
and by writing a special clock interrupt handler that is used only
during Linux driver configuration.  We have also renamed the clock
variable that Linux drivers refer to; Mach calls the variable
``elapsed_ticks'' while Linux calls it ``jiffies.''  The special
handler increments this variable, and we have changed the standard
Mach clock handler to do so also.  Mach's clock handler could not be
used during driver initialization because it manipulates the ``current
thread'' data structure which doesn't exist at that point.  We did not
fix Mach device drivers after the fact to use measured time rather
than estimated time in their delay loops.


3.4  Memory Usage

Mach and Linux make different use of the kernel's address space in two
ways: addressing and memory allocation.  Both of these differences
impact driver emulation.


3.4.1 Addressing

Both Mach and Linux map the kernel into the upper gigabyte of the
32-bit address space.  However, Mach sets kernel segment register
values to zero, and is linked so as to generate virtual addresses in
the range [0xC0000000 .. 0xFFFFFFFF].  In contrast, Linux sets its
segment registers to 0xC0000000, and generates virtual addresses in
the range [0x0 .. 0x3FFFFFFF].  Linux does this to ease kernel
programming: kernel virtual addresses and physical addresses are
identical and interchangeable.  Consequently, Linux drivers have no
provision for translating between physical addresses and kernel
virtual addresses, and operations that require physical addresses
(e.g., DMA) would not work if a Linux driver were simply compiled and
linked with Mach.

To resolve this difference, we changed Mach's machine-specific memory
management module (pmap) and linking instructions so that Mach also
generates virtual addresses in the range [0x0 .. 0x3FFFFFFF].


3.4.2 Kernel Memory Allocation

An initialization call to a DMA-capable Linux driver includes two
parameters which are the start and end addresses of a segment of
contiguous, DMA-able [1] physical memory.  The driver uses this memory
for DMA buffers and for storing its data structures.  In Mach, driver
initialization occurs after virtual memory is enabled, so the virtual
memory system cannot be avoided when searching for memory for driver
initialization.  The emulation code searches Mach's free page list,
looking for a sequence of contiguous pages that lie below the 16MB
boundary.  The boundary addresses of this segment are passed to the
Linux driver.

To service dynamic requests made by Linux drivers for extra DMA
buffers, we implemented a new memory allocator that Linux drivers
(only) use to share 64KB of DMA-able memory set aside at
initialization.  The reason for adding yet another kernel memory
allocator is that no existing Mach facility provided the right
combination of being able to run during interrupt service, [2]
allocating physically contiguous DMA-able memory, and being
space-efficient for small allocations.


3.4.3 I/O Blocking

Linux uses a small block size for I/O operations.  For most block
devices, the block size is 1KB.  For CD-ROM, it is 2KB.  In contrast,
Mach's block size is 4KB, the page size.  The negative effect of a
small block size is ameliorated in part by the fact that the Linux
block cache code ``clusters'' the pieces of a multi-block I/O
operation.  To cluster means to coalesce physically contiguous blocks
into a single ``segment,'' and to form a list of segments into a
single I/O command.  Such a list is suited to devices that have
scatter/gather ability. [3] List formation is done without extra
copies, since the segment list itself consists only of pointers to
segments.

Clustering reduces the number of I/O operations whenever the device
provides scatter/gather.  (If the device does not support
scatter/gather, each segment is a separate operation.)  For this
reason, we have ported the clustering code from Linux's block cache
into the emulation code.


3.5 Synchronization


3.5.1 Between Driver and Processes

Mach and Linux embody fundamentally different design decisions
regarding the (a)synchrony of interaction between device drivers and
the higher level software that invokes I/O operations.

In Linux, it is assumed that some process is waiting for each I/O
operation.  The process formats an I/O request, initiates the I/O
operation, then sleeps on the I/O request buffer.  When the I/O
operation completes, the device driver performs a wakeup on the I/O
request buffer, and the process resumes.  In contrast, in Mach the
interaction between the driver and the process that initiated I/O is
asynchronous.  Mach maintains an ``I/O-done list'' and an ``I/O-done
thread.''  When an I/O operation completes, note is made in the
I/O-done list.  At some later time, the I/O-done thread is scheduled,
removes the completed operation from the I/O-done list, and sends a
message to the initiator of the operation.

We made the Linux synchronization method compatible with Mach as
follows.  The I/O request block is passed to the driver locked.  When
the Linux driver completes the I/O operation, it calls routine unlock.
This routine unlocks the I/O request block and calls wakeup.  We
replaced unlock with code that unlocks the I/O request block then
manipulates the I/O-done list to indicate that the operation is
finished.  It is guaranteed that the I/O-done list already has a
record for the operation because the Mach I/O code placed one there
before invoking the Linux driver to do the operation.


3.5.2 Between Driver and Hardware

Mach and Linux differ substantially in how they disable interrupts.
Mach masks classes of interrupts by using spl to set the CPU to one
of 8 different interrupt priority levels.  Linux does not vary CPU
priority.  Instead, it will disable individual devices by turning them
off at the PIC (programmable interrupt controller); Linux also uses
the x86 CLI and STI instructions to disable and enable interrupts
entirely.  Turning off individual devices is a fine grain of control
not available in Mach, and changing Mach to use such control would be
prohibitively difficult.  Consequently, we emulate a Linux driver
turning off a single device by using spl to turn off the entire class
of devices.  We have not observed any problems from this over-masking.

Linux is not designed to run on multiprocessors, so Linux device
drivers are not concurrency-safe.  Mach, on the other hand, contains
locking to permit execution on multiprocessors.  To use Linux drivers
safely within Mach, the emulation code implements a per-device lock
that ensures that calls to any single device are serialized.


3.6 Machine Resources

Linux implements an organized approach to the notorious problem of
binding interrupt lines (IRQs), DMA channels (DRQs), and I/O port
ranges to devices.  There is a central allocator that keeps track of
each resource.  Well behaved device drivers request these resources
and release them when they are finished.

Central allocation prevents conflicts, and is a good idea.  Since Mach
had no such facility, we simply ported this code into Mach for use by
the Linux drivers.  We did not bother to install similar code in the
Mach drivers, so they are ill behaved with respect hardware resource
allocation, and they should not be used at the same time as Linux
drivers.

Another worthy facility used by Linux drivers that we ported from
Linux into Mach is ``automatic IRQ detection.''  The purpose of this
facility is to automatically discover which interrupt line a
jumper-configured controller card is using.  It works as follows:

	1. The device driver calls the ``autoIRQ'' facility, which
	   installs special interrupt handlers for every unallocated
	   IRQ.

	2. The device driver forces the device to interrupt.

	3. The device driver calls the autoIRQ facility to receive a
	   report.  A timeout is given as an argument.

	4. If the device was configured to use one of the available
	   IRQs, then one of the special interrupt handlers was invoked.
	   The autoIRQ allocates this IRQ to the device and indicates
	   so when the device reports asks for the report.

	5. If the device was not configured to use one of the available
	   IRQs, then the interrupt was handled by some other device's
	   interrupt handler and presumably treated as an error.  The
	   device driver's call to autoIRQ for a report times out and
	   indicates that no IRQ was allocated.

	6. A side effect of autoIRQ giving its report to the device
	   driver is that the special interrupt handlers are uninstalled.


4.  EVALUATION

Of the 53 emulated drivers, we tested seven and measured the
performance of two, the SMC ``Ultra'' Ethernet controller and the
Adaptec 1542C SCSI controller.  Both devices attach to the ISA bus.
All the remaining Linux drivers compile.  This is not a trivial
statement, given that compilation means that emulation code exists for
every external name in all drivers.

In the two cases, we compared the performance of the emulated driver
versus the native Mach driver.


4.1 Experimental Setting

Our test platform was a Pentium 90MHz system with 16MB of DRAM, ISA
and PCI I/O buses, a Fast SCSI-2 disk, and an unloaded Ethernet.
Artificial workloads were generated by simple programs we wrote to
open a specific device (network or disk) and then access the device
via direct calls to the microkernel.  In fact, the exact
characteristics of the hardware platform and load generation software
are not important, for a few reasons.  First, we are concerned with
latency rather than throughput, so there is no need for the workload
generator to be able to run the timed sections often.  Second, we are
concerned with the relative latencies of two drivers; absolute times
are not important.  Third, the timed code sections are short and
contain few sources of variability; page faults are guaranteed not to
happen during a timed section, and no I/O operation occurs in three of
the four tests.  It is not worrisome if the system calls generated by
the workload exhibit variable latency; what matters are the short
timed sections deep within the kernel.

In order to generate precise timing numbers for short code sections,
we used the Pentium's RDMSR instruction.  This instruction can read
any two of about 40 registers that count the number of certain actions
(e.g., cache misses, time ticks) since CPU reset.  The time register
is a 64-bit counter that is incremented on the edge of every CPU
(on-chip) clock signal; i.e., for a 90MHz processor the counter is
incremented 90 million times a second.  In order to interpret the
counter one must know the CPU clock rate; we took code from FreeBSD
2.0.5 that figures this out.  The RDMSR instruction itself takes 6
clock ticks.  The time to move the returned value to a memory location
pushes the overall time to sample the counter up to 8 clock ticks if
the memory location is cached; the time could be considerably longer
if the location is not cached.  On a 90MHz Pentium, 8 clock ticks is
about one tenth of a microsecond.  Since the shortest code paths we
measured is 2 microseconds, we deem the error introduced by counter
sampling to be negligible.

Below are two sections, one for each comparison.  In each section, we
present latency figures in a table and point out and explain any
results that are unexpected or interesting.


4.2 Network

------------------------------------------------------------------------
    DRIVER	SIZE		MIN	VARIANCE

    Linux	60 bytes 	74 usec	    9

    Mach	  60		 62	    3

    Linux	 256		159	   22

    Mach	 256		214	    3

    Linux	1500		712	   31

    Mach	1500		582	   31

------------------------------------------------------------------------
	    Table 1:  Network Transmit Latency (vs. Mach)
------------------------------------------------------------------------

------------------------------------------------------------------------
------------------------------------------------------------------------
    DRIVER	SIZE		MIN	VARIANCE

    Linux	60 bytes 	86 usec	   12

    Mach	  60		 84	    2

    Linux	 256		265	    9

    Mach	 256		295	   18

    Linux	1500		655	   54

    Mach	1500		749	  185

------------------------------------------------------------------------
	    Table 2:  Network Receive Latency (vs. Mach)
------------------------------------------------------------------------

We measured the minimum latency and latency variance of the two
performance-critical operations: network transmit and receive.

The first time sample was taken at the point where the Linux and Mach
drivers first differ.  Similarly, the second time sample was taken
where the drivers first re-converge.  For transmit, the first point is
where the packet is queued and a software interrupt scheduled.  The
second point is the routine to deallocate a packet once transmission
is complete.  For receive, the first point is in the packet received}
interrupt handler, while the second point is in the routine that
delivers the packet to the kernel.  The transmit path includes the I/O
to the Ethernet chip.  I/O is finished by the time the receive path
begins, but the receive path includes copying the packet from the
controller.

For three of the six comparisons (transmit 256, receive 256, and
receive 1500), we have the puzzling result that the emulated Linux
driver is faster than the native Mach driver.  This should not happen
since, despite the differences between the drivers, the emulated Linux
case always performs strictly more work than the native Mach case.  In
the two receive cases, we have determined that the entire time
difference is due to the single instruction that copies the packet
from the I/O controller to DRAM.  However, we have been unable to
determine what hardware effect is causing the difference or why this
hardware effect occurs in the native driver but not in the emulated
Linux driver.


4.3 SCSI

------------------------------------------------------------------------
    DRIVER	SIZE	PATH	MIN	VAR

    Linux	512	CMD	 86 us	  1

    Mach	512	CMD	 29	  4

    Linux	512	INT	 43	  4

    Mach	512	INT	  2	  1

    Linux	 4K	CMD	119	  2

    Mach	 4K	CMD	 30	  4

    Linux	 4K	INT	 52	  1

    Mach	 4K	INT	  2	  1

    Linux	 8K	CMD	141	  2

    Mach	 8K	CMD	 35	  5

    Linux	 8K	INT	 58	  2

    Mach	 8K	INT	  3	  0

    Linux	16K	CMD	160	  2

    Mach	16K	CMD	 39	  4

    Linux	16K	INT	 74	  3

    Mach	16K	INT	  3	  0

------------------------------------------------------------------------
	    Table 3:  SCSI Read Latency (vs. Mach)
------------------------------------------------------------------------

------------------------------------------------------------------------
    DRIVER	SIZE	PATH	MIN	VAR

    Linux	512	CMD	85 us	  2

    Mach	512	CMD	 29	  3

    Linux	512	INT	 43	  4

    Mach	512	INT	  2	  0

    Linux	 4K	CMD	120	  1

    Mach	 4K	CMD	 30	  4

    Linux	 4K	INT	 51	  2

    Mach	 4K	INT	  2	  0

    Linux	 8K	CMD	143	  0

    Mach	 8K	CMD	 35	  4

    Linux	 8K	INT	 59	  1

    Mach	 8K	INT	  3	  0

    Linux	16K	CMD	162	  0

    Mach	16K	CMD	 37	  5

    Linux	16K	INT	 74	  3

    Mach	16K	INT	  2	  0

------------------------------------------------------------------------
	    Table 4:  SCSI Write Latency (vs. Mach)
------------------------------------------------------------------------

We measured the minimum latency and latency variance of the two
performance-critical operations: SCSI read and write.  In these tests,
two code paths were timed.  Path CMD is from the write system call
until the command is issued to the host adapter.  Path INT is from the
command complete interrupt until the I/O request block is retired
from the driver's queue of commands.  The I/O operation takes place
between the first and second timed paths, and hence is not included in
any of the timings.  As with the network tests, the beginning and
ending of the timed paths are the points at which code paths in the
Linux and Mach drivers diverge and converge, respectively.

Unlike the network timings, the SCSI results are easily explained.
First, for a common driver and data size, the times for read and write
are virtually identical.  This is to be expected because the two
operations are the same except for the direction of data transfer.
Second, the numbers for the emulated Linux driver rise with increasing
data size.  (In contrast, the CMD numbers for the native Mach driver
do rise with data size, but very slowly; the INT numbers don't rise at
all.)  The explanation is that two forms of emulation processing are
proportional to transfer size.  One is that the emulation code tries
to coalesce a multi-block transfer into ``segments'' with physically
contiguous pages.  This affects both CMD and INT paths.  The other is
that the CMD path checks if a DMA operation is scheduled for a memory
location beyond 16MB and, if so, allocates ``bounce buffers'' to
compensate. [6] The Mach driver makes no such check, requiring that
DMA be to/from addresses below 16MB.


4.4 Conclusion

Although it is disappointing that we are presently unable to explain
the network timings, one point is clear, and that is that the cost of
emulation is very low.  The case where emulation imposes the highest
cost is writing 16KB to SCSI, and here the cost is less than 0.2
milliseconds [7] for an I/O operation that requires several
milliseconds.

In fact, we added considerably more function and overhead to a
re-implementation after seeing, in our initial implementation, how
little cost was imposed.  I/O drivers seem to be an especially
appropriate place for emulation since, so long as the I/O bus is slow
and the CPU is fast, a considerable number of instructions can be
executed by an emulator without having noticeable impact on
performance.


5.  SUMMARY

Reactions to this work at its outset included expressions that a
practical emulation of Linux drivers in Mach would be
quasi-miraculous.  About quasi-miracles Samuel Johnson once wrote A
dog's walking on his hind legs [is] not done well; but you are
surprised to find it done at all.''  In our case, the dog actually
walks quite well.

We have demonstrated the practicality of incorporating unmodified
device driver source code of one operating system into another.  Both
Mach and Linux are intellectual descendants of UNIX, but they do not
share code ancestry.  Mach's device drivers are for the most part
descended from 4.3 BSD, while Linux device drivers were written from
scratch using different assumptions about some important aspects of
the surrounding operating system.  Nevertheless, the performance
penalty for emulation is very limited.  In the tests we did, the
degree depends on both device and operation, and varies from 2 to 197
microseconds.

We are aware of no related work that shares the essential feature of
our work: incorporating unmodified source code of one operating system
into another.  Of course within the last few years there has been a
good deal of work in the UNIX community on emulating the system calls
of the far more popular DOS and Windows interfaces; examples include
Mach's DOS server [MachDOS], Linux ``DOSemu'' and ``Wine'' projects,
and a number of commercial efforts such as ``DOSMerge'' and ``WABI.''
The difference between those efforts and ours is that we include
another system's source code into our kernel, so we are emulating
intra-kernel variables and interfaces rather than the more
standardized and carefully considered system call interface.

The obvious direction for future work in this area is to extend the
emulation to include more drivers from Linux and to include drivers
from other operating systems.  The true mother lode of device drivers
in the PC world is of course the DOS or Windows binaries that ship
with nearly every piece of peripheral hardware.  Emulating these would
be enormously more difficult than what we have done so far because of
lack of access to source code and because of the huge range of actions
taken by these drivers, based on the assumption that they can control
the whole mahcine.  Nevertheless, a reasonable starting point might be
to emulate network drivers because they are so simple and because so
many conform to the NDIS specification [NDIS].


FOOTNOTES

[1] In the PC architecture, DMA-able memory must be below 16MB,
because the ISA bus has only 24 address lines.

[2] Mach assumes that interrupts do not change the page list.

[3] A scatter/gather I/O operation consists of an indication of the
direction in which the data should move and a list of segments aligned
to some boundary.  The device either ``scatters'' a write over the
segments or ``gathers'' a read from the segments.  Different devices
have different limits on how many segments can be accommodated in a
single I/O command.

[4] SCSI read and write, and network reception.  The timed segment for
network transmission includes the I/O.

[5] We suspect a cache effect, but even with the considerable help of
the RDMSR instruction we have not been able to pinpoint the cause.

[6] A bounce buffer is a DMA-able region of memory used as a
waystation by a DMA operation whose target address is beyond 16MB.

[7] 197 microseconds = 125 on the CMD path plus 72 on the INT path.


REFERENCES

NDIS		S. Dhawan.
		Networking Device Drivers.
		Van Nostrand Reinhold, New York, 1995.

MachDOS		G. Malan, R. Rashid, D. Golub, and R. Baron.
		DOS as a Mach 3.0 Application.
		In Proc. USENIX Mach Symp., USENIX, pp. 27-40, November 1991.

PacketFilter	M. Yuhara, B.N. Bershad, C. Maeda, and J.E.B. Moss.
		Efficient Packet Demultiplexing for Multiple Endpoints
		and Large Messages.
		In Proc. 1994 Winter USENIX Conf., USENIX, pp. 153-165,
		January 1994.


ACKNOWLEDGEMENTS

This work was supported in part by the Advanced Research Projects
Agency, ARPA order number B094, under contract N00014-94-1-0719,
monitored by the Office of Naval Research; and in part by the Center
for Telecommunications Research, an NSF Engineering Research Center
supported by grant number ECD-88-11111.


AUTHOR INFORMATION

Shantanu Goel is an MS candidate in the Computer Science Department
at Columbia University.  His research interest is operating systems.

Dan Duchamp is an Associate Professor of Computer Science at
Columbia University.  His current research interest is the various
issues in mobile computing.  For his initial efforts in this area, he
was named an Office of Naval Research Young Investigator for the
period 1993-1996.

Mailing address:
   450 Computer Science Bldg.
   Columbia University
   500 West 120th St.
   New York NY 10027
Email addresses: {goel,djd}@cs.columbia.edu