################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	    Proceedings of the USENIX SEDMS IV Conference
      (Experiences with Distributed and Multiprocessor Systems)
	     San Diego, California, September 22-23, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


Mether-NFS: A Modified NFS Which Supports Virtual Shared Memory

Ronald G. Minnich 
Supercomputing Research Center 
 
*Abstract

Mether-NFS (MNFS) is a modified NFS which supports virtual
shared memory. The model it supports is based on the Mether model. 
It is compatible in most ways with standard NFS. The major difference 
is in the way mapped files are handled. Mapped files, rather than adhering
to standard NFS memory consistency semantics instead obey rules such 
that writes are globally ordered. This allows MNFS to be used to support
shared memory programs. 

MNFS has been used on a 16-node SPARCStation ELC cluster to support
shared memory programs. On appropriate applications we have realized 
supercomputer-level performance, which given the modest cost of the 
cluster implies a correspondingly higher price-performance. 

MNFS performance is superior to NFS performance in every case, 
but particularly in the case of write performance. Writes in MNFS 
are at least twice as fast as NFS. 

MNFS provides an extended memory operator, called 
Conditional Fetch-and-op (CFOP).
CFOP is used to support memory synchronization. Programs using CFOP
can atomically test and set a variable in 1.3 
milliseconds. Our target performance
goal for CFOP is 500 microseconds. 
In order to achieve this performance we have had to highly optimize
the path through the SunOS kernel on both the client and server sides. 


Introduction

Virtual Shared Memory, also called Distributed Shared Memory (DSM), 
provides the shared memory abstraction on a set of machines 
connected via a network. 
DSMs can also be implemented in hardware, e.g. MemNet,
but we are restricting our discussion in this paper to software DSMs.

Implementations to date include Ivy, Mether, 
and others. 

A significant problem to date has been the absence of an 
``industrial strength'' implementation of DSM that would allow ordinary 
users access to it for their applications. 
Despite the efforts of researchers, no implementation 
of software DSM that we know of has made it far beyond the labs
which have developed it (including our own) and into 
general use. 
User evaluation and feedback on the idea of DSM is essential in order 
to further develop it. 

We have developed a DSM at SRC called Mether. The source has been 
made available since 1991 and it has been installed 
at a number of universities in the U.S. and overseas 
for both research and applications purposes. 
While researchers have studied it for their own work, 
it has proven to be too unfamiliar an environment for normal users 
and the result has been little applications development using Mether 
outside SRC. 

In order to provide a system more useable to non-researchers we 
rebuilt Mether using 
NFS as the support base.
The result is Mether-NFS, or MNFS. 

Our goals for MNFS were:

Provide ease-of-use for DSM users by fitting the Mether model into a 
Unix file system framework. The result is a value-added NFS that offers
standard NFS semantics for read and write operations; and offers a useful
DSM model for memory-mapped files. 
Provide a tool for sysadmins that was easily installed, requiring no 
kernel recompilation or expertise beyond that needed to use NFS. 
Provide a system that was observable with standard tools found
on Sun systems.
Provide compatibility with the extent possible to standard NFS. 
This implies no changes to NFS Remote Procedure Call types or 
External Data Representation structures. 
Provide a reliable system, at least as reliable as NFS. 

These goals were met. 

MNFS: 

looks to users like NFS
is dynamically loadable after the computer has been booted, using
Sun's ``modload''  utility
does not modify RPC or XDR components of NFS in any major way,
with the result that MNFS packets can be monitored with all the tools used
to monitor NFS 

supports a memory model that makes it possible to use shared-memory 
programming techniques on memory-mapped files
provides users with a ``recoverable'' virtual memory, in the sense that
virtual memory segments are backed by files and thus can be examined
during an application's execution and as part of post-mortem debugging. 
Memory segments backed by MNFS can even be integrated into a 
checkpoint/restart scheme. 


In addition to operating as an NFS file system, 
and supporting the Mether model for memory-mapped files, MNFS
adds an extension called 
Conditional Fetch-and-op (or CFOP). 
CFOP is used for low-latency synchronization. Using CFOP processes
can perform P and V operations on shared variables in 1.3 milliseconds. 
This performance is far better than we are able to achieve via other
mechanisms. 

In the following section we will provide an overview of the MNFS applications
interface, followed by a description of the implementation of MNFS and 
the CFOP support, and then discuss performance. 

Applications Programming Interface

The applications programming interface for MNFS is 
for the most part identical to that offered by NFS-based Unix file systems. 
There are a few areas in which the two differ:

behaviour of the mmap
  system call. In standard SunOS, the mmap system call is used to 
map a file in. The user may specify an mmap area that is larger than 
the file; if a process references a mapped file area beyond the current
end of file, the process will receive a segmentation violation signal
and in the usual case will terminate. In MNFS, the file size is grown
as needed by the mmap support code to match the size of the mmap request. 
Thus no segmentation violation will occur if the user references an 
area of the file beyond the initial end-of-file. 
Semantics of virtual address. As described below there are two 
different page sizes supported in MNFS. One bit of the virtual address is 
used to indicate whether the reference is to the long-size page space or 
the short-size page space. 
No read-ahead. MNFS does not currently read ahead. The traditional 
NFS read-ahead plays havoc with memory-mapped files. 
CFOP system call. The CFOP system call is available to users of 
MNFS. CFOP operates on mapped-in MNFS files.
Swap instructions work. Atomic memory instructions such as the 
SPARC SWAP instruction work properlyassuming, of course, that
the SPARC implementation 
in question executes the instruction correctly; not all do. 
Users used to traditional NFS semantics
may be surprised. 
File locking is not necessary. Given that swap instructions work and 
the availability of CFOP we have found no need to use the standard 
fcntl system call to lock regions of a file for writing. 


As mentioned above the memory model for MNFS mapped files is 
based on our earlier work with the Mether distributed shared 
memory. 
Mether presents users with a virtual address space partitioned 
into pages. A Mether operation on any part of a page applies to 
the whole page as well. For example, a reference to part of a page will cause
a whole page to be fetched. 

The operations that Mether supports on pages differ
in several crucial ways from some of the DSMs mentioned above. 
On a memory-fetch by memory-fetch basis, programs can determine 
the type of service needed for that fetch. The types of service 
supported are:

[Exclusive Write Access] 
When a process is writing to a page, it has the only writeable copy of that
page. However, unlike systems such as Ivy, Mether does not 
invalidate all readable copies of the page. The exclusive access 
for the writeable
page guarantees global serialization of all writes to the page. 
It also eliminates the need for cache invalidation protocols with their
attendant overhead. 
[Latency Tolerance]
Latency tolerance is supported in several ways. First, there is an 
operation
that is used by processes to fetch items that may take 
a long period of time. This operation is used by processes
to fetch data without a page fault, but to also 
indicate that a high latency is both acceptable and expected. 
Second, there are pre-fetch operations 
that allow processes to cause the system to acquire pages before they are 
used. 
[Network Refresh]
Network refresh is  a process by which an application can deliver the latest
copy of the writeable page to holders of a readonly copy. Recall that there is
only one writeable copy. This operation is called ``post store'' by 
Kendall Square in their machine and has identical semantics. 
[User Purge]
In user purge, an application can purge all the readonly data in a given
address range. Thus, a program can ensure that it is reading the most
recently written data. 
The user purge and network refresh can be used in a  way similar to the 
memory barriers on the DEC Alpha microprocessor. 
[Support for different-size objects] 
Programs can, on a fetch-by-fetch basis, choose
to access pages that have a very low transmission cost but do not move much
data (32 bytes, a size chosen with ATM in mind); 
or they may choose to access
large pages that have a higher transmission cost but are useful for moving
large amounts of data (currently 8192 bytes). The 
small pages (``short pages'') are useful for synchronization and small objects;
the large pages are useful for bulk data transfer. 
These sizes are well matched to the sizes most commonly seen in 
networking
environments. 
[Event-Driven Memory Synchronization ]
Event-Driven Memory Synchronization (EDMS), allows a process to pause 
waiting for another process to issue a network refresh. 
Thus the resumption of execution of a process is a direct consequence of 
an action taken by a process on a remote machine. 
This synchronization mechanism
was found in  to minimize the network and host load
and hence to minimize the contribution made to the overall latency  
by host and network load. 
EDMS is not a primitive operation but is constructed from 
two primitives: the reader performs the operation to fetch in the 
high-latency mode, and at a later time the writer
performs a network refresh. 
[CFOP] CFOP is described below, but is a high-performance operator
for shared variables. Our target goal for CFOP is 500 microseconds round-trip 
time. We currently can perform the operation in 1300 microseconds. 


Shown in Figure  is a portion of program text. 
The program shown opens a file and maps it in. It then invokes different 
MNFS operations. Each type of MNFS operation invoked and what it does
are described in the following sections. 

*User Purge
The user purge of pages is accessed via the SunOS mctl call. 
We use an mctl 
interface function called msync, which is part of the C library. 
In the example, the program is purging a cached copy of a page that was
accessed via reads only. Of course we could guarantee that the page would 
always be read-only via options to the mmap call; specifically specifying
PROTREAD instead of both PROTREAD and PROTWRITE as the 
protection. 

*Network Refresh
The next code fragment demontrates network refresh. The key difference is that
the page is present at the local machine and writeable. In this case 
at a minimum MNFS will return a copy of the page to the server machine. 
In addition, for each machine holding a read-only copy of the page 
(the readholders mentioned above) a copy of the short page will be sent 
to that machine. When the short page is received by a readholder it will 
be paged in, thereby accomplishing two things: 

replacing the short page currently at the machine, if any, and allowing
processes to examine variables on the short page without going through a 
page-fault cycle. 
invalidating any copies of the long page currently at the machine,
requiring that further references to the long page be satisfied with 
a page fault for the new copy


In the sample code there are two variations on the network refresh. In the 
first call to msync for the writeable page, the MSINVALIDATE option is set.
This will invalidate the writeable copy of the page held at the client. 
For the client to access the page again it must fault it in by storing
into the variable x. 
In the second msync call, the invalidate option is not
set, so that a writeable copy of the page remains at the client. 
Both of these scenarios are useful in different applications. 

*Short and Long Pages
In the next mmap call the program converts the normal address, x, to 
a pointer to a short page. It then maps the short page in. 
In the short page space, only the first 32 bytes of the page are transferred
from and to the server. Short pages are useful for small variables 
that move frequently, such as synchronization variables. Short pages 
represent the first 32 bytes of data on a long page. Thus a page can be treated
as a union of a 32-byte entity and a 8192-byte entity. The 32-byte entity
is accessed via the short page space; the 8192-byte entity via the long-page
space. 

The two address spaces have very different latencies and bandwidths by 
design. 
Short page latency is determined by the latency of the protocol stack; 
the cost is not measurably different from that of moving a zero-byte packet. 
Long page latency, on the other hand, is dominated by the time cost of 
moving 8192 bytes of data through the networking software and over the 
wire. 
Currently it takes about 5 milliseconds to get a short page on SPARCSTATION 
ELCs over the Ethernet; long pages take 14 milliseconds in the best case.
Bandwidth on the other hand is very low for short pages at just 
640 bytes/second; long pages on the other hand run at 585 Kbytes/second. 
The long page bandwidth is lower than we would like; we should be 
able to get close to 1 Mbyte/second. 

*CFOP
The next operation shown is a Conditional Fetch-and-op. The operator
in this case is Compare and Swap on a single word. 
The target is a virtual address in the mapped-in file, the source values
are provided by the program. 
The target is compared to the first value and, if they are equal, is set to 
the second value. 
The value of the target after the operation is returned. The implementation
actually performs the operation at the server, and performs it in such a way
that the test is indivisible from the assign. 

There are other 
operators that can span up to eight words; whether the operation is 
performed
is conditioned on the first word of the target being equal to the first 
word in the source. 
Thus a lock can be set and values assigned to a locked
variable in one operation. 
Compare and Swap on a word is the subset of a more general operation. 

*Load Incoherent
RISC processors such as the R4000 have  a way of loading 
read-only or writeable cache lines 
without actually reading them in from memory. The operation is called
Load Incoherent. Load Incoherent is useful for MNFS programs as well. 
The next mctl operation shown is a LOADINCOHERENT operation. 
The set of pages starting at the address given by the pointer x and for 
sizeof(*x) 
bytes will be created at the client without communicating with 
the server at all. This greatly increases the effective bandwidth to the MNFS
server, since no data need move. 

*Event Driven Memory Operations
Finally we show an example of Event Driven Memory Operations (EDMS). 
In this case the program will wait for up to 30 seconds for another program
on another host to perform a network refresh of a page. 
Note that the 30-second time is selected by the programmer for this instance;
any time may be selected, including infinity. 
If x were a variable on a short page, and there had been a 
network refresh, x would contain the new value after the refresh. 
Since x is a variable on a long page, the next reference will 
cause a page fault, as network refresh invalidates long pages. 
If there has not been a network refresh the next reference to the 
x will cause a fault only if the program has previously purged the page
containing x. 

*Summary
This interface doubtless seems very low level compared to other systems. 
That is precisely the intent. The goal of Mether, and hence MNFS, is to 
provide a 
completely exposed set of cache operations to applications. In the long
term our goal is to have a high-level language use these operations, but
we have also found that in some cases programmers need to have direct access
to the low-level interface. 
Thus the operations described in this section should be considered
a set of building blocks for higher-level operators. 

An attendant goal of Mether from the beginning has also been to prototype
the API for a hardware-based system. Of necessity, therefore, the API of 
this system is far more constrained than what we can do in a system
that will only ever be implemented in software. A hardware-based system 
being designed now holds out the promise of far better performance
than any software-based system extant. 

Implementation
Before discussing MNFS implementation details we will first provide an 
overview of what is lacking in the NFS handling of memory-mapped files. 

*NFS Limitations
Shown in Figure  is an example of how NFS handles memory reads and
writes to a mapped file. 
Two client programs on different machines have mapped in a file. 


The file is mapped in such a way that both reads and writes are possible. 
A client reads a memory location, causing a writeable 
copy of the page to 
be loaded. 
A writeable copy is loaded, and not a read-only copy as might be expected, 
because the file is mapped with write permission. 
The program 
then performs a SPARC SWAP instruction, which 
performs an indivisible memory operation. At this point, in a normal memory
system, all processes accessing the memory would see the same value, or 
in the worst case, would see the old value for some time before seeing 
the new value. In NFS, however, as the illustration shows, the other
process has also executed a SWAP, and the processes in fact see different
values. For every process involved in using this memory-mapped file, 
there can be a different value of the shared location. More important, there
is no way to determine what value the variable will eventually have. 
At some point the many writeable copies of the page will be returned to the
server. Depending on which page returns last, the value 
will have one of many different values. 
A process writing a different variable on the page may in fact restore 
the old value, before the swap instructions were performed. 
The situation is further complicated by any retries that may occur as part
of the NFS protocol. Thus, the first write may be retried, making it the last
write, and it may be retried many seconds later, so that programming 
techniques  that attempt to work around this problem 
by waiting for a time for the value to ``settle down'' will not work.
Thus, in NFS mapped files, 

time can move backwards-- a variable can take on an old value, then 
a new value, then an old value
writes are not ordered-- at different nodes, the order of changes 
to a page can be different
the behavior is highly non-deterministic
it is not possible to use the mapped variables for synchronization
It is not possible for a process holding a writeable copy 
of a page to deliver it to the processes holding a readonly copy of 
the page, so that they may see the changed value of a variable


There are other problems as well. Only one page size (4096 bytes on the 
Sun 4c and 4m SPARC implementations) is supported, while 
programs frequently need small data pages for synchronization and large
data pages for bulk transfer. There is no support for direct 
application-to-application synchronization via NFS. 

Thus the problems are several:

NFS client code does not differentiate between read and write faults
NFS server code does not ensure that only one process is writing a page
at a time
The only coherency mode supported is completey incoherent, while programs
can usefully use both the incoherent mode of NFS and strongly coherent 
modes offered by other systems such as Ivy. 
More than one page size is needed
A mechanism for delivering a writeable copy of a page to holders
of readonly copies is needed
A mechanism for 
direct 
application-to-application synchronization is 
needed


We had 
built a memory model that incorporated many of the properties we need 
in an earlier system call Mether. 
We determined that the SunOS virtual memory system and NFS implementation
was flexible enough to accomodate the Mether model as part of NFS. 
We will now describe the changes we made. 

MNFS Changes
The single most important MNFS change is the fact that
only one writeable page of a mapped-in
file may be present in the network at a time. 
When a process faults on a page, and that page is not present, the host
supporting that process must request the page from the server. If the page
is not present at the server it must be located by the server and 
brought back to the server. 

The server might have several ways to locate a page. The naive way would be 
to broadcast for a page every time it is needed. This would preserve
statelessness. It would also limit scaling, as broadcast-based systems 
do not scale well. 
We decided that in no case would MNFS use broadcast. 
Our choice was to have the server track ownership of 
pages. For each page, the server records who has a writeable copy and 
who has read-only copies. 

The tracking of pages has many implications for NFS. 
The most important  is that there is now server state related
to a client's file activity. This further implies that a server crash 
and reboot may not be transparent to a client, as it is now. 

The stateless model is one casualty of our change to MNFS. 
This change is one we thought about carefully. 
NFS has from the beginning relied on its stateless model
to reduce complexity and increase reliability. 
For the environment and time in which NFS was created, such a model 
made sense. At the time it was not unusual for servers to crash on a daily 
or weekly basis. For the present, however, our experience with servers is 
that they don't crash. We have servers that stay up for months at a time, 
going down only when we need to add a device or change the OS. 
When the servers do go down it is in an orderly fashion, so a controlled 
clean-up of server state is possible. 
In the one case in which we have had a server crash in the last year it was
because disk heads and disk arms parted company; it no longer mattered
that NFS had reliability because the disk was physically unreadable. 
There is no 
point in staying with the stateless model in this environment-- its 
restrictions are too constraining, and it is buying us reliability we don't 
need. 
Statelessness provides a constant performance penalty, for reliability that is 
not necessary, in order to cover for an event that in a well-run environment
does not occur. 

At the same time, we advocate taking care in what state is added to the 
server. We have been careful to add a very small amount of state per file,
and to add it in a way that recovery from server crashes and client crashes
is possible. 

A distributed  
application may not survive all crash scenarios, however. 
There is a possibility that a client process may change a page and sync it 
back to the server and the changes not be preserved. 
If the server crashes at just the right instant, such that the client is 
satisfied that the server has the page and has discarded it, the new
page may be lost. Thus the client has an invalid view of the contents of the 
page, and the correct view can not be restored. 

Creating this particular failure scenario is extremely difficult. 
The easiest way is to halt the client machine (literally halt it-- 
toggle the run/halt switch, or on Suns use the L1-A key sequence), 
power off the server
machineNote that unplanned power failure is also very unlikely 
given that any server of any importance is backed by both uninterruptable
power supplies and non-volatile memory without shutting
it down in an orderly way, then continue the client. 
Once the server is back up, it will have lost records of clients holding
pages, and will reject returned writeable pages from the client. 
At that point the newest copy of the writeable page is lost, 
and the application which 
wrote it is in a state predicated on the server having that writeable
page. Since there is no return from this scenario,
we have code which has
been tested which will send the client a segmentation violation signal 
in this case. 
In testing, the code has been exercised and works; in practice
it has never happened. 

The state we added to the server was related to pages from files. On a 
per page basis we build a record which maintains track of:

the time the state record was created.
the name of the current host holding a writeable copy of the page, 
      if any (the writeholder). There is only one writeholder allowed.
a list of current hosts holding read-only copies of the page, if any. 
We call these hosts 
     readholders.


Given that we can track read- and writeholders of a page, there remains the
question of how to get a page back, or how to deliver new copies of a page
to a readholder. 
It so happens that the types of operations we need to do fit within the 
domain of the read and write NFS operations. 
Getting a page back or asynchronously delivering a page to a client requires
that an NFS server be able to perform a Remote Procedure Call (RPC) to a 
client. 
In the case of getting a page back we issue an NFS read to the client. 
For delivering the new page we issue an NFS write. 
This technique requires that an MNFS client be able to respond to MNFS 
operations that before only a server needed to respond to: an MNFS client
must be able to support server operations. 
Our solution is to make
every client capable of handling these RPCs. 
We accomplish this by running
an NFS server on every client. 
The server's functions are very limited, but it is a server nonetheless. 
The burden of running a server has not proven to be a problem. 

For all the other operations save CFOP and EDMS, 
calls to the SunOS mctl
function are used. 
The CFOP operation required a new system call; this call is dynamically
loaded as part of the MNFS module. 

Shown in Figure  is a server with four MNFS clients. 


One of the clients is accessing the page as a writeable page; 
three others are accessing the page as a readable page, one of them 
in the short page space. 

As clients request readonly pages the server returns the copy of the 
page at the server; the server does not request copies of the most-recent page 
from the writeholder. 

Shown in  Figure  is the sequence of events that occurs


when a client requests a writeable page. 

The server must request the writeable page from the client holding it; 
the server then invalidates any long page copies held at other clients, and
refreshes short page copies; and finally, the server returns the page
to the client requesting it. The sequence of events is the same whether a 
long page or short page is accessed by the client. The message sent to 
invalidate the long pages and refresh the short pages is actually the same; 
clients interpret it differently depending on the type of page they are 
holding.

The principle is that changes to  a page only become visible to other clients
when the page transits the network from the current writeholder to the 
server. 

*User Purge
User purge is a mechanism already supported in the SunOS kernel. 
In SunOS, an msync call will invalidate a page if that page is 
part of the processes address space and it is not modified, which in 
terms of MNFS means it is
read-only. 
No changes needed to be made to NFS to support user purge. 

*Network Refresh
Network refresh also uses existing SunOS mechanisms for support. 
In SunOS, when an msync is issued for a writeable page, and that page
is a mapped file backed by an NFS file system, 
the page is flushed back to the server. 
On the server, 
MNFS extends the handling of returned pages from clients by sending 
out a write of the short page to all clients holding read-only copies of the 
long page or the short page. This is the same mechanism 


shown above in Figure . 
A network refresh sequence is shown in Figure . 

For network refresh of long pages we require that the program use a different
system call than msync. This is because network refresh of long pages is 
very expensive in network and server load, and should not be called unless
the programmer 
absolutely intended to call it. Nevertheless network refresh for long 
pages is available, and is used. 

*Short and Long Pages
Short and long page access is determined by a bit in the Virtual Address. 
This technique is used as it is compatible with a hardware implementation 
and also turns out to be the most portable across different operating systems.
The choice of which bit to use turns out to be sensitive to the implementation 
of the memory management architecture found in most SPARCStations. 
In these architectures there is a hole in the virtual address space of the 
processor in the range 0x20000000 -- 0xDFFFFFFF. It is not possible
to make a reference in this part of the processors virtual address space. 
The implication is that the maximum virtual address space of a processes 
mapped files for MNFS purposes is 512 Megabytes. The use of a bit in the 
virtual address space for long and short pages reduces this to 256 Megabytes. 
This range is smaller than we would like, but useable. 

For process page 
fault handling, the MNFS support code examines the faulting virtual
address. If the short page bit is set, the client only requests 32 bytes
of data for the page. The server also records what type of page the client
requested, and compares the amount of data the client later returns to what
was earlier handed out. Any difference is logged as an error. 
Note that if a short page is present then faults on the long page are 
still possible. In the event of a fault on a long page, the client first
returns the short page if it is held writeable and then requests 
the long page. 
It would be possible at the cost of added complexity and client/server 
interaction to simply ask for the remaining part of the page. 
This  sequence is not followed in order to avoid potential deadlock
scenarios that can occur when the server is very busy. 
Typically programs only use the pages as long or short only; repeated access
in the two different modes is not used by any programs currently running. 

*CFOP
CFOP was an interesting case where a great deal of optimization 
at all levels was 
required before performance was acceptable. Our goal for CFOP, still not 
achieved, is to perform the operation in a measured round-trip time 
of 500 microseconds, from the initiation of the call in the program
to its return to the program. 
 This is 
approximately 100 times as fast as an equivalent  operation can be performed
via NFS. The time was chosen because our applications programmers have been 
demanding it; also, this 500 microsecond time puts us in the company of 
several commercial MPPs. 

CFOP came about as part of our attempt to provide high-performance 
synchronization. The goal was to be able to set a lock variable to a value and
then make that variable visible to other processes. 
Before CFOP, in order to set the variable it had to be fetched back to the 
process trying to set it. The process performed a SWAP instruction, 
and then returned the page containing the variable back to the server. This
sequence is shown in Figure . 


The cost of the fetch for  a short page is 5 milliseconds. The cost of the 
msync is 5 milliseconds, plus 2 milliseconds per host that needs a 
network refresh. Thus this sequence costs at a minimum 10 milliseconds. 
That is about 20 times our desired performance of 500 microseconds. 

It became clear that some sort of operator that performed the memory operation 
at the server was needed. 
Performing the operation at the server would make the change immediately 
visible to all applications and, more importantly, 
would reduce the number of messages needed 
to two, the minimum number possible. 

We defined the CFOP operator to work the following way: Given an operation
CFOP(virtualaddress, function, val1, val2) then CFOP will do the following:


Note that val2 can be more than one word, although in the 
only function we currently have (compare and swap) it is one word. 
For compare and swap the function simply evaluates to the value of val2; 
the current value at virtualaddress and val1 are ignored. 
The return value from the CFOP function to the program calling it is 
the value of the word at virtualaddress; thus the program can see if the 
operation succeeded or not. If the operation did not succeed, the 
value at virtualaddress can be used to compute a new value that might
succeed. 

CFOP is implemented on the client side as a system call. The system 
call verifies that the virtual address is a valid mapped region of 
an MNFS file; gets a file handle for the file; builds an MNFS request
and sends it out. 
The program does not actually reference the variable directly, and hence
performance of a CFOP does not imply that the page will be faulted in. 
By implementing CFOP this way we avoid any costs of 
virtual memory software that would be incurred as part of a  page fault
sequence. 

On the server side CFOP is implemented as an idempotent NFS operation, 
one example of which is the NFS read operation.
Making CFOP idempotent eliminates the checking for duplicate operations
that NFS does for non-idempotent operations such as NFS write, 
thus speeding the operation
up. 

The initial performance of the CFOP operator was 5 milliseconds, or 
ten times as slow as we wanted it to run. 
A series of optimizations to the server and client sides 
resulted in a four-fold improvement in performance to the current 
1.3 milliseconds. 

The first improvement, on the server side, 
was to eliminate unnecessary calls to the local 
disk file system from NFS when the page for the file was in memory
and addressible. 
NFS servers are implemented as a layered architecture over an underlying
Unix File System, or UFS. Calls to NFS servers result in calls to UFS. 
Thus, an NFS read, received at the server, will result in a read
call to the 
UFS. 
The CFOP operator, to operate on the page, must obtain a pointer to a 
copy of the page in memory on the server. 
Thus, to gain access to a pointer to the page the CFOP operator
initially used the standard server technique of calling the UFS read 
function. 
Much of the time this extra call to the UFS layer is unnecessary; 
the page to be read is in memory and accessible, having been accessed
in a previous call to NFS. We wrote a kernel functionTo our 
surprise such a function does not exist in SunOS to determine 
whether the page was available and mapped in and to return a pointer 
to the page if it was found. In this case we could avoid the call
to the UFS read function. This enhancement saved 1.5 milliseconds. 

The next performance improvement involved eliminating Sun's implementation of 
eXternal Data Repesentation, or XDR, code. XDR is used to 
allow hosts of different types to transfer data structures 
back and forth in a standard format. As practiced by Sun, XDR
involves a large number of function calls. In the case of CFOP, 
we were encoding four longs on the client for the request and 
decoding them on the server; the result was encoded in three longs
and decoded as such on the client. We got rid of the Sun XDR code entirely
and turned the decode into an inline decode using the traditional 
conversion functions used in, e.g., TCP/IP. This simple 
encoding saved 1.5 milliseconds, 
which was astonishing. Storing and retrieving seven words from memory
should take at most 500 nanoseconds. That it takes 3000 times as long
using XDR implies that XDR itself needs to be reexamined as a technique
for encoding data, or at the very least Sun's implementation of XDR. 

We next inlined some of the more complex kernel RPC functions into our 
operator,which saved 150 microseconds; then recompiled with the 
Gnu C Compiler, which saved another 150 microseconds. 
The total time was now 1.9 milliseconds for CFOP. 

The final optimization which brought us down to 1.3 milliseconds was
to handle the CFOP in the interrupt handler code for the network when
possible. The checks for whether CFOP can be handled at interrupt
are fairly 
straightforward. Given a packet that has been delivered from the network
interface to the generic interrupt handler, if: 

The packet is not a fragment of a larger packet
The value of the data at the offset in the packet for an NFS op is 
the CFOP operator
The packet is an RPC version 2, NFS version 3 packet
the protocol control block can be located
the page for the file handle at the offset requested is in memory

Then we perform the op in interrupt mode. If these tests fail, the packet
is passed into the normal network soft interrupt queue for normal handling. 

Note that most packets fail the first two tests immediately, and those
tests involved at most four word compares; the overhead of this test is
not measurable. 

There were a number of lessons learned with CFOP: 

There are many unexploited opportunities for optimization in 
the NFS code
The SunOS Virtual Memory (VM) 
architecture makes many optimizations in NFS 
server code possible, as the VM system and the buffer cache are unified. 
Sun's implementation of 
XDR with its attendant XDR generation tool
saves a programmer work one time, but penalizes users on every 
NFS operation; we note here that the 4.4 BSD implementation of NFS uses 
more traditional XDR techniques used in TCP/IP 
and achieves much better performance. 
In fact we used pieces of the 4.4 BSD NFS code for decoding our 
CFOP packets. 
Given a packet interface with a large enough maximum packet size, 
many NFS operations could be done at interrupt level
inlining did not buy nearly as much as it appeared it would
Layering, while an attractive technique from the point of view 
of ease-of-implementation, is often very costly. 
As usual, optimizing the common case is worth the effort. 


*Load Incoherent
NFS already had a mechanism for the equivalent of Load Incoherent. 
In NFS, when a write occurs to pages past the end-of-file, NFS simply
extends the file and creates the pages locally on the machine. 
We simply adapted this mechanism for Load Incoherent; we create local 
pages
for writing even when they are in the range of the size of the file. 

*Event Driven Memory Operations
EDMS operates via a system call. It is in fact the same system call used 
for CFOP, with a different opcode; CFOP takes a 32-bit opcode as one of 
its parameters. 
In EDMS, the system call checks to make sure the virtual address is valid
and corresponds to an MNFS file. It then checks to see if the page for the 
virtual address is present, and if so returns immediately. If not, 
it goes into a kernel sleep on the virtual address. 

When the client code receives a network refresh for a given virtual address,
it will issue a kernel wakeup call for that virtual address. Any waiting 
processes will be woken up at that point. 

Performance
The current version of MNFS extends NFS server and client code in 
very limited ways. These limitations are intentional; our goal is to 
make MNFS and NFS work alike as much as possible. 

Nevertheless the changes to date have made for marked improvements in 
performance, and in no case have
made performance worse. 
We will cite some numbers here. The numbers were derived 
from experiments run on SPARCStation ELC systems, which have 33 Mhz SPARC
CPUs. These systems in our experience have the ability to drive the Ethernet
at close to 1 Mbyte/second. We will cite numbers for 8 kilobyte block size
file systems. This is the default NFS block size, and the size of an MNFS long
page. 

For NFS writes, the best-case NFS write performance is 72 
milliseconds; MNFS performs them in 35 milliseconds. NFS reads
run at 14 milliseconds; MNFS reads take the same time. 

Alternating NFS reads and writes run at 67 milliseconds. 
This time is slightly better than the all-write case as the read is 
causing some caching to occur.  In the case of 
MNFS they run at 35 milliseconds. 

The intermixed reads and writes highlight a difference of MNFS and NFS. 
NFS does not differentiate read and write faults; when a read fault on 
a page is handled in NFS, and the page is found to be mapped writeable, 
it is loaded as a writeable page. 
MNFS differentiates read and write faults; hence a read fault followed
by a write fault requires two NFS reads. 
Thus the MNFS case requires twice as many reads
as the NFS case. Nevertheless the performance is twice as fast. 

For short pages, the write time is 5 milliseconds, and the read time is
5 milliseconds. Read followed by write takes 10 milliseconds. 
Note that if the cost were absolutely linear to packet size, the short pages 
would only take 140 microseconds. We could therefore grow the short pages 
somewhat. Their size has been determined by a number of factors, however, 
one of which is our goal that they be able to be contained in an ATM packet. 
Thus we leave their size at 32 bytes. 

CFOP is our highest-performance operator, clocking in at 1.3 milliseconds. 
Once we finish the move to Solaris 2.1 we will be trying to double
the performance. 

EDMS provides very precise synchronization. We have a very graphic 
demonstration of just how fine at SRC. One of our demos sets up 15 
SPARCStations in an EDMS wait state on a synchronization variable. 
When the variable is changed and a network refresh performed, the 
SPARCStations all play one sound. It is not possible to distinguish any one 
machine's individual sound from any others; the first time we ran the demo
we thought it was broken, since it only sounded like one machine. 

Applications
We will now discuss applications for which MNFS has been used. 
Their use of MNFS is very different, and highlights the different demands
that applications make of such a system. 

*Monte Carlo
We first used a 
Monte Carlo simulation of a radiative heat transfer among surfaces of 
arbitrary 2-D enclosures.
This choice was made for several reasons: we have studied this 
problem in a number of computing environments and are familiar
with its performance and behavior;
the problem is inherently very parallel, but parallelism must
be extracted in different ways to exploit different architectures;
the problem, while only 2-D, is nevertheless of real world interest--in 
particular, we are interested in modeling the geometry
of a laser isotope separation (LIS) unit for which the accurate 
determination of radient exchange factors among the surfaces 
is an important component in the larger simulation of the 
isotope separation process.

For the LIS geometry, we modeled the trajectories of 37 million 
photons, one million emissions per surface.
All floating point computations were done with 64-bit numbers
on both machines.

The time to complete this run on the Cray 2 was 262 minutes
(CPU time; 10 hours 20 minutes wall-clock time, but that is because we 
were not the only users on the Cray), 
while the time for the ELC farm was 28 minutes 
(wall-clock time). 

It is possible that with a sufficient expenditure of effort the Cray code
 could be vectorized. If we make  the very optimistic estimate 
that this could be done, and that we could get dedicated time on 
a Cray-2, the problem might be made to run in 12 minutes. Typically, however, 
Crays are run in such a way that wall-clock time of a job is at a minimum
four times the cpu-seconds, and more typically seven times; 
we expect that in the best case in a real 
environment we could get this problem to run in 48 minutes or so. Thus this
multi-million dollar system could run the problem, in the best case, more 
slowly than our 70K ELC farm. 
Additionally, we could add processors (cheaply) to the ELC
farm and expect speedup into at least
the 32 processor range.

The newest supercomputer from Cray is the C90. We have estimated that 
given dedicated access to a 16-processor system, and assuming 
optimal vectorization, this problem could be run in a minute. 
Given more typical supercomputer environments the problem would 
probably take 5-10 minutes. Thus this machine, costing approximately
30 million, could outperform our ELC farm. 

Finally, a comment on speedup. 
We observed almost completely linear speedup for the 37-million photon case. 
That is, in the 16 machine case, where each machine did the computation 
for 65536 photons, the problem ran 16 
times as fast as the one-machine case-- minus the 40 or so seconds it took to 
start up the tasks. This 40 second overhead was nearly constant across
all problem sizes, and became a serious performance problem for 
problems which only take a few minutes to runNote once again that
these small problems are not of interest for this application, but are of 
interest for measurement purposes. For example, 
with only 6534 photons per processor, the total time for 16 processors 
was linear speedup (150 seconds) plus the startup time for the 15 additional 
tasks (40 seconds). Obviously in the 
case of a farm with 60 or 70 processors process inititiation is an issue. 
We have done some recent work, of which more in another paper, which 
has brought process startup time down to the 50 ms. range for a 
process on a remote machine. This 
fast process startup 
will improve the speedup for small problem sets, since the processes can be up
and running in under a second. 

*16 MegaPixel Display
The 16 Megapixel display is composed of 16 SPARCStation ELCs positioned
in racks as a 4 by 4 grid of monitors. The display racks were designed
and fabricated at SRC. They hold the monitors in place with no space 
between them. 

The display is programmed as one large frame buffer. The programming library
is the pixrect programming library from Sun, which is a memory-based
programming library. The programmer calls a function named displaycreate
which takes as an argument a file name, which should be a file in an 
MNFS file system. This file is used as the shared memory segment 
for the 16 megapixel area. From that point on the programmer need
only use Sun's pixrect operations in the normal way. 
A function called syncmpr is provided which will perform an msync on the
mapped in file and deliver changed pages to clients holding them. 
Each client in turn performs raster operations to copy a fragment of the 
frame buffer file to its local frame buffer. 
A layout of the structure of this system is shown in Figure . 


We can display one new frame a second. This is surprising given the unoptimized
nature of this implementation, which was done in a day or two by an 
undergraduate. The hardest part of this job for the undergraduate was learning
how to use the pixrect functions; the second hardest part was 
setting up the demo, which required us to scan many bitmaps until we could
find enough G-rated bitmaps for 
display. 
The display is not as fast as we would like, but is fast as it could be 
considering we are running on Ethernet and are shipping bitmaps around. 
We are working on a version which uses a display list architecture 
so that the descriptors for objects are what is shipped, not the bitmaps
for the object. 

Nevertheless the large display has some lessons for us: 

Using virtual memory to support data sharing is effective. In this 
application the user never needed to know what pages were used by 
which displays, which bits in the file corresponded to which bits 
in the display, and so on. In fact,
until we did the diagram for Figure  we never thought about
the correspondance of pages to display areas at all. 
By using the msync operation to sync pages to displays we freed ourselves
of the task of tracking damage regions; the tracking of changed data is 
an inherent part of the VM architecture, and using it saves a lot of effort. 
Note that any object can span any number of displays, up to 16. 
Tracking and computing which objects fit where and which data structures to 
send to which device would have been very complex, as would have been setting
up the communications channels. 
X11 is not required for network graphics programming. This demonstration
is a large bitmap displayed on 16 networked machines. X11 was not used for 
any part of the program. That is one reason why it was so easy. 
Page fault is necessary but not sufficient as a 
way for a process to get a new copy
of changed data. 
In this case, the renderer program is causing the newest copy
of the modified pages to be delivered to the clients. 
Other DSMs such as Ivy or Mach external pagers would require that the 
pages be invalidated and the processes fault them in when they are 
referenced, which we have measured at a five-fold impact on performance. 
DSMs which do not support some form of network refresh will not run well
in a high-latency network. 


*Rarefied Gas Model
An application under development is a rarefied gas model. In this application
particles traverse a region, divided into a grid, and collide with 
complex objects and other particles. The communications-intensive part 
of this application involves transferring particles from one grid region 
to another. 
The communications is N-squared and hence challenging. Our communications
structures will be FIFO-like. The only coherent variables will be pointers 
to free areas; noncoherent loads and stores will be used to transfer
particle information from processor to processor. 

This application is interesting because while it makes almost no use
of the coherency MNFS makes available, without that coherency for the 
FIFO control variables it would not be possible to use the file system
at all over the network. 

Summary
Mether-NFS, or MNFS is an ``industrial strength'' DSM implemented as a 
modified NFS. It can be dynamically loaded into a running SunOS kernel; 
it can be measured and controlled with standard NFS software tools; 
it supports traditional NFS semantics on files, which users know how to 
work with; but for memory-mapped files, the behavior of writes is 
improved so that they are globally ordered. 
It has been installed and is in use at a number of universities and 
research institutions. 

MNFS performance is at least as good as  
NFS performance in almost every category, and is superior to it in most. 


A major improvement of MNFS over NFS is in the area of memory mapped files. 
Memory mapped files have globally ordered writes, allowing the use of 
atomic test-and-set operators. MNFS also supports high-performance 
synchronization operators which allow very precise control of the activities
of many computers. 

MNFS has a number of characteristics, including two different page sizes; 
program control of invalidation; and network refresh, which make it 
flexible and adaptable to the needs of different programs. 

MNFS is currently running on SPARC-based systems running SunOS version 4.1.1 or 
higher. No kernel recompilation is required; MNFS is 
a dynamically loadable system. MNFS is being ported to SGI systems 
running IRIX 5.0 or later. MNFS is also being ported to AIX 3.2 on the 
IBM RS/6000. 
A limited version of the system, running without short pages, is working
now. The AIX work is being done at the University of Pennsylvania by 
John Shaffer. 

We have found that MNFS supports a number of very different applications 
very effectively. Three of the ones described in this paper are a Monte Carlo;
a 16-machine, 16 Megapixel Display; and a rarefied gas application. 
MNFS allowed us to move the Monte Carlo program 
transparently from a shared-memory system to a 
network of SPARCStations. 
The use of MNFS made many of the aspects of 
programming the large display transparent to the user, and allowed us to 
exploit the virtual memory architecture of SunOS to manage which data had
to move, and to which machines. Doing all the work ourselves would have 
made the process very difficult. 
The rarefied gas application makes very little use of MNFS memory 
consistency, but without the little it does use it could not be 
run on the SPARCStation cluster at all. 

Future Work
There are a number of areas which we are pursuing. 

We are working on getting  the  applications working on the
48-node cluster, recently assembled, 
32 of the machines of which were supplied by Sun Microsystems as part of 
a research collaboration. 
The rarefied gas application is the most interesting case, as it 
 has a large amount of communications
and requires high-performance synchronization. CFOP was originally motivated
by our planning for this application. 

We are currently porting the system to Solaris 2.1, which has a 
newer Virtual Memory architecture and other changes. Fortunately
the MNFS changes to NFS are quite restricted in scope, so starting
with NFS and creating MNFS is a straightforward process. 

Work continues on CFOP, toward our goal of 500 microsecond and better
round-trip times. 

We are also working with MNFS on high-bandwidth networks such as ATM and 
FDDI. The networks feature larger maximum transmission units or MTUs, 
in the case of ATM
large enough to send an entire NFS packet in one transmission. 
Thus these nets hold out the possibility of doing an entire MNFS operation
in interrupt mode, greatly enhancing performance. 

Finally, hardware implementations using this memory model as a guide 
are under consideration. 

Acknowledgements
Tim Thomas performed the work which made MNFS a mod-loadable module. 
Scott Zettlemoyer has done a considerable amount of testing and bug-finding
for MNFS. 

Kevin Smith is not a co-author of this paper, but he is a co-author of MNFS.
Without Kevin's contributions to the design and implementation of MNFS it 
would not exist. 

plain