################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Mether-NFS: A Modified NFS Which Supports Virtual Shared Memory Ronald G. Minnich Supercomputing Research Center *Abstract Mether-NFS (MNFS) is a modified NFS which supports virtual shared memory. The model it supports is based on the Mether model. It is compatible in most ways with standard NFS. The major difference is in the way mapped files are handled. Mapped files, rather than adhering to standard NFS memory consistency semantics instead obey rules such that writes are globally ordered. This allows MNFS to be used to support shared memory programs. MNFS has been used on a 16-node SPARCStation ELC cluster to support shared memory programs. On appropriate applications we have realized supercomputer-level performance, which given the modest cost of the cluster implies a correspondingly higher price-performance. MNFS performance is superior to NFS performance in every case, but particularly in the case of write performance. Writes in MNFS are at least twice as fast as NFS. MNFS provides an extended memory operator, called Conditional Fetch-and-op (CFOP). CFOP is used to support memory synchronization. Programs using CFOP can atomically test and set a variable in 1.3 milliseconds. Our target performance goal for CFOP is 500 microseconds. In order to achieve this performance we have had to highly optimize the path through the SunOS kernel on both the client and server sides. Introduction Virtual Shared Memory, also called Distributed Shared Memory (DSM), provides the shared memory abstraction on a set of machines connected via a network. DSMs can also be implemented in hardware, e.g. MemNet, but we are restricting our discussion in this paper to software DSMs. Implementations to date include Ivy, Mether, and others. A significant problem to date has been the absence of an ``industrial strength'' implementation of DSM that would allow ordinary users access to it for their applications. Despite the efforts of researchers, no implementation of software DSM that we know of has made it far beyond the labs which have developed it (including our own) and into general use. User evaluation and feedback on the idea of DSM is essential in order to further develop it. We have developed a DSM at SRC called Mether. The source has been made available since 1991 and it has been installed at a number of universities in the U.S. and overseas for both research and applications purposes. While researchers have studied it for their own work, it has proven to be too unfamiliar an environment for normal users and the result has been little applications development using Mether outside SRC. In order to provide a system more useable to non-researchers we rebuilt Mether using NFS as the support base. The result is Mether-NFS, or MNFS. Our goals for MNFS were: Provide ease-of-use for DSM users by fitting the Mether model into a Unix file system framework. The result is a value-added NFS that offers standard NFS semantics for read and write operations; and offers a useful DSM model for memory-mapped files. Provide a tool for sysadmins that was easily installed, requiring no kernel recompilation or expertise beyond that needed to use NFS. Provide a system that was observable with standard tools found on Sun systems. Provide compatibility with the extent possible to standard NFS. This implies no changes to NFS Remote Procedure Call types or External Data Representation structures. Provide a reliable system, at least as reliable as NFS. These goals were met. MNFS: looks to users like NFS is dynamically loadable after the computer has been booted, using Sun's ``modload'' utility does not modify RPC or XDR components of NFS in any major way, with the result that MNFS packets can be monitored with all the tools used to monitor NFS supports a memory model that makes it possible to use shared-memory programming techniques on memory-mapped files provides users with a ``recoverable'' virtual memory, in the sense that virtual memory segments are backed by files and thus can be examined during an application's execution and as part of post-mortem debugging. Memory segments backed by MNFS can even be integrated into a checkpoint/restart scheme. In addition to operating as an NFS file system, and supporting the Mether model for memory-mapped files, MNFS adds an extension called Conditional Fetch-and-op (or CFOP). CFOP is used for low-latency synchronization. Using CFOP processes can perform P and V operations on shared variables in 1.3 milliseconds. This performance is far better than we are able to achieve via other mechanisms. In the following section we will provide an overview of the MNFS applications interface, followed by a description of the implementation of MNFS and the CFOP support, and then discuss performance. Applications Programming Interface The applications programming interface for MNFS is for the most part identical to that offered by NFS-based Unix file systems. There are a few areas in which the two differ: behaviour of the mmap system call. In standard SunOS, the mmap system call is used to map a file in. The user may specify an mmap area that is larger than the file; if a process references a mapped file area beyond the current end of file, the process will receive a segmentation violation signal and in the usual case will terminate. In MNFS, the file size is grown as needed by the mmap support code to match the size of the mmap request. Thus no segmentation violation will occur if the user references an area of the file beyond the initial end-of-file. Semantics of virtual address. As described below there are two different page sizes supported in MNFS. One bit of the virtual address is used to indicate whether the reference is to the long-size page space or the short-size page space. No read-ahead. MNFS does not currently read ahead. The traditional NFS read-ahead plays havoc with memory-mapped files. CFOP system call. The CFOP system call is available to users of MNFS. CFOP operates on mapped-in MNFS files. Swap instructions work. Atomic memory instructions such as the SPARC SWAP instruction work properlyassuming, of course, that the SPARC implementation in question executes the instruction correctly; not all do. Users used to traditional NFS semantics may be surprised. File locking is not necessary. Given that swap instructions work and the availability of CFOP we have found no need to use the standard fcntl system call to lock regions of a file for writing. As mentioned above the memory model for MNFS mapped files is based on our earlier work with the Mether distributed shared memory. Mether presents users with a virtual address space partitioned into pages. A Mether operation on any part of a page applies to the whole page as well. For example, a reference to part of a page will cause a whole page to be fetched. The operations that Mether supports on pages differ in several crucial ways from some of the DSMs mentioned above. On a memory-fetch by memory-fetch basis, programs can determine the type of service needed for that fetch. The types of service supported are: [Exclusive Write Access] When a process is writing to a page, it has the only writeable copy of that page. However, unlike systems such as Ivy, Mether does not invalidate all readable copies of the page. The exclusive access for the writeable page guarantees global serialization of all writes to the page. It also eliminates the need for cache invalidation protocols with their attendant overhead. [Latency Tolerance] Latency tolerance is supported in several ways. First, there is an operation that is used by processes to fetch items that may take a long period of time. This operation is used by processes to fetch data without a page fault, but to also indicate that a high latency is both acceptable and expected. Second, there are pre-fetch operations that allow processes to cause the system to acquire pages before they are used. [Network Refresh] Network refresh is a process by which an application can deliver the latest copy of the writeable page to holders of a readonly copy. Recall that there is only one writeable copy. This operation is called ``post store'' by Kendall Square in their machine and has identical semantics. [User Purge] In user purge, an application can purge all the readonly data in a given address range. Thus, a program can ensure that it is reading the most recently written data. The user purge and network refresh can be used in a way similar to the memory barriers on the DEC Alpha microprocessor. [Support for different-size objects] Programs can, on a fetch-by-fetch basis, choose to access pages that have a very low transmission cost but do not move much data (32 bytes, a size chosen with ATM in mind); or they may choose to access large pages that have a higher transmission cost but are useful for moving large amounts of data (currently 8192 bytes). The small pages (``short pages'') are useful for synchronization and small objects; the large pages are useful for bulk data transfer. These sizes are well matched to the sizes most commonly seen in networking environments. [Event-Driven Memory Synchronization ] Event-Driven Memory Synchronization (EDMS), allows a process to pause waiting for another process to issue a network refresh. Thus the resumption of execution of a process is a direct consequence of an action taken by a process on a remote machine. This synchronization mechanism was found in to minimize the network and host load and hence to minimize the contribution made to the overall latency by host and network load. EDMS is not a primitive operation but is constructed from two primitives: the reader performs the operation to fetch in the high-latency mode, and at a later time the writer performs a network refresh. [CFOP] CFOP is described below, but is a high-performance operator for shared variables. Our target goal for CFOP is 500 microseconds round-trip time. We currently can perform the operation in 1300 microseconds. Shown in Figure is a portion of program text. The program shown opens a file and maps it in. It then invokes different MNFS operations. Each type of MNFS operation invoked and what it does are described in the following sections. *User Purge The user purge of pages is accessed via the SunOS mctl call. We use an mctl interface function called msync, which is part of the C library. In the example, the program is purging a cached copy of a page that was accessed via reads only. Of course we could guarantee that the page would always be read-only via options to the mmap call; specifically specifying PROTREAD instead of both PROTREAD and PROTWRITE as the protection. *Network Refresh The next code fragment demontrates network refresh. The key difference is that the page is present at the local machine and writeable. In this case at a minimum MNFS will return a copy of the page to the server machine. In addition, for each machine holding a read-only copy of the page (the readholders mentioned above) a copy of the short page will be sent to that machine. When the short page is received by a readholder it will be paged in, thereby accomplishing two things: replacing the short page currently at the machine, if any, and allowing processes to examine variables on the short page without going through a page-fault cycle. invalidating any copies of the long page currently at the machine, requiring that further references to the long page be satisfied with a page fault for the new copy In the sample code there are two variations on the network refresh. In the first call to msync for the writeable page, the MSINVALIDATE option is set. This will invalidate the writeable copy of the page held at the client. For the client to access the page again it must fault it in by storing into the variable x. In the second msync call, the invalidate option is not set, so that a writeable copy of the page remains at the client. Both of these scenarios are useful in different applications. *Short and Long Pages In the next mmap call the program converts the normal address, x, to a pointer to a short page. It then maps the short page in. In the short page space, only the first 32 bytes of the page are transferred from and to the server. Short pages are useful for small variables that move frequently, such as synchronization variables. Short pages represent the first 32 bytes of data on a long page. Thus a page can be treated as a union of a 32-byte entity and a 8192-byte entity. The 32-byte entity is accessed via the short page space; the 8192-byte entity via the long-page space. The two address spaces have very different latencies and bandwidths by design. Short page latency is determined by the latency of the protocol stack; the cost is not measurably different from that of moving a zero-byte packet. Long page latency, on the other hand, is dominated by the time cost of moving 8192 bytes of data through the networking software and over the wire. Currently it takes about 5 milliseconds to get a short page on SPARCSTATION ELCs over the Ethernet; long pages take 14 milliseconds in the best case. Bandwidth on the other hand is very low for short pages at just 640 bytes/second; long pages on the other hand run at 585 Kbytes/second. The long page bandwidth is lower than we would like; we should be able to get close to 1 Mbyte/second. *CFOP The next operation shown is a Conditional Fetch-and-op. The operator in this case is Compare and Swap on a single word. The target is a virtual address in the mapped-in file, the source values are provided by the program. The target is compared to the first value and, if they are equal, is set to the second value. The value of the target after the operation is returned. The implementation actually performs the operation at the server, and performs it in such a way that the test is indivisible from the assign. There are other operators that can span up to eight words; whether the operation is performed is conditioned on the first word of the target being equal to the first word in the source. Thus a lock can be set and values assigned to a locked variable in one operation. Compare and Swap on a word is the subset of a more general operation. *Load Incoherent RISC processors such as the R4000 have a way of loading read-only or writeable cache lines without actually reading them in from memory. The operation is called Load Incoherent. Load Incoherent is useful for MNFS programs as well. The next mctl operation shown is a LOADINCOHERENT operation. The set of pages starting at the address given by the pointer x and for sizeof(*x) bytes will be created at the client without communicating with the server at all. This greatly increases the effective bandwidth to the MNFS server, since no data need move. *Event Driven Memory Operations Finally we show an example of Event Driven Memory Operations (EDMS). In this case the program will wait for up to 30 seconds for another program on another host to perform a network refresh of a page. Note that the 30-second time is selected by the programmer for this instance; any time may be selected, including infinity. If x were a variable on a short page, and there had been a network refresh, x would contain the new value after the refresh. Since x is a variable on a long page, the next reference will cause a page fault, as network refresh invalidates long pages. If there has not been a network refresh the next reference to the x will cause a fault only if the program has previously purged the page containing x. *Summary This interface doubtless seems very low level compared to other systems. That is precisely the intent. The goal of Mether, and hence MNFS, is to provide a completely exposed set of cache operations to applications. In the long term our goal is to have a high-level language use these operations, but we have also found that in some cases programmers need to have direct access to the low-level interface. Thus the operations described in this section should be considered a set of building blocks for higher-level operators. An attendant goal of Mether from the beginning has also been to prototype the API for a hardware-based system. Of necessity, therefore, the API of this system is far more constrained than what we can do in a system that will only ever be implemented in software. A hardware-based system being designed now holds out the promise of far better performance than any software-based system extant. Implementation Before discussing MNFS implementation details we will first provide an overview of what is lacking in the NFS handling of memory-mapped files. *NFS Limitations Shown in Figure is an example of how NFS handles memory reads and writes to a mapped file. Two client programs on different machines have mapped in a file. The file is mapped in such a way that both reads and writes are possible. A client reads a memory location, causing a writeable copy of the page to be loaded. A writeable copy is loaded, and not a read-only copy as might be expected, because the file is mapped with write permission. The program then performs a SPARC SWAP instruction, which performs an indivisible memory operation. At this point, in a normal memory system, all processes accessing the memory would see the same value, or in the worst case, would see the old value for some time before seeing the new value. In NFS, however, as the illustration shows, the other process has also executed a SWAP, and the processes in fact see different values. For every process involved in using this memory-mapped file, there can be a different value of the shared location. More important, there is no way to determine what value the variable will eventually have. At some point the many writeable copies of the page will be returned to the server. Depending on which page returns last, the value will have one of many different values. A process writing a different variable on the page may in fact restore the old value, before the swap instructions were performed. The situation is further complicated by any retries that may occur as part of the NFS protocol. Thus, the first write may be retried, making it the last write, and it may be retried many seconds later, so that programming techniques that attempt to work around this problem by waiting for a time for the value to ``settle down'' will not work. Thus, in NFS mapped files, time can move backwards-- a variable can take on an old value, then a new value, then an old value writes are not ordered-- at different nodes, the order of changes to a page can be different the behavior is highly non-deterministic it is not possible to use the mapped variables for synchronization It is not possible for a process holding a writeable copy of a page to deliver it to the processes holding a readonly copy of the page, so that they may see the changed value of a variable There are other problems as well. Only one page size (4096 bytes on the Sun 4c and 4m SPARC implementations) is supported, while programs frequently need small data pages for synchronization and large data pages for bulk transfer. There is no support for direct application-to-application synchronization via NFS. Thus the problems are several: NFS client code does not differentiate between read and write faults NFS server code does not ensure that only one process is writing a page at a time The only coherency mode supported is completey incoherent, while programs can usefully use both the incoherent mode of NFS and strongly coherent modes offered by other systems such as Ivy. More than one page size is needed A mechanism for delivering a writeable copy of a page to holders of readonly copies is needed A mechanism for direct application-to-application synchronization is needed We had built a memory model that incorporated many of the properties we need in an earlier system call Mether. We determined that the SunOS virtual memory system and NFS implementation was flexible enough to accomodate the Mether model as part of NFS. We will now describe the changes we made. MNFS Changes The single most important MNFS change is the fact that only one writeable page of a mapped-in file may be present in the network at a time. When a process faults on a page, and that page is not present, the host supporting that process must request the page from the server. If the page is not present at the server it must be located by the server and brought back to the server. The server might have several ways to locate a page. The naive way would be to broadcast for a page every time it is needed. This would preserve statelessness. It would also limit scaling, as broadcast-based systems do not scale well. We decided that in no case would MNFS use broadcast. Our choice was to have the server track ownership of pages. For each page, the server records who has a writeable copy and who has read-only copies. The tracking of pages has many implications for NFS. The most important is that there is now server state related to a client's file activity. This further implies that a server crash and reboot may not be transparent to a client, as it is now. The stateless model is one casualty of our change to MNFS. This change is one we thought about carefully. NFS has from the beginning relied on its stateless model to reduce complexity and increase reliability. For the environment and time in which NFS was created, such a model made sense. At the time it was not unusual for servers to crash on a daily or weekly basis. For the present, however, our experience with servers is that they don't crash. We have servers that stay up for months at a time, going down only when we need to add a device or change the OS. When the servers do go down it is in an orderly fashion, so a controlled clean-up of server state is possible. In the one case in which we have had a server crash in the last year it was because disk heads and disk arms parted company; it no longer mattered that NFS had reliability because the disk was physically unreadable. There is no point in staying with the stateless model in this environment-- its restrictions are too constraining, and it is buying us reliability we don't need. Statelessness provides a constant performance penalty, for reliability that is not necessary, in order to cover for an event that in a well-run environment does not occur. At the same time, we advocate taking care in what state is added to the server. We have been careful to add a very small amount of state per file, and to add it in a way that recovery from server crashes and client crashes is possible. A distributed application may not survive all crash scenarios, however. There is a possibility that a client process may change a page and sync it back to the server and the changes not be preserved. If the server crashes at just the right instant, such that the client is satisfied that the server has the page and has discarded it, the new page may be lost. Thus the client has an invalid view of the contents of the page, and the correct view can not be restored. Creating this particular failure scenario is extremely difficult. The easiest way is to halt the client machine (literally halt it-- toggle the run/halt switch, or on Suns use the L1-A key sequence), power off the server machineNote that unplanned power failure is also very unlikely given that any server of any importance is backed by both uninterruptable power supplies and non-volatile memory without shutting it down in an orderly way, then continue the client. Once the server is back up, it will have lost records of clients holding pages, and will reject returned writeable pages from the client. At that point the newest copy of the writeable page is lost, and the application which wrote it is in a state predicated on the server having that writeable page. Since there is no return from this scenario, we have code which has been tested which will send the client a segmentation violation signal in this case. In testing, the code has been exercised and works; in practice it has never happened. The state we added to the server was related to pages from files. On a per page basis we build a record which maintains track of: the time the state record was created. the name of the current host holding a writeable copy of the page, if any (the writeholder). There is only one writeholder allowed. a list of current hosts holding read-only copies of the page, if any. We call these hosts readholders. Given that we can track read- and writeholders of a page, there remains the question of how to get a page back, or how to deliver new copies of a page to a readholder. It so happens that the types of operations we need to do fit within the domain of the read and write NFS operations. Getting a page back or asynchronously delivering a page to a client requires that an NFS server be able to perform a Remote Procedure Call (RPC) to a client. In the case of getting a page back we issue an NFS read to the client. For delivering the new page we issue an NFS write. This technique requires that an MNFS client be able to respond to MNFS operations that before only a server needed to respond to: an MNFS client must be able to support server operations. Our solution is to make every client capable of handling these RPCs. We accomplish this by running an NFS server on every client. The server's functions are very limited, but it is a server nonetheless. The burden of running a server has not proven to be a problem. For all the other operations save CFOP and EDMS, calls to the SunOS mctl function are used. The CFOP operation required a new system call; this call is dynamically loaded as part of the MNFS module. Shown in Figure is a server with four MNFS clients. One of the clients is accessing the page as a writeable page; three others are accessing the page as a readable page, one of them in the short page space. As clients request readonly pages the server returns the copy of the page at the server; the server does not request copies of the most-recent page from the writeholder. Shown in Figure is the sequence of events that occurs when a client requests a writeable page. The server must request the writeable page from the client holding it; the server then invalidates any long page copies held at other clients, and refreshes short page copies; and finally, the server returns the page to the client requesting it. The sequence of events is the same whether a long page or short page is accessed by the client. The message sent to invalidate the long pages and refresh the short pages is actually the same; clients interpret it differently depending on the type of page they are holding. The principle is that changes to a page only become visible to other clients when the page transits the network from the current writeholder to the server. *User Purge User purge is a mechanism already supported in the SunOS kernel. In SunOS, an msync call will invalidate a page if that page is part of the processes address space and it is not modified, which in terms of MNFS means it is read-only. No changes needed to be made to NFS to support user purge. *Network Refresh Network refresh also uses existing SunOS mechanisms for support. In SunOS, when an msync is issued for a writeable page, and that page is a mapped file backed by an NFS file system, the page is flushed back to the server. On the server, MNFS extends the handling of returned pages from clients by sending out a write of the short page to all clients holding read-only copies of the long page or the short page. This is the same mechanism shown above in Figure . A network refresh sequence is shown in Figure . For network refresh of long pages we require that the program use a different system call than msync. This is because network refresh of long pages is very expensive in network and server load, and should not be called unless the programmer absolutely intended to call it. Nevertheless network refresh for long pages is available, and is used. *Short and Long Pages Short and long page access is determined by a bit in the Virtual Address. This technique is used as it is compatible with a hardware implementation and also turns out to be the most portable across different operating systems. The choice of which bit to use turns out to be sensitive to the implementation of the memory management architecture found in most SPARCStations. In these architectures there is a hole in the virtual address space of the processor in the range 0x20000000 -- 0xDFFFFFFF. It is not possible to make a reference in this part of the processors virtual address space. The implication is that the maximum virtual address space of a processes mapped files for MNFS purposes is 512 Megabytes. The use of a bit in the virtual address space for long and short pages reduces this to 256 Megabytes. This range is smaller than we would like, but useable. For process page fault handling, the MNFS support code examines the faulting virtual address. If the short page bit is set, the client only requests 32 bytes of data for the page. The server also records what type of page the client requested, and compares the amount of data the client later returns to what was earlier handed out. Any difference is logged as an error. Note that if a short page is present then faults on the long page are still possible. In the event of a fault on a long page, the client first returns the short page if it is held writeable and then requests the long page. It would be possible at the cost of added complexity and client/server interaction to simply ask for the remaining part of the page. This sequence is not followed in order to avoid potential deadlock scenarios that can occur when the server is very busy. Typically programs only use the pages as long or short only; repeated access in the two different modes is not used by any programs currently running. *CFOP CFOP was an interesting case where a great deal of optimization at all levels was required before performance was acceptable. Our goal for CFOP, still not achieved, is to perform the operation in a measured round-trip time of 500 microseconds, from the initiation of the call in the program to its return to the program. This is approximately 100 times as fast as an equivalent operation can be performed via NFS. The time was chosen because our applications programmers have been demanding it; also, this 500 microsecond time puts us in the company of several commercial MPPs. CFOP came about as part of our attempt to provide high-performance synchronization. The goal was to be able to set a lock variable to a value and then make that variable visible to other processes. Before CFOP, in order to set the variable it had to be fetched back to the process trying to set it. The process performed a SWAP instruction, and then returned the page containing the variable back to the server. This sequence is shown in Figure . The cost of the fetch for a short page is 5 milliseconds. The cost of the msync is 5 milliseconds, plus 2 milliseconds per host that needs a network refresh. Thus this sequence costs at a minimum 10 milliseconds. That is about 20 times our desired performance of 500 microseconds. It became clear that some sort of operator that performed the memory operation at the server was needed. Performing the operation at the server would make the change immediately visible to all applications and, more importantly, would reduce the number of messages needed to two, the minimum number possible. We defined the CFOP operator to work the following way: Given an operation CFOP(virtualaddress, function, val1, val2) then CFOP will do the following: Note that val2 can be more than one word, although in the only function we currently have (compare and swap) it is one word. For compare and swap the function simply evaluates to the value of val2; the current value at virtualaddress and val1 are ignored. The return value from the CFOP function to the program calling it is the value of the word at virtualaddress; thus the program can see if the operation succeeded or not. If the operation did not succeed, the value at virtualaddress can be used to compute a new value that might succeed. CFOP is implemented on the client side as a system call. The system call verifies that the virtual address is a valid mapped region of an MNFS file; gets a file handle for the file; builds an MNFS request and sends it out. The program does not actually reference the variable directly, and hence performance of a CFOP does not imply that the page will be faulted in. By implementing CFOP this way we avoid any costs of virtual memory software that would be incurred as part of a page fault sequence. On the server side CFOP is implemented as an idempotent NFS operation, one example of which is the NFS read operation. Making CFOP idempotent eliminates the checking for duplicate operations that NFS does for non-idempotent operations such as NFS write, thus speeding the operation up. The initial performance of the CFOP operator was 5 milliseconds, or ten times as slow as we wanted it to run. A series of optimizations to the server and client sides resulted in a four-fold improvement in performance to the current 1.3 milliseconds. The first improvement, on the server side, was to eliminate unnecessary calls to the local disk file system from NFS when the page for the file was in memory and addressible. NFS servers are implemented as a layered architecture over an underlying Unix File System, or UFS. Calls to NFS servers result in calls to UFS. Thus, an NFS read, received at the server, will result in a read call to the UFS. The CFOP operator, to operate on the page, must obtain a pointer to a copy of the page in memory on the server. Thus, to gain access to a pointer to the page the CFOP operator initially used the standard server technique of calling the UFS read function. Much of the time this extra call to the UFS layer is unnecessary; the page to be read is in memory and accessible, having been accessed in a previous call to NFS. We wrote a kernel functionTo our surprise such a function does not exist in SunOS to determine whether the page was available and mapped in and to return a pointer to the page if it was found. In this case we could avoid the call to the UFS read function. This enhancement saved 1.5 milliseconds. The next performance improvement involved eliminating Sun's implementation of eXternal Data Repesentation, or XDR, code. XDR is used to allow hosts of different types to transfer data structures back and forth in a standard format. As practiced by Sun, XDR involves a large number of function calls. In the case of CFOP, we were encoding four longs on the client for the request and decoding them on the server; the result was encoded in three longs and decoded as such on the client. We got rid of the Sun XDR code entirely and turned the decode into an inline decode using the traditional conversion functions used in, e.g., TCP/IP. This simple encoding saved 1.5 milliseconds, which was astonishing. Storing and retrieving seven words from memory should take at most 500 nanoseconds. That it takes 3000 times as long using XDR implies that XDR itself needs to be reexamined as a technique for encoding data, or at the very least Sun's implementation of XDR. We next inlined some of the more complex kernel RPC functions into our operator,which saved 150 microseconds; then recompiled with the Gnu C Compiler, which saved another 150 microseconds. The total time was now 1.9 milliseconds for CFOP. The final optimization which brought us down to 1.3 milliseconds was to handle the CFOP in the interrupt handler code for the network when possible. The checks for whether CFOP can be handled at interrupt are fairly straightforward. Given a packet that has been delivered from the network interface to the generic interrupt handler, if: The packet is not a fragment of a larger packet The value of the data at the offset in the packet for an NFS op is the CFOP operator The packet is an RPC version 2, NFS version 3 packet the protocol control block can be located the page for the file handle at the offset requested is in memory Then we perform the op in interrupt mode. If these tests fail, the packet is passed into the normal network soft interrupt queue for normal handling. Note that most packets fail the first two tests immediately, and those tests involved at most four word compares; the overhead of this test is not measurable. There were a number of lessons learned with CFOP: There are many unexploited opportunities for optimization in the NFS code The SunOS Virtual Memory (VM) architecture makes many optimizations in NFS server code possible, as the VM system and the buffer cache are unified. Sun's implementation of XDR with its attendant XDR generation tool saves a programmer work one time, but penalizes users on every NFS operation; we note here that the 4.4 BSD implementation of NFS uses more traditional XDR techniques used in TCP/IP and achieves much better performance. In fact we used pieces of the 4.4 BSD NFS code for decoding our CFOP packets. Given a packet interface with a large enough maximum packet size, many NFS operations could be done at interrupt level inlining did not buy nearly as much as it appeared it would Layering, while an attractive technique from the point of view of ease-of-implementation, is often very costly. As usual, optimizing the common case is worth the effort. *Load Incoherent NFS already had a mechanism for the equivalent of Load Incoherent. In NFS, when a write occurs to pages past the end-of-file, NFS simply extends the file and creates the pages locally on the machine. We simply adapted this mechanism for Load Incoherent; we create local pages for writing even when they are in the range of the size of the file. *Event Driven Memory Operations EDMS operates via a system call. It is in fact the same system call used for CFOP, with a different opcode; CFOP takes a 32-bit opcode as one of its parameters. In EDMS, the system call checks to make sure the virtual address is valid and corresponds to an MNFS file. It then checks to see if the page for the virtual address is present, and if so returns immediately. If not, it goes into a kernel sleep on the virtual address. When the client code receives a network refresh for a given virtual address, it will issue a kernel wakeup call for that virtual address. Any waiting processes will be woken up at that point. Performance The current version of MNFS extends NFS server and client code in very limited ways. These limitations are intentional; our goal is to make MNFS and NFS work alike as much as possible. Nevertheless the changes to date have made for marked improvements in performance, and in no case have made performance worse. We will cite some numbers here. The numbers were derived from experiments run on SPARCStation ELC systems, which have 33 Mhz SPARC CPUs. These systems in our experience have the ability to drive the Ethernet at close to 1 Mbyte/second. We will cite numbers for 8 kilobyte block size file systems. This is the default NFS block size, and the size of an MNFS long page. For NFS writes, the best-case NFS write performance is 72 milliseconds; MNFS performs them in 35 milliseconds. NFS reads run at 14 milliseconds; MNFS reads take the same time. Alternating NFS reads and writes run at 67 milliseconds. This time is slightly better than the all-write case as the read is causing some caching to occur. In the case of MNFS they run at 35 milliseconds. The intermixed reads and writes highlight a difference of MNFS and NFS. NFS does not differentiate read and write faults; when a read fault on a page is handled in NFS, and the page is found to be mapped writeable, it is loaded as a writeable page. MNFS differentiates read and write faults; hence a read fault followed by a write fault requires two NFS reads. Thus the MNFS case requires twice as many reads as the NFS case. Nevertheless the performance is twice as fast. For short pages, the write time is 5 milliseconds, and the read time is 5 milliseconds. Read followed by write takes 10 milliseconds. Note that if the cost were absolutely linear to packet size, the short pages would only take 140 microseconds. We could therefore grow the short pages somewhat. Their size has been determined by a number of factors, however, one of which is our goal that they be able to be contained in an ATM packet. Thus we leave their size at 32 bytes. CFOP is our highest-performance operator, clocking in at 1.3 milliseconds. Once we finish the move to Solaris 2.1 we will be trying to double the performance. EDMS provides very precise synchronization. We have a very graphic demonstration of just how fine at SRC. One of our demos sets up 15 SPARCStations in an EDMS wait state on a synchronization variable. When the variable is changed and a network refresh performed, the SPARCStations all play one sound. It is not possible to distinguish any one machine's individual sound from any others; the first time we ran the demo we thought it was broken, since it only sounded like one machine. Applications We will now discuss applications for which MNFS has been used. Their use of MNFS is very different, and highlights the different demands that applications make of such a system. *Monte Carlo We first used a Monte Carlo simulation of a radiative heat transfer among surfaces of arbitrary 2-D enclosures. This choice was made for several reasons: we have studied this problem in a number of computing environments and are familiar with its performance and behavior; the problem is inherently very parallel, but parallelism must be extracted in different ways to exploit different architectures; the problem, while only 2-D, is nevertheless of real world interest--in particular, we are interested in modeling the geometry of a laser isotope separation (LIS) unit for which the accurate determination of radient exchange factors among the surfaces is an important component in the larger simulation of the isotope separation process. For the LIS geometry, we modeled the trajectories of 37 million photons, one million emissions per surface. All floating point computations were done with 64-bit numbers on both machines. The time to complete this run on the Cray 2 was 262 minutes (CPU time; 10 hours 20 minutes wall-clock time, but that is because we were not the only users on the Cray), while the time for the ELC farm was 28 minutes (wall-clock time). It is possible that with a sufficient expenditure of effort the Cray code could be vectorized. If we make the very optimistic estimate that this could be done, and that we could get dedicated time on a Cray-2, the problem might be made to run in 12 minutes. Typically, however, Crays are run in such a way that wall-clock time of a job is at a minimum four times the cpu-seconds, and more typically seven times; we expect that in the best case in a real environment we could get this problem to run in 48 minutes or so. Thus this multi-million dollar system could run the problem, in the best case, more slowly than our 70K ELC farm. Additionally, we could add processors (cheaply) to the ELC farm and expect speedup into at least the 32 processor range. The newest supercomputer from Cray is the C90. We have estimated that given dedicated access to a 16-processor system, and assuming optimal vectorization, this problem could be run in a minute. Given more typical supercomputer environments the problem would probably take 5-10 minutes. Thus this machine, costing approximately 30 million, could outperform our ELC farm. Finally, a comment on speedup. We observed almost completely linear speedup for the 37-million photon case. That is, in the 16 machine case, where each machine did the computation for 65536 photons, the problem ran 16 times as fast as the one-machine case-- minus the 40 or so seconds it took to start up the tasks. This 40 second overhead was nearly constant across all problem sizes, and became a serious performance problem for problems which only take a few minutes to runNote once again that these small problems are not of interest for this application, but are of interest for measurement purposes. For example, with only 6534 photons per processor, the total time for 16 processors was linear speedup (150 seconds) plus the startup time for the 15 additional tasks (40 seconds). Obviously in the case of a farm with 60 or 70 processors process inititiation is an issue. We have done some recent work, of which more in another paper, which has brought process startup time down to the 50 ms. range for a process on a remote machine. This fast process startup will improve the speedup for small problem sets, since the processes can be up and running in under a second. *16 MegaPixel Display The 16 Megapixel display is composed of 16 SPARCStation ELCs positioned in racks as a 4 by 4 grid of monitors. The display racks were designed and fabricated at SRC. They hold the monitors in place with no space between them. The display is programmed as one large frame buffer. The programming library is the pixrect programming library from Sun, which is a memory-based programming library. The programmer calls a function named displaycreate which takes as an argument a file name, which should be a file in an MNFS file system. This file is used as the shared memory segment for the 16 megapixel area. From that point on the programmer need only use Sun's pixrect operations in the normal way. A function called syncmpr is provided which will perform an msync on the mapped in file and deliver changed pages to clients holding them. Each client in turn performs raster operations to copy a fragment of the frame buffer file to its local frame buffer. A layout of the structure of this system is shown in Figure . We can display one new frame a second. This is surprising given the unoptimized nature of this implementation, which was done in a day or two by an undergraduate. The hardest part of this job for the undergraduate was learning how to use the pixrect functions; the second hardest part was setting up the demo, which required us to scan many bitmaps until we could find enough G-rated bitmaps for display. The display is not as fast as we would like, but is fast as it could be considering we are running on Ethernet and are shipping bitmaps around. We are working on a version which uses a display list architecture so that the descriptors for objects are what is shipped, not the bitmaps for the object. Nevertheless the large display has some lessons for us: Using virtual memory to support data sharing is effective. In this application the user never needed to know what pages were used by which displays, which bits in the file corresponded to which bits in the display, and so on. In fact, until we did the diagram for Figure we never thought about the correspondance of pages to display areas at all. By using the msync operation to sync pages to displays we freed ourselves of the task of tracking damage regions; the tracking of changed data is an inherent part of the VM architecture, and using it saves a lot of effort. Note that any object can span any number of displays, up to 16. Tracking and computing which objects fit where and which data structures to send to which device would have been very complex, as would have been setting up the communications channels. X11 is not required for network graphics programming. This demonstration is a large bitmap displayed on 16 networked machines. X11 was not used for any part of the program. That is one reason why it was so easy. Page fault is necessary but not sufficient as a way for a process to get a new copy of changed data. In this case, the renderer program is causing the newest copy of the modified pages to be delivered to the clients. Other DSMs such as Ivy or Mach external pagers would require that the pages be invalidated and the processes fault them in when they are referenced, which we have measured at a five-fold impact on performance. DSMs which do not support some form of network refresh will not run well in a high-latency network. *Rarefied Gas Model An application under development is a rarefied gas model. In this application particles traverse a region, divided into a grid, and collide with complex objects and other particles. The communications-intensive part of this application involves transferring particles from one grid region to another. The communications is N-squared and hence challenging. Our communications structures will be FIFO-like. The only coherent variables will be pointers to free areas; noncoherent loads and stores will be used to transfer particle information from processor to processor. This application is interesting because while it makes almost no use of the coherency MNFS makes available, without that coherency for the FIFO control variables it would not be possible to use the file system at all over the network. Summary Mether-NFS, or MNFS is an ``industrial strength'' DSM implemented as a modified NFS. It can be dynamically loaded into a running SunOS kernel; it can be measured and controlled with standard NFS software tools; it supports traditional NFS semantics on files, which users know how to work with; but for memory-mapped files, the behavior of writes is improved so that they are globally ordered. It has been installed and is in use at a number of universities and research institutions. MNFS performance is at least as good as NFS performance in almost every category, and is superior to it in most. A major improvement of MNFS over NFS is in the area of memory mapped files. Memory mapped files have globally ordered writes, allowing the use of atomic test-and-set operators. MNFS also supports high-performance synchronization operators which allow very precise control of the activities of many computers. MNFS has a number of characteristics, including two different page sizes; program control of invalidation; and network refresh, which make it flexible and adaptable to the needs of different programs. MNFS is currently running on SPARC-based systems running SunOS version 4.1.1 or higher. No kernel recompilation is required; MNFS is a dynamically loadable system. MNFS is being ported to SGI systems running IRIX 5.0 or later. MNFS is also being ported to AIX 3.2 on the IBM RS/6000. A limited version of the system, running without short pages, is working now. The AIX work is being done at the University of Pennsylvania by John Shaffer. We have found that MNFS supports a number of very different applications very effectively. Three of the ones described in this paper are a Monte Carlo; a 16-machine, 16 Megapixel Display; and a rarefied gas application. MNFS allowed us to move the Monte Carlo program transparently from a shared-memory system to a network of SPARCStations. The use of MNFS made many of the aspects of programming the large display transparent to the user, and allowed us to exploit the virtual memory architecture of SunOS to manage which data had to move, and to which machines. Doing all the work ourselves would have made the process very difficult. The rarefied gas application makes very little use of MNFS memory consistency, but without the little it does use it could not be run on the SPARCStation cluster at all. Future Work There are a number of areas which we are pursuing. We are working on getting the applications working on the 48-node cluster, recently assembled, 32 of the machines of which were supplied by Sun Microsystems as part of a research collaboration. The rarefied gas application is the most interesting case, as it has a large amount of communications and requires high-performance synchronization. CFOP was originally motivated by our planning for this application. We are currently porting the system to Solaris 2.1, which has a newer Virtual Memory architecture and other changes. Fortunately the MNFS changes to NFS are quite restricted in scope, so starting with NFS and creating MNFS is a straightforward process. Work continues on CFOP, toward our goal of 500 microsecond and better round-trip times. We are also working with MNFS on high-bandwidth networks such as ATM and FDDI. The networks feature larger maximum transmission units or MTUs, in the case of ATM large enough to send an entire NFS packet in one transmission. Thus these nets hold out the possibility of doing an entire MNFS operation in interrupt mode, greatly enhancing performance. Finally, hardware implementations using this memory model as a guide are under consideration. Acknowledgements Tim Thomas performed the work which made MNFS a mod-loadable module. Scott Zettlemoyer has done a considerable amount of testing and bug-finding for MNFS. Kevin Smith is not a co-author of this paper, but he is a co-author of MNFS. Without Kevin's contributions to the design and implementation of MNFS it would not exist. plain