################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


         Implementation  of  a  Reliable


               Remote  Memory  Pager


             Evangelos P. Markatos and George Dramitinos

            Computer Architecture and VLSI Systems Group

                    Institute of Computer Science (ICS)

  Foundation for Research & Technology - Hellas (FORTH), Crete


                               ABSTRACT


           Traditional operating systems use magnetic disks

      as paging devices, even though the cost of a disk trans-

      fer measured in processor cycles continues to increase.

           In this paper we explore the use of remote main

      memory  for  paging.   We  describe  the  design,  imple-

      mentation and evaluation of a pager that uses main

      memory of remote workstations as a faster-than-disk

      paging  device  and  provides  reliability  in  case  of  sin-

      gle  workstation  failures.  Our  pager  has  been  imple-

      mented  as  a  block  device  driver  linked  to  the  DEC

      OSF/1  operating  system,  without  any  modifications

      to the kernel code.  Using several test applications we

      measure  the  performance  of  remote  memory  paging

      over an Ethernet interconnection network and find it

      to be faster than traditional disk paging.  We evalu-

      ate the performance of various reliability policies and

      prove  their  feasibility  even  over  low  bandwidth  net-

      works, like Ethernet.

           We  conclude  that  the  benefits  of  reliable  remote

      memory paging in workstation clusters are significant

      today and will probably increase in the near future.


1   Introduction

      Applications like multimedia, windowing systems, scientific compu-
tations, engineering simulations, etc.  running on workstation clusters (or
networks of PCs) require an ever increasing amount of memory, usually
more than any single workstation has available.  To alleviate the memory
shortage problem, an application could use the virtual memory paging pro-
vided by the operating system, and have some of its data in main memory
________________________________
    The authors are also with the University of Crete.

and the rest on the disk. Unfortunately, as the disparity between processor
and disk speeds becomes ever increasing, the cost of paging to a magnetic
disk becomes unacceptable. Faster swap disks would only temporarily rem-
edy the situation, because processor speeds are improving at a much higher
rate than disk speeds [14 ].  Clearly, if paging is going to have reasonable
overhead, a new paging device is needed. This device should provide high
bandwidth and low latency. Fortunately, a device with these characteristics
exists in most distributed systems and it is not used most of the time. It is
the collective memory of all computers in the distributed system, hereafter
called remote memory.

      Remote memory provides high transfer rates which are mainly dic-
tated by the interconnection network.  Fortunately, most of the time re-
mote main memory is unused and thus can be exploited by remote memory
paging systems. To verify this claim, we profiled the unused memory of the
workstations in our lab1 for the duration of one week: 16 workstations with
a total of 800 MBytes of main memory. Figure 1 plots the free memory as
a function of the day of the week.  We see that for significant periods of
time more than 700 Mbytes are unused, especially during the nights, and
the weekend.  Although during business hours the amount of free memory
falls, it is rarely lower than 400 Mbytes!

      Architecture and software developments suggest that the use of re-
mote memory for paging purposes is desirable, possible and efficient:


    fflMemory to memory transfer rates between workstations have
      increased sharply in the last few years:  Local Area Networks
      (like ATM and FDDI) have a high throughput and (usually) low la-
      tency. This increase in communication bandwidth implies a dramatic
      decrease in network transfer time for large messages (like operating
      system pages). On the other hand, the disk technology has not  shown
      a similar increase in transfer rates. Moreover, disk accesses suffer from
      seek and rotation latency which is not expected to be reduced from
      advances in semiconductor technology.

    fflApplication's working sets have increased dramatically over
      the last few years: Modern processors provide 64-bit address spaces,
      which  make  it  possible  for  the  processor  to  address  an  enormous
      amount of memory. Thus, software that takes advantage of a large ad-
      dress space is being developed: memory-mapped files and databases,
      sophisticated window interfaces, and multimedia, are a few examples
      that require an enormous amount of main memory.

    fflModern architectures provide low latency remote memory
      accesses:  Modern distributed systems provide a variety of efficient
      access operations to remote memories.  The SCI-to-SBUS interface
      provides SPARC workstations with the ability to access the memories
      of other workstations in a network using simple load and store oper-
      ations [23 ].  Similar ability is provided by Telegraphos [19 ], Hamlyn
      [5], Memory Channel [13 ], and SHRIMP [4]. Fast remote memory ac-
      cesses have also been implemented in software using Active Messages
      [26 , 2], programmed network interfaces [16 ], and trap-based remote
      invocation [25 ]. The ability to perform single remote memory accesses
________________________________1
    We expect that more main memory will be available in places that have lighter load.
Our workstations are heavily used running VERILOG simulations for most of the time.


Figure 1: Unused memory in a workstation cluster. The figure plots
the idle memory during a typical week in the workstations of our lab:  a
total of 16 workstations with about 800 Mbytes of total memory.  We see
that memory usage was at each peak (and thus free memory was scarce) at
noon and afternoon of working days.  In all times though, more than 300
Mbytes of main memory were unused.


      efficiently will enhance the performance of a remote memory paging
      policy, since the application can use them to access infrequently used
      pages.


      In this paper we show that it is both possible and beneficial to use
remote memory as a reliable paging device by building the systems soft-
ware that transparently transfers operating system pages across worksta-
tion memories within a workstation cluster. We describe a pager built as a
device driver of the DEC OSF/1 operating system. Our pager is completely
portable to any system that runs DEC OSF/1, because we didn't modify
the operating system kernel. More important, by running real applications
on top of our memory manager, we show that even on top of low band-
width interconnection networks (like Ethernet), it is efficient to use remote
memory as backing store. Our performance results suggest that paging to
remote memory over Ethernet, rather than paging to a local disk of com-
parable bandwidth, results in up to 96% faster execution times for real
applications. Moreover, we show that reliability and redundancy comes at
no significant extra cost.  We describe the implementation and evaluation
of several reliability policies that keep some form of redundant information,
which enables the application to recover its data in case a workstation in
the distributed system crashes.  Finally, we use extrapolation to find the
performance of paging to remote memory over faster than Ethernet net-
works like FDDI and ATM. Our extrapolated results suggest that paging
over a 100 Mbits/sec interconnection network, reduces paging overhead to
less than 17% of the execution time of the application running over such a
network. Faster networks will reduce this overhead even more.

      The rest of the paper is organized as follows:  Section 2 presents the
design of a remote memory pager and the issues involved. Section 3 presents
the implementation of the pager as a device driver. Section 4 presents our
performance results which are very encouraging.  Section 5 presents some
aspects that we plan to explore as part of our future work.   Section 6
presents related work. Finally, section 7 presents our conclusions.

2   The Design of a Remote Memory Pager

2.1    Selection of Workstations

      All workstations, that participate in remote memory paging are reg-
istered in a common file. These workstations are known as remote memory
servers, while the workstations that run applications that use remote mem-
ory for swapping are called clients. Depending on its workload, a worksta-
tion may act either as a server, or as a client.

      All server workstations run a remote memory server that handles re-
quests for pageins, pageouts, as well as for swap space allocation. When a
client wants to swap out memory it picks the most promising server, asks
for a number of page frames and starts sending requests to it.  When a
server runs out of memory, it denies further swap space allocation requests.
When native memory-demanding processes start on a server workstation,
part of the server's memory is swapped out to disk. Future requests will be
serviced from the disk, and a note will be sent to the client, advising it to
send no more pages to this server. On reception of this message, the client
will try to find another server having enough free memory and migrate the
pages that were stored by the loaded server to the new one.  If no server
having enough free memory can be found the client's local disk will be used
to house these pages. Whenever the client's local disk is used to store some
of it's paged out pages, the client periodically checks the memory load of
all possible remote memory servers. If a server having enough free memory
is found, some of the pages stored at the local disk are replicated to this
server. Future requests concerning these pages will be served by the remote
memory server rather than the disk.


2.2    Reliability

      In a distributed system, a workstation may crash at any time.  If
the crashed workstation acts as a server, it will lose the pages of several
clients.  Clearly, it is not acceptable for applications running on the client
workstation to crash due to remote server crash. Instead, we would like to
be able to recover their pages. Otherwise a remote server crash will cause
a client crash as well, since all programs that have some of their pages
swapped out (including programs like init and system daemons) will not be
able to continue execution.

      There are many types of crashes.  First of all there may be machine
crashes due to a black out.  This situation is not addressed by this paper,
since most computer buildings are equipped with UPSs. Another cause of
failure may be a network problem (e.g. network partitioning due to a bridge
failure). In this case, the client can not retrieve its pages from the servers.
As a result it remains blocked waiting for the network to recover. The most
frequent cause of crash is a software crash, followed by a hardware error.
To avoid loss of data due to a server crash, some systems write all network
memory pages to the disk as well ([1, 11]). Instead we implement a reliable
remote memory paging system that is able to reconstruct the lost pages.

      To provide this level of reliability, some form of redundancy must be
used. The main issues that must be taken into account regarding the form
of redundancy used are:


    fflThe runtime overhead introduced must be minimal since it is a cost
      paid even when no server crashes.

    fflThe memory overhead introduced must be as low as possible because
      the memory reserved for reliability could be used in order to store
      memory pages of other workstations.

    fflThe crash recovery overhead, that is the time it takes to recover from
      a server crash. This overhead is not as important as the previous two,
      since it is affordable to devote a few more seconds whenever a server
      crashes, which happens rather rarely.


      We explore three different policies: mirroring, parity, and parity log-
ging.


Mirroring:   The simplest form of redundancy is mirroring. In mirroring,
there exist two copies of each page. When the client swaps out a page, the
page is sent to two different servers. Even when one of the servers crashes,
the application is able to complete its execution, because all pages of the
crashed server exist on the mirror servers.  Obviously the crash recovery
overhead, in case of mirroring, is minimal. However, the runtime overhead
is rather high, since each pageout requires two page transfers.  To make
matters worse, mirroring wastes half of the remote memory used.


Parity:   To reduce the main memory waste caused by mirroring, we can
use parity-based redundancy schemes much like the ones used in RAIDS
[6].  Suppose, for example, that we have S servers, each having P  pages.
Page (i; j) is the jth page that resides on server i. Assume, that we have P
parity pages, where parity page j is formed by taking the XOR of all the
jth pages in all servers. We say that all these jth pages belong to the same
parity group.  If a server crashes, all its pages can be restored by XORing
all pages within each parity group.

      When the client swaps out a page it has to update the parity to reflect
the change. This update is done in two steps:


   1. The client sends the swapped out page to the server, which computes
      the XOR of the old and the new page.

   2. The server sends the just computed XOR to the parity server, which
      XORs it with the old parity, forming the new parity.


Unfortunately, this method involves two page transfers:  one from client
to server, and one from server to parity.  Moreover, the client should not
discard the page just swapped out, because the server may crash before the
new parity is computed, thus, making it impossible to restore the swapped
out page. This parity method increases the amount of remote main memory
only by a factor of (1 + 1=S) minimizing the memory overhead, but it still
imposes a significant runtime overhead.


Parity Logging:   To avoid the additional page transfers induced by the
basic parity method, we have developed a parity logging scheme. The key
idea is that a given page need not be bound to a particular server or parity
group.  Instead, every time a page is paged out, a new server and a new
parity group may be used to host the page.

      Suppose the client uses S servers.  Each paged out page is XORed
with a page size buffer maintained by the client (which is initially filled
with zeros) and then is transfered to a server following a round robin policy.

Whenever S pages have been transfered, the buffer is also transfered to a
parity server. Using this technique, the runtime overhead is minimal, since
for each paged out page 1 + 1=S page transfers are required. When a server
crashes, all of its pages can be restored by XORing the pages in their group
with the corresponding parity page. 2

      Every time a page is repaged out, it is marked in the old parity group
containing it as inactive. 3 When all the pages of a parity group are marked
as inactive, all the memory server pages and the corresponding parity page
can be reused.  It is obvious that each memory server must have some
extra overflow memory to support parity logging since many versions of a
given page may be present simultaneously at the servers' memory.  Also,
due to this situation, it is possible that some server runs out of memory.
In this case, one has to perform garbage collection freeing parity sets by
combining their active pages to new ones.  In our experiments, 4 servers
were used devoting 10% more memory to support parity logging and we
never had to perform garbage collection.


3   Implementation

      The proposed system has been built and is in everyday use. It consists
of a client issuing paging requests and servers satisfying these requests. It is
also able to use the local disk for paging and may support either mirroring
or parity logging.  The client side has been linked with the DEC OSF/1
kernel of a DEC-Alpha 3000 model 300 with 32 MB main memory as a
block device driver that handles all pagein and pageout requests. In order
to service these requests, it may forward them either to user level servers
running on other hosts, or to the local disk. The DEC OSF/1 kernel is not
even aware that we use remote main memory instead of magnetic disk as
a paging device.  It just performs ordinary paging activities using a block
device.  This design minimizes the modifications needed in order to port
the system to another operating system and avoids modifications to the
operating system kernel.


3.1    The Remote Memory Pager

      Normally the Remote Memory Pager (RMP for short) is a client which
forwards the paging requests to a remote server using sockets over an Eth-
ernet. The RMP connects to the remote memory servers using sockets over
TCP/IP. One dedicated paging daemon issues pagein and pageout requests
to the server and receives the data sent by them. When mirroring is used, it
is responsible for selecting two servers for each paged out page and transfer
the data to them.  When parity logging is used, it maintains all the data
structures related to page and parity group management and computes the
parity pages.  Security is ensured by allowing access to our device only to
the superuser and by using privileged ports for the communication among
the client and the servers.

      RMP is also capable of forwarding the requests to the local disk using
either a specified partition or a file. In the former case, it invokes a routine
that places the request in the disk queue. In the later case it issues a read
________________________________2
    Note that since the parity page is computed by the client, it is not necessary to wait
for acknowledgments from the servers before transfering the parity page in order to be
able3to recover from a single server crash.
    However, the old version of the page is not deleted from the server's memory, because
if it were, the old parity page should be updated, leading to more page transfers.

Figure 2: Performance of applications using either the disk, or the
remote memory as paging device.  We see that for all applications,
the use of remote memory results in significantly faster execution.   All
applications were run on a DEC-Alpha 3000 model 300 workstation.  The
input sizes for QSORT was 3000 records, for GAUSS, a 17001700 matrix, for
MVEC, a 21002100 matrix, for FFT an array with 700 K elements, for
FILTER a 12 MB image, and the whole DEC OSF/1 V3.2 kernel for CC.


or write operation through the VFS layer routines. When no server can be
found in order to satisfy the client's requests, paging to local disk is used.

      Although the current implementation runs on top of a low bandwidth
10 Mbps Ethernet, remote paging is up to 2 times faster than using a local
disk of the same bandwidth.  It takes about 8.4 ms to transfer an 8KB
page through the network, while transferring a page to/from the local disk
takes about 17 ms.  Faster networks such as ATM, or FDDI should offer
even more promising performance, especially when faster communication
protocols are used [26 ].


3.2    The Remote Memory Server

      The server is a user level program listening to a socket and accepting
connections from clients.  Each client is served by a new instance of the
server which uses portion of the local workstation's main memory to store
the client's pages.  When the client requests a pagein, the server transfers
the requested page(s) over the socket. When the client requests a pageout,
the server reads the incoming pages from the socket, and stores them in its
main memory. The server is also responsible for swap space allocation and
for providing periodically information to the client concerning the memory
load of its host.  A parity server is by no means different than a memory
server. It just performs pageins and pageouts responding to client requests
without knowing whether it stores memory pages or parity pages.


4   Performance Results

      To evaluate the performance of our remote memory pager, and com-
pare it to traditional disk paging, we conducted a series of performance
measurements using a number of representative applications that require a
large amount of memory. Our applications include GAUSS, a gaussian elim-
ination, QSORT, a quicksort program, FFT, a Fast-Fourier Transform, MVEC,
a matrix-vector multiplication, FILTER, a two pass separable image sharp-
ening filter described in [20 ] and CC, a kernel build after modifying the code
of our device driver.  All applications were executed on the DEC-Alpha
3000 model 300, and were compiled with the standard C compiler with the
optimization enabled.  All workstations that contributed their main mem-
ory for paging purposes were DEC-Alpha 3000 model 300, connected via
a standard 10Mbits/sec Ethernet.  In all experiments the amount of idle
memory was larger than the amount of memory needed for paging and was
equally distributed among all workstations.  The local disk that was used
for paging is a DEC RZ55, providing 10Mbits/sec bandwidth, and average
seek time of 16 msec.


4.1    Performance of Remote Memory Paging Over the Ethernet

      In our first experiment we evaluate four methods for paging:

    fflNO_RELIABILITY, which uses only main memory of other workstations
      as a paging device.  In this experiment two remote memory servers
      were used. The measurements were done on an (almost) idle Ethernet
      to ensure repeatability.

    fflPARITY_LOGGING, which uses 4 servers plus a parity server, all devot-
      ing 10% overflow memory.

    fflMIRRORING, which uses one primary memory server and one mirror
      memory server.

    fflDISK, which uses the local DEC RZ55 disk for paging.  In this case
      the page transfer requests go directly from the DEC OSF/1 kernel to
      the disk driver without the intervention of our pager.


The completion time of the applications is plotted in figure 2. We see that
in all cases the use of remote memory results in significant performance im-
provements.  For example, for the GAUSS application, the NO_RELIABILITY
results in 96% faster execution time than DISK. Even for the MVEC  appli-
cation which performed very little paging, NO_RELIABILITY results in 32%
faster execution time.  The reliability methods induce some runtime over-
head as expected but still perform much better than DISK. PARITY_LOGGING
results in 40.4% faster execution time for QSORT and in 59.86% faster time
for GAUSS. MIRRORING also performs better than DISK for all applications
except MVEC, since MVEC performs many pageouts and almost no pageins.

      In order to evaluate the use of remote memory for a more realis-
tic  application,  we  measured  the  completion  time  of  a  kernel  build  af-
ter modifying the code of our device driver.  As can be seen in figure 2,
NO_RELIABILITY performs 26.56% better than disk, PARITY_LOGGING per-
forms 24.65% better and MIRRORING performs just 9.7% better. We see that
PARITY_LOGGING performs very close to NO_RELIABILITY. As the number of
the remote memory servers used increases, the difference in performance
between NO_RELIABILITY and PARITY_LOGGING becomes lower.

      Our performance results suggest that paging to remote memory over
an Ethernet interconnection network is simply faster than paging to the
disk. Even though both the disk and the Ethernet have similar data transfer
rates, remote memory does not suffer from seek and rotational latency as
DISK does.

      Our  experimental  results  verify  that  even  when  the  network  data
transfer rate is as low as the disk transfer rate, the performance of remote
memory is significantly higher than the performance of disk. Moreover the
performance requirements of reliability are surprisingly small.  Since ar-
chitecture trends suggest that modern high speed networks provide much
higher data transfer rates than modern disks, the performance improve-
ments of remote memory over disk are bound to increase.


4.2    Scaling the Input

      To understand the impact of the working set size on the paging policy,
we measure the execution time of one of our applications (FFT), as a function
of its input size. The completion time of FFT both under PARITY_LOGGING
and under DISK is plotted in figure 3. We see that as soon as the working
set size exceeds 18 MBytes, the paging starts, and the completion time of
the application rises sharply.  Most users would not be willing to tolerate


Figure 3:  Performance of FFT as a function of input size when
either the disk, or remote memory are used as backing store.


such a high overhead in order to run an application that does not fit in main
memory. Fortunately, remote memory reduces this overhead substantially.


4.3    Scaling the Network Bandwidth

      Although figure 3 suggests that the performance of remote memory
(parity logging) is significantly better than the performance of disk, the
completion time of an application even under remote memory may be un-
acceptably high. Hopefully, the performance of remote memory will be im-
proved as soon as the Ethernet interconnection network is substituted by a
faster one (e.g. FDDI, ATM, FCS, etc.). To evaluate the performance of the
applications on top of faster networks we make detailed performance mea-
surements that separate the completion time of the application into the fol-
lowing factors: (i) user time (utime), (ii) system time (systime) (iii) initial-
ization time (inittime) (iv) page transfer time (ptime). Using the provided
time command we measure the utime, systime, and elapsed time (etime)
for each application.  Subtracting the utime and systime from the etime
for instances of the applications that perform no paging we calculate the
inittime, that is the time it takes the operating system to load and start ex-
ecuting the application. The ptime consists of the protocol processing time
(pptime) and the bandwidth dependent blocking time (btime).  We mea-
sured the pptime and found it to be equal to 1.6 ms per page for TCP/IP.
We calculate the btime using the formula :  btime = (etime  utime

systime  inittime  no_of _page_transf ers  pptime): Assuming that a
network with X times higher bandwidth will decrease btime by a factor of
X, we can predict the etime of the application over this high bandwidth
network.  Thus, the formula used is :  Expected_elapsed_time = utime +
systime + inittime + number_of _page_transf ers  pptime + btime=X:

      We made all these measurements on our FFT application, and pre-
dict its performance on a system with an interconnection network which
provides ten times more bandwidth than the Ethernet. We also predict its
completion time on a system that has enough memory to hold all the work-
ing set of the application (ALL_MEMORY) by adding the utime, systime and
inittime.  The predicted execution times, along with the measured execu-
tion times of DISK and PARITY_LOGGING are plotted in figure 4. We see that
ETHERNET*10 performs very close to ALL_MEMORY, and significantly better
than both ETHERNET and DISK.

      To understand the results shown in figure 4, we analyze the execution
time of FFT with 24MBytes of input when PARITY_LOGGING is used.  The
measured elapsed time is 130.76 seconds, consisting of 66.138 sec of useful
user time, 3.133 sec of system time, 0.21 sec of initialization time and 61.279
sec of page transfer time.  During the same run, the application suffered
2718 pageouts and 2055 pageins.  Since 4 servers were used plus a parity
server the number of page transfers was equal to 3397 + 2055  =  5452.
Thus the protocol overhead was equal to 5452  0:0016, or about 8.723 sec.
The bandwidth dependent blocking time was equal to 61:279  8:723, or
about 52.556 sec.  Using a ten times faster interconnection network, the
bandwidth dependent waiting time will be reduced to 5.255 sec. Thus, the
total completion time of FFT would be 66:138 + 3:133 + 0:21 + 8:723 + 5:255
sec, or 83.459 sec, divided as follows:  79.246% in user time, 3.754% in
system time, 0.252% in initialization time and 16.748% in page transfer
time.  We see that a 100 Mbit/sec interconnection network reduces the
total paging overhead to less than 17% of the total application execution
time. We believe that most users would be willing to pay such an overhead
in order to run an application that does not fit in main memory. After all,
the only other option they have is to suffer from disk thrashing.


4.4    The Latency of Remote Memory Paging

      As explained previously, the paging latency for FFT with input size
equal to 24 MB is 61.279 sec, or 11.24 ms per page transfer.  From these,
1.6 ms were spent during protocol processing and 9.64 ms were spent trans-
ferring each page on the Ethernet.

      Previous  measurements  have  reported  that  a  4  KByte  page  takes
about 45 ms over an Ethernet for each pagein [22 ].  Of those 45 ms, 19
ms were spent on TCP overhead, 4 ms were spent on Mach IPC overhead,
7.2 ms were spend on the Ethernet, and the rest were spent on the com-
puter's I/O bus. The total software latency of our implementation, is only
1.6 ms. The reason for this significant difference in performance is threefold:


    fflThe I/O bus of the DEC-Alpha 3000 model 300 we use is significantly
      faster and does not pose a problem in performance.

    fflWe use a DEC-Alpha processor, which is 3-4 times faster than the
      386 processor used in [22 ].

    fflFinally, our pager is implemented as a block device driver, while in [22 ]


Figure 4:  Performance of FFT for various Architecture Alterna-
tives. DISK is the measured completion time when paging to a local disk.
ETHERNET is the measured completion time of parity logging to remote mem-
ory on top of the Ethernet. ETHERNET*10 is the predicted completion time
when using remote memory as a paging device, on top of a network that
provides ten times more bandwidth than the Ethernet interconnection net-
work.  ALL_MEMORY is the predicted completion time of FFT when we use
the same workstation but with enough memory to hold its entire working
set.

      it was implemented as a user-level memory manager on top of Mach.
      Although user-level memory management gives increasing flexibility
      it induces large overhead.


      In general, although our approach may have less flexibility than a full-
fledged user-level pager, it has much better performance.  Moreover, our
device-driver implementation provides better performance than traditional
(local) disk paging, while user-level implementations have not reported per-
formance results to support similar claims [22 ].


4.5    Using Busy Workstations as Servers

      In all our experiments so far, the remote memory servers run on idle
workstations. However, workstations that are able to donate their memory
for paging purposes may not be completely idle, as they may run interac-
tive applications. Thus, we would like to investigate how our performance
figures change when a non-idle workstation is used as a memory server. So,
we conducted the following experiment:


      On each server workstation we started an X-window environ-
      ment, and an instance of the vi editor which was continuously
      used for editing.  Then, we run the applications of the experi-
      ment in figure 2.  The same inputs, and the same clients were
      used.  The only difference was that the remote memory server
      processes were run on busy instead of idle workstations.4


We were surprised to see that for the FFT, GAUSS, and MVEC applications,
their completion times were within 1 sec of their completion times when the
server ran on an idle workstation.  Only QSORT suffered a 7% overhead in
its completion time: probably the kernel swapped out some of the remote
memory server's pages on the disk. However, in order to find out how the
completion time of our applications changes with server load, we ran FFT
and QSORT under NO_RELIABILITY using two remote memory servers.  On
one of them a cpu bound program (performing a "while(1);" loop) was ini-
tiated. To our surprise, even then the completion times of our applications
were within 7% of their completion times when the server ran on an idle
workstation.

      Our performance figures suggest that most of the time the remote
memory servers were able to satisfy the client's requests immediately, even
on busy workstations.  Our results agree with the measurements in figure
1 which report that a significant portion of all workstation's memory is
unused even at business hours, thus no overhead is expected to be seen
when some other server process uses the extra pages.

      In the same course of experiments, we would like to see what is the
overhead that remote paging induces on the server workstation. Thus, we
measured the CPU utilization of the (otherwise idle) remote memory server
for all our experiments, and found it always to be less than 15%. Thus, the
computational overhead imposed on the remote workstation is so low that
will not be noticed by the workstation's owner.
________________________________4
    One could argue that an X-window environment and an editor, induce almost no
load on the workstation. But, this is exactly the point: a typical workstation, even when
it is used, it is very lightly loaded. The rest of the workstations that are heavily loaded
do not donate their main memory for remote paging.

4.6    Using Remote Memory Paging over a Loaded Ethernet

      All the experiments presented so far were done over an almost idle
Ethernet to ensure repeatability of our results. However, we would like to
find out how the performance of remote memory paging is affected by the
load of the network.  That is why we repeated our experiments using an
already loaded Ethernet.  The results showed a performance degradation
even when the Ethernet was lightly loaded. This situation is by no means
surprising since the paging itself uses all the bandwidth it can get. Adding
more sources of traffic leads to an enourmous demand for bandwidth causing
repeated collisions and lowering the effective bandwidth of the network,
leading to throughput collapse.

      Fortunately, this inefficiency is not inherent to remote memory paging
but rather to the CSMA/CD protocol employed by the Ethernet [24 ]. This
means that it is still beneficial to use remote memory paging over networks
that employ other technologies (e.g.  token ring), as long as they are able
to provide to remote memory paging an effective bandwidth of 10 or more
Mbps.


4.7    Using the Local Disk to Increase Reliability

      In our work we use remote main memory to store redundant infor-
mation that will be used to recover from workstation crashes.  Another
approach would be to store all remote pages to the local disk as well [11 ],
effectively treating remote memory as a write-through cache of the disk.
We will now compare the two approaches to find out the circumstances
under which the one approach is preferable to the other.

      Both approaches use the remote memory to satisfy the read requests.
This means that both approaches perform reads at the same speed and
avoid disk head movements due to reads, thus outperforming the local disk.
Parity logging transfers 1+1=N pages per paged out page, due to the parity
computation (in our experiments N was equal to 4).  On the other hand,
write through transfers each paged out page both to disk to the remote
memory.  These two page transfers are executed in parallel.  This means
that the choice of the right approach depends on the effective bandwidth
offered by the disk and the network.  If the network bandwidth is much
higher than the disk bandwidth, then the disk will be the bottleneck for
write through making it an unwise choice.  If however the effective band-
width offered by the disk is comparable to the bandwidth offered by the
network and the system can overlap disk transfers with network transfers
then it is unclear which method is best to use. In our experimental environ-
ment the disk and network bandwidth are both equal to 10 Mbps.  When
write through is used the efective disk bandwidth is close to 10 Mbps, since
there are no head movements for reads and writes are performed in large
chunks. In this environment write through performs better than parity log-
ging and slightly worse than our no-reliability implementation in most cases,
as shown in figure 5 . However, when a modern high bandwidth network is
used, parity logging will probably be the best approach, since write through
will eventually be limited by the local disk bandwidth.


5   Discussion - Future work

      Our implementation suggests that it is possible to build a reliable
efficient  remote memory pager without making any modifications to the


Figure 5:  Performance  of  parity  logging  and  write  through  for
various applications.  The input sizes for QSORT was 3000 records, for
GAUSS, a 17001700 matrix, for MVEC a 21002100 matrix, and for FFT
an array with 700 K elements.

operating system kernel. Although our system contains all necessary mech-
anisms to support remote memory paging,  there are a few more issues
concerning the overall policy that deserve further investigation.  Some of
these issues are discussed below.


Network load:   Although remote paging is faster than using the local
disk, sometimes the network traffic may be so high that the bandwidth
used by RMP will be limited.  In this case the cost of using the network,
especially in the case of old low bandwidth networks like Ethernet, may
become higher than the cost of using the local disk. Such a situation could
be handled by the RMP by measuring the time it takes to satisfy a request
and using a threshold to determine whether it should continue to use the
network to route pageout requests or it would be better to switch to the
local disk.


Heterogeneous networks:   The current implementation assumes a net-
work of workstations that all have the same order of magnitude of physical
memory and are interconnected by a local area network.  It would be in-
teresting to explore the requirements that heterogeneous networks pose to
the design of the remote pager. For example, on a wider area network the
time it takes to transfer a page may not be identical for each server. In this
case there may be more than three levels in the memory hierarchy (local
memory, remote memory, disk), depending on the variance of the cost of
communication among the hosts of the network. Connecting machines that
have an enormous amount of memory (e.g. a supercomputer) to a network
of workstations also poses some problems. When the supercomputer mem-
ory is idle, it may not always be easy to find enough free remote workstation
memory in order to be able to use reliability policies.  In this case, a no
reliability policy can be used, since all remote memory will be provided by
a single host (the supercomputer).


6   Related Work

      Several research groups have studied the issues in using remote mem-
ory in a workstation cluster to improve paging performance [2, 12 , 7, 15 ,
22 , 3].

      Felten and Zahorjian [12 ] have implemented a remote paging system
on top of a traditional Ethernet based system, and presented an analytical
model to predict its performance. Their performance results, although pre-
liminary, are encouraging towards using remote memory paging systems.
Schilit and Duchamp [22 ] have implemented a remote memory paging sys-
tem on top of Mach 2.5 for portable computers.  Their remote memory
paging system has performance similar to local disk paging. The cost of a
single remote memory pagein over an Ethernet, they quote, is about 45 ms
for a 4Kbyte page, which is rather high. According to their measurements,
a significant percentage of this time (close to 16 ms) is spend executing
Mach IPC and TCP code.  Comer and Griffoen [7] have implemented and
compared remote memory paging vs. remote disk paging, over NFS, on an
environment with diskless workstations. Their results suggest that remote
memory paging can be 20% to 100% faster than remote disk paging, de-
pending on the disk access pattern. Anderson et. al. have proposed the use
of network memory as backing store [2].  Their simulation results suggest
that using remote memory over a 155Mbits/s ATM network "is 5 to 10
times faster than thrashing to disk" [2]. In their subsequent work [18 ], they

outline the implementation of a remote memory pager on top of an ATM
based network.

      Our work differs from previous approaches to remote memory paging
in the following aspects: (i) we use a variety of real applications to evalu-
ate and demonstrate the feasibility of remote memory paging, and (ii) we
explore the issues in building a reliable remote memory system that is re-
silient to individual workstation failures. Previous approaches either ignore
workstation failures, or write dirty pages both to the disk and the remote
memory, limiting their performance by the available disk throughput.

      Recently, research groups start to explore the issue of using remote
memory to improve file system performance [11 , 1, 8].  Feeley et.  al.  have
implemented a global memory management system in a workstation clus-
ter, using the idle memory in the cluster to store clean pages of memory
loaded workstations [11 ]. Anderson et. al. have implemented xFS, a server-
less network file system [1, 9].  Both network memory systems have been
incorporated inside the kernel of existing operating systems and their per-
formance has been demonstrated.  Although improvements in file system
performance may ultimately lead to paging performance improvements, so-
lutions developed for file systems may be cumbersome, or too general for
remote memory paging systems:  (i) in file systems, client processes may
share file data, leading to cooperative remote memory management poli-
cies. In paging instead, clients never share their swap spaces. Thus, policies
developed to optimize a client-server approach to file I/O, and facilitate co-
operation among client processes that share data, do not necessarily apply
to a paging system where no single paging server is used, and no sharing
(of swap spaces) between client processes takes place.  Finally, we use the
network memory for storing both clean and dirty pages using our novel
parity-based approach. Thus, page out (write) operations can be acknowl-
edged at the speed of remote memory, while in [11 , 1] page out operations
are acknowledged at the speed of disk.

      Although the area of reliability in network memory systems is new,
it shares several of the ideas developed for other areas of reliable memory
management.  For example, parity based methods have been extensively
used for Redundant Arrays of Inexpensive Disks (RAIDs) [6].

      Log based methods have been used for Log based file systems, that
send all updates to a file to be written in sequential blocks of the disk [21 ].
Thus, the head of the disk does not make random seek movements, and the
effective data transfer rate of the disk increases.  Log based file systems,
alike our LOGGING methods, create a fragmented space that needs to be
cleaned.  Although the general ideas may be similar, there are substantial
differences between a log based file system and the log based reliable net-
work memory we propose. For example, (i) Fragmentation in log based file
systems occurs in large chunks (several Mbytes), while fragmentation in log
based reliable network memory occurs in small parity groups, and (ii) Log
based reliable network memory systems may use parity groups as soon as
they are emptied, but log based file systems may not used emptied disk
blocks, because this would require a head movement.  (iii) Cleaning in log
based file system is much more infrequent than it is in network memory,
thus it must be made more efficient, and (iv) the objective of log based
network memory systems is to reduce page transfers, while the objective
of log based file systems is to reduce disk head movements. For the above
reasons, methods developed for log based file systems do not necessarily

apply "as is" to network memory systems.

      Our work bears some similarity with distributed shared memory sys-
tems [17 , 10 ] in that both approaches use remote memory to store an ap-
plication's data.  Our main difference is that we focus on sequential appli-
cations where pages are not (or rarely) shared, while distributed-shared-
memory projects deal with parallel applications, where the main focus is
to reduce the cost of page sharing.


7   Conclusions

      In this paper we explore the use of remote main memory for paging.
We describe our prototype implementation of a remote memory pager im-
plemented on top of the DEC OSF/1 operating system as a device driver.
No modifications were made to the kernel of the (monolithic) DEC OSF/1
operating system.  We run several applications that use our pager on top
of a DEC-Alpha-based workstation cluster to measure the performance of
the system. The contributions of this paper are:


    fflWe describe how to build a reliable remote memory paging system;
      we propose a novel parity-based policy that is resilient to single work-
      station failures.

    fflWe show that reliable paging to remote memory results in substantial
      performance improvements over local disk paging.


Based on our implementation and our performance results we conclude:


    fflPaging to remote memory results in significant performance improve-
      ment over paging to disk. Applications that use our pager even when
      running on top of traditional Ethernet technology show performance
      improvements of up to 96% (see figure 2).  Extrapolating from our
      results, we show that on top of a faster interconnection network even
      higher performance improvements are realizable!

    fflPaging to remote memory is an inexpensive way to let applications use
      more main memory than a single workstation provides. Remote mem-
      ory paging provides good performance with almost no extra hardware
      support. The only way for magnetic disks to provide comparable per-
      formance is to use expensive disk arrays.

    fflReliability in remote memory paging comes at low cost.  Parity log-
      ging based paging provides reliability at low runtime and memory
      overhead, performs very close to NO_RELIABILITY and much faster
      than disk paging.

    fflThe benefits of paging to remote memory will only increase with time.
      Current  architecture  trends  suggest  that  the  gap  between  proces-
      sor and disk speed continues to widen.  Disks are not expected to
      provide the bandwidth needed by paging unless a breakthrough in
      disk technology occurs. On the other hand, interconnection network
      bandwidth keeps increasing at a much higher rate than (single) disk
      bandwidth, thereby increasing the performance benefits of paging to
      remote memory.

      Based on our performance measurements we believe that remote mem-
ory  paging  is  a  cost-effective  and  performance-effective  way  to  execute
memory-limited applications on a network of workstations.


Acknowledgments

      This work was developed in the ESPRIT/HPCN project "SHIPS",
and will form a test application for the ACTS project "ASICCOM", funded
by the European Union (DG III and DG XIII). We deeply appreciate this
financial support, without which this work would have not existed.

      We  would  like  to  thank  Manolis  Katevenis,  Sotiris  Ioannidis  and
Kosmas Papachristos for useful comments in earlier drafts of this paper.
Michael J. Feeley and Henry M. Levy pointed out useful references.  Fi-
nally,  we deeply appreciate the thoughtful comments of the anonymous
referees.


References

 [1] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S.
     Roselli, and R. Y. Wang.  Serverless Network File Systems.  In Proc.
     15-th Symposium on Operating Systems Principles, December 1995.

 [2] Thomas E. Anderson, David E. Culler, and David A. Patterson.  A
     Case for NOW (Networks of Workstations).  IEEE Micro, February
     1995.

 [3] G. Bernard and S. Hamma.  Remote Memory Paging in Networks of
     Workstations.  In Proceedings of the SUUG International Conference
     on Open Systems: Solutions for Open Word, April 1994.

 [4] M. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sand-
     berg.  Virtual Memory Mapped Network Interface for the SHRIMP
     Multicomputer.  In Proceedings of the Twenty-First Int. Symposium
     on Computer Architecture, pages 142-153, Chicago, IL, April 1994.

 [5] Greg Buzzard,  David Jacobson,  Scott Marovich,  and John Wilkes.
     Hamlyn:   a  high-performance  network  interface  with  sender-based
     memory management. In Proceedings of the Hot Interconnects III Sym-
     posium, August 1995.

 [6] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and
     David A. Patterson.  RAID: High-Performance,  Reliable Secondary
     Storage. ACM Computing Surveys, 26(2):145-185, June 1994.

 [7] D. Comer and J. Griffoen.  A new design for Distributed Systems:
     the Remote Memory Model.  In Proceedings of the USENIX Summer
     Conference, pages 127-135, 1990.

 [8] T.  Cortes,  S.  Girona,  and  J.  Labarta.    PACA:  A  Distributed
     File System Cache for Parallel MAchines. Performance under Unix-
     like  workload.   Technical  Report  UPC-DAC-1995-20,  Departament
     d'Arquitectura de computadors, Universitat Politecnica de Catalunya
     (UPC), June 15 1995.

 [9] M.D. Dahlin, R.Y. Wang, T.E. Anderson, and D.A. Patterson. Coop-
     erative Cahing: Using Remote Client Memory to Improve File System
     Performance. In First USENIX Symposium on Operating System De-
     sign and Implementation, pages 267-280, 1994.

[10] G. Delp.  The Architecture and implementation of Memnet:  A High-
     Speed Shared Memory Computer Communication Network. PhD thesis,
     University of Delaware, 1988.

[11] M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, H. M. Levy,
     and C. A. Thekkath. Implementing Global Memory Management in a
     Workstation Cluster. In Proc. 15-th Symposium on Operating Systems
     Principles, December 1995.

[12] E. W. Felten and J. Zahorjan.  Issues in the Implementation of a Re-
     mote Memory Paging System. Technical Report TR 91-03-09, Univer-
     sity of Washington, November 1991.

[13] R. Gillet.  Memory Channel.  In Proceedings of the Hot Interconnects
     III Symposium, August 1995.

[14] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quan-
     titative Approach. Morgan Kaufmann Publishers, Inc., 1990.

[15] L. Iftode, K. Li, and K. Petersen. Memory Servers for Multicomputers.
     In Proceedings of COMPCON 93, 1993.

[16] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Ghara-
     chorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta,
     M. Rosenblum,  and J. Hennessy.  The FLASH Multiprocessor.  In
     Proc. 21-th International Symposium on Comp. Arch., pages 302-313,
     Chicago, IL, April 1994.

[17] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Mem-
     ory Systems. ACM Transactions on Computer Systems, 7(4):321-359,
     November 1989.

[18] A. Mainwaring, C. Yoshikawa, and K. Wright.  NOW White Paper:
     Network              RAM              Prototype,                 1994.
     https://now.cs.berkeley.edu/Nram/network-ram.html.

[19] Evangelos  P.  Markatos  and  Manolis  G.H.  Katevenis.   Telegraphos:
     High-Performance Networking for Parallel Processing on Workstation
     Clusters.  In Proceedings of the Second International Symposium on
     High-Performance  Computer  Architecture,  (HPCA),  San  Jose,  CA,
     USA, February 1996.

[20] Gary Newman. Organizing Arrays for Paged Memory Systems. Com-
     munications of the ACM, 38(7):93-110, July 1995.

[21] Mendel Rosenblum and John Ousterhout. The Design and Implemen-
     tation of a Log-Structured File System. In Proc. 13-th Symposium on
     Operating Systems Principles, pages 1-15, October 1991.

[22] B.N. Schilit and D. Duchamp.  Adaptive Remote Paging for Mobile
     Computers.  Technical Report CUCS-004-91, University of Columbia,
     1991.

[23] Dolphin Interconnect Solutions.  DIS301 SBus-to-SCI Adapter User's
     Guide.

[24] A. S. Tanenbaum. Computer Networks, chapter 3, page 128. Prentice
     Hall International, 1989.

[25] C.A. Thekkath, H.M. Levy, and E.D. Lazowska. Efficient Support for
     Multicomputing on ATM Networks.  Technical Report 93-04-03, De-
     partment of Computer Science and Engineering, University of Wash-
     ington, April 12 1993.

[26] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active
     Messages: a Mechanism for Integrated Communication and Computa-
     tion. In Proc. 19-th International Symposium on Comp. Arch., pages
     256-266, Gold Coast, Australia, May 1992.


Biographical information

Evangelos P. Markatos   is an Assistant professor at ICS-FORTH and at
the University of Crete. He received his diploma in Computer Engineering
from the University of Patras in 1988, and the MS and Ph.D. degrees from
the University of Rochester in 1990 and 1993 respectively.  His interestes
include parallel and distributed systems, operating systems and computer
architecture.


George Dramitinos   is a graduate student in Computer Science at the
University of Crete, where he received a B.Sc.  degree.  He has worked at
A.C.R.I. in Lyon, France, participating in the design and implementation
of an OSF/1 based operating system for the company's supercomputer.
He joined ICS-FORTH in 1993.  His interests include operating systems,
parallel and distributed programming and computer architecture.

      The authors can be contacted at fmarkatos,  dramitg@ics.forth.gr.
or at their postal address at Institute  of  Computer  Science  (ICS),
FORTH,  Science  and  Technology  Park  of  Crete,  Vassilika  Vouton,
P.O.  Box  1385,  GR  711  10  Heraklion,  Crete,  Greece.


Availability

      The most recent version of the pager along with the test programs are
freely distributed using ftp from ftp.ics.forth.gr:pub/pager. More informa-
tion about the project can be found at https://www.ics.forth.gr/proj/arch-vlsi/os.

From markatos@ics.forth.gr  Mon Nov  6 03:46:34 1995
Received: from relay1.UU.NET by usenix.ORG (4.1/1.29-emg890317)
	id AA12951; Mon, 6 Nov 95 03:46:34 PST
Received: from nemesis.ics.forth.gr by relay1.UU.NET with SMTP 
	id QQzore20262; Mon, 6 Nov 1995 06:44:56 -0500 (EST)
Received: from saronis.csi.forth.gr by nemesis.ics.forth.gr (ICS mailhost);
	on Mon, 6 Nov 1995 13:15:59 +0200 (EET DST); with id AA15898
Date: Mon, 6 Nov 1995 13:15:59 +0200
From: Evangelos Markatos <markatos@ics.forth.gr>
Received: 	by saronis.csi.forth.gr (4.1/SMI-4.1)
	id AA05994; Mon, 6 Nov 95 13:15:53 +0200
Message-Id:   <9511061115.AA05994@saronis.csi.forth.gr>
Organization: FORTH - ICS, P.O.Box 1385, Heraklio, Crete, Greece 711 10
	      tel: +30(81)221171, 229368,02 fax: +30(81)229342,3 tlx: 262389 CCI
To: sandiegoascii@usenix.ORG
Subject: ASCII version of "Implementation of a Reliable Remote Memory Pager" paper 
Cc: markatos@ics.forth.gr
Status: RO

Dear USENIX representative, 
please find enclosed the ASCII version of 
the paper 
Implementation of a Reliable Remote Memory Pager
coauthored by 
Evangelos P. Markatos and George Dramitinos
that will appear in the USENIX 1996 Technical Conference 
Sincerely, 
Evangelos Markatos 
------------------------------------------------
Evangelos Markatos         markatos@ics.forth.gr
ICS, FORTH, Science and Technology Park of Crete
Vassilika Vouton, P.O. Box 1385
GR 711 10  Heraklion, Crete, Greece
Tel: +30 (81) 39 16 55,   Fax: +30 (81) 39 16 71
------------------------------------------------


         Implementation  of  a  Reliable


               Remote  Memory  Pager


             Evangelos P. Markatos and George Dramitinos

            Computer Architecture and VLSI Systems Group

                    Institute of Computer Science (ICS)

  Foundation for Research & Technology - Hellas (FORTH), Crete


                               ABSTRACT


           Traditional operating systems use magnetic disks

      as paging devices, even though the cost of a disk trans-

      fer measured in processor cycles continues to increase.

           In this paper we explore the use of remote main

      memory  for  paging.   We  describe  the  design,  imple-

      mentation and evaluation of a pager that uses main

      memory of remote workstations as a faster-than-disk

      paging  device  and  provides  reliability  in  case  of  sin-

      gle  workstation  failures.  Our  pager  has  been  imple-

      mented  as  a  block  device  driver  linked  to  the  DEC

      OSF/1  operating  system,  without  any  modifications

      to the kernel code.  Using several test applications we

      measure  the  performance  of  remote  memory  paging

      over an Ethernet interconnection network and find it

      to be faster than traditional disk paging.  We evalu-

      ate the performance of various reliability policies and

      prove  their  feasibility  even  over  low  bandwidth  net-

      works, like Ethernet.

           We  conclude  that  the  benefits  of  reliable  remote

      memory paging in workstation clusters are significant

      today and will probably increase in the near future.


1   Introduction

      Applications like multimedia, windowing systems, scientific compu-
tations, engineering simulations, etc.  running on workstation clusters (or
networks of PCs) require an ever increasing amount of memory, usually
more than any single workstation has available.  To alleviate the memory
shortage problem, an application could use the virtual memory paging pro-
vided by the operating system, and have some of its data in main memory
________________________________
    The authors are also with the University of Crete.

and the rest on the disk. Unfortunately, as the disparity between processor
and disk speeds becomes ever increasing, the cost of paging to a magnetic
disk becomes unacceptable. Faster swap disks would only temporarily rem-
edy the situation, because processor speeds are improving at a much higher
rate than disk speeds [14 ].  Clearly, if paging is going to have reasonable
overhead, a new paging device is needed. This device should provide high
bandwidth and low latency. Fortunately, a device with these characteristics
exists in most distributed systems and it is not used most of the time. It is
the collective memory of all computers in the distributed system, hereafter
called remote memory.

      Remote memory provides high transfer rates which are mainly dic-
tated by the interconnection network.  Fortunately, most of the time re-
mote main memory is unused and thus can be exploited by remote memory
paging systems. To verify this claim, we profiled the unused memory of the
workstations in our lab1 for the duration of one week: 16 workstations with
a total of 800 MBytes of main memory. Figure 1 plots the free memory as
a function of the day of the week.  We see that for significant periods of
time more than 700 Mbytes are unused, especially during the nights, and
the weekend.  Although during business hours the amount of free memory
falls, it is rarely lower than 400 Mbytes!

      Architecture and software developments suggest that the use of re-
mote memory for paging purposes is desirable, possible and efficient:


    fflMemory to memory transfer rates between workstations have
      increased sharply in the last few years:  Local Area Networks
      (like ATM and FDDI) have a high throughput and (usually) low la-
      tency. This increase in communication bandwidth implies a dramatic
      decrease in network transfer time for large messages (like operating
      system pages). On the other hand, the disk technology has not  shown
      a similar increase in transfer rates. Moreover, disk accesses suffer from
      seek and rotation latency which is not expected to be reduced from
      advances in semiconductor technology.

    fflApplication's working sets have increased dramatically over
      the last few years: Modern processors provide 64-bit address spaces,
      which  make  it  possible  for  the  processor  to  address  an  enormous
      amount of memory. Thus, software that takes advantage of a large ad-
      dress space is being developed: memory-mapped files and databases,
      sophisticated window interfaces, and multimedia, are a few examples
      that require an enormous amount of main memory.

    fflModern architectures provide low latency remote memory
      accesses:  Modern distributed systems provide a variety of efficient
      access operations to remote memories.  The SCI-to-SBUS interface
      provides SPARC workstations with the ability to access the memories
      of other workstations in a network using simple load and store oper-
      ations [23 ].  Similar ability is provided by Telegraphos [19 ], Hamlyn
      [5], Memory Channel [13 ], and SHRIMP [4]. Fast remote memory ac-
      cesses have also been implemented in software using Active Messages
      [26 , 2], programmed network interfaces [16 ], and trap-based remote
      invocation [25 ]. The ability to perform single remote memory accesses
________________________________1
    We expect that more main memory will be available in places that have lighter load.
Our workstations are heavily used running VERILOG simulations for most of the time.


Figure 1: Unused memory in a workstation cluster. The figure plots
the idle memory during a typical week in the workstations of our lab:  a
total of 16 workstations with about 800 Mbytes of total memory.  We see
that memory usage was at each peak (and thus free memory was scarce) at
noon and afternoon of working days.  In all times though, more than 300
Mbytes of main memory were unused.


      efficiently will enhance the performance of a remote memory paging
      policy, since the application can use them to access infrequently used
      pages.


      In this paper we show that it is both possible and beneficial to use
remote memory as a reliable paging device by building the systems soft-
ware that transparently transfers operating system pages across worksta-
tion memories within a workstation cluster. We describe a pager built as a
device driver of the DEC OSF/1 operating system. Our pager is completely
portable to any system that runs DEC OSF/1, because we didn't modify
the operating system kernel. More important, by running real applications
on top of our memory manager, we show that even on top of low band-
width interconnection networks (like Ethernet), it is efficient to use remote
memory as backing store. Our performance results suggest that paging to
remote memory over Ethernet, rather than paging to a local disk of com-
parable bandwidth, results in up to 96% faster execution times for real
applications. Moreover, we show that reliability and redundancy comes at
no significant extra cost.  We describe the implementation and evaluation
of several reliability policies that keep some form of redundant information,
which enables the application to recover its data in case a workstation in
the distributed system crashes.  Finally, we use extrapolation to find the
performance of paging to remote memory over faster than Ethernet net-
works like FDDI and ATM. Our extrapolated results suggest that paging
over a 100 Mbits/sec interconnection network, reduces paging overhead to
less than 17% of the execution time of the application running over such a
network. Faster networks will reduce this overhead even more.

      The rest of the paper is organized as follows:  Section 2 presents the
design of a remote memory pager and the issues involved. Section 3 presents
the implementation of the pager as a device driver. Section 4 presents our
performance results which are very encouraging.  Section 5 presents some
aspects that we plan to explore as part of our future work.   Section 6
presents related work. Finally, section 7 presents our conclusions.

2   The Design of a Remote Memory Pager

2.1    Selection of Workstations

      All workstations, that participate in remote memory paging are reg-
istered in a common file. These workstations are known as remote memory
servers, while the workstations that run applications that use remote mem-
ory for swapping are called clients. Depending on its workload, a worksta-
tion may act either as a server, or as a client.

      All server workstations run a remote memory server that handles re-
quests for pageins, pageouts, as well as for swap space allocation. When a
client wants to swap out memory it picks the most promising server, asks
for a number of page frames and starts sending requests to it.  When a
server runs out of memory, it denies further swap space allocation requests.
When native memory-demanding processes start on a server workstation,
part of the server's memory is swapped out to disk. Future requests will be
serviced from the disk, and a note will be sent to the client, advising it to
send no more pages to this server. On reception of this message, the client
will try to find another server having enough free memory and migrate the
pages that were stored by the loaded server to the new one.  If no server
having enough free memory can be found the client's local disk will be used
to house these pages. Whenever the client's local disk is used to store some
of it's paged out pages, the client periodically checks the memory load of
all possible remote memory servers. If a server having enough free memory
is found, some of the pages stored at the local disk are replicated to this
server. Future requests concerning these pages will be served by the remote
memory server rather than the disk.


2.2    Reliability

      In a distributed system, a workstation may crash at any time.  If
the crashed workstation acts as a server, it will lose the pages of several
clients.  Clearly, it is not acceptable for applications running on the client
workstation to crash due to remote server crash. Instead, we would like to
be able to recover their pages. Otherwise a remote server crash will cause
a client crash as well, since all programs that have some of their pages
swapped out (including programs like init and system daemons) will not be
able to continue execution.

      There are many types of crashes.  First of all there may be machine
crashes due to a black out.  This situation is not addressed by this paper,
since most computer buildings are equipped with UPSs. Another cause of
failure may be a network problem (e.g. network partitioning due to a bridge
failure). In this case, the client can not retrieve its pages from the servers.
As a result it remains blocked waiting for the network to recover. The most
frequent cause of crash is a software crash, followed by a hardware error.
To avoid loss of data due to a server crash, some systems write all network
memory pages to the disk as well ([1, 11]). Instead we implement a reliable
remote memory paging system that is able to reconstruct the lost pages.

      To provide this level of reliability, some form of redundancy must be
used. The main issues that must be taken into account regarding the form
of redundancy used are:


    fflThe runtime overhead introduced must be minimal since it is a cost
      paid even when no server crashes.

    fflThe memory overhead introduced must be as low as possible because
      the memory reserved for reliability could be used in order to store
      memory pages of other workstations.

    fflThe crash recovery overhead, that is the time it takes to recover from
      a server crash. This overhead is not as important as the previous two,
      since it is affordable to devote a few more seconds whenever a server
      crashes, which happens rather rarely.


      We explore three different policies: mirroring, parity, and parity log-
ging.


Mirroring:   The simplest form of redundancy is mirroring. In mirroring,
there exist two copies of each page. When the client swaps out a page, the
page is sent to two different servers. Even when one of the servers crashes,
the application is able to complete its execution, because all pages of the
crashed server exist on the mirror servers.  Obviously the crash recovery
overhead, in case of mirroring, is minimal. However, the runtime overhead
is rather high, since each pageout requires two page transfers.  To make
matters worse, mirroring wastes half of the remote memory used.


Parity:   To reduce the main memory waste caused by mirroring, we can
use parity-based redundancy schemes much like the ones used in RAIDS
[6].  Suppose, for example, that we have S servers, each having P  pages.
Page (i; j) is the jth page that resides on server i. Assume, that we have P
parity pages, where parity page j is formed by taking the XOR of all the
jth pages in all servers. We say that all these jth pages belong to the same
parity group.  If a server crashes, all its pages can be restored by XORing
all pages within each parity group.

      When the client swaps out a page it has to update the parity to reflect
the change. This update is done in two steps:


   1. The client sends the swapped out page to the server, which computes
      the XOR of the old and the new page.

   2. The server sends the just computed XOR to the parity server, which
      XORs it with the old parity, forming the new parity.


Unfortunately, this method involves two page transfers:  one from client
to server, and one from server to parity.  Moreover, the client should not
discard the page just swapped out, because the server may crash before the
new parity is computed, thus, making it impossible to restore the swapped
out page. This parity method increases the amount of remote main memory
only by a factor of (1 + 1=S) minimizing the memory overhead, but it still
imposes a significant runtime overhead.


Parity Logging:   To avoid the additional page transfers induced by the
basic parity method, we have developed a parity logging scheme. The key
idea is that a given page need not be bound to a particular server or parity
group.  Instead, every time a page is paged out, a new server and a new
parity group may be used to host the page.

      Suppose the client uses S servers.  Each paged out page is XORed
with a page size buffer maintained by the client (which is initially filled
with zeros) and then is transfered to a server following a round robin policy.

Whenever S pages have been transfered, the buffer is also transfered to a
parity server. Using this technique, the runtime overhead is minimal, since
for each paged out page 1 + 1=S page transfers are required. When a server
crashes, all of its pages can be restored by XORing the pages in their group
with the corresponding parity page. 2

      Every time a page is repaged out, it is marked in the old parity group
containing it as inactive. 3 When all the pages of a parity group are marked
as inactive, all the memory server pages and the corresponding parity page
can be reused.  It is obvious that each memory server must have some
extra overflow memory to support parity logging since many versions of a
given page may be present simultaneously at the servers' memory.  Also,
due to this situation, it is possible that some server runs out of memory.
In this case, one has to perform garbage collection freeing parity sets by
combining their active pages to new ones.  In our experiments, 4 servers
were used devoting 10% more memory to support parity logging and we
never had to perform garbage collection.


3   Implementation

      The proposed system has been built and is in everyday use. It consists
of a client issuing paging requests and servers satisfying these requests. It is
also able to use the local disk for paging and may support either mirroring
or parity logging.  The client side has been linked with the DEC OSF/1
kernel of a DEC-Alpha 3000 model 300 with 32 MB main memory as a
block device driver that handles all pagein and pageout requests. In order
to service these requests, it may forward them either to user level servers
running on other hosts, or to the local disk. The DEC OSF/1 kernel is not
even aware that we use remote main memory instead of magnetic disk as
a paging device.  It just performs ordinary paging activities using a block
device.  This design minimizes the modifications needed in order to port
the system to another operating system and avoids modifications to the
operating system kernel.


3.1    The Remote Memory Pager

      Normally the Remote Memory Pager (RMP for short) is a client which
forwards the paging requests to a remote server using sockets over an Eth-
ernet. The RMP connects to the remote memory servers using sockets over
TCP/IP. One dedicated paging daemon issues pagein and pageout requests
to the server and receives the data sent by them. When mirroring is used, it
is responsible for selecting two servers for each paged out page and transfer
the data to them.  When parity logging is used, it maintains all the data
structures related to page and parity group management and computes the
parity pages.  Security is ensured by allowing access to our device only to
the superuser and by using privileged ports for the communication among
the client and the servers.

      RMP is also capable of forwarding the requests to the local disk using
either a specified partition or a file. In the former case, it invokes a routine
that places the request in the disk queue. In the later case it issues a read
________________________________2
    Note that since the parity page is computed by the client, it is not necessary to wait
for acknowledgments from the servers before transfering the parity page in order to be
able3to recover from a single server crash.
    However, the old version of the page is not deleted from the server's memory, because
if it were, the old parity page should be updated, leading to more page transfers.

Figure 2: Performance of applications using either the disk, or the
remote memory as paging device.  We see that for all applications,
the use of remote memory results in significantly faster execution.   All
applications were run on a DEC-Alpha 3000 model 300 workstation.  The
input sizes for QSORT was 3000 records, for GAUSS, a 17001700 matrix, for
MVEC, a 21002100 matrix, for FFT an array with 700 K elements, for
FILTER a 12 MB image, and the whole DEC OSF/1 V3.2 kernel for CC.


or write operation through the VFS layer routines. When no server can be
found in order to satisfy the client's requests, paging to local disk is used.

      Although the current implementation runs on top of a low bandwidth
10 Mbps Ethernet, remote paging is up to 2 times faster than using a local
disk of the same bandwidth.  It takes about 8.4 ms to transfer an 8KB
page through the network, while transferring a page to/from the local disk
takes about 17 ms.  Faster networks such as ATM, or FDDI should offer
even more promising performance, especially when faster communication
protocols are used [26 ].


3.2    The Remote Memory Server

      The server is a user level program listening to a socket and accepting
connections from clients.  Each client is served by a new instance of the
server which uses portion of the local workstation's main memory to store
the client's pages.  When the client requests a pagein, the server transfers
the requested page(s) over the socket. When the client requests a pageout,
the server reads the incoming pages from the socket, and stores them in its
main memory. The server is also responsible for swap space allocation and
for providing periodically information to the client concerning the memory
load of its host.  A parity server is by no means different than a memory
server. It just performs pageins and pageouts responding to client requests
without knowing whether it stores memory pages or parity pages.


4   Performance Results

      To evaluate the performance of our remote memory pager, and com-
pare it to traditional disk paging, we conducted a series of performance
measurements using a number of representative applications that require a
large amount of memory. Our applications include GAUSS, a gaussian elim-
ination, QSORT, a quicksort program, FFT, a Fast-Fourier Transform, MVEC,
a matrix-vector multiplication, FILTER, a two pass separable image sharp-
ening filter described in [20 ] and CC, a kernel build after modifying the code
of our device driver.  All applications were executed on the DEC-Alpha
3000 model 300, and were compiled with the standard C compiler with the
optimization enabled.  All workstations that contributed their main mem-
ory for paging purposes were DEC-Alpha 3000 model 300, connected via
a standard 10Mbits/sec Ethernet.  In all experiments the amount of idle
memory was larger than the amount of memory needed for paging and was
equally distributed among all workstations.  The local disk that was used
for paging is a DEC RZ55, providing 10Mbits/sec bandwidth, and average
seek time of 16 msec.


4.1    Performance of Remote Memory Paging Over the Ethernet

      In our first experiment we evaluate four methods for paging:

    fflNO_RELIABILITY, which uses only main memory of other workstations
      as a paging device.  In this experiment two remote memory servers
      were used. The measurements were done on an (almost) idle Ethernet
      to ensure repeatability.

    fflPARITY_LOGGING, which uses 4 servers plus a parity server, all devot-
      ing 10% overflow memory.

    fflMIRRORING, which uses one primary memory server and one mirror
      memory server.

    fflDISK, which uses the local DEC RZ55 disk for paging.  In this case
      the page transfer requests go directly from the DEC OSF/1 kernel to
      the disk driver without the intervention of our pager.


The completion time of the applications is plotted in figure 2. We see that
in all cases the use of remote memory results in significant performance im-
provements.  For example, for the GAUSS application, the NO_RELIABILITY
results in 96% faster execution time than DISK. Even for the MVEC  appli-
cation which performed very little paging, NO_RELIABILITY results in 32%
faster execution time.  The reliability methods induce some runtime over-
head as expected but still perform much better than DISK. PARITY_LOGGING
results in 40.4% faster execution time for QSORT and in 59.86% faster time
for GAUSS. MIRRORING also performs better than DISK for all applications
except MVEC, since MVEC performs many pageouts and almost no pageins.

      In order to evaluate the use of remote memory for a more realis-
tic  application,  we  measured  the  completion  time  of  a  kernel  build  af-
ter modifying the code of our device driver.  As can be seen in figure 2,
NO_RELIABILITY performs 26.56% better than disk, PARITY_LOGGING per-
forms 24.65% better and MIRRORING performs just 9.7% better. We see that
PARITY_LOGGING performs very close to NO_RELIABILITY. As the number of
the remote memory servers used increases, the difference in performance
between NO_RELIABILITY and PARITY_LOGGING becomes lower.

      Our performance results suggest that paging to remote memory over
an Ethernet interconnection network is simply faster than paging to the
disk. Even though both the disk and the Ethernet have similar data transfer
rates, remote memory does not suffer from seek and rotational latency as
DISK does.

      Our  experimental  results  verify  that  even  when  the  network  data
transfer rate is as low as the disk transfer rate, the performance of remote
memory is significantly higher than the performance of disk. Moreover the
performance requirements of reliability are surprisingly small.  Since ar-
chitecture trends suggest that modern high speed networks provide much
higher data transfer rates than modern disks, the performance improve-
ments of remote memory over disk are bound to increase.


4.2    Scaling the Input

      To understand the impact of the working set size on the paging policy,
we measure the execution time of one of our applications (FFT), as a function
of its input size. The completion time of FFT both under PARITY_LOGGING
and under DISK is plotted in figure 3. We see that as soon as the working
set size exceeds 18 MBytes, the paging starts, and the completion time of
the application rises sharply.  Most users would not be willing to tolerate


Figure 3:  Performance of FFT as a function of input size when
either the disk, or remote memory are used as backing store.


such a high overhead in order to run an application that does not fit in main
memory. Fortunately, remote memory reduces this overhead substantially.


4.3    Scaling the Network Bandwidth

      Although figure 3 suggests that the performance of remote memory
(parity logging) is significantly better than the performance of disk, the
completion time of an application even under remote memory may be un-
acceptably high. Hopefully, the performance of remote memory will be im-
proved as soon as the Ethernet interconnection network is substituted by a
faster one (e.g. FDDI, ATM, FCS, etc.). To evaluate the performance of the
applications on top of faster networks we make detailed performance mea-
surements that separate the completion time of the application into the fol-
lowing factors: (i) user time (utime), (ii) system time (systime) (iii) initial-
ization time (inittime) (iv) page transfer time (ptime). Using the provided
time command we measure the utime, systime, and elapsed time (etime)
for each application.  Subtracting the utime and systime from the etime
for instances of the applications that perform no paging we calculate the
inittime, that is the time it takes the operating system to load and start ex-
ecuting the application. The ptime consists of the protocol processing time
(pptime) and the bandwidth dependent blocking time (btime).  We mea-
sured the pptime and found it to be equal to 1.6 ms per page for TCP/IP.
We calculate the btime using the formula :  btime = (etime  utime

systime  inittime  no_of _page_transf ers  pptime): Assuming that a
network with X times higher bandwidth will decrease btime by a factor of
X, we can predict the etime of the application over this high bandwidth
network.  Thus, the formula used is :  Expected_elapsed_time = utime +
systime + inittime + number_of _page_transf ers  pptime + btime=X:

      We made all these measurements on our FFT application, and pre-
dict its performance on a system with an interconnection network which
provides ten times more bandwidth than the Ethernet. We also predict its
completion time on a system that has enough memory to hold all the work-
ing set of the application (ALL_MEMORY) by adding the utime, systime and
inittime.  The predicted execution times, along with the measured execu-
tion times of DISK and PARITY_LOGGING are plotted in figure 4. We see that
ETHERNET*10 performs very close to ALL_MEMORY, and significantly better
than both ETHERNET and DISK.

      To understand the results shown in figure 4, we analyze the execution
time of FFT with 24MBytes of input when PARITY_LOGGING is used.  The
measured elapsed time is 130.76 seconds, consisting of 66.138 sec of useful
user time, 3.133 sec of system time, 0.21 sec of initialization time and 61.279
sec of page transfer time.  During the same run, the application suffered
2718 pageouts and 2055 pageins.  Since 4 servers were used plus a parity
server the number of page transfers was equal to 3397 + 2055  =  5452.
Thus the protocol overhead was equal to 5452  0:0016, or about 8.723 sec.
The bandwidth dependent blocking time was equal to 61:279  8:723, or
about 52.556 sec.  Using a ten times faster interconnection network, the
bandwidth dependent waiting time will be reduced to 5.255 sec. Thus, the
total completion time of FFT would be 66:138 + 3:133 + 0:21 + 8:723 + 5:255
sec, or 83.459 sec, divided as follows:  79.246% in user time, 3.754% in
system time, 0.252% in initialization time and 16.748% in page transfer
time.  We see that a 100 Mbit/sec interconnection network reduces the
total paging overhead to less than 17% of the total application execution
time. We believe that most users would be willing to pay such an overhead
in order to run an application that does not fit in main memory. After all,
the only other option they have is to suffer from disk thrashing.


4.4    The Latency of Remote Memory Paging

      As explained previously, the paging latency for FFT with input size
equal to 24 MB is 61.279 sec, or 11.24 ms per page transfer.  From these,
1.6 ms were spent during protocol processing and 9.64 ms were spent trans-
ferring each page on the Ethernet.

      Previous  measurements  have  reported  that  a  4  KByte  page  takes
about 45 ms over an Ethernet for each pagein [22 ].  Of those 45 ms, 19
ms were spent on TCP overhead, 4 ms were spent on Mach IPC overhead,
7.2 ms were spend on the Ethernet, and the rest were spent on the com-
puter's I/O bus. The total software latency of our implementation, is only
1.6 ms. The reason for this significant difference in performance is threefold:


    fflThe I/O bus of the DEC-Alpha 3000 model 300 we use is significantly
      faster and does not pose a problem in performance.

    fflWe use a DEC-Alpha processor, which is 3-4 times faster than the
      386 processor used in [22 ].

    fflFinally, our pager is implemented as a block device driver, while in [22 ]


Figure 4:  Performance of FFT for various Architecture Alterna-
tives. DISK is the measured completion time when paging to a local disk.
ETHERNET is the measured completion time of parity logging to remote mem-
ory on top of the Ethernet. ETHERNET*10 is the predicted completion time
when using remote memory as a paging device, on top of a network that
provides ten times more bandwidth than the Ethernet interconnection net-
work.  ALL_MEMORY is the predicted completion time of FFT when we use
the same workstation but with enough memory to hold its entire working
set.

      it was implemented as a user-level memory manager on top of Mach.
      Although user-level memory management gives increasing flexibility
      it induces large overhead.


      In general, although our approach may have less flexibility than a full-
fledged user-level pager, it has much better performance.  Moreover, our
device-driver implementation provides better performance than traditional
(local) disk paging, while user-level implementations have not reported per-
formance results to support similar claims [22 ].


4.5    Using Busy Workstations as Servers

      In all our experiments so far, the remote memory servers run on idle
workstations. However, workstations that are able to donate their memory
for paging purposes may not be completely idle, as they may run interac-
tive applications. Thus, we would like to investigate how our performance
figures change when a non-idle workstation is used as a memory server. So,
we conducted the following experiment:


      On each server workstation we started an X-window environ-
      ment, and an instance of the vi editor which was continuously
      used for editing.  Then, we run the applications of the experi-
      ment in figure 2.  The same inputs, and the same clients were
      used.  The only difference was that the remote memory server
      processes were run on busy instead of idle workstations.4


We were surprised to see that for the FFT, GAUSS, and MVEC applications,
their completion times were within 1 sec of their completion times when the
server ran on an idle workstation.  Only QSORT suffered a 7% overhead in
its completion time: probably the kernel swapped out some of the remote
memory server's pages on the disk. However, in order to find out how the
completion time of our applications changes with server load, we ran FFT
and QSORT under NO_RELIABILITY using two remote memory servers.  On
one of them a cpu bound program (performing a "while(1);" loop) was ini-
tiated. To our surprise, even then the completion times of our applications
were within 7% of their completion times when the server ran on an idle
workstation.

      Our performance figures suggest that most of the time the remote
memory servers were able to satisfy the client's requests immediately, even
on busy workstations.  Our results agree with the measurements in figure
1 which report that a significant portion of all workstation's memory is
unused even at business hours, thus no overhead is expected to be seen
when some other server process uses the extra pages.

      In the same course of experiments, we would like to see what is the
overhead that remote paging induces on the server workstation. Thus, we
measured the CPU utilization of the (otherwise idle) remote memory server
for all our experiments, and found it always to be less than 15%. Thus, the
computational overhead imposed on the remote workstation is so low that
will not be noticed by the workstation's owner.
________________________________4
    One could argue that an X-window environment and an editor, induce almost no
load on the workstation. But, this is exactly the point: a typical workstation, even when
it is used, it is very lightly loaded. The rest of the workstations that are heavily loaded
do not donate their main memory for remote paging.

4.6    Using Remote Memory Paging over a Loaded Ethernet

      All the experiments presented so far were done over an almost idle
Ethernet to ensure repeatability of our results. However, we would like to
find out how the performance of remote memory paging is affected by the
load of the network.  That is why we repeated our experiments using an
already loaded Ethernet.  The results showed a performance degradation
even when the Ethernet was lightly loaded. This situation is by no means
surprising since the paging itself uses all the bandwidth it can get. Adding
more sources of traffic leads to an enourmous demand for bandwidth causing
repeated collisions and lowering the effective bandwidth of the network,
leading to throughput collapse.

      Fortunately, this inefficiency is not inherent to remote memory paging
but rather to the CSMA/CD protocol employed by the Ethernet [24 ]. This
means that it is still beneficial to use remote memory paging over networks
that employ other technologies (e.g.  token ring), as long as they are able
to provide to remote memory paging an effective bandwidth of 10 or more
Mbps.


4.7    Using the Local Disk to Increase Reliability

      In our work we use remote main memory to store redundant infor-
mation that will be used to recover from workstation crashes.  Another
approach would be to store all remote pages to the local disk as well [11 ],
effectively treating remote memory as a write-through cache of the disk.
We will now compare the two approaches to find out the circumstances
under which the one approach is preferable to the other.

      Both approaches use the remote memory to satisfy the read requests.
This means that both approaches perform reads at the same speed and
avoid disk head movements due to reads, thus outperforming the local disk.
Parity logging transfers 1+1=N pages per paged out page, due to the parity
computation (in our experiments N was equal to 4).  On the other hand,
write through transfers each paged out page both to disk to the remote
memory.  These two page transfers are executed in parallel.  This means
that the choice of the right approach depends on the effective bandwidth
offered by the disk and the network.  If the network bandwidth is much
higher than the disk bandwidth, then the disk will be the bottleneck for
write through making it an unwise choice.  If however the effective band-
width offered by the disk is comparable to the bandwidth offered by the
network and the system can overlap disk transfers with network transfers
then it is unclear which method is best to use. In our experimental environ-
ment the disk and network bandwidth are both equal to 10 Mbps.  When
write through is used the efective disk bandwidth is close to 10 Mbps, since
there are no head movements for reads and writes are performed in large
chunks. In this environment write through performs better than parity log-
ging and slightly worse than our no-reliability implementation in most cases,
as shown in figure 5 . However, when a modern high bandwidth network is
used, parity logging will probably be the best approach, since write through
will eventually be limited by the local disk bandwidth.


5   Discussion - Future work

      Our implementation suggests that it is possible to build a reliable
efficient  remote memory pager without making any modifications to the


Figure 5:  Performance  of  parity  logging  and  write  through  for
various applications.  The input sizes for QSORT was 3000 records, for
GAUSS, a 17001700 matrix, for MVEC a 21002100 matrix, and for FFT
an array with 700 K elements.

operating system kernel. Although our system contains all necessary mech-
anisms to support remote memory paging,  there are a few more issues
concerning the overall policy that deserve further investigation.  Some of
these issues are discussed below.


Network load:   Although remote paging is faster than using the local
disk, sometimes the network traffic may be so high that the bandwidth
used by RMP will be limited.  In this case the cost of using the network,
especially in the case of old low bandwidth networks like Ethernet, may
become higher than the cost of using the local disk. Such a situation could
be handled by the RMP by measuring the time it takes to satisfy a request
and using a threshold to determine whether it should continue to use the
network to route pageout requests or it would be better to switch to the
local disk.


Heterogeneous networks:   The current implementation assumes a net-
work of workstations that all have the same order of magnitude of physical
memory and are interconnected by a local area network.  It would be in-
teresting to explore the requirements that heterogeneous networks pose to
the design of the remote pager. For example, on a wider area network the
time it takes to transfer a page may not be identical for each server. In this
case there may be more than three levels in the memory hierarchy (local
memory, remote memory, disk), depending on the variance of the cost of
communication among the hosts of the network. Connecting machines that
have an enormous amount of memory (e.g. a supercomputer) to a network
of workstations also poses some problems. When the supercomputer mem-
ory is idle, it may not always be easy to find enough free remote workstation
memory in order to be able to use reliability policies.  In this case, a no
reliability policy can be used, since all remote memory will be provided by
a single host (the supercomputer).


6   Related Work

      Several research groups have studied the issues in using remote mem-
ory in a workstation cluster to improve paging performance [2, 12 , 7, 15 ,
22 , 3].

      Felten and Zahorjian [12 ] have implemented a remote paging system
on top of a traditional Ethernet based system, and presented an analytical
model to predict its performance. Their performance results, although pre-
liminary, are encouraging towards using remote memory paging systems.
Schilit and Duchamp [22 ] have implemented a remote memory paging sys-
tem on top of Mach 2.5 for portable computers.  Their remote memory
paging system has performance similar to local disk paging. The cost of a
single remote memory pagein over an Ethernet, they quote, is about 45 ms
for a 4Kbyte page, which is rather high. According to their measurements,
a significant percentage of this time (close to 16 ms) is spend executing
Mach IPC and TCP code.  Comer and Griffoen [7] have implemented and
compared remote memory paging vs. remote disk paging, over NFS, on an
environment with diskless workstations. Their results suggest that remote
memory paging can be 20% to 100% faster than remote disk paging, de-
pending on the disk access pattern. Anderson et. al. have proposed the use
of network memory as backing store [2].  Their simulation results suggest
that using remote memory over a 155Mbits/s ATM network "is 5 to 10
times faster than thrashing to disk" [2]. In their subsequent work [18 ], they

outline the implementation of a remote memory pager on top of an ATM
based network.

      Our work differs from previous approaches to remote memory paging
in the following aspects: (i) we use a variety of real applications to evalu-
ate and demonstrate the feasibility of remote memory paging, and (ii) we
explore the issues in building a reliable remote memory system that is re-
silient to individual workstation failures. Previous approaches either ignore
workstation failures, or write dirty pages both to the disk and the remote
memory, limiting their performance by the available disk throughput.

      Recently, research groups start to explore the issue of using remote
memory to improve file system performance [11 , 1, 8].  Feeley et.  al.  have
implemented a global memory management system in a workstation clus-
ter, using the idle memory in the cluster to store clean pages of memory
loaded workstations [11 ]. Anderson et. al. have implemented xFS, a server-
less network file system [1, 9].  Both network memory systems have been
incorporated inside the kernel of existing operating systems and their per-
formance has been demonstrated.  Although improvements in file system
performance may ultimately lead to paging performance improvements, so-
lutions developed for file systems may be cumbersome, or too general for
remote memory paging systems:  (i) in file systems, client processes may
share file data, leading to cooperative remote memory management poli-
cies. In paging instead, clients never share their swap spaces. Thus, policies
developed to optimize a client-server approach to file I/O, and facilitate co-
operation among client processes that share data, do not necessarily apply
to a paging system where no single paging server is used, and no sharing
(of swap spaces) between client processes takes place.  Finally, we use the
network memory for storing both clean and dirty pages using our novel
parity-based approach. Thus, page out (write) operations can be acknowl-
edged at the speed of remote memory, while in [11 , 1] page out operations
are acknowledged at the speed of disk.

      Although the area of reliability in network memory systems is new,
it shares several of the ideas developed for other areas of reliable memory
management.  For example, parity based methods have been extensively
used for Redundant Arrays of Inexpensive Disks (RAIDs) [6].

      Log based methods have been used for Log based file systems, that
send all updates to a file to be written in sequential blocks of the disk [21 ].
Thus, the head of the disk does not make random seek movements, and the
effective data transfer rate of the disk increases.  Log based file systems,
alike our LOGGING methods, create a fragmented space that needs to be
cleaned.  Although the general ideas may be similar, there are substantial
differences between a log based file system and the log based reliable net-
work memory we propose. For example, (i) Fragmentation in log based file
systems occurs in large chunks (several Mbytes), while fragmentation in log
based reliable network memory occurs in small parity groups, and (ii) Log
based reliable network memory systems may use parity groups as soon as
they are emptied, but log based file systems may not used emptied disk
blocks, because this would require a head movement.  (iii) Cleaning in log
based file system is much more infrequent than it is in network memory,
thus it must be made more efficient, and (iv) the objective of log based
network memory systems is to reduce page transfers, while the objective
of log based file systems is to reduce disk head movements. For the above
reasons, methods developed for log based file systems do not necessarily

apply "as is" to network memory systems.

      Our work bears some similarity with distributed shared memory sys-
tems [17 , 10 ] in that both approaches use remote memory to store an ap-
plication's data.  Our main difference is that we focus on sequential appli-
cations where pages are not (or rarely) shared, while distributed-shared-
memory projects deal with parallel applications, where the main focus is
to reduce the cost of page sharing.


7   Conclusions

      In this paper we explore the use of remote main memory for paging.
We describe our prototype implementation of a remote memory pager im-
plemented on top of the DEC OSF/1 operating system as a device driver.
No modifications were made to the kernel of the (monolithic) DEC OSF/1
operating system.  We run several applications that use our pager on top
of a DEC-Alpha-based workstation cluster to measure the performance of
the system. The contributions of this paper are:


    fflWe describe how to build a reliable remote memory paging system;
      we propose a novel parity-based policy that is resilient to single work-
      station failures.

    fflWe show that reliable paging to remote memory results in substantial
      performance improvements over local disk paging.


Based on our implementation and our performance results we conclude:


    fflPaging to remote memory results in significant performance improve-
      ment over paging to disk. Applications that use our pager even when
      running on top of traditional Ethernet technology show performance
      improvements of up to 96% (see figure 2).  Extrapolating from our
      results, we show that on top of a faster interconnection network even
      higher performance improvements are realizable!

    fflPaging to remote memory is an inexpensive way to let applications use
      more main memory than a single workstation provides. Remote mem-
      ory paging provides good performance with almost no extra hardware
      support. The only way for magnetic disks to provide comparable per-
      formance is to use expensive disk arrays.

    fflReliability in remote memory paging comes at low cost.  Parity log-
      ging based paging provides reliability at low runtime and memory
      overhead, performs very close to NO_RELIABILITY and much faster
      than disk paging.

    fflThe benefits of paging to remote memory will only increase with time.
      Current  architecture  trends  suggest  that  the  gap  between  proces-
      sor and disk speed continues to widen.  Disks are not expected to
      provide the bandwidth needed by paging unless a breakthrough in
      disk technology occurs. On the other hand, interconnection network
      bandwidth keeps increasing at a much higher rate than (single) disk
      bandwidth, thereby increasing the performance benefits of paging to
      remote memory.

      Based on our performance measurements we believe that remote mem-
ory  paging  is  a  cost-effective  and  performance-effective  way  to  execute
memory-limited applications on a network of workstations.


Acknowledgments

      This work was developed in the ESPRIT/HPCN project "SHIPS",
and will form a test application for the ACTS project "ASICCOM", funded
by the European Union (DG III and DG XIII). We deeply appreciate this
financial support, without which this work would have not existed.

      We  would  like  to  thank  Manolis  Katevenis,  Sotiris  Ioannidis  and
Kosmas Papachristos for useful comments in earlier drafts of this paper.
Michael J. Feeley and Henry M. Levy pointed out useful references.  Fi-
nally,  we deeply appreciate the thoughtful comments of the anonymous
referees.


References

 [1] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S.
     Roselli, and R. Y. Wang.  Serverless Network File Systems.  In Proc.
     15-th Symposium on Operating Systems Principles, December 1995.

 [2] Thomas E. Anderson, David E. Culler, and David A. Patterson.  A
     Case for NOW (Networks of Workstations).  IEEE Micro, February
     1995.

 [3] G. Bernard and S. Hamma.  Remote Memory Paging in Networks of
     Workstations.  In Proceedings of the SUUG International Conference
     on Open Systems: Solutions for Open Word, April 1994.

 [4] M. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. Felten, and J. Sand-
     berg.  Virtual Memory Mapped Network Interface for the SHRIMP
     Multicomputer.  In Proceedings of the Twenty-First Int. Symposium
     on Computer Architecture, pages 142-153, Chicago, IL, April 1994.

 [5] Greg Buzzard,  David Jacobson,  Scott Marovich,  and John Wilkes.
     Hamlyn:   a  high-performance  network  interface  with  sender-based
     memory management. In Proceedings of the Hot Interconnects III Sym-
     posium, August 1995.

 [6] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and
     David A. Patterson.  RAID: High-Performance,  Reliable Secondary
     Storage. ACM Computing Surveys, 26(2):145-185, June 1994.

 [7] D. Comer and J. Griffoen.  A new design for Distributed Systems:
     the Remote Memory Model.  In Proceedings of the USENIX Summer
     Conference, pages 127-135, 1990.

 [8] T.  Cortes,  S.  Girona,  and  J.  Labarta.    PACA:  A  Distributed
     File System Cache for Parallel MAchines. Performance under Unix-
     like  workload.   Technical  Report  UPC-DAC-1995-20,  Departament
     d'Arquitectura de computadors, Universitat Politecnica de Catalunya
     (UPC), June 15 1995.

 [9] M.D. Dahlin, R.Y. Wang, T.E. Anderson, and D.A. Patterson. Coop-
     erative Cahing: Using Remote Client Memory to Improve File System
     Performance. In First USENIX Symposium on Operating System De-
     sign and Implementation, pages 267-280, 1994.

[10] G. Delp.  The Architecture and implementation of Memnet:  A High-
     Speed Shared Memory Computer Communication Network. PhD thesis,
     University of Delaware, 1988.

[11] M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, H. M. Levy,
     and C. A. Thekkath. Implementing Global Memory Management in a
     Workstation Cluster. In Proc. 15-th Symposium on Operating Systems
     Principles, December 1995.

[12] E. W. Felten and J. Zahorjan.  Issues in the Implementation of a Re-
     mote Memory Paging System. Technical Report TR 91-03-09, Univer-
     sity of Washington, November 1991.

[13] R. Gillet.  Memory Channel.  In Proceedings of the Hot Interconnects
     III Symposium, August 1995.

[14] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quan-
     titative Approach. Morgan Kaufmann Publishers, Inc., 1990.

[15] L. Iftode, K. Li, and K. Petersen. Memory Servers for Multicomputers.
     In Proceedings of COMPCON 93, 1993.

[16] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Ghara-
     chorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta,
     M. Rosenblum,  and J. Hennessy.  The FLASH Multiprocessor.  In
     Proc. 21-th International Symposium on Comp. Arch., pages 302-313,
     Chicago, IL, April 1994.

[17] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Mem-
     ory Systems. ACM Transactions on Computer Systems, 7(4):321-359,
     November 1989.

[18] A. Mainwaring, C. Yoshikawa, and K. Wright.  NOW White Paper:
     Network              RAM              Prototype,                 1994.
     https://now.cs.berkeley.edu/Nram/network-ram.html.

[19] Evangelos  P.  Markatos  and  Manolis  G.H.  Katevenis.   Telegraphos:
     High-Performance Networking for Parallel Processing on Workstation
     Clusters.  In Proceedings of the Second International Symposium on
     High-Performance  Computer  Architecture,  (HPCA),  San  Jose,  CA,
     USA, February 1996.

[20] Gary Newman. Organizing Arrays for Paged Memory Systems. Com-
     munications of the ACM, 38(7):93-110, July 1995.

[21] Mendel Rosenblum and John Ousterhout. The Design and Implemen-
     tation of a Log-Structured File System. In Proc. 13-th Symposium on
     Operating Systems Principles, pages 1-15, October 1991.

[22] B.N. Schilit and D. Duchamp.  Adaptive Remote Paging for Mobile
     Computers.  Technical Report CUCS-004-91, University of Columbia,
     1991.

[23] Dolphin Interconnect Solutions.  DIS301 SBus-to-SCI Adapter User's
     Guide.

[24] A. S. Tanenbaum. Computer Networks, chapter 3, page 128. Prentice
     Hall International, 1989.

[25] C.A. Thekkath, H.M. Levy, and E.D. Lazowska. Efficient Support for
     Multicomputing on ATM Networks.  Technical Report 93-04-03, De-
     partment of Computer Science and Engineering, University of Wash-
     ington, April 12 1993.

[26] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active
     Messages: a Mechanism for Integrated Communication and Computa-
     tion. In Proc. 19-th International Symposium on Comp. Arch., pages
     256-266, Gold Coast, Australia, May 1992.


Biographical information

Evangelos P. Markatos   is an Assistant professor at ICS-FORTH and at
the University of Crete. He received his diploma in Computer Engineering
from the University of Patras in 1988, and the MS and Ph.D. degrees from
the University of Rochester in 1990 and 1993 respectively.  His interestes
include parallel and distributed systems, operating systems and computer
architecture.


George Dramitinos   is a graduate student in Computer Science at the
University of Crete, where he received a B.Sc.  degree.  He has worked at
A.C.R.I. in Lyon, France, participating in the design and implementation
of an OSF/1 based operating system for the company's supercomputer.
He joined ICS-FORTH in 1993.  His interests include operating systems,
parallel and distributed programming and computer architecture.

      The authors can be contacted at fmarkatos,  dramitg@ics.forth.gr.
or at their postal address at Institute  of  Computer  Science  (ICS),
FORTH,  Science  and  Technology  Park  of  Crete,  Vassilika  Vouton,
P.O.  Box  1385,  GR  711  10  Heraklion,  Crete,  Greece.


Availability

      The most recent version of the pager along with the test programs are
freely distributed using ftp from ftp.ics.forth.gr:pub/pager. More informa-
tion about the project can be found at https://www.ics.forth.gr/proj/arch-vlsi/os.