################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX Summer 1993 Technical Conference
		  Cincinnati, Ohio  June 21-25, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


			Anonymous RPC:

	Low-Latency Protection in a 64-Bit Address Space

     Curtis Yarvin, Richard Bukowski, and Thomas Anderson

 Computer Science Division, University of California at Berkeley


		          Abstract:

In this paper, we propose a method of reducing the latency of
cross-domain remote procedure call (RPC).  Traditional systems use
separate address spaces to provide memory protection between separate
processes, but even with a highly optimized RPC system, the cost of
switching between address spaces can make cross-domain RPC
prohibitively expensive.

Our approach is to use _anonymity_ instead of hardware page tables for
protection.  Logically independent memory segments are placed at random
locations in the same address space and protection domain.  With 64-bit
virtual addresses, it is unlikely that a process will be able to
locate any other segment by accidental or malicious memory probes; it
is impossible to corrupt a segment without knowing its location.  The
benefit is that a cross-domain RPC need not involve a hardware context
switch.  Measurements of our prototype implementation show that a
round-trip null RPC takes only 7.7us on an Intel 486-33.

		          Support:

This work was supported in part by the National Science Foundation
(CDA-8722788), the Digital Equipment Corporation (the Systems Research
Center and the External Research Program), and the AT&T Foundation.
Yarvin was supported by a California MICRO Fellowship, Bukowski by an
NSF Graduate Fellowship, and Anderson by a National Science Foundation
Young Investigator Award.


1. Introduction

A traditional function of operating systems is providing _protection
domains_, or areas of memory accessible only by the process which owns
them.  UNIX, for example, keeps every process in an entirely separate
address space; other systems use a shared address space, but have a
different page protection map for each process.  Either way, keeping
separate processes in separate protection domains provides _safety_ and
_security_:  protection from buggy processes which accidentally touch
memory locations they don't own, and protection from malicious
processes trying to read or alter the memory of other processes.
Safety and security are necessary in any modern operating system.

Using virtual memory hardware to enforce protection is flexible and
powerful.  However, it is also expensive.  Giving each process its own
address space increases the amount of context-specific state, and thus
the cost of context switches.  Context-switch cost contributes little
to system overhead on normal UNIX systems; but it sets a strong lower
bound on the cost of IPC.  At a minimum, two context switches are
required for each round-trip interprocess communication.  This can
limit the feasibility of splitting logically independent, but closely
cooperating, modules into separate protection domains.  

Software overhead in context switching once dominated the hardware
cost.  But, as Anderson et al. [And91] discuss, the former has
decreased with processor improvements while the latter has not.  The
result is that context-switch times for conventionally protected
systems have remained static and large, and now pose a significant
impediment to extremely fine-grained IPC.

One solution is lightweight threads which share a single protection
domain.  Thread switching and communication is fast, but threads are
not safe or secure, and even when used in a trusted environment are
vulnerable to byzantine memory-corruption bugs.

Our goal is to build a system that combines the protection of UNIX
processes and the speed of threads, to make IPC cheap enough to for
heavy use.  This could allow application interactions of a much finer
grain than are now feasible, and large software systems could be
structured as groups of cooperating processes instead of single
monolithic entities.

The key to our approach is Druschel and Peterson's observation [DP92]
that, in a very large, sparse address space, virtual address mappings
can act as _capabilities_ [DVH66].  If a process knows a segment's
position in its address space, accessing it is trivial; without this
knowledge, access is impossible.  Protection can be accomplished by
restricting the knowledge of segment mappings.  We refer to this
approach as _anonymity_.  

Anonymity was not feasible before the advent of 64-bit architectures.
A 32-bit address space is small enough that every valid page in the
address space can be easily found through exhaustive search.  But
searching a 64-bit address space in this way is nontrivial.  Thus, on
a 64-bit machine, it is possible to randomly map unprotected pages in
a shared region and use their virtual addresses as capabilities.  This
provides a fast and flexible way of sharing memory.

Druschel and Peterson use their approach to pass buffers between
user-level protocol layers in their x-kernel system.  We believe that
anonymity can be used as a general-purpose protection mechanism,
providing reasonable safety and security to independent processes
sharing the same address space.  The protection this provides is
_probabilistic_, but effective.  Between processes in the same address
space, a context switch is a simple matter of swapping registers and
stacks, and can be performed with the same efficiency as with
lightweight threads.

We have developed a simple prototype of anonymous protection;
our implementation can perform a round-trip null procedure
call between two protected domains in only 7.7us on an Intel 486-33.

The rest of this paper discusses these topics in more detail.
Section~\ref{model] shows how we can use a large address space to implement
probabilistic protection; Section~\ref{rpc] shows how we preserve 
anonymity during cross-domain communication.  
Section~\ref{uses] outlines some potential uses of anonymous protection,
while Section~\ref{problems] outlines some limitations of our approach.
Section~\ref{performance] presents performance results;
Section~\ref{related] considers related work.
Section~\ref{concl] summarizes our experiences.


2. Implementing Anonymous Protection

Safe and secure anonymity is not difficult to implement, but requires
some care.

First we need a way to assign addresses.  When a segment (any piece of
memory that must be mapped contiguously) is loaded, it needs a virtual
address.  If the segment is to remain anonymous, no process that is not
explicitly given this address may be allowed to discover or derive
it.  

So we must select the address randomly.  If the address is truly
random, than no better algorithm exists to discover it than
brute-force search.  Unfortunately, there is no such thing as a truly
random number generator.  The best we can do is a cryptosystem, such
as DES [NBS77]; if the key is a secure password or code, and the
plaintext an allocation sequence number, then the resulting encrypted
text will be securely random.  We transform the encrypted text into a
virtual address and map the segment at that position (unless it would
overlap another segment, in which case we compute another address).

This ensures that the most efficient algorithm for finding other
processes' data will be brute-force search: iterating through virtual
space and dereferencing every page (or every segment width, if
segments are larger than pages), intentionally ignoring segmentation
faults until a valid page is found.  (The UNIX signal handling
machinery, for example, allows processes to catch and ignore
segmentation fault signals; for some applications this is necessary
semantics.)  If a process is allowed to indefinitely ignore
segmentation faults, this procedure will eventually find all segments
in the address space.  Security is compromised when it finds the first
one.  We analyze how long this will take.

The analysis has the following parameters: 

	V, the size of the virtual address space.

	M, the amount of physical memory (not the amount of the virtual
address space that is in use.  When faulting in a page not in primary
memory, the operating system can explicitly check the segment
permissions at the time of the page fault).

	D, the delay a process incurs on a segmentation fault.  

To simplify the formula, both V and M are in units of the maximum
segment size.  The result of the analysis is T, the expected time for
a malicious process to find a segment for which it does not have the
capability.

We first define P(n), the probability that n memory probes will not
find a mapped segment in n tries, or in other words, after nD
seconds.  P(n) is equal to the probability that, in a sequence of V
references, all of the references to the M mapped segments are after
the first n references: 

	P(n) = ( (V-n) choose M ) / ( V choose M ).

To find T, we set P(n)=0.50 and solve for nD.
Table 1 presents some numerically calculated values for T,
as a function of the delay and the amount of mapped memory,
assuming a 64-bit address space and 8KByte segments.
(The segment size has little effect on the results.)

	+-------------+-------------+------------+------------+
	| Memory Size | 1s Delay    | 1ms Delay  | 1us Delay  |
	+-------------+-------------+------------+------------+
	| 16MB        | 24162 years | 24 years   | 1.25 weeks |
	| 256MB       | 1510 years  | 1.5 years  | 13.2 hours |
	| 2GB         | 188 years   | 2.3 months | 1.7 hours  |
	| 16GB        | 23 years    | 8.5 days   | 12 minutes |
	+-------------+-------------+------------+------------+

  Table 1: Time To Breach Anonymous Security Through Brute-Force Search

So the processor's natural fault-handling delay is unlikely to
suffice, and the system must impose an additional delay penalty.  The
penalty should not only put the faulting process to sleep; to keep a
malicious user from using multiple processes to search the address
space, the penalty must also apply to all processes forked by the
process's owner.  However, this would still allow groups of users to
divide the penalty.

The penalty time is best set as local policy.  Constraints are the
desired level of security, the number of users, and the added
difficulty of working with faulty programs.  For an currently average
system, a constant delay of 1 second would seem acceptable on all
fronts.  As the memory size or user base of a system increases, it may
be necessary to switch to adaptive delay functions which examine
recent fault history.  Fault history can also be examined to report
suspicious patterns to the administrator.

Thus, the security we provide is not perfect; it is only probabilistic.
However, we believe that the probabilities involved are low enough to
make them of little concern when compared to external security issues.

Faulty software is also a threat; a faulty program may accidentally
reference data it does not own.  Normally, though, a failing program
will stop once it causes a segmentation fault, and the likelihood that
any individual accidental misreference will be to an otherwise unknown
segment is quite small.  The safety we provide is again probabilistic,
not guaranteed, but the probabilities involved are minimal next to the
chances of damage from external sources.


3. Anonymous RPC

Running separate processes in the same protection domain will provide
fast context switching; to take advantage of this, we must devise an
interprocess communication mechanism that preserves anonymity.

Traditional UNIX paradigms like pipes and sockets are not easy to use
for fine-grained communication; they involve considerable system and
application overhead.  A better scheme for our purposes is remote
procedure call (RPC)[Birrell84].  In RPC, a _server_ process exports
an interface to one of its procedures; any _client_ process can then
bind to the procedure as if it was linked directly into the client.
Local RPC has been extensively studied[Bersh90,Bersh91,SB90], and
seems to be the most convenient communication paradigm for integrating
software systems across domains.  

Druschel and Peterson optimized their RPC system mainly for data
throughput; we feel that this goal has been achieved, and optimize our
anonymous RPC system -- `ARPC' -- for round-trip latency.


3.1. Maintaining Anonymity During Communication

In principle, there is nothing to stop two processes running in the
same address space from communicating directly via procedure calls.
Unfortunately, this cannot be used as a protected communication
protocol.  In a normal procedure call, the caller must know the
procedure's address in the callee's code segment; and must tell the
callee its own address, to allow return.  This is incompatible with
anonymity.  Even if code segments are write-protected and all data
segregated, a malicious process can trace through code to find data.

To preserve anonymity, the path of control must flow through some
_intermediary_: an entity which is itself protected, aware of the RPC
binding, and able to manage control and data flow without revealing
either party's address to the other.

The most obvious intermediary is the kernel.  Once a process traps to
the kernel, it loses control of its execution, and can be shifted to
any domain without having learned where that domain is.  This is
simple; unfortunately, it is slow.  Kernel traps are typically an
order of magnitude more expensive than procedure calls.

A better system can be devised if the host architecture supports
execute-only page protection.  Execute-only code lends itself well to
anonymity; jumping to an execute-only entry point is anonymous for both
sides.  The caller knows the address of the callee's text, but cannot
damage that text or discover where the data might be.  We cannot give
the caller the actual entry point in the callee's code, however; a
jump into an arbitrary point in callee code might compromise data.
Instead, we use an execute-only jump table, synthesized to contain the
entry point as an immediate operand.  The cost of anonymity is an
extra jump.  

(Execute-only anonymity also requires an architecture on which
branches are atomic; that is, they fully commit the processor to
execution at the branch point.  Some RISC architectures allow the
processor to take a branch, execute an instruction, and then have the
branch nullified by another branch in the delay slot.)

The data transfer protocol must also be modified to preserve
anonymity.  In a normal procedure call, the ``server'' uses the same
stack as the ``client'' for local storage.  It is therefore easy for
either to corrupt the other by accidentally or maliciously writing
into the wrong stack frame.  Thus our RPC protocol must include a
stack switch on call and on return.  We must also clear registers
which may contain data that can compromise anonymity.

Otherwise, anonymous RPC bears a close resemblance to ordinary
procedure call, and can in some cases be performed with comparable
efficiency.

3.2. ARPC Protocols

In this section we describe our ARPC protocols in considerable detail.
A local RPC protocol can be constructed using either intermediary
scheme.  As an optimization, we design not one protocol but several,
dividing RPC into several cases based on the level of _trust_ between
the client and server processes.  

If process A trusts that it is communicating with a non-malicious, but
possibly buggy, process B, then process A can rely on the compiler and
the RPC stub generator, instead of the kernel, to preserve its
anonymity.  This allows a more efficient implementation of
cross-domain RPC.  Malicious users can circumvent these utilities, but
benign users are unlikely to do so accidentally.

3.2.1. Binding

Before any RPC calls can be performed, the client must _bind_ to the
server procedure.  Binding is only performed once for each client,
server, and procedure; thus it it need not be optimized as heavily as
the call sequence itself.

The server initiates the binding sequence, by registering an _entry
point_; it gives the RPC manager the name and address of the
procedure.  Once the entry point is present, the client can _connect_,
telling the RPC manager the name of the procedure it wants to access.
A permission check may be imposed on the connection; if it succeeds,
the RPC manager generates the binding.  

At bind time, the RPC manager creates any necessary intermediaries, and
reports the RPC entry (be it an execute-only jump table, a kernel
trap, or a direct entry into the server) to the client.  The manager
also creates an execution stack for use in executing calls through the
new binding.  All our ARPC protocols statically allocate one stack per
procedure binding.  This may seem wasteful of memory, but stacks can
be cached and unmapped when not in frequent use.  The static approach
eliminates the need for a dynamic stack allocation on every call.

3.2.2. ARPC, Mutual-Distrust

Our first protocol is for the most general case, when neither client
nor server trusts the other.  In this case we cannot jump directly
from client to server, even anonymously; we also have to save the
client's stack and return address.  This must be done in the
intermediary.  If execute-only page protection is providing the
anonymity, we jump to the intermediary through through an execute-only
jump table.

The protocol is outlined in Figure 1:

                push arguments and return address on client stack
                enter intermediary
                save return address
                save registers
                clear registers
                save address of client stack
                copy arguments to server stack
                switch to server stack
                leave intermediary to server procedure

                    execute server procedure
                    
                enter intermediary
                copy return data to client stack
                switch to client stack
                restore saved registers
                clear unsaved registers
                leave intermediary to client return

  Figure 1: ARPC Protocol, Mutual Distrust Between Client and Server

For cross-domain procedure calls with few arguments, the dominant cost
is saving, clearing, and restoring the registers.  This is
particularly true on modern processors with large register sets.  For
instance, on machines with register windows, the entire register set,
not only the current window, must be saved, cleared, and restored on
each call and return.  

Marshaling of indirect parameters may not be necessary if the client
organizes its memory properly.  Since all addresses in the client are
valid in the server, the client can keep data structures to be
exported in a segment mapped separately from its private data, and
pass pointers.  Only if this is infeasible will explicit marshaling be
needed.  Direct parameters are marshaled on the client stack by the
ordinary procedure-call protocol.

Note that in the absence of marshaling and the presence of execute-only
code, no client stub is required.  The client's view of the remote
call can simply be a function pointer whose target address is the jump
instruction in the anonymity table, and a generator-supplied header
file can define its dereference as the function call.  The ordinary
argument-pushing semantics of a function call are exactly what we want.

3.2.3. ARPC, Server Not Malicious

The second protocol applies to a much more common case, when
the server is trusted but the client is not.  This might be the case,
for example, in a small-kernel operating system.  The protocol is in
Figure 2.

                save live registers
                push arguments and return address on stack
                call through intermediary to server

                push address of client stack
                switch to server stack
                execute body of server procedure
                pop address of client stack
                push return data onto client stack
                clear sensitive registers
                return through intermediary to client procedure

                restore live registers

       Figure 2: ARPC Protocol, Server Not Malicious


Trusting the server not to be malicious has a considerable performance
advantage.  The client can anonymously jump (through the kernel or a
jump table) directly to the server.  In the simple case, the jump
would be to a stub which would then marshal the client arguments onto
the server stack.  But we can achieve better performance by using an
RPC generator to do a simple source code transformation of the server
procedure so that it reads its arguments directly off the client
stack; in this case, no copying is required.  The RPC generator parses
the server procedure, converts all argument references into client
stack references, adds the instruction to save the client stack, and
replaces ordinary returns with RPC returns.

This protocol is considerably more efficient than the mutual-distrust
version.  However, some additional work is required to make it
protocol safe and secure.

Safety could be compromised if a register containing a client pointer
became the initial value of a server variable.  The solution is to
ensure that all automatic server variables are initialized before use.
A program which relies on a the value of an uninitialized variable is
unlikely to be correct, and most modern compilers can at least warn of
such errors.

Likewise, security could be compromised by server data passing through
registers back to the client.  This is a more serious problem.  The
server must clear all sensitive registers before returning; if
compiler analysis cannot identify which registers are sensitive, all
registers must be cleared. 

Another safety hole is the client stack.  Although we allow the server
to access its arguments on the client stack (to eliminate copying), we
do not want the server to be able to inadvertently `smash' the client
stack.  The solution is to prevent the server procedure from taking
the address of any of its arguments; this can be enforced in the RPC
generator.

3.2.4. ARPC, Neither Side Malicious

Finally, our third case: a protocol for cases in which both client and
server are trusted to be non-malicious, in Figure 3.

                save live registers
                push arguments and return address on stack
                call directly to server procedure

                push address of client stack
                switch to server stack
                execute body of server procedure
                pop address of client stack
                push return data onto client stack
                return directly to client procedure

                restore live registers

      Figure 3: ARPC Protocol, Neither Side Malicious

The only operation being performed here which is not part of a normal
procedure call (assuming the caller-save register protocol) is the
stack switch.  And the protocol is safe; data may cross in registers,
but a correctly-compiled program will never allow it to be addressed
by a variable.

3.3. Service Management

Like LRPC[Bersh90], ARPC is implemented by running the server
procedure in the client thread.  Any alternative would involve a slow
interaction with operating system data structures.  The logical
semantics of RPC, however, imply a sleeping service thread which
awakes to run the procedure and returns to sleep when it finishes.
Reconciling these models requires some juggling.

One problem comes when the server crashes.  We do not want the client
thread to die with the server, because it was not the client who
caused the error.  Instead, the RPC call should return with an error
condition.

Our solution is expensive, but not inappropriate given that such faults
should be a rare condition.  We check all the server's incoming
intermediaries for saved stacks, and restore the client threads from
those; if the server has any outgoing calls, we zero the stack in the
intermediary so that the call faults when it returns.

This does not work for the mutual-trust protocol, which has no
intermediary and saves the client stack address on the server stack.
If separate crash recovery is required, a separate word must be
reserved for the client stack.  The separate-recovery model, in any
case, is often not the preferred semantics for mutual-trust uses.

Synchronizing stack allocation is another problem.  All of these
routines assume one proviso: that only one thread of control,
including past levels of recursion, is using the same RPC binding at
any time.  The assumption is convenient because it allows stacks to be
statically allocated.  If multiple threads must use the same RPC
binding, however, they will have to synchronize externally, as they
would for any single-consumer resource.

Also, in systems which support threads, a simple lock is necessary at
the beginning of untrusted-client entry points, to prevent malicious
clients from sending multiple threads through the same binding and
causing a stack collision.


4. Uses of Anonymous RPC

ARPC can perform many functions within a system.  It can be installed
at the user level, to provide additional safety within large
applications that are externally protected by traditional protection
domains; or it can be installed at the system level, as the main
interprocess protection system; or it can be used only for some
specialized tasks which require especially low latency.  We consider
each.

4.1. User-Level ARPC

Many large programs are written in languages, such as C, which do not
guarantee the internal safety of memory.  That is, code in one module
may inadvertently corrupt the unrelated data of another, creating a
bug which is hard to find in single-threaded code, and almost
intractable in a multithreaded program.

This is an unpleasant possibility, but software designers accept it
because of the high performance penalty a safe implementation would
incur.  In conventional systems, safety can only be ensured by
actively checking each pointer dereference, or by placing separate
modules in separate hardware protection domains.  Either solution
incurs a substantial performance penalty and neither is widely used.

ARPC offers a convenient medium.  When neither side is malicious, the
cost of an ARPC call is little greater than the cost of an ordinary
procedure call.  In most cases, replacing procedure calls with ARPC
calls will have little impact on performance.

We can use this principle to increase internal safety in large
software systems.  Logically separate modules, and their heaps, can be
placed in separate ARPC domains, providing strong probabilistic safety
at minimal cost.

Such a transformation can be installed transparently in a C compiler
and linker, providing increased safety invisibly to the programmer.
It can also be used in compilers for languages that do guarantee
memory safety, allowing them to maintain their safe semantics while
providing performance competitive with C.  Protecting small objects
can waste memory: an anonymous object smaller than a page must still
take take up an entire page of physical memory.

4.2. ARPC as an Operating System Base

With some care, ARPC can be used as the sole process protection system
in a traditional UNIX-style operating system.  This would accelerate
communication between user-level processes; it might, for example,
allow fast, fine-grained drawing interaction between imaging software
and a display manager.

It can also be used, in a more radical design, to improve the internal
structure of the operating system.  ARPC allows considerable kernel
decomposition, even beyond the usual microkernel level of[Young87];
all traditional system services -- filesystem, communication, scheduling,
memory management, and device drivers -- can be performed at user level.

This is possible because, in an ARPC system, all common operating
systems primitives can be executed without supervisor privilege.
Context switches do not require manipulation of the virtual memory
hardware, and process authentication can be performed by giving system
servers separate ARPC entry points for each process.  Even page tables
and other hardware-accessed data structures can be mapped into the
user address space and used directly by user-level servers.
Memory-mapped devices can be treated the same way.  The only functions
that must be performed at supervisor level are system initialization
and execution of privileged instructions.  A very small kernel can
handle the latter, verifying that requests come from genuine system
servers by checking the source address of the trap.

This model has some practical deficiencies.  UNIX semantics require
separate address spaces for sibling processes after a fork(); although
the extra space can be discarded upon the first exec(), handling it
may prove a considerable design nuisance.

We also caution against using ARPC as the sole protection device in
applications where local security is mission-critical.  In an
ARPC-based system, it seems too easy for small software errors in
implementation -- the accidental release of capabilities -- to leave
obscure security holes that could be exploited by dedicated
intruders.  This is true in any system that uses cryptographic
capabilities -- e.g.  Amoeba[Mull90] -- but especially so in ARPC,
where every pointer is a capability.

4.3. Limited Uses of ARPC

To be useful, ARPC need not be the main protection device in a
system.  Its use can be restricted to special circumstances where
speed is important, with traditional hardware domains used in all
other places where protection is needed.

One such use might be network communications.  Some new
networks[And92] offer latency on the order of microseconds, a useful
feature which traditional software architectures have difficulty
exporting to the user.  An ARPC solution would be to map the network
device anonymously and give its address to a user-level device
manager, which would in turn accept ARPC calls.  Network-critical
processes would have to share an address space with the device manager
and the device; all other tasks could have their own spaces.


5. Some Problems and Disadvantages of ARPC

One potential problem with an ARPC-based environment is that image
initialization is slower; since programs must be remapped randomly
every time they are instantiated, they must be relocated on
execution.  This is a concern, but the latency of initialization is
already great and relocation can be performed as part of copying the
program into memory.

An anonymous address space may be very sparse and difficult to
manage.  If logical domains become as small as one page, a traditional
multilevel page-table scheme will be prohibitively expensive.
However, inverted page tables[IBM90] will work; as will a
software-loaded TLB[Kane87] backed by a simple search scheme such as
binary tree or hash table.

Tagging TLBs and virtual caches with process identifiers would
decrease the cost of context switches and makes conventional
protection more competitive with ARPC.  Tagged TLBs and caches
eliminate the need to flush state on context switch, one of the most
expensive components of a traditional domain switch.  However, not
many architectures support tagged TLB's; even if they do, the switch
must still be performed in supervisor mode.

It is also uncertain whether the imbalance of virtual and physical
memory will continue.  One cannot predict whether or not the industry
will move to a 128-bit bus before available physical memory approaches
64 bits.  It is worth noting, however, than in the past bus size has
increased faster than physical memory size.  

Also, ARPC precludes other uses for the 64-bit address space which may
be superior.  Large databases[Ston91] or distributed systems[Carter92]
can use most of a 64-bit space.  

6. Performance Results

We implemented a test prototype of anonymity on an Intel 486-33
machine, capable of about 15 SPECint.  The base operating system was
Linux 0.98.4, a copylefted POSIX clone for the 386 architecture
[Torv92].  The 486 is a 32-bit machine, and as such a truly functional
implementation was impossible; however, we did our best to assure that
practical concerns were treated as realistically as possible.  

We did not convert the entire system to use a shared virtual address
space; only specially-flagged relocatable executables were run in the
same address space.  Even on the 32-bit 486, it would have been
feasible to run all processes in the same space, but Linux is
distributed largely in binary form, and finding and rebuilding source
for all our utilities to make them relocatable would have been a task
not worth the benefit.

Under this system we chose to test the slowest possible version of
ARPC; the protocol for two mutually untrusted processes, implemented
without the use of execute-only code.  Although the intermediary was
accessed by traps, it ran in user mode.

Our time for a user-to-user round-trip null ARPC in C was 7.7
microseconds.  Of this, 3.4us were due to the user-level traps; 2.4us
to segment peculiarities associated with the Linux implementation;
0.5us to save, clear, and restore registers; and 1.8u for overhead in
the intermediate region.  That overhead was 29 instructions; of them,
15 were involved in loading static parameters, and could be eliminated
were we to synthesize the intermediary for each bind[MP89]. This is
somewhat more overhead than we had hoped for, but we consider it
acceptable.

It is difficult to find comparable performance figures for other RPC
systems, since we do not know of any other optimized local RPC results
on the same architecture.  Instead, Table 2 compares our performance
to that of four other RPC implementations running on other hardware:
Mach RPC[Bersh92], SRC RPC[SB90], LRPC[Bersh90], and URPC[Bersh91].
These are all optimized RPC implementations; commercial RPC
implementations are frequently another order of magnitude slower. 

	+----------+-----------+-----------+------+--------+
	| Protocol | Time (us) | Processor | MIPS | us/MIP |
	+----------+-----------+-----------+------+--------+
	| SRC RPC  | 454       | C-VAX     | 2.7  | 1226   |
	| Mach RPC | 95        | R2000     | 10   | 950    |
	| LRPC     | 157       | C-VAX     | 2.7  | 424    |
	| URPC     | 93        | C-VAX     | 2.7  | 251    |
	| ARPC     | 7.7       | i486-33   | 15   | 116    |
	+---------+-----------+-----------+------+---------+

		Table 2: Null RPC Performance Results


7. Related Work

The most closely related work to ours is Druschel and Peterson's
_fbufs_, packet buffers mapped in a shared anonymous space[DP92].
User-level protocols communicate via fbufs to exchange packets at very
high speed.  Fbufs are a much more robust and conservative use of
anonymity; the user-level protocol layers which use fbufs to
communicate each have their own protection domain.  Performance
improvements are achieved by increased flexibility and efficiency in
buffer management.  Buffers do not need to be remapped in the middle
of a transfer, nor must they be assigned to specific protocol pairs
beforehand.  Fbufs achieve data transfer rates near the theoretical
limits of memory.

Other systems, like Pilot[Red80], use a single address space without
separate hardware domains, and rely for protection on languages which
restrict the use of pointers.  Pilot was designed for installations
which do not need security, like some personal computers, and thus
benefits from the ability to optimize its protection mechanisms only
for safety.

Wahbe et al. [Wah93] propose a different approach to enforcing
protection.  Instead of relying on hardware, or on anonymity, they
enforce protection in software by installing extra instructions into
the object code to prevent out-of-bounds pointer dereferences.  Unlike
Pilot, Wahbe et al.'s approach is language-independent because the
protection is enforced at runtime, not in the compiler.  The benefit
is that separate processes can run in the same address space and
communicate efficiently, at the expense of a slight slowdown in memory
operations.

The Opal system[Chase92] also uses a 64-bit shared address space, but
each Opal task has its own protection domain; the goal is uniformity of
naming.  Using identical addresses to reference the same objects in all
domains allows increased flexibility in sharing complex data
structures.  A context switch, however, still involves switching page
tables.  Other systems also follow this pattern[Carter92].  


8. Conclusion

Anonymous RPC exploits a simple property of memory systems: that it is
impossible to address data whose address is unknown.  With the advent
of 64-bit machines, it is now possible to take advantage of this
property, by placing segments at random locations in a very large,
sparse address space.  This allows efficient interprocess
communication without expensive hardware domains.

We believe that ARPC is a useful technique for high-speed communication
between closely coupled domains.  Even if it is not used to organize
an entire operating system, it can be used as to protect individual
subsystems which use especially fine-grained communication.  We
suggest that software designers consider it as an option, and that
hardware designers consider including features, such as execute-only
page protection, which help its implementation.


9. Acknowledgements

We would like to thank Peter Druschel, Dave Keppel, and Ed Lazowska
for their helpful comments on this paper.


10. Bibliography

[And91] Thomas Anderson, Henry Levy, Brian Bershad, and Edward
Lazowska, ``The Interaction of Architecture and OS Design.'' {\em
Fourth Conference on Architectural Support for Programming Languages
and Operating Systems_, April 1991.

[And92] T. Anderson, S. Owicki, J. Saxe, and C. Thacker.
``High Speed Switch Scheduling for Local Area Networks.'' _Proc.
Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems_, October 1992.

[Bersh90] Brian Bershad, Thomas Anderson, Edward
Lazowska, and Henry Levy, ``Lightweight Remote Procedure Call.''
_ACM Transactions on Computer Systems 8_, 1 (February 1990).

[Bersh91] Brian Bershad, Thomas Anderson, Edward Lazowska, and
Henry Levy.  ``User-Level Interprocess Communication for Shared-Memory
Multiprocessors.'' _ACM Transactions on Computer Systems 9_, 2
(May 1991).

[Bersh92] Brian Bershad, Richard Draves, and Alessandro Forin,
``Using Microbenchmarks to Evaluate System Performance.''  _Third
Workshop on Workstation Operating Systems_, April 1992.

[Chase92] Jeff Chase, Henry Levy, Miche Baker-Harvey, and Ed
Lazowska, ``Opal: A Single Address Space System for 64-bit
Architectures.''  _Third Workshop on Workstation Operating Systems_,
April 1992.

[Carter92] John B. Carter, Alan L. Cox, David B. Johnson, and Willy
Zwaenepoel, ``Distributed Operating Systems Based on a Protected
Global Address Space.'' _Third Workshop on Workstation Operating
Systems_, April 1992.

[DP92] Peter Druschel and Larry L. Peterson, ``High
Performance Cross-Domain Data Transfer.'' TR92-11, Department of
Computer Science, University of Arizona.

[IBM90] IBM Corporation, Advanced Workstation Division,
Austin, Texas.  _POWER Processor Architecture_.  IBM Corporation, 1990.

[Kane87] G. Kane. _MIPS R2000 RISC Architecture_.
Prentice Hall, 1987.

[MP89] Henry Massalin and Calton Pu, ``Threads and I/O in
the Synthesis Kernel.'' _Twelfth ACM Symposium on Operating Systems
Principles_, 1989.

[SB90] Michael Schroeder and Michael Burrows, ``Performance of
Firefly RPC.''  _Twelfth ACM Symposium on Operating Systems Principles_,
1989.

[Ston91] Michael Stonebraker and Jeff Dozier, ``Sequoia 2000:
Large Capacity Object Servers to Support Global Change Research.''
_Sequoia 2000 Technical Report 91/1_, July 1991.

[Torv92] Linus Torvalds, ``Free Unix for the 386.''  finger
torvalds@kruuna.helsinki.fi.

[Wulf74] W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C.
Pierson, and F. Pollack, ``HYDRA: The Kernel of a Multiprocessor
Operating System.'' _Communications of the ACM_, Vol. 17, No. 6, June 1974.

[Young87] M. Young, A. Tevanian, R. Rashid, D. Golub, J.
Eppinger, J. Chew, W. Bolosky, D. Black, and R. Baron, ``The Duality of
Memory and Communication in the Implementation of a Multiprocessor
Operating System.''  _Operating Systems Review_, Vol.21, No.5.

[Mull90] S. Mullender, G. van Rossum, A. Tanenbaum, R. van
Renesse, and H. van Staveren, ``Amoeba: A Distributed Operating System
for the 1990s.'' _IEEE Computer_, May 1990.

[DVH66] Jack Dennis and Earl Van Horn, ``Virtual Memory,
Processes, and Sharing in Multics.'' _Communications of the ACM_,
March 1966.

[NBS77] National Bureau of Standards, ``The Data Encryption
System.''  _Federal Information Processing Standards Publication
46_, Government Printing Office.

[Red80] D. Redell, Y. Dalal, T. Horsley, H. Lauer, W.  Lynch,
P. McJones, H. Murray, and S. Purcell, ``Pilot: An Operating System
for a Personal Computer.'' _Communications of the ACM_, February
1980.

[Wah93] Robert Wahbe, Steven Lucco, Thomas Anderson, and Susan Graham,
``Low Latency RPC Via Software-Enforced Protection Domains.'' Submitted
to _Sixteenth ACM Symposium on Operating Systems Principles_, 1993.


11. Author Information

Curtis Yarvin is a first-year graduate student in the Computer Science
Division at the University of California at Berkeley.  
He graduated from Brown University in 1992.
His e-mail address is ``curtis@cs.berkeley.edu''.

Richard Bukowski is a first-year graduate student in Computer Science
Division at the University of California at Berkeley.  
He graduated from Cornell University in 1992.
His e-mail address is ``bukowski@cs.berkeley.edu''.

Thomas Anderson is an Assistant Professor in the Computer Science
Division at the University of California at Berkeley.  He received his
A.B. in philosophy from Harvard University in 1983 and his M.S. and
Ph.D. in computer science from the University of Washington in 1989
and 1991, respectively.  He won an NSF Young Investigator Award in
1992, and he co-authored award papers at the 1989 SIGMETRICS
Conference, the 1989 and 1991 Symposia on Operating Systems
Principles, the 1992 ASPLOS Conference, and the 1993 Winter USENIX
Conference.  His interests include operating systems, computer
architecture, multiprocessors, high speed networks, massive storage
systems, and computer science education.  His e-mail address is
``tea@cs.berkeley.edu''.