################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	    Proceedings of the USENIX SEDMS IV Conference
      (Experiences with Distributed and Multiprocessor Systems)
	     San Diego, California, September 22-23, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


An Implementation of the Shared Data Formats Standard for 
Distributed Shared Memories

Maya B. Gokhale 
	Ronald G. Minnich 
Supercomputing Research Center 
17100 Science Drive 
Bowie, Md.  20715 

*Abstract

Distributed Shared Memory (DSM) is a mechanism by which processes can share
data over a network using the memory abstraction . The processors may be
heterogeneous, one difference being that they use incompatible data formats
for such basic types as integers. Programmers need a programming mechanism
for dealing with these differences. 

In this paper we describe a compiler which supports IEEE 1596.5, a
machine-independent set of types specified so as to allow the use of shared
data in heterogeneous DSMs.  Programs that use these types for the DSM
can share data regardless of the processor they run on; data can be
shared in heterogeneous environments.

The use of this compiler converts the run-time handling of eXternal Data
Representation, or XDR, to compile-time, and introduces the opportunity 
for using optimizing compiler technology in handling run-time conversion of
data types in a heterogeneous environment. 
It also gives the programmer a high degree of control over when translation
occurs, in contrast to the XDR approach of always translating everything. 
Finally, it promotes run-time sharing of data instead of the copy-in
copy-out semantics of XDR. 

Introduction

Distributed Shared Memory (DSM) is a mechanism by which processes can share
data over a network using the memory abstraction. 
Distributed shared memories have been implemented both in hardware and 
software.
Hardware DSMs (or HDSMs) can provide much higher bandwidth and much 
lower latency than networks such as Ethernet, which in turn can magnify
the cost of any software overhead attached to use of the DSM. 

A problem which has not been effectively solved in DSMs, but which 
is growing in importance, is dealing with heterogeneity. 
That is, given a hardware DSM such as SysTran
 or IEEE 1596, how may
heterogeneous machines
share data. 
Heterogeneous in this sense refers to data format heterogeneity; 
for our purposes 
Sun-3 and Sun-4 machines are not heterogeneous, even though  one uses 
an MC68020 architecture base and the other uses a SPARC architecture 
base. The most important
aspects of the data format are common to the two architectures. 
Given the low latency and high performance  inherent in 
these HDSMs, it is essential that any 
software mechanisms provided for managing data heterogeneity 
also have low overhead, so as to preserve the 
low latency and high bandwidth provided by the HDSM. 

In this paper we describe a compiler which supports IEEE 1596.5,
a 
machine-independent set of types specified so as to allow use of shared data
 in a DSM by a set of heterogeneous machines. 
Programs that use these types for the DSM can share data regardless of the 
processor they run on; data can be shared in heterogeneous environments. 

The types were first implemented via C++, but that path had poor performance
due to inherent limitations in providing new scalar types via C++. 
The current 
implementation uses a compiler. 

One effect of the use of this compiler is to eliminate much of the 
software executed as part of 
runtime eXternal Data Representation, or XDR, 
encoding and decoding of data.
XDR is currently used to manage heterogeneity in networks of machines. 
The typical approach used for XDR 
is to encode the data in a canonical format at the 
transmitting end and to decode the data at the receiving end. 

Once a message is encoded via XDR and sent, it must be decoded at the 
receiving end. 
All the data sent in a message 
is decoded regardless of whether only a piece of the message
is used or the whole message is used. 
The reason is that the process of using the message data in a program occurs
separately from the process of decoding the data.  
The decode software can not determine which pieces of the message
will be used, and hence must process the entire message. 

Another limitation with XDR is 
the generality of the runtime encoding and decoding causes very 
slow execution  
because it is so general.
In our work
we have measured times of several hundred microseconds to encode
data that could be encoded in less than 10 microseconds. 
The longer 
time was measured using the encoding functions used to encode NFS 
requests and replies in the SunOS kernel. 
These same functions are also used by many other vendors in their NFS 
software. 
The shorter time is possible if compile-time encoding of the data is done. 
Thus, this work moves the processing of data representation, sometimes 
called presentation, from the domain of runtime work to the 
domain of compile-time work. Much of the work of 
the decoding and encoding software 
is done only once, at compile time, rather than every time data is processed. 

*Related Work
The problem of accomodating heterogeneity in a DSM has been addressed
a number of  times over the last ten years. 
The approaches have in general fallen into two basic strategies:

Tag every data item As the data moves over the network
convert it as needed for the destination machine, either at the source
or at the destination. Approaches used here have involved picking 
a machine-defined addressing unit, such as a page, and only allowing one type 
of data on that page; using XDR-style descriptors and requiring the 
user to explicitly request that data be moved; or storing the data in memory
in a self-describing format, converting the data as it is moved. 
Use object-oriented programming techniques Tag each data item and 
at run-time, for each item, determine on a per-access basis the format
of the item and whether it needs to be converted. 

A system called Agora 
 supported a 
heterogeneous data environment. 
The mechanism was to limit the set of scalar types to that supported 
by DEC's VAX computer; and to always convert the types when they were
moved via the network to some other machine. This approach 
penalizes sharing of data and limits the data types which can be 
shared.

In Mermaid, Li et. al. demonstrated heterogeneous 
data support. The data are
always converted and  the set of types is limited. 
The Mermaid approach is to overlay XDR on the 
distributed shared memory. 
In the Ethernet environment the overhead of Mermaid is not a dominating 
factor. 
However, 
 in a high-performance network the overhead would be a dominant cost, 
usually well over a millisecond per page. 
When HDSMs become available, we do not believe access to the
peak performance of the HDSM will be possible with this 
type of software.

These approaches have a number of problems that have not been solved 
satisfactorily: 

Space overhead. The cost of storing attributes per data item 
can be excessive for arrays of  structures 
that consume, e.g., several tens of megabytes. One reason for blocking
same scalar type items on a page is to avoid this sort of overhead, but 
such blocking runs counter to the type of structures used by C programmers. 
In C, data structures group together dissimilar components, which in 
turn are grouped into arrays or larger structures. Often these 
structures and arrays are dynamically allocated. 
Theoretically a compiler could translate structures into 
page-sized blocks of same scalar types and then relate inidividual 
scalar items on these pages 
back to individual structure components; getting such a system
correct in practice 
is extremely difficult and does not address the issue of dynamically 
allocated structures. 
Time overhead. In an HDSM such as the SysTran or SCI systems, 
the available network bandwidth is 
the bus bandwidth of the 
interface, which in the worst case (SysTran on an old Sun 3) is
four megabytes per second, and in the best case is several hundred megabytes
per second; latency is measured
in microseconds. Converting every element of every structure of a shared
array every time one word is accessed is prohibitively expensive. 
The costs of the run-time encode and decode common to these approaches 
appears manageable only on an Ethernet, which is a low-bandwidth, high-latency
network. A several-hundred microsecond cost of run-time encode can appear
negligible compare to a several millisecond cost of moving the message; 
this same overhead becomes the dominating factor when the 
message transmission time is measured in microseconds. 
Programming model. Because of the cost of the encoding, these 
models discourage programmers from writing peer-to-peer parallel 
programs which communicate frequently, 
instead forcing them into the client-server model in which data is handed
out infrequently. 
Many parallel programs do not fit such a client-server model. 


In our approach we use the 
IEEE 1596.5 standard, which provides a basic set of data types from 
which to choose. 


Because the types are explicit,
the option  is open for the programmer to copy the types in, i.e. the
programmer may explicitly force the conversion of the data. Thus the 
programmer has control over how and when the conversion is done, if 
needed. Finally, peer-to-peer parallel programming becomes 
easier, since the programs can access the networks with much less
software in the way, and hence can take advantage of low-latency HDSMs. 

*IEEE 1596.5
IEEE 1596.5 is a standard which defines a set of data types for use 
in a heterogeneous computing environment. The standard considers four 
parameters when defining a type:

Size in eight-bit bytes.
Bit-ordering. The position of the most-significant bit.
Whether the type is floating-point, signed integer, or unsigned
integer.
Whether the type is aligned on four-byte or one-byte boundaries.


The set of types is rich enough that there is an efficient encoding for 
almost any machine. 
Exceptions are machines such as the IBM 7090, the Digital 
Equipment Corporation DEC-10 series, and 
the Unisys A- and B-series, which do not have word sizes that are 
a power of two. 

The name of a type is composed by describing its attributes. The
ordering of attributes in a name is: 

Aligned or Unaligned
Big or Little endian 
Signed integer, Unsigned integer, Real number, 
or Multifield (Multifield is used for bit-field support; bit numbers
are defined in a host-independent format)
Byte, Doublet, Quadlet, Octlet, or Hexlet
(1, 2, 4, 8, or 16 bytes respectively)


Thus an aligned  32-bit big-endian unsigned integer is an 
AlignedBigUnsignedQuadlet. 
Because these names are rather long, a standard abbreviation 
is defined by using the first initial of each keyword. Hence, 
AlignedBigRealQuadlet and ABRQ are equivalent. We expect in practice
that users would define their own desired names as types with these names, 
i.e. a user might do the following in C: typedef ABRQ sparcfloat. 

The 1596.5 standard, because it specifies types known at compile time,
opens the door for compile-time optimization in a way that is not possible
for run-time decoding of data streams. 

*Using 1596.5
The 1596.5 types make it easy for programmers to 
modify their programs to operate in a heterogeneous environment. 
In order to show how such a modification can be done, we will 
provide an example. 

The example is a program which models quantum photon transport 
in a complex chamber, using Monte Carlo techniques. In this program 
there are structures which hold problem state. The program uses 
shared-memory, and has been modified to use the 
MNFS 
 software DSM for the shared-memory support. 
MNFS is a modified NFS developed at SRC which provides distributed shared 
memory. 

Thus the program already used a software DSM, and needed only two changes
to run on many different types of machines. The changes are simple: 
formerly, we had a typedef double photonfloat in a structure; 
it becomes typedef ABRQ photonfloat. 
Also, in the structure, we change declarations of int 
to declarations of AlignedBigSignedQuadlet, or 
ABSQ for short. 
At this point, the program
may be compiled on any machine which supports 1596.5; it will run 
and interoperate with any other machine, regardless of architecture. 
If the machines in question  
have MNFS, our software DSM, or use a hardware DSM such as the SysTran 
system, then the program is ready to run. 

With two simple textual changes we have converted a program for homogeneous
machines to a program which will run on a set of heterogeneous machines. 
To make this change with systems such as the Open Software Foundation's
Distributed Computing Environment (DCE)
or Sun's
 Remote Procedure Call (SunRPC) we would need to: 

Learn a new language which is used to define the data types 
for the DCE or SunRPC system
Using this language, define 
the structures in some separate file to be transferred among
the programs. Note that every time we change the structures used 
by the program to share data with other programs, we also have to 
change this separate file.
Rewrite our program to use the client/server model of SunRPC or DCE; 
currently the program has peer-to-peer semantics, not client-server 
semantics. In the case of this program such a change is possible, but in 
many programs the client-server model is not a correct one for the 
structure of the program
Find those places in the program where shared data is used,
and modify them so that the sharing is done via calling the proper
functions, not just
referencing the data. This can be very difficult to do 
efficiently. 
Taking a set of array references which are compile-time expressions and 
turning them into a set of RPC interactions is an effort fraught with 
difficulty and has great potential for introducing bugs into a program. 
Get the main driver program written which will cause all the other
RPC calls to occur. 


*C++ Implementation
We initially implemented the IEEE 1596.5 types in C++. 
The C++ classes were compiled with several versions of G++, 
all version 2.0 and later. 
The goal was to build a complete set of 1596.5 scalar types as C++ classes, 
to the 
point that we could run the 1596.5 validation suite included as part of the 
standard. 

We encapsulated each type in a C++ object. 
To reduce the amount of effort involved, we developed a tool that,
given a 1596.5 type name, 
could generate a C++ class for that type. 
The tool generated code which was tested on 
SPARC, IBM RS/6000, and Intel  386
processors. 
An example is shown in Figure  . 


For the example we have generated code for an 
AlignedBigRealQuadlet, which is a word-aligned, big-endian, floating point 
number using four bytes. We will also refer to this type as ABRQ. 


The type is representable as a float on the native machine (in this case
a SPARC), so the private variable for the class is a float. 
The alignment keyword allows us (in GNU C and C++) to specify how the 
variable is aligned; this may  not be available in non-GNU C++ implementations. 
If alignment specification is not available, then a complete set of 
compliant 1596.5 types can
not be generated. 

There are four operators for +, *, - and /. 
The native operator always returns the value of the variable 
absent any 
interpretation; this can be used to examine the variable directly, 
which is useful in debugging, and can also be used to optimize the evaluation
process in the class methods  
when the machine representation and the 1596.5 representation
are compatible. 
As 
in function entry and exit or program initiation, 
the operators with the name AlignedBigRealQuadlet are for 
the constructors, which are invoked when the class is instantiated. 
The = operators are for assignment; finally, the float() operator returns
the value of the variable when used in an expression. 

The code for non-native variable classes is quite different, as shown in 
Figure . 


In this class, the variable must be converted 
from and to little-endian format prior to use. 
Conversion  is performed by the littlefloat 
function. In all other respects this class is identical to the previous one. 
There is a subtlety in the four arithmetic operators: the cast will cause
the float() operator to be invoked, causing the littlefloat function 
to be called. 

The bytesize() function illustrates a problem with the C++ sizeof() 
operator. There is a basic confusion in C++ about whether sizeof() should
take the size of the class, or the size of the private data
in the class. In some cases these can be different. 
For example, if the private data area is a byte-sized variable, and 
we specify that the variable be word-aligned, sizeof()
will return 4, not the expected 1. 
In the 
case of the 1596.5 validation suite, the code used sizeof() on a type which 
was a class, and got errors because it really was depending on the 
semantics of sizeof yielding the exact number of bytes for the data type,
which in our case was the size of the private data. Bytesize() removes this
ambiguity, returning for all 1596.5 classes the size of the private data area. 

Promotion rules for scalar types are another source of concern in C++. 
For example, consider the code fragment shown in Figure . 


Multiplication of the float b and the  
int 2 will result in 
conversion of the int to a float, and a resultant multiplication 
of two floats. 
On the following line, multiplication of the ABRQ c and the
int 2 will result in one of two possible outcomes: 

If an int method is provided, the ABRQ will be converted to an 
int, and then multiplied by 2. The result will be erroneous if c
had a fractional part. 
If no int method is provided, the C++ compiler we use (G++) will 
produce an error message. 

There is a work-around for the problem, as shown in the next line, but 
it requires the programmer to explicitly introduce casts, which in turn
may require a large number of textual changes to the program. 

These classes were made functional to the point of 
passing the 1596.5 validation suite.
Performance was very disappointing, however. 
Our goal is that for an 1596.5 type which is native, there be no 
penalty for using that type. Our expectation was that for an 1596.5 type which
was native the C++ optimization would simply remove all the scaffolding 
and be as efficient as the native type. For example, in the ABRQ case 
shown above, we expected that on the highest optimization level 
the code generated would simply store 
directly to and from the float. Such was not the case. In fact, 
use of the ABRQ on a SPARC instead of a native float type 
caused a near order-of-magnitude performance penalty. 

We should note here that these experiences with C++ are based on our use
of G++ version 2.0 and higher. It may be that a different C++ compiler would 
exhibit different behaviour on promotions and be more efficient. 
Different behaviour on promotions would be some cause for concern: it 
would mean that from compiler to compiler, our types might not pass 
the validation suite. 

We did some work with the ATT C++ compiler version 3.0. It had two 
serious limitations, which ruled out its use: 

It can't handle SCI type names -- they are too long. 
There is no mechanism for specifying alignment of types. 


Another problem with the C++ approach relates to optimization. Many 
of the values used in a program are constants. The conversion of these 
constants can be done at compile time if the types are known to the 
compiler. In C++ no such optimization is easily available. 

Our conclusions were that 

C++, while it may have its uses, is not useful for providing 
new scalar types to a language when performance or correct 
promotion behaviour is an issue.
We had to integrate the types into a compiler.


We therefore decided to investigate building 
a compiler which supports 1596.5 types correctly and efficiently. 

C Compiler Implementation
We modified an
ANSI C compiler to 

make the SCI types  built-in and useable in arithmetic and
assignment operations,
generate the necessary conversions between SCI and native
types, and
unparse SCI-augmented C programs into standard C.


An example of the use of SCI types and the operation of
the SCI compiler is shown in Figure .


The SCI C compiler recognizes the type AlignedBigRealQuadlet
as builtin. It permits operations between variables of that type as
well as with native types, and generates calls to conversion routines. 
The result is shown in Figure . 


The meaning of the data type ABRQ is defined in a
header file sci.h, and is different for machines with different
bit orderings. On the Sparc, this type is defined to be a float. On a
little endian machine it is a structure containing an array of bytes.
Similarly, realABRQ is a macro whose definition is
platform-dependent. On the Sparc, it is a noop, since ABRQ is the
native ``real'' type. On a little endian machine, realABRQ is a
function which actually performs conversion to the little endian
floating point format.

The functionality of the compiler-generated code is the same as the
C. However, on the native machine for a particular SCI type, the
performance is the same as if the native type had been used, which is
an order of magnitude better than the C version. 

With this basic SCI C compiler in place, we are planning to add
optimizations which can improve performance on non-native types, such
as operations on ABRQ on a little endian machine. We would like
to eliminate calls to conversion routines where they could be avoided.
One example from the code fragment above is the == operator. If
the two operands are of the same type, then a bitwise compare
suffices. At the moment, the compiler converts each operand to the
native type (via realABRQ) and then does the compare. We plan to
make each SCI type a union of primitive type and array of bytes, and
generate == and != compares on the primitive type variant, as shown
in Figure . 


*Status and Performance
On machines which directly support a given 1596.5 type, using the 1596.5
type (e.g. ABRQ) is as efficient as using the native type (e.g. float).
Thus we have met one of our goals. 

On machines in which the 1596.5 and native type are incompatible, 
the cost is higher. 
The two major differences from one machine to the next are support
of alignment on non-word boundaries, which on RISC machines such as SPARC
or MIPS can cause a substantial penalty; and byte-ordering, which for 
different machines requires that all the bytes in the word be reversed. 

For unaligned types with the same byte order the cost is the cost of producing
an aligned type. 

The issue of byte reordering is more difficult. 
We timed a byte reordering  algorithm we developed
on several machines 
to get some idea of the relative penalty. 
We used the highest optimization level of the compilers. 
GCC was used on the SPARC and 386 machines; the standard IBM compiler 
was used on the RS/6000. 
The program loaded a variable, reordered the bytes, and then stored it
to a different variable. The variables were declared volatile to insure 
that the store was done. 
We examined the assembly code to make sure it was doing what we expected. 
For testing purposes we ran 100 million iterations. 
We have determined for a SPARCstation 2
that the reordering for an unsigned long or float (i.e. 32-bit numbers)
takes twice as long as a simple load or store (21 seconds for 100 million
iterations vs. 43 seconds). 
The numbers also apply to a SPARCStation ELC (25 seconds vs. 52 seconds).
On the RS/6000, it 
takes almost six times as long (12 seconds vs. 73 seconds)It could 
be that gcc on the RS/6000 would have much better results.. 
For the 386 systems we used, it takes less than twice as long (118 seconds 
vs. 203 seconds). 

Given a reasonable compiler and architecture, the reordering should not 
be a significant penalty. For the type of applications we have 
run in a distributed
environment, we don't feel it will be an issue. 

*Future Work
We would like to implement a predicate called native() which, when given 
a 1596.5 type name, will return 1 or 0 if that type has an efficient native
representation on the native machine. This predicate should function 
even at the C preprocessor stage, so that ifdef directives work. 

Another useful built-in would be called bestfit. Bestfit would, 
given a 1596.5 type, return  a native type that best approximates the 
1596.5 type. For example, on a little-endian machine, the best-fit type
for an big-endian float would be a little-endian float. 

Finally, a 1596.5 structure copy would be useful. This would do pairwise 
copies from structures containing non-native types to a structure containing
bestfit types for each structure member. 


An example of such support is shown in Figure . 

Although we have demonstrated the use of the 1596.5 types with DSMs,
the C compiler we have described could also be used in RPC systems to replace
XDR. 
For example, all the NFS structures could be specified using
1596.5 types, eliminating the need for the current layer of XDR software. 
This opens the way for much more efficient
encoding and decoding of NFS requests.
The cost of encoding and decoding software is becoming a concern in
high bandwidth networks.
We plan to  create a set of SCI-based structures which are defined by the 
NFS Version 3 protocol when that protocol is finally released. 

Summary
IEEE 1596.5 is a standard specifying a set of data types for use in 
Hardware Distributed Shared Memories.
We have built a C compiler which supports these types as 
first-class objects. The types may be freely intermixed with 
the six standard C types. For types which are compatible 
with native machine types, performance is equivalent to what would be 
expected with the native machine type. For types which have no native 
representation, the cost of a load or store for a 1596.5 type
ranges from 2 -- 6 times the cost of a native type, depending on the machine
and the compiler.
We are working on compile-time optimizations to further reduce the cost of using
non-native types.

Using the IEEE 1596.5 compiler, users can make simple modifications 
to the declarations in their programs and create programs which will 
work in heterogeneous environments. The programmers do not need to 
learn another language or set of tools as they must with standard systems
such as DCE or SunRPC. 

The IEEE 1596.5 types, while intended for use in HDSMs, can also be used
in networking software currently in use. As part of an evaluation of this type
of use, we are writing 1596.5-based definitions of the structures 
used for NFS. The use of these structures will eliminate all the NFS 
XDR code currently in place. 

Acknowledgements
Jason Kassoff implemented the program that translated 1596.5 type names to 
C++ structures. He did a very good job.