################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org An Implementation of the Shared Data Formats Standard for Distributed Shared Memories Maya B. Gokhale Ronald G. Minnich Supercomputing Research Center 17100 Science Drive Bowie, Md. 20715 *Abstract Distributed Shared Memory (DSM) is a mechanism by which processes can share data over a network using the memory abstraction . The processors may be heterogeneous, one difference being that they use incompatible data formats for such basic types as integers. Programmers need a programming mechanism for dealing with these differences. In this paper we describe a compiler which supports IEEE 1596.5, a machine-independent set of types specified so as to allow the use of shared data in heterogeneous DSMs. Programs that use these types for the DSM can share data regardless of the processor they run on; data can be shared in heterogeneous environments. The use of this compiler converts the run-time handling of eXternal Data Representation, or XDR, to compile-time, and introduces the opportunity for using optimizing compiler technology in handling run-time conversion of data types in a heterogeneous environment. It also gives the programmer a high degree of control over when translation occurs, in contrast to the XDR approach of always translating everything. Finally, it promotes run-time sharing of data instead of the copy-in copy-out semantics of XDR. Introduction Distributed Shared Memory (DSM) is a mechanism by which processes can share data over a network using the memory abstraction. Distributed shared memories have been implemented both in hardware and software. Hardware DSMs (or HDSMs) can provide much higher bandwidth and much lower latency than networks such as Ethernet, which in turn can magnify the cost of any software overhead attached to use of the DSM. A problem which has not been effectively solved in DSMs, but which is growing in importance, is dealing with heterogeneity. That is, given a hardware DSM such as SysTran or IEEE 1596, how may heterogeneous machines share data. Heterogeneous in this sense refers to data format heterogeneity; for our purposes Sun-3 and Sun-4 machines are not heterogeneous, even though one uses an MC68020 architecture base and the other uses a SPARC architecture base. The most important aspects of the data format are common to the two architectures. Given the low latency and high performance inherent in these HDSMs, it is essential that any software mechanisms provided for managing data heterogeneity also have low overhead, so as to preserve the low latency and high bandwidth provided by the HDSM. In this paper we describe a compiler which supports IEEE 1596.5, a machine-independent set of types specified so as to allow use of shared data in a DSM by a set of heterogeneous machines. Programs that use these types for the DSM can share data regardless of the processor they run on; data can be shared in heterogeneous environments. The types were first implemented via C++, but that path had poor performance due to inherent limitations in providing new scalar types via C++. The current implementation uses a compiler. One effect of the use of this compiler is to eliminate much of the software executed as part of runtime eXternal Data Representation, or XDR, encoding and decoding of data. XDR is currently used to manage heterogeneity in networks of machines. The typical approach used for XDR is to encode the data in a canonical format at the transmitting end and to decode the data at the receiving end. Once a message is encoded via XDR and sent, it must be decoded at the receiving end. All the data sent in a message is decoded regardless of whether only a piece of the message is used or the whole message is used. The reason is that the process of using the message data in a program occurs separately from the process of decoding the data. The decode software can not determine which pieces of the message will be used, and hence must process the entire message. Another limitation with XDR is the generality of the runtime encoding and decoding causes very slow execution because it is so general. In our work we have measured times of several hundred microseconds to encode data that could be encoded in less than 10 microseconds. The longer time was measured using the encoding functions used to encode NFS requests and replies in the SunOS kernel. These same functions are also used by many other vendors in their NFS software. The shorter time is possible if compile-time encoding of the data is done. Thus, this work moves the processing of data representation, sometimes called presentation, from the domain of runtime work to the domain of compile-time work. Much of the work of the decoding and encoding software is done only once, at compile time, rather than every time data is processed. *Related Work The problem of accomodating heterogeneity in a DSM has been addressed a number of times over the last ten years. The approaches have in general fallen into two basic strategies: Tag every data item As the data moves over the network convert it as needed for the destination machine, either at the source or at the destination. Approaches used here have involved picking a machine-defined addressing unit, such as a page, and only allowing one type of data on that page; using XDR-style descriptors and requiring the user to explicitly request that data be moved; or storing the data in memory in a self-describing format, converting the data as it is moved. Use object-oriented programming techniques Tag each data item and at run-time, for each item, determine on a per-access basis the format of the item and whether it needs to be converted. A system called Agora supported a heterogeneous data environment. The mechanism was to limit the set of scalar types to that supported by DEC's VAX computer; and to always convert the types when they were moved via the network to some other machine. This approach penalizes sharing of data and limits the data types which can be shared. In Mermaid, Li et. al. demonstrated heterogeneous data support. The data are always converted and the set of types is limited. The Mermaid approach is to overlay XDR on the distributed shared memory. In the Ethernet environment the overhead of Mermaid is not a dominating factor. However, in a high-performance network the overhead would be a dominant cost, usually well over a millisecond per page. When HDSMs become available, we do not believe access to the peak performance of the HDSM will be possible with this type of software. These approaches have a number of problems that have not been solved satisfactorily: Space overhead. The cost of storing attributes per data item can be excessive for arrays of structures that consume, e.g., several tens of megabytes. One reason for blocking same scalar type items on a page is to avoid this sort of overhead, but such blocking runs counter to the type of structures used by C programmers. In C, data structures group together dissimilar components, which in turn are grouped into arrays or larger structures. Often these structures and arrays are dynamically allocated. Theoretically a compiler could translate structures into page-sized blocks of same scalar types and then relate inidividual scalar items on these pages back to individual structure components; getting such a system correct in practice is extremely difficult and does not address the issue of dynamically allocated structures. Time overhead. In an HDSM such as the SysTran or SCI systems, the available network bandwidth is the bus bandwidth of the interface, which in the worst case (SysTran on an old Sun 3) is four megabytes per second, and in the best case is several hundred megabytes per second; latency is measured in microseconds. Converting every element of every structure of a shared array every time one word is accessed is prohibitively expensive. The costs of the run-time encode and decode common to these approaches appears manageable only on an Ethernet, which is a low-bandwidth, high-latency network. A several-hundred microsecond cost of run-time encode can appear negligible compare to a several millisecond cost of moving the message; this same overhead becomes the dominating factor when the message transmission time is measured in microseconds. Programming model. Because of the cost of the encoding, these models discourage programmers from writing peer-to-peer parallel programs which communicate frequently, instead forcing them into the client-server model in which data is handed out infrequently. Many parallel programs do not fit such a client-server model. In our approach we use the IEEE 1596.5 standard, which provides a basic set of data types from which to choose. Because the types are explicit, the option is open for the programmer to copy the types in, i.e. the programmer may explicitly force the conversion of the data. Thus the programmer has control over how and when the conversion is done, if needed. Finally, peer-to-peer parallel programming becomes easier, since the programs can access the networks with much less software in the way, and hence can take advantage of low-latency HDSMs. *IEEE 1596.5 IEEE 1596.5 is a standard which defines a set of data types for use in a heterogeneous computing environment. The standard considers four parameters when defining a type: Size in eight-bit bytes. Bit-ordering. The position of the most-significant bit. Whether the type is floating-point, signed integer, or unsigned integer. Whether the type is aligned on four-byte or one-byte boundaries. The set of types is rich enough that there is an efficient encoding for almost any machine. Exceptions are machines such as the IBM 7090, the Digital Equipment Corporation DEC-10 series, and the Unisys A- and B-series, which do not have word sizes that are a power of two. The name of a type is composed by describing its attributes. The ordering of attributes in a name is: Aligned or Unaligned Big or Little endian Signed integer, Unsigned integer, Real number, or Multifield (Multifield is used for bit-field support; bit numbers are defined in a host-independent format) Byte, Doublet, Quadlet, Octlet, or Hexlet (1, 2, 4, 8, or 16 bytes respectively) Thus an aligned 32-bit big-endian unsigned integer is an AlignedBigUnsignedQuadlet. Because these names are rather long, a standard abbreviation is defined by using the first initial of each keyword. Hence, AlignedBigRealQuadlet and ABRQ are equivalent. We expect in practice that users would define their own desired names as types with these names, i.e. a user might do the following in C: typedef ABRQ sparcfloat. The 1596.5 standard, because it specifies types known at compile time, opens the door for compile-time optimization in a way that is not possible for run-time decoding of data streams. *Using 1596.5 The 1596.5 types make it easy for programmers to modify their programs to operate in a heterogeneous environment. In order to show how such a modification can be done, we will provide an example. The example is a program which models quantum photon transport in a complex chamber, using Monte Carlo techniques. In this program there are structures which hold problem state. The program uses shared-memory, and has been modified to use the MNFS software DSM for the shared-memory support. MNFS is a modified NFS developed at SRC which provides distributed shared memory. Thus the program already used a software DSM, and needed only two changes to run on many different types of machines. The changes are simple: formerly, we had a typedef double photonfloat in a structure; it becomes typedef ABRQ photonfloat. Also, in the structure, we change declarations of int to declarations of AlignedBigSignedQuadlet, or ABSQ for short. At this point, the program may be compiled on any machine which supports 1596.5; it will run and interoperate with any other machine, regardless of architecture. If the machines in question have MNFS, our software DSM, or use a hardware DSM such as the SysTran system, then the program is ready to run. With two simple textual changes we have converted a program for homogeneous machines to a program which will run on a set of heterogeneous machines. To make this change with systems such as the Open Software Foundation's Distributed Computing Environment (DCE) or Sun's Remote Procedure Call (SunRPC) we would need to: Learn a new language which is used to define the data types for the DCE or SunRPC system Using this language, define the structures in some separate file to be transferred among the programs. Note that every time we change the structures used by the program to share data with other programs, we also have to change this separate file. Rewrite our program to use the client/server model of SunRPC or DCE; currently the program has peer-to-peer semantics, not client-server semantics. In the case of this program such a change is possible, but in many programs the client-server model is not a correct one for the structure of the program Find those places in the program where shared data is used, and modify them so that the sharing is done via calling the proper functions, not just referencing the data. This can be very difficult to do efficiently. Taking a set of array references which are compile-time expressions and turning them into a set of RPC interactions is an effort fraught with difficulty and has great potential for introducing bugs into a program. Get the main driver program written which will cause all the other RPC calls to occur. *C++ Implementation We initially implemented the IEEE 1596.5 types in C++. The C++ classes were compiled with several versions of G++, all version 2.0 and later. The goal was to build a complete set of 1596.5 scalar types as C++ classes, to the point that we could run the 1596.5 validation suite included as part of the standard. We encapsulated each type in a C++ object. To reduce the amount of effort involved, we developed a tool that, given a 1596.5 type name, could generate a C++ class for that type. The tool generated code which was tested on SPARC, IBM RS/6000, and Intel 386 processors. An example is shown in Figure . For the example we have generated code for an AlignedBigRealQuadlet, which is a word-aligned, big-endian, floating point number using four bytes. We will also refer to this type as ABRQ. The type is representable as a float on the native machine (in this case a SPARC), so the private variable for the class is a float. The alignment keyword allows us (in GNU C and C++) to specify how the variable is aligned; this may not be available in non-GNU C++ implementations. If alignment specification is not available, then a complete set of compliant 1596.5 types can not be generated. There are four operators for +, *, - and /. The native operator always returns the value of the variable absent any interpretation; this can be used to examine the variable directly, which is useful in debugging, and can also be used to optimize the evaluation process in the class methods when the machine representation and the 1596.5 representation are compatible. As in function entry and exit or program initiation, the operators with the name AlignedBigRealQuadlet are for the constructors, which are invoked when the class is instantiated. The = operators are for assignment; finally, the float() operator returns the value of the variable when used in an expression. The code for non-native variable classes is quite different, as shown in Figure . In this class, the variable must be converted from and to little-endian format prior to use. Conversion is performed by the littlefloat function. In all other respects this class is identical to the previous one. There is a subtlety in the four arithmetic operators: the cast will cause the float() operator to be invoked, causing the littlefloat function to be called. The bytesize() function illustrates a problem with the C++ sizeof() operator. There is a basic confusion in C++ about whether sizeof() should take the size of the class, or the size of the private data in the class. In some cases these can be different. For example, if the private data area is a byte-sized variable, and we specify that the variable be word-aligned, sizeof() will return 4, not the expected 1. In the case of the 1596.5 validation suite, the code used sizeof() on a type which was a class, and got errors because it really was depending on the semantics of sizeof yielding the exact number of bytes for the data type, which in our case was the size of the private data. Bytesize() removes this ambiguity, returning for all 1596.5 classes the size of the private data area. Promotion rules for scalar types are another source of concern in C++. For example, consider the code fragment shown in Figure . Multiplication of the float b and the int 2 will result in conversion of the int to a float, and a resultant multiplication of two floats. On the following line, multiplication of the ABRQ c and the int 2 will result in one of two possible outcomes: If an int method is provided, the ABRQ will be converted to an int, and then multiplied by 2. The result will be erroneous if c had a fractional part. If no int method is provided, the C++ compiler we use (G++) will produce an error message. There is a work-around for the problem, as shown in the next line, but it requires the programmer to explicitly introduce casts, which in turn may require a large number of textual changes to the program. These classes were made functional to the point of passing the 1596.5 validation suite. Performance was very disappointing, however. Our goal is that for an 1596.5 type which is native, there be no penalty for using that type. Our expectation was that for an 1596.5 type which was native the C++ optimization would simply remove all the scaffolding and be as efficient as the native type. For example, in the ABRQ case shown above, we expected that on the highest optimization level the code generated would simply store directly to and from the float. Such was not the case. In fact, use of the ABRQ on a SPARC instead of a native float type caused a near order-of-magnitude performance penalty. We should note here that these experiences with C++ are based on our use of G++ version 2.0 and higher. It may be that a different C++ compiler would exhibit different behaviour on promotions and be more efficient. Different behaviour on promotions would be some cause for concern: it would mean that from compiler to compiler, our types might not pass the validation suite. We did some work with the ATT C++ compiler version 3.0. It had two serious limitations, which ruled out its use: It can't handle SCI type names -- they are too long. There is no mechanism for specifying alignment of types. Another problem with the C++ approach relates to optimization. Many of the values used in a program are constants. The conversion of these constants can be done at compile time if the types are known to the compiler. In C++ no such optimization is easily available. Our conclusions were that C++, while it may have its uses, is not useful for providing new scalar types to a language when performance or correct promotion behaviour is an issue. We had to integrate the types into a compiler. We therefore decided to investigate building a compiler which supports 1596.5 types correctly and efficiently. C Compiler Implementation We modified an ANSI C compiler to make the SCI types built-in and useable in arithmetic and assignment operations, generate the necessary conversions between SCI and native types, and unparse SCI-augmented C programs into standard C. An example of the use of SCI types and the operation of the SCI compiler is shown in Figure . The SCI C compiler recognizes the type AlignedBigRealQuadlet as builtin. It permits operations between variables of that type as well as with native types, and generates calls to conversion routines. The result is shown in Figure . The meaning of the data type ABRQ is defined in a header file sci.h, and is different for machines with different bit orderings. On the Sparc, this type is defined to be a float. On a little endian machine it is a structure containing an array of bytes. Similarly, realABRQ is a macro whose definition is platform-dependent. On the Sparc, it is a noop, since ABRQ is the native ``real'' type. On a little endian machine, realABRQ is a function which actually performs conversion to the little endian floating point format. The functionality of the compiler-generated code is the same as the C. However, on the native machine for a particular SCI type, the performance is the same as if the native type had been used, which is an order of magnitude better than the C version. With this basic SCI C compiler in place, we are planning to add optimizations which can improve performance on non-native types, such as operations on ABRQ on a little endian machine. We would like to eliminate calls to conversion routines where they could be avoided. One example from the code fragment above is the == operator. If the two operands are of the same type, then a bitwise compare suffices. At the moment, the compiler converts each operand to the native type (via realABRQ) and then does the compare. We plan to make each SCI type a union of primitive type and array of bytes, and generate == and != compares on the primitive type variant, as shown in Figure . *Status and Performance On machines which directly support a given 1596.5 type, using the 1596.5 type (e.g. ABRQ) is as efficient as using the native type (e.g. float). Thus we have met one of our goals. On machines in which the 1596.5 and native type are incompatible, the cost is higher. The two major differences from one machine to the next are support of alignment on non-word boundaries, which on RISC machines such as SPARC or MIPS can cause a substantial penalty; and byte-ordering, which for different machines requires that all the bytes in the word be reversed. For unaligned types with the same byte order the cost is the cost of producing an aligned type. The issue of byte reordering is more difficult. We timed a byte reordering algorithm we developed on several machines to get some idea of the relative penalty. We used the highest optimization level of the compilers. GCC was used on the SPARC and 386 machines; the standard IBM compiler was used on the RS/6000. The program loaded a variable, reordered the bytes, and then stored it to a different variable. The variables were declared volatile to insure that the store was done. We examined the assembly code to make sure it was doing what we expected. For testing purposes we ran 100 million iterations. We have determined for a SPARCstation 2 that the reordering for an unsigned long or float (i.e. 32-bit numbers) takes twice as long as a simple load or store (21 seconds for 100 million iterations vs. 43 seconds). The numbers also apply to a SPARCStation ELC (25 seconds vs. 52 seconds). On the RS/6000, it takes almost six times as long (12 seconds vs. 73 seconds)It could be that gcc on the RS/6000 would have much better results.. For the 386 systems we used, it takes less than twice as long (118 seconds vs. 203 seconds). Given a reasonable compiler and architecture, the reordering should not be a significant penalty. For the type of applications we have run in a distributed environment, we don't feel it will be an issue. *Future Work We would like to implement a predicate called native() which, when given a 1596.5 type name, will return 1 or 0 if that type has an efficient native representation on the native machine. This predicate should function even at the C preprocessor stage, so that ifdef directives work. Another useful built-in would be called bestfit. Bestfit would, given a 1596.5 type, return a native type that best approximates the 1596.5 type. For example, on a little-endian machine, the best-fit type for an big-endian float would be a little-endian float. Finally, a 1596.5 structure copy would be useful. This would do pairwise copies from structures containing non-native types to a structure containing bestfit types for each structure member. An example of such support is shown in Figure . Although we have demonstrated the use of the 1596.5 types with DSMs, the C compiler we have described could also be used in RPC systems to replace XDR. For example, all the NFS structures could be specified using 1596.5 types, eliminating the need for the current layer of XDR software. This opens the way for much more efficient encoding and decoding of NFS requests. The cost of encoding and decoding software is becoming a concern in high bandwidth networks. We plan to create a set of SCI-based structures which are defined by the NFS Version 3 protocol when that protocol is finally released. Summary IEEE 1596.5 is a standard specifying a set of data types for use in Hardware Distributed Shared Memories. We have built a C compiler which supports these types as first-class objects. The types may be freely intermixed with the six standard C types. For types which are compatible with native machine types, performance is equivalent to what would be expected with the native machine type. For types which have no native representation, the cost of a load or store for a 1596.5 type ranges from 2 -- 6 times the cost of a native type, depending on the machine and the compiler. We are working on compile-time optimizations to further reduce the cost of using non-native types. Using the IEEE 1596.5 compiler, users can make simple modifications to the declarations in their programs and create programs which will work in heterogeneous environments. The programmers do not need to learn another language or set of tools as they must with standard systems such as DCE or SunRPC. The IEEE 1596.5 types, while intended for use in HDSMs, can also be used in networking software currently in use. As part of an evaluation of this type of use, we are writing 1596.5-based definitions of the structures used for NFS. The use of these structures will eliminate all the NFS XDR code currently in place. Acknowledgements Jason Kassoff implemented the program that translated 1596.5 type names to C++ structures. He did a very good job.