USENIX ;login: - Single UNIX Specification, Version 2

Data Size Neutrality and 64-bit Support

Andrew Josey <a.josey@opengroup.org> of The Open Group reports on changes in the Single UNIX Specification, Version 2.

The Single UNIX Specification, Version 2, provides enhanced support for 64-bit programming models by being n-bit clean and data size neutral. This article is a brief introduction to 64-bit programming models, data size neutrality, and application porting issues.

Introduction

When the UNIX operating system was first created in 1969 it was developed to run on a 16-bit computer architecture. The C language of the time supported 16-bit integer and pointer data types and also supported a 32-bit integer data type that could be emulated on hardware that did not support 32-bit arithmetic operations.

When 32-bit computer architectures, which supported 32-bit integer arithmetic operators and 32-bit pointers, were introduced in the late 1970s, the UNIX operating system was quickly ported to this new class of hardware platforms. The C language data model developed to support these 32-bit architectures quickly evolved to consist of a 16-bit short-integer type, a 32-bit integer type, and a 32-bit pointer. During the 1980s, this was the predominant C data model available for 32-bit UNIX platforms.

To describe these two data models in modern terms, the 16-bit UNIX platforms used an IP16 data model, while 32-bit UNIX platforms use the ILP32 programming model. The notation describes the width assigned to the basic data types; for example, ILP32 denotes that int (I), long (L), and pointer (P) types are all 32-bit entities. This notation is used extensively throughout this article.

The first UNIX standardization effort was begun in 1983 by a /usr/group committee. This work was merged into the work program of the IEEE POSIX committees in 1985. By 1988, both POSIX and X/Open committees had developed detailed standards and specifications that were based upon the predominant UNIX implementations of the time. These committees endeavored to develop architecture-neutral definitions that could be implemented on any hardware architecture.

The transition from 16-bit to 32-bit processor architectures happened quite rapidly just before the UNIX standardization work was begun. Since the specifications were based on existing practice and the predominant data model did not change during this gestation period, some dependencies upon the ILP32 data model were inadvertently incorporated into the final specifications.

Most of today's 32-bit UNIX platforms use the ILP32 data model. However another data model, the LP32 model, is also very popular for other operating systems. The majority of C-language programs written for Microsoft Windows 3.1 are written for the Win-16 API which uses the LP32 data model. The Apple Macintosh also uses the LP32 data model.

32-bit platforms have a number of limitations that are increasingly a source of frustration to developers of large applications, such as databases, who wish to take advantage of advances in computer hardware. There is much discussion today in the computer industry about the barrier presented by 32-bit addresses. 32-bit pointers can address only 4GB of virtual address space. There are ways of overcoming this limitation, but application development is more complicated and performance is significantly reduced. Until recently the size of a data file could not exceed 4GB. However, the 4GB file size limitation was overcome by the Large File Summit extensions which are included in XSH, Issue 5.

Disk storage has been improving in real density at the rate of 70% compounded annually, and drives of 4GB and larger are readily available. Memory prices have not dropped as sharply, but 16MB chips are readily available, with 64MB chips in active development. CPU processing power continues to increase by about 50% every 18 months, providing the power to process ever larger quantities of data. This conjunction of technological forces, along with the continued demand for systems capable of supporting ever-larger databases, simulation models and full-motion video, have generated requirements for support of larger addressing structures.

A number of 64-bit processors are now available, and the transition from 32-bit to 64-bit architectures is rapidly occurring among all the major hardware vendors. 64-bit UNIX platforms do not suffer from the file size or flat address space limitations of 32-bit platforms. Applications can access files that occupy terabytes of disk space because 64-bit file offsets are possible. Similarly, applications can now theoretically access terabytes of memory because pointers can be 64 bits. More physical memory results in faster operations. The performance of memory-mapped file access, caching, and swapping, is greatly improved. 64-bit virtual addresses simplify the design of large applications. All the major database vendors now support 64-bit platforms because of dramatically improved performance for very large database applications available on very large memory (VLM) systems.

The world is currently dominated by 32-bit computers, a situation that is likely to continue for the near future. These computers run 16- or 32-bit applications or some mixture of the two. Meanwhile, 64-bit computers will run 32-bit code, 64-bit code, or mixtures of the two (and perhaps even some 16-bit code). New 64-bit applications and operating systems must integrate smoothly into this environment. Key issues facing the computing industry are the interchange of data between 64- and 32-bit systems (in some cases on the same system) and the cost of maintaining software in both environments. Such interchange is especially needed for large application suites such as database systems, where one may want to distribute most of the applications as 32-bit binaries that run across a large installed base, but be able to choose 64-bits for a few crucial applications.

64-bit Data Models

Prior to the introduction of 64-bit platforms, it was generally believed that the introduction of 64-bit UNIX operating systems would naturally use the ILP64 data model. However, this view was too simplistic and overlooked optimizations that could be obtained by choosing a different data model.

Unfortunately, the C programming language does not provide a mechanism for adding new fundamental data types. Thus, providing 64-bit addressing and integer arithmetic capabilities involves changing the bindings or mappings of the existing data types or adding new data types to the language.

ISO/IEC 9899:1990, Programming Languages - C (ISO C) left the definition of the short int, the int, the long int, and the pointer deliberately vague to avoid artificially constraining hardware architectures that might benefit from defining these data types independently from one another. The only constraints were that ints must be no smaller than shorts, and longs must be no smaller than ints, and size_t must represent the largest unsigned type supported by an implementation. It is possible, for instance, to define a short as 16 bits, an int as 32 bits, a long as 64 bits and a pointer as 128 bits. The relationship between the fundamental data types can be expressed as:

sizeof(char) <= sizeof(short) <= sizeof(int)
<= sizeof(long) = sizeof(size_t)

Ignoring non-standard types, all three of the following 64-bit pointer data models satisfy the above relationship:

LP64 (also known as 4/8/8)
ILP64 (also known as 8/8/8)
LLP64 (also known as 4/4/8).

The differences between the three models lies in the non-pointer data types. The table below details the data types for the above three data models and includes LP32 and ILP32 for comparison purposes.

Data Type LP32 ILP32 ILP64 LLP64 LP64

char 8 8 8 8 8

short 16 16 16 16 16

int32 32

int 16 32 64 32 32

long 32 32 64 32 64

long long (int64) 64

pointer 32 32 64 64 64

When the width of one or more of the C data types is changed, applications may be affected in various ways. These effects fall into two main categories:

Data objects, such as a structure, defined with one of the 64-bit data types will be different in size from those declared in an identical way on a 16 or 32-bit system.
Common assumptions about the relationships between the fundamental data types may no longer be valid in a 64-bit data model. Applications which depend on those relationships often cease to work properly when compiled on a 64-bit platform. A typical assumption made by many application developers is that:
sizeof(int) = sizeof(long) = sizeof(pointer)
This relationship is not codified in any C programming standard, but it is valid for the ILP32 data model. However, it is not valid for two of the three 64-bit data models described above, nor is it valid for the LP32 data model.

The ILP64 data model attempts to maintain the relationship between int, long, and pointer which exists in the ILP32 model by making all three types the same size. Assignment of a pointer to an int or a long will not result in data loss.

The downside of this model is that it depends on the addition of a new 32-bit data type such as int32 to handle true 32-bit quantities. There is thus a potential for conflict with existing typedefs in applications. An application which was developed on an ILP32 platform, and subsequently ported to an ILP64 platform, may be forced to make frequent use of the int32 data type to preserve the size and alignment of data because of interoperability requirements or binary compatibility with existing data files.

The LLP64 data model preserves the relationship between int and long by leaving both as 32-bit types. Data objects, such as structures, which do not contain pointers will be the same size as on a 32-bit system. This model is sometimes described as a 32-bit model with 64-bit addresses. Most of the run-time problems associated with the assumptions between the sizes of the data types are related to the assumption that a pointer will fit in an int. To solve this class of problems, int or long variables which should be 64 bits in length are changed to long long (or int64), a non-standard data type. This data model is thus again dependent on the introduction of a new data type. Again there is potential for conflict with existing typedefs in applications.

The LP64 data model takes the middle road. 8, 16, and 32-bit scalar types (.B char , short, and int) are provided to support objects that must maintain size and alignment with 32-bit systems. A 64-bit type, long, is provided to support the full arithmetic capabilities, and is available to use in conjunction with pointer arithmetic. Applications that assign addresses to scalar objects need to specify the object as long instead of int.

In the LP64 data model, data types are natural. Each scalar type is larger than the preceding type. No new data types are required. As a language design issue, the purpose of having long in the language anticipates cases where there is an integral type longer than int. The fact that int and long represent different width data types is a natural and common sense approach, and is the standard in the PC world where int is 16-bits and long is 32-bits.

A major test for any C data model is its ability to support the very large existing UNIX applications code base. The investment in code, experience, and data surrounding these applications is the largest determiner of the rate at which new technology is adopted and spread. In addition, it must be easy for an application developer to build code that can be used in both existing and new environments.

The UNIX development community is driven technically by a set of API agreements embodied in standards and specifications documents from groups such as X/Open, IEEE, ANSI, and ISO. These documents were developed over many years to codify existing practice and define agreement on new capabilities. As a result these specifications are of major value to the system developers, application developers, and end-users. There are numerous test suites that verify that implementations correctly embody the details of a specification and certify that fact to interested parties. Any 64-bit data model cannot invalidate large portions of these specifications and expect to achieve wide adoption.

A number of vendors have extensive experience with the LP64 data model. By far, the largest body of existing 32-bit code already modified for 64-bit environments runs on LP64 platforms. Experience has shown that it is relatively easy to modify existing code so that it can be compiled on either an 32-bit or 64-bit platform. Interoperability with existing ILP32 platforms is well proven and is not an issue. At least one LP64-based operating system (Digital UNIX V4.0) has met

and passed the majority of existing verification suites and has obtained the UNIX 95 brand.

A small number of ILP64-based platforms have also shipped. These have demonstrated that it is feasible to complete the implementation of an ILP64 environment. However, as of early 1997, no LLP64 or ILP64-based systems had achieved the same level of standards conformance or met the requirements of the UNIX 95 brand.

Although the number of applications written in C requiring a large virtual address space is growing rapidly, there has not been a requirement to date for a 64-bit int data type. The majority of existing 64-bit applications previously ran only on 32-bit platforms, and had no expectation of a greater range for the int data type. The extra 32 bits of data space in a 64-bit int would appear to be wasted. Any future applications that require a larger scalar data type can use the long type.

Nearly all applications moving from a 32-bit platform require some minor modifications to handle 64-bit pointers, especially where erroneous assumptions about the relative size of int and pointer data types were made. Common assumptions about the relative sizes of int, char, short, and float data types generally do not cause problems on LP64 platforms (since the sizes of those data types are identical to those on an ILP32 platform), but do so on an ILP64 platform.

Other language implementations will continue to support a 32-bit int type. For example, the FORTRAN-77 standard requires that the type INTEGER be the same size as REAL, which is half the size of DOUBLE PRECISION. This, together with customer expectations, means that FORTRAN-77 implementations will generally leave INTEGER as a 32-bit type, even on 64-bit platforms. A significant number of applications use C and FORTRAN together, either calling each other or sharing data files. Such applications have been among the first to move to 64-bit environments. Experience has shown that it is usually easier to modify the data sizes and types on the C side than the FORTRAN side of such applications. These applications will continue to require a 32-bit int data type in C regardless of the size of the int data type.

In 1995, a number of major UNIX vendors agreed to standardize on the LP64 data model for a number of reasons:

Experience suggests that neither the LP64 nor the ILP64 data models provide a painless porting path from a 32-bit platform, but that all other things being equal, the smaller data types in the LP64 data model enable better application performance.
A crucial investment for end-users is the existing data built up over decades in their computer systems. Any proposed solution must make it easy to utilize such data on a continuing basis. Unfortunately, the ILP64 data model does not provide a natural way to describe 32-bit data types, and must resort to non-portable constructs such as int32 to describe such types. This is likely to cause practical problems in producing code which can run on both 32- and 64-bit platforms without numerous #ifdef constructions. It has been possible to port large quantities of code to LP64 platforms without the need to make such changes, while maintaining the investment made in data sets, even in cases where the typing information was not made externally visible by the application.
Most ints in existing applications can remain as 32 bits in a 64-bit environment; only a small number are expected to be the same size as pointer or long. Under the ILP64 data model, most ints will need to change to int32. However, int32 does not behave like a 32-bit int. Instead, int32 is like short in that all operations have to be converted to int (64-bits, sign extended) and performed in 64-bit arithmetic. Thus, int32 in the ILP64 data model is not exactly the same as int in the ILP32 data model. These differences may cause subtle and hard-to-find bugs.
Instruction cycle penalties are incurred whenever additional cycles are required to properly implement the semantics of the intended data model. For example, in the LP64 data model it is only necessary to perform sign extension on int when you have a mixed expression including longs. However, most integral expressions do not include longs and compilers can be made smart enough to only sign extend when necessary.
int is by far the most frequent data type to be found (statically and sometimes dynamically) within C and C++ programs. 64-bit integers require twice as much space as 32-bit integers. Applications using 64-bit integers consume additional memory and CPU cycles transporting that memory throughout the system. Furthermore, the latency penalty of 64-bit integers can be enormous, especially to disk, where it can exceed 1,000,000 CPU cycles (3 nsec to 3 msec). The memory size penalty for unneeded 64-bit integers could therefore be very high for some applications.
The LP64 data model enhances portability, especially for combined FORTRAN and C applications, and the most common types of problems that can occur are susceptible to automatic detection.
Interoperability is improved by the ability to use a standard data type to declare data structures that can be used in both 32-bit and 64-bit environments.
Standards conformance has been demonstrated both in the practical sense by the porting of many programs and in the formal sense of compliance with industry standards through verification test suites.
Transition from the current industry practice is smooth and direct following a path grooved with experience and demonstrated success.
No new non-portable data types are required. The data model makes natural use of the C fundamental data types.

Data Size Neutrality

When it was understood that the Single UNIX Specification was constraining system implementations that were other than ILP32, the relevant specifications were reviewed and recommendations drafted to make these specifications data size- and architecture-neutral. These recommendations were incorporated into the Single UNIX Specification, Version 2 published in 1997.

Porting Issues

Porting an application to a 64-bit UNIX system can be accomplished with a minimal amount of effort if the application was developed using good modern software engineering practices such as:

ISO C function prototypes
consistent and careful use of data types
all declarations are in headers

First of all, determine which data model is available on the platform to which you are porting. This data model will have a major impact on the amount of work required to achieve a successful port.

Then take the time to create and use ISO C function prototypes if they are absent from the source code. Unfortunately large quantities of perfectly good legacy code developed in the days before portability was a major issue may not have function prototypes. Fortunately many compilers have an option to generate ISO C function prototypes.

The remainder of this article assumes that you are porting to an LP64 platform since this is the data model of choice amongst major vendors, but the issues raised are equally valid on some or all of the other 64-bit data models.

General

Use utilities such as grep to locate and check all instances of the following:

and complement operators; that is, "<<", ">>", "~". If used with long, add "L" to value shifted to avoid an incorrect result.
Addresses of objects ("&") should not be stored in an int.
Declarations of type long. Many of these can be converted to type int to save space. This is particularly true for network code.
The functions lseek(), fseek(), ftell(), fgetpos(), and so on. Use either off_t or fpos_t as appropriate for offset arguments. Do not use int or long to store file offsets.
All (int *) and (long *) casts.
Use of (char *) 0 for zero or (char *) comparisons. Use NULL instead.
Hard-coded byte counts or memory sizes. These will be wrong if they assume longs or pointers are 32 bits. Applications should use the sizeof() operator to avoid such problems.

Declarations

To enable application code to work on both 32-bit and 64-bit platforms, check all int and long declarations. Declare integer constants using "L" or "U" as appropriate. Ensure an unsigned int is used where appropriate to prevent sign extension. If you have specific variables that need to be 32 bits on both platforms, define the type to be int. If the variable should be 32 bits on an ILP32 platform and 64 bits on an LP64 platform, define the variables to be long.

Declare numeric variables as int or long for alignment and performance. Don't worry about trying to save bytes by using char or short. Remember that if the type specifier is missing from a declaration, it defaults to an int. Declare character pointers and character bytes as unsigned to avoid sign extension problems with 8-bit characters.

Assignments and Function Parameters

All assignments require checking. Since pointer, int, and long are no longer the same size on LP64 platforms, problems may arise depending on how the variables are assigned and used within an application.

Do not use int and long interchangeably because of the possible truncation of significant digits, as shown in the following example:

int iv; long lv; iv = lv;

Do not use int to store a pointer. The following example works on an ILP32 platform but fails on an LP64 platform because a 32-bit integer cannot hold a 64-bit pointer:

unsigned int i, *p;

i = (unsigned) p;

The converse of the above example is sign extension:

int *p; int i;

p = (int *)i;

Do not pass long arguments to functions expecting int arguments. Avoid assignments similar to the following:

int foo(int);

int iv; long lv; iv = foo( lv );

Do not freely exchange pointers and ints. Assigning a pointer to an int, assigning back to a pointer, and dereferencing the pointer may result in a segmentation fault. Avoid assignments similar to the following example:

int iv; char *buffer;

buffer = (char *) malloc ((size_t)MAX_LINE );

iv = (int) buffer; ... buffer = (char *) iv;

Do not pass a pointer to a function expecting an int as this will result in lost information. For example, avoid assignments similar to the following:

void f(); char *cp;

f(cp);

Use of ISO C function prototypes should avoid this problem. Use the void* type if you need to use a generic pointer type. This is preferable to converting a pointer to type long.

Examine all assignments of a long to a double as this can result in a loss of accuracy. On an ILP32 platform, an application can assume that a double contains an exact representation of any value stored in a long (or a pointer). On LP64 platforms this is no longer a valid assumption.

External Interfaces

An external interface mismatch occurs when an external interface requires data in a particular size or layout, but the data is not supplied in the correct format.

For example, an external interface may expect a 64-bit quantity, but receive instead a 32-bit quantity. Another example is an external structure which expects a pointer to a structure with 2 ints (8 bytes) but instead receives a pointer to a structure with an int and a long (16 bytes, 12 of data, 4 of alignment padding). External interface mismatching is a major cause of porting problems.

Format Strings

The function printf() and related functions can be a major source of problems. For example, on 32-bit platforms, using "%d" to print either an int or long will usually work, but on LP64 platforms "%ld" must be used to print a long. Use the modifier "l" with the d, u, o, and x conversion characters to specify assignment of type long or unsigned long. When printing a pointer, use "%p". If you wish to print the pointer as a specific representation, the pointer should be cast to an appropriate integer type before using the desired format specifier. For example, to print a pointer as a unsigned long decimal number, use %lu:

char *p;

printf( "%p %lu\n", (void *)p, (unsigned long)p );

As a rule, to print an integer of arbitrary size, cast the integer to long or unsigned long and use the "%ld" conversion character.

Constants

The results of arithmetic operations on a 64-bit platform can differ from those obtained using the same code on a 32-bit platform. Differing results are often caused by sign extension problems. These are generally the result of mixing signed and unsigned types and the use of hexadecimal constants. Consider the following code example:

long lv = 0xFFFFFFFF;

if ( lv < 0 ) {

On an ILP32 platform, lv is interpreted as -1 and the if condition succeeds. On an LP64 platform lv is interpreted as 4294967295 and the if condition fails.

Pointers

On ILP32 platforms, an int and a pointer are the same size (32 bits) and application code can generally use them interchangeably. For example, a structure could contain a field declared as an int, and most of the time contain an integer, but occasionally be used to store a pointer.

Another example, which most 32-bit int utilities will not catch, is the following:

int iv, *pv;

iv = (int) pv; pv = (int *) iv;

This code fails on an LP64 platform. Not only do you lose the high 4 bytes of "p", but by default these high bytes are significant.

Sizeof()

On ILP32 platforms sizeof(int) = sizeof(long) = sizeof(ptr *). Using the wrong sizeof() operand does not cause a problem. On LP64 platforms, however, using the wrong sizeof() will almost certainly cause a problem. For example, the following 32-bit code which copies an array of pointers to ints:

memcpy((char *)dest, (char *)src, number * sizeof(int))

must be changed to use sizeof(int *):

memcpy((char *)dest, (char *)src, number * sizeof(int *))

on an LP64 platform.

Note that the result of the sizeof() operation is type size_t which is an unsigned long on LP64 platforms.

Structures and Unions

The size of structures and unions on 64-bit platforms can be different from those on 32-bit platforms. For example, on ILP32 platforms the size of the following structure is 8 bytes:

struct Node { struct Node *left; struct Node *right; }

but on an LP64 platform its size is 16 bytes.

If you are sharing data defined in structures between 32-bit and 64-bit platforms, be careful about using longs and pointers as members of shared structures. These data types introduce sizes that are not generally available on 32-bit platforms. Avoid storing structures with pointers in data files. This code then becomes non-portable between 32-bit and 64-bit platforms.

To increase the portability of your code, use typedef'd types for the fields in structures to set up the types as appropriate for the platform, and use the sizeof() operator to determine the size of a structure. If necessary, use the #pragma pack statement to avoid compiler structure padding (This is not portable and is not a general solution). This is important if data alignment cannot change (network packets, and so on).

Structures are aligned according to the strictest aligned member. Padding may be added to ensure proper alignment. This padding may be added within the structure, or at the end of the structure to terminate the structure on the same alignment boundary which it started.

Problems can occur when the use of a union is based on an implicit assumption, such as the size of member types.

Consider the following code fragment which works on ILP32 platforms. The code assumes that an array of two unsigned long overlays a double.

union double_union { double d; unsigned long ul[2]; };

To work on an LP64 platform, ul must be changed to an unsigned int type:

union double_union { double d; unsigned int ul[2] };

This problem also occurs when building unions between ints and pointers since they are not the same size on LP64 platforms.

Beware of all aliasing of different multiple definitions of the same data. For example, assume the following two structures refer to the same data in different ways:

struct node { int src_addr, dst_addr; char *name; }

struct node { struct node *src, *dst; char *name; }

This works on an ILP32 platform, but fails on an LP64 platform. The two structure definitions should be replaced with a union declaration to ensure portability.

More Information

This article is derived from The Open Group Source Book, Go Solo 2 The Authorized Guide to Version 2 of the Single UNIX Specification. This is published herein with permission of The Open Group. More information on the Single UNIX Specification, Version 2, can be obtained from the following sources:

The online version of the Single UNIX Specification can be found at: <https://www.opengroup.org/unix/online.html>.

The Open Group Source Book Go Solo 2 The Authorized Guide to Version 2 of the Single UNIX Specification, 600 pages, ISBN 0-13-575689-8. This book provides complete information on what's new in Version 2, with technical papers written by members of the working groups that developed the specifications , and a CD-ROM containing the complete 3000 page specification in both HTML and PDF formats (including PDF reader software). For more information on the book, see <https://www.opengroup.org/unix/gosolo2>.

Additional information on the Single UNIX Specification can be obtained at The Open Group Web site, <https://www.opengroup.org/unix/>.