System Issues in Implementing High Speed 
Distributed Parallel Storage Systems

Brian Tierney (bltierney@lbl.gov),
Bill Johnston (wejohnston@lbl.gov),
Hanan Herzog, Gary Hoo, Guojun Jin, Jason Lee 

Imaging Technologies Group 
Lawrence Berkeley Laboratory
Berkeley, CA 94720

Abstract

In this paper we describe several aspects of implementing a high 
speed network-based distributed application. We describe the 
design and implementation of a distributed parallel storage system 
that uses high speed ATM networks as a key element of the archi-
tecture. The architecture provides what amounts to a collection of 
network-based disk block servers, and an associated name server 
that provides some file system functionality. The implementation 
approach is that of user level software that runs on UNIX worksta-
tions. Both the architecture and the implementation are intended to 
provide for easy and economical scalability in both performance 
and capacity. We describe the software architecture, the implemen-
tation and operating system overhead issues, and our experiences 
with this approach in an IP-over-ATM WAN. Although most of the 
paper is specific to a distributed parallel data server, we believe 
many of the issues we encountered are generally applicable to any 
high speed network-based application.

1.0  Introduction

In recent years many technological advances have made distributed 
multimedia servers a reality, and people now desire to put "on-line" 
large amounts of information, including images, videos, and hyper-
media databases. Increasingly, there are applications that demand 
high-bandwidth access to this data, either in single user streams 
(e.g., large image browsing, uncompressible scientific and medical 
video, and multiple coordinated multimedia streams) or, more com-
monly, in aggregate for multiple users. Here we describe a network 
distributed data server, called the Image Server System (ISS). The 
ISS is being used to supply data to a terrain visualization applica-
tion that requires 300-400 Mbits/s of data to provide a realistic visu-
alization. Both the ISS and the application have been developed in 
the context of the MAGIC Gigabit testbed ([5] and [11]). To address 
a point frequently raised, compression is not practical in the case of 
terrain visualization because the cost of decompression is prohibi-
tive in the absence of suitable hardware implementations.

To comment briefly on the relevant technologies, current disk tech-
nology delivers about 4 Mbytes/s (32 Mbits/s), a rate that has 
improved at about 7% each year since 1980 [10], and there is reason 
to believe that it will be some time before a single disk is capable of 
delivering streams at the rates needed for the application men-
tioned. While RAID [10] and other parallel disk array technologies 
can deliver higher throughput, they are still relatively expensive, 
and do not scale well (at least economically), especially in the sce-
nario of multiple network-distributed users (where we assume that 
the sources of data, as well as the multiple users, will be widely dis-
tributed). Wide area Asynchronous Transfer Mode (ATM) networks 
are being built on a SONET infrastructure, which has the character-
istic that bandwidth aggregates upward (contrary to our current net-
work hierarchy, where the slowest networks are at the top of the 
hierarchy). This characteristic makes it possible to use the network 
to aggregate many low-speed sources into a single high-speed 
stream.

The approach described here differs in many ways from RAID, and 
should not be confused with it. RAID uses a particular data layout 
and redundancy strategy to secure reliable data storage and parallel 
disk operation. Our approach, while using parallel disks and serv-
ers, imposes no particular layout strategy (in fact, this is deliber-
ately left to the application domain), and is implemented entirely in 
software (though the data redundancy idea of RAID might be 
applied across servers).

At the present state of development and experience, the system we 
describe is used primarily as a large, fast "cache" for image or mul-
timedia data. In our approach, reliability with respect to data cor-
ruption is provided by the usual OS and disk controller-level 
mechanisms, and reliability of the overall system is a function of 
user-level strategies of data replication. The data of interest (tens to 
hundreds of GBytes) is typically loaded from archival tertiary stor-
age, or written into the system from live video sources. (In the latter 
case, the data is also archived to bulk storage in real-time.)

The Image Server System is an example of distributed parallel data 
storage technology. It is a multimedia file server that is distributed 
across a wide area network to supply data to applications located 
anywhere in the network. This system is not a general purpose, dis-
tributed file system in that the data units ("files") are assumed to be 
large and relatively static. The approach allows data organization to 
be determined by the user as a function of data type and access pat-
terns. For most applications, the goal of data organization is that 
data should be declustered across both disks and servers (that is, 
dispersed in such a way that as many system elements as possible 
can operate simultaneously to satisfy a given request). This declus-
tering allows a large collection of disks to seek in parallel, and 
allows all servers to send the resulting data to the application in par-
allel, enabling the ISS to perform as a high-speed image server. 
Architecturally, the ISS is a distributed, parallel mass storage sys-
tem that uses UNIX workstation technology to provide a low-cost, 
scalable implementation. The data transport is via TCP/IP or 
RTP/IP[13]. The scalability arises from the high degree of indepen-
dence among the servers: both performance and capacity may be 
increased, in essentially linear fashion, by adding servers. (Ulti-
mately this is limited by the parallelism inherent in the data.) The 
general idea is illustrated in Figure 1 \x11(Data Streams Aggregated by 
ATM Switches).

In our prototypes, the typical ISS consists of several (four - five) 
UNIX workstations (e.g. Sun SPARCStation, DEC Alpha, SGI 
Indigo, etc.), each with several (four - six) fast-SCSI disks on multi-
ple (two - three) SCSI host adaptors. Each workstation is also 
equipped with an ATM network interface. An ISS configuration 
such as this can deliver an aggregated data stream to an application 
at about 400 Mbits/s (50 Mbytes/s) using these relatively low-cost, 
"off the shelf" components by exploiting the parallelism provided 
by approximately five hosts, twenty disks, ten SCSI host adaptors, 
and five network interfaces.

2.0  Related Work

In some respects, the ISS resembles the Zebra network file system, 
developed by John H. Hartman and John K. Ousterhout at the Uni-
versity of California, Berkeley [4]. Both the ISS and Zebra can sep-
arate their file access and management activities across several 
hosts on a network. Both try to maintain the availability of the sys-
tem as a whole by building in some redundancy, allowing for the 
possibility that a disk or host might be unavailable at a critical time. 
The goal of both is to increase data throughput despite the current 
limits on both disk and host throughput. 

However, the ISS and the Zebra network file system differ in the 
fundamental nature of the tasks they perform. Zebra is intended to 
provide traditional file system functionality, ensuring the consis-
tency and correctness of a file system whose contents are changing 
from moment to moment. The ISS, on the other hand, tries to pro-
vide extremely high-speed, high-throughput access to a relatively 
static set of data. It is optimized to retrieve data, requiring only min-
imum overhead to verify data correctness and no overhead to com-
pensate for corrupted data. 

There are other research groups working on solving problems 
related to distributed storage and fast multimedia data retrieval. For 
example, Ghandeharizadeh, Ramos, et al., at USC are working on 
declustering methods for multimedia data [3], and Rowe, et al., at 
UCB are working on a continuous media player based on the 
MPEG standard [12].

3.0  Applications

There are several target applications for the initial implementation 
of the ISS. These applications fall into two categories: image serv-
ers and multimedia / video file servers.

3.1  Image Server

The initial use of the ISS is to provide data to a terrain visualization 
application in the MAGIC testbed. This application, known as Ter-
raVision [5], allows a user to navigate through and over a high reso-
lution landscape represented by digital aerial images and elevation 
models. TerraVision is of interest to the U.S. Army because of its 
ability to let a commander "see" the battlefield. TerraVision is very 
different from a typical "flight simulator"-like program in that it 
uses large quantities of high resolution aerial imagery for the visual-
ization instead of simulated terrain. TerraVision requires large 
amounts of data, transferred at both bursty and steady rates. The ISS 
is used to supply image data at hundreds of Mbits/s rates to TerraVi-
sion. We are not using any data compression for this application 
because the bandwidth requirements for TerraVision are such that 
real-time decompression is not possible without using special pur-
pose hardware.

In the case of a large-image browsing application like TerraVision, 
the strategy for using the ISS is straightforward: the image is tiled 
(broken into smaller, equal-sized pieces), and the tiles are scattered 
across the disks and hosts of the ISS. The order of images delivered 
to the application is determined by the application predicting a 
"path" through the image (landscape), and requesting the tiles 
needed to supply a view along the path. The actual delivery order is 
a function of how quickly a given server can read the tiles from disk 
and send them over the network. Tiles will be delivered in roughly 
the requested order, but small variations from the requested order 
will occur. These variations must be accommodated by buffering or 
other strategies in the client application.

This approach can supply data to any sort of large-image browsing 
application, including applications for displaying large aerial-photo 
landscapes, satellite images, X-ray images, scanning microscope 
images, and so forth.

3.2  Video Server

Examples of video server applications include video players, video 
editors, and multimedia document browsers. A video server might 
contain several types of stream-like data, including conventional 
video, compressed video, variable time base, multimedia hypertext, 
interactive video, and others. Several users may be accessing the 
same video data at the same time, but may be at different frames in 
the stream. This affects many factors in an image server system, 
including the layout of the data on disks. Since there is a predict-
able, sequential pattern to the requests for data, the data would be 
placed on disk in a like manner. Large commercial concerns such as 
Time Warner and U.S. West are building large-scale commercial 
video servers such as the Time Warner / Silicon Graphics video 
server [6]. Our approach may address a wider scale, as well as a 
greater diversity, of data organization strategies so as to serve the 
diverse needs of schools, research institutions, and hospitals for 
video-image servers in support of various educational and 
research-oriented digital libraries.

3.3  Application to ISS Interface

Application access to the ISS is through a client-side library that 
accepts requests for data, and returns data to the application. The 
client library obtains from the ISS Master a list of ISS Disk Servers 
(q.v.) that have data for the area of interest. The client library estab-
lishes connections to all ISS Disk Servers containing that data set. 
The application specifies the location of a memory buffer to receive 
incoming data. 

The current implementation provides access to large images, in 
which the unit of data is a tile. The application requests data in 
terms of lists of tiles, and the tiles sent by the ISS servers are placed 
into the application's buffer. (See Figure 2 \x11(Application Architec-
ture).)

3.4  Data Prediction

Data prediction is important to ensure that the ISS is utilized as effi-
ciently as possible. By always requesting more tiles than the ISS 
can actually deliver before the next tile request, we ensure that no 
component of the ISS is ever idle. For example, if most of a request 
list's images were on one server, the other servers could still be 
reading and sending or caching images that may be needed in the 
future, instead of idly waiting. The point of the path prediction is to 
provide a rational basis for pre-requesting tiles.

As an example of path prediction, consider an interactive video 
database with a finite number of possible branch points. (A "branch 
point" occurs where a user might select one of several possible play 
clips.) As a branch point is approached by the player, the predictor 
(without knowledge of which branch will be taken) will start 
requesting images (frames) along both branches. These images are 
cached first at the disk servers, then at the receiving application. As 
soon as a branch is chosen, the predictor ceases to send requests for 
images from the other branch. Any cached but unsent images are 
flushed as better predictions fill the cache. 

For MAGIC's TerraVision, prediction is based on geometric charac-
teristics of the path being followed, the limitations of the mode of 
simulated transport (that is, walking, driving, flying, etc.), the 
intended destination, and so on. The prediction results in a priority 
ordered list of tile requests being sent to the ISS. The ISS has no 
knowledge of the prediction strategy (or even if one has been used).

The client will keep asking for an image until it shows up, or until it 
is no longer needed (e.g. the application may have "passed" the 
region of landscape that involves the image that was requested, but 
never received.) Applications will have different strategies to deal 
with images that do not arrive in time. For example, TerraVision 
keeps a local low resolution data set to fill in for missing tiles.

Prediction is transparent to the ISS, and is manifested only in the 
order and priority of images in the request list. The prediction algo-
rithm is mostly a function of the client application, and typically 
runs on the client. Prediction could also be done by a third-party 
application. The aforementioned interactive video database, for 
example, might use a third-party application for prediction.

4.0  Design

4.1  Goals 

The following are some guidelines we have followed in designing 
the ISS:

   + The ISS should be capable of being geographically distrib-
uted. In a future environment of large-scale, mesh-connected 
national networks, network-distributed storage should be 
capable of providing an uninterruptable stream of data, in 
much the same way that a power grid is resilient in the face 
of source failures, and tolerant of peak demands because of 
multiple sources multiply interconnected.

  + The ISS approach should be scalable in all dimensions, 
including data set size, number of users, number of server 
sites, and aggregate data delivery speed.

  + The ISS should deliver coherent image streams to an applica-
tion, given that the individual images that make up the 
stream are scattered (by design) all over the network. In this 
case, "coherent" means "in the order needed by the applica-
tion." No one disk server will ever be capable of delivering 
the entire stream. The network is the server. 

  + The ISS should be affordable. While something like a 
HIPPI-based RAID device might be able to provide the 
functionality of the ISS, this sort of device is very expensive, 
is not scalable, and is a single point of failure.

4.2  Research Issues

The design goals present several issues that need to be addressed. 
These include:

  + How to store and retrieve image data at gigabit speeds using a 
storage system whose servers are widely distributed;

  + How to place data blocks (tiles) such that image data will be 
distributed across many storage servers in a way that ensures 
parallel operation;

  + How to handle high-speed IP transport over ATM networks to 
provide the parallel data paths needed to aggregate 
medium-speed disk servers into a logically integrated, 
high-speed image storage server. (Although ATM will prob-
ably become the Ethernet of the future, end-to-end networks 
will be heterogeneous for a long time to come, necessitating 
the use of an internetwork protocol, of which IP is the clear 
choice);

  + Assessing how an ATM network will behave (or misbehave) 
under the conditions of multiple, coordinated, parallel data 
streams.

4.3  Approach: A Distributed, Parallel Server

The ISS design is based on the use of multiple low-cost, 
medium-speed disk servers which use the network to aggregate 
server output into a single high-bandwidth stream. To achieve high 
performance we exploit all possible levels of parallelism, including 
that available at the level of the disks, controllers, processors/mem-
ory banks, servers, and the network. Proper data placement strategy 
is also key to exploiting system parallelism. For placement of image 
tile data, an application-related program declusters tiles so that all 
ISS disks are evenly accessed by tile requests, but clusters tiles that 
are on the same disk[1]. This strategy is a function of both the data 
structure (tiled images) and the geometry of the access (e.g., paths 
through the landscape). Currently we are working on extending 
these methods to handle video-like data.

At the individual server level, the approach is that of a collection of 
disk managers that move requested data to memory cache. Depend-
ing on the nature of the data and its organization, the disk managers 
may have a strategy for moving other closely located and related 
data from disk to memory as a side effect of a particular data 
request. However, in general, we have tried to keep the implementa-
tion of data prediction (determining what data will be needed in the 
near future) separate from the basic data moving function of the 
server. Prediction might be done by the application, or it might be 
done be a third party that understands the data usage patterns. In 
any event, the server sees only lists of requested blocks. 

As explained below, the dominant bottlenecks for this type of appli-
cation in a typical UNIX workstation are, first, memory copy speed, 
and second, network access speed. For these reasons, an important 
design criterion is to use as few memory copies as possible, and to 
keep the network interface operating at full bandwidth all the time. 
Our implementation uses only three copies to get data from disk to 
network, so maximum server throughput is about 
(mem_copy_speed / 3).

Another important aspect of the design is that all components are 
instrumented for timing and data flow monitoring in order to char-
acterize the ISS implementation and the network performance. To 
do this, all communications between ISS components are times-
tamped. We are using GPS (Global Positioning System) receivers 
and NTP (Network Time Protocol) [8] [9] to synchronize the clocks 
of all ISS servers and of the client application in order to accurately 
measure network throughput and latency. 

5.0  ISS Architecture and Implementation

The following is a brief overview of a typical ISS operation. A data 
set must first be loaded across a given set of ISS hosts and disks, 
and a table containing disk/offset locations for each block of data is 
stored on each host. The application sends requests for data 
(images, video, sound, etc.) to the Name Server process on each 
Disk Server host, which does a lookup to determine the location 
(server - disk - offset) of the requested data. If the data is not stored 
on that host, the request is discarded with the assumption that 
another host will handle it; otherwise the list of locations is passed 
to the ISS Disk Server. Each Disk Server then checks to see if the 
data is already in its cache, and if not, fetches the data from disk and 
transfers it to the cache. Once the data is in the cache, it is sent to 
the requesting application. 

In the following sections, we describe the basic software modules, 
their functions, how they relate to each other, and some of the terms 
and models that were used in the design of the ISS. Figure 3 \x11(ISS 
Server Architecture) shows how the components of the ISS relate to 
each other. 

5.1  ISS Master

The ISS Master process is responsible for application-to-ISS startup 
communication, Disk Server process initialization, performance 
monitoring, and coordination between multiple ISS Disk Servers. 
This includes the ability to collect performance and usage statistics 
of all ISS components. In the future, we plan to extend the function-
ality of the Master to dynamically reconfigure ISS Disk Server 
usage to avoid network or ISS Disk Server bottlenecks.

5.2  Name Server

The Name Server listens for tile request lists from the application. 
After receiving a list, the Name Server does a table lookup to deter-
mine where the data is located (i.e. which server, which disk, and 
the disk offset). The Name Server then passes this list to the ISS 
Disk Server.

5.3   ISS Disk Server

There is one ISS Disk Server process for each ISS host. It is respon-
sible for all ISS memory buffer and request list management on that 
host. The Disk Server receives image requests from the Master pro-
cess, and determines if the image is already in its buffer cache. If it 
is already in the buffer cache (which is kept entirely in available 
memory), the request is added to the "to send" list. Otherwise, it is 
added to a "to read" list. Tile requests that have not been satisfied by 
the time the next list from the Master process arrives are "flushed" 
(discarded) from the lists. All requests that haven't been either read 
off of disk or written to the network interface are removed from all 
request lists, and any memory buffers waiting to be written are 
returned to the hash table. Note that if a tile read has completed, but 
the tile has not yet been sent to the network, the data stays in the 
cache, so that if that tile is in the next request list it will be sent first. 
Those buffers that were waiting to be filled with data from the disk 
are put at the head of an LRU (Least Recently Used) list so they 
may be used for requests in the newly arrived list. The Disk Server 
process also periodically sends status information to the Master. 

 ISS buffer management is very similar to that of the UNIX operat-
ing system, and many of the ideas for lists, hashing, and the format 
of the headers have been adopted from UNIX for use within the ISS 
[7]. A buffer can be freed from the hash table in one of two ways. If 
a buffer was allocated to a list (read/send) and that list was flushed, 
the buffer is returned to the head of the LRU list so that it is the next 
buffer to be reused. A buffer may also naturally progress through 
the LRU list until it has reached the end of the list, at which time it 
is recycled. 

The Disk Server handles three request priority levels: 

   * high: send first, with an implicit priority given by order 
within the list.

   * medium: send if there is time. 

   * low: fetch into the cache if there is time, but don't send.

The priority of a particular request is set by the requesting applica-
tion. The application's prediction algorithm should use these prior-
ity levels to keep the ISS fully utilized at all times without 
requesting more data than the application can process. For example, 
the application should send low priority requests to pull data into 
the ISS cache that the application will need in the near future; this 
data is not sent to the application until the application is ready. 
Another example is an application that plays back a movie with a 
sound track, where audio might be high priority requests, and video 
medium priority requests. 

5.4  ISS Reader

The ISS Reader process reads data off of disk and puts it into the 
buffer cache that is managed by the Disk Server process. There is 
one Reader per physical disk. This process continually checks for 
requests in the "to read" list, starts a read operation on that disk if a 
request is pending, then waits for the data to be read. Once the data 
is read off of disk the request is moved into the list of data that is to 
be written out. There are two distinct lists of data that are to be writ-
ten out, one for each of the high and medium priority levels 
described above. 

5.5   ISS Sender

The ISS Sender process sends all data in the "to send" list out to the 
application that requested it. There is one sender per network inter-
face. This process continually checks the list of data that is ready to 
be written out, looking for data that is of high or medium priority 
(as described above). Note that data of medium priority will only be 
sent if there is no data of high priority in the cache. However, it is 
possible for medium priority data to be written out before higher 
priority data, as in the case where the medium priority data is in the 
memory cache, and higher priority data is resident on disk. 

6.0  Current ISS Status

All ISS software is currently tested and running on Sun worksta-
tions (SPARCstations and SPARCserver 1000's) running SunOS 
4.1.3 and Solaris 2.3, DEC Alpha's running OSF/1, and SGI's run-
ning IRIX 5.x. Demonstrations of the ISS with the MAGIC Terrain 
Visualization application TerraVision have been done using several 
WAN configurations in the MAGIC testbed [11]. Using enough 
disks (4-8, depending on the disk and system type), the ISS soft-
ware has no difficulty saturating current ATM interface cards. We 
have worked with 100 Mbit and 140Mbit TAXI S-Bus and VME 
cards from Fore systems, and OC-3 cards from DEC, and in all 
cases ISS throughput is only slightly less than ttcp speeds. 

Table 1 below shows various system ttcp speeds and ISS speeds. 
The first column is the maximum ttcp speeds using TCP over a 
ATM LAN with a large TCP window size. In this case, ttcp just 
copies data from memory to the network. For the values in the sec-
ond column, we ran a program that continuously reads from all ISS 
disks simultaneously with ttcp operation. This gives us a much 
more realistic value for what network speeds the system is capable 
or while the ISS is running. The last column is the actual throughput 
values measured from the ISS. These speeds indicate that the ISS 
software adds a relatively small overhead in terms of maximum 
throughput.

7.0  Operation System Issues

7.1  Threads

Currently, the ISS Disk Server is implemented as a group of 
loosely-coordinated UNIX processes. We believe performance can 
be enhanced by transforming these processes into threads. Most of 
the gains arise from bypassing the overhead of the interprocess 
communication mechanisms needed to guarantee consistency of 
resources shared by the processes, e.g., the semaphores needed to 
ensure non-simultaneous access to the to-read and to-send lists. The 
same functionality can be achieved using thread-based mechanisms 
that are designed to be much faster, e.g., mutual exclusion locks.

The ISS Disk Server requires separate processes to receive from the 
network, read from disk, and send to the network. These processes 
must share certain resources, namely, the to-read lists, the to-send 
list, and the data cache. To ensure fair access to each of these 
resources, we force some processes to sleep for a short time: by this 
mechanism, we guarantee that the operating system will perform a 
context switch. When any Disk Server process accesses a list or the 
cache, it first obtains a semaphore to guarantee exclusive access for 
the duration of the time it needs to perform its task. If other pro-
cesses attempt to access the data, they are rejected and must, after a 
sleep-induced wait, try again.

Instead of the expensive semaphore mechanism, multiple threads 
guarantee exclusive access by using mutual-exclusion locks. The 
overhead of mutex locks is much less than that of semaphores, and 
checking mutex locks is much faster. Threads which cannot obtain a 
needed resource enter into a state of conditional waiting: this state 
eliminates the cycle of checking for the available resources, sleep-
ing, and checking again, which characterizes processes attempting 
to gain a shared resource. Threads in conditional wait are simply 
put to sleep and signaled when the resource is available. Interthread 
communication is much faster than interprocess communication 
and threads consume fewer resources, since threads share the same 
text space with one another. 

7.2  Real-Time Scheduling

An application like the Image Server System could benefit from real 
time scheduling. The ISS currently must attempt to coerce the 
UNIX scheduler to context-switch between the various competing 
ISS processes: trying to promote such context switching wastes 
time and reduces efficiency.  A real-time operating system allows 
fine-grained control of the scheduler by means of thread prioritiza-
tion and conditional waiting. Effectively, threads can take more or 
less processor time as necessary instead of arbitrarily taking a fixed 
slice of CPU time, and reducing competition with kernel-level or 
other user-level threads. This ability to vary the amount of time 
used by each thread is especially useful given that the ISS is driven 
by external events (the requests for the images) and must deliver the 
images back to the driving application within a predetermined time. 

8.0  Experience

8.1  ATM Networking Issues

The design of the ISS is based upon the ability to use ATM network 
switches to aggregate cells from multiple physical data streams into 
a single high-bandwidth stream to the application. Figure 1 \x11(Data 
Streams Aggregated by ATM Switches) shows multiple ISS servers 
being used to form a single high-speed data stream to the applica-
tion.

Below is a list of what we have learned from our experience using 
ATM networks. Most of the experience reflected here comes from 
our work in the MAGIC gigabit testbed. 

	I)	Hardware and Physical Layer:

	+ delivery of most ATM hardware has been delayed;

	+ the "infant mortality" rate has been high (several ATM 
	interfaces and the ATM switch died in the first 60 
	days);

	+ it now seems clear that the workstation vendors have 
	adopted multimode OC-3 as their preferred physical 
	layer (we bought several 140 Mb/s TAXI, which were 
	available six months ago);

	+ several of the multimode interfaces will drive single 
	mode equipment (e.g., SONET terminals) by carrying 
	the multimode fiber to the single mode interface (all 
	examples of this have had active elements immediate-
	ly behind the single mode interface);

	II)	 Link Layer:

	+ not all of the ATM cell definitions (especially with re-
	spect to the Quality of Service (QOS) field) are 
	uniform among manufacturers (QOS = 0 seems to be 
	the point of agreement);

	+ there are many places where cell loss is apparent: these 
	include switch output ports (next generation switches 
	have much more buffering), and workstation interfac-
	es, which are easily overrun for reasons not yet clear 
	(it could be failure of kernel code to empty buffers fast 
	enough);

	+ ATM device drivers are still fairly crude and buggy. 
	This is especially true on multiprocessor systems, 
	where the device drivers don't yet fully take advan-
	tage of the multi-threading capabilities of the 
	operating system;

	III)	 Network:

	Except for homogeneous switch environments, 
	switched virtual circuits are not yet standardized, and 
	in a large scale environment, use of PVCs makes set 
	up and reconfiguration tedious and prone to errors;

	IV)	Transport:

	There are many throughput anomalies that are being 
	investigated;

	Our work with timer-driven RTP shows some promise 
	of it being a little more immune to cell loss than TCP.

One of the things that is becoming apparent in our work with this 
architecture is that the conventional notion of QOS is not a good 
method for regulating tightly coupled applications like the ISS, and 
(for similar reasons) may not be good for distributed-parallel com-
pute server systems. Problems frequently occur when several serv-
ers that normally operate asynchronously to provide data to a single 
source, suddenly synchronize to produce a burst of data that over-
loads the switch and interface on the single receiver, thereby caus-
ing everybody to slow down and retransmit, leading to severe 
throughput degradation. This is very similar to the problem of rout-
ing message synchronization described by Floyd and Jacobson[2].

8.2  Memory Copy Speed

The main bottleneck for the ISS Disk Server is the speed of moving 
data into and out of memory.   A SPARCStation 10, for example, 
has memory copy speed of about 22 Mbytes/s (176 Mbits/s). When 
writing to the network, the situation is even worse because data is 
moved to the interface via UNIX "mbufs" [7], adding additional 
overhead. We have measured the speed of an mbuf copy as 19 
Mbytes/s (152 Mbits/s), and there are two mbuf copies required to 
send a packet to the network. Along with the other overhead 
required to assemble packets, this limits the speed with which we 
can write to the network to 9.2 Mbytes/s (74 Mbits/s). 

If the network sends were faster, e.g., 19.4 Mbytes/s (155 Mbits/s - 
the OC-3 rate, ignoring ATM overhead), the next bottleneck would 
be the disk reading speed, which in this configuration is 9 Mbytes/s 
(72 Mbits/s). This bottleneck is trivially removed by adding more 
disks. This brings us back to the "memcpy" limit of 22 Mbytes/s as 
the key bottleneck. The other bottlenecks are not likely to be rele-
vant in the near future. Increasing the speed of workstation memory 
is the key to increased performance for this application. 

9.0  Conclusions

The emergence of wide area high-speed networks enables many 
types of new systems, include distributed parallel data servers. We 
have designed and implemented a special purpose high-speed data 
server, called the ISS. The ISS is designed to be distributed across 
multiple hosts with multiple disks on each host, and should be capa-
ble of scaling to gigabit per second rates. Moreover, we believe that 
the core design is flexible enough so that only minor modifications 
need be made to adapt the ISS to different data types and access pat-
terns.

 In the process of implementing and using this system, we have 
learned many things about workstation and operating system bottle-
necks, and using ATM networks. The main things we discovered 
are: 

  * memory to memory copy speed is the main I/O bottleneck on 
today's workstations;

  * ATM networks still have many problems to be worked out 
before they are ready for general use.

10.0  References

	[1]	Chen, L. T. and Rotem, D., "Declustering Objects for Visu-
alization", Proc. of the 19th VLDB (Very Large Data-
base) Conference, 1993.

	[2]	Floyd, S. and Jacobson, V., "The Synchronization of Peri-
odic Routing Messages", Proceedings of SIGCOMM 
`93, 1993.

	[3]	Ghandeharizadeh, S. and Ramos, L, "Continuous Retrieval 
of Multimedia Data Using Parallelism, IEEE Transac-
tions on Knowledge and Data Engineering, Vol 5, No 4, 
August 1993.

	[4]	Hartman, J. H. and Ousterhout, J. K., "Zebra: A Striped 
Network File System", Proceedings of the USENIX 
Workshop on File Systems, May 1992.

	[5]	Leclerc, Y.G. and Lau, S.Q., Jr.,"TerraVision: A Terrain 
Visualization System", SRI International, Technical Note 
#540, Menlo Park, CA, 1994.

	[6]	Langberg, M., "Silicon Graphics Lands Cable Deal with 
Time Warner Inc.", San Jose Mercury News, June 8, 
1993.

	[7]	Leffler, S.J., McKusick, M.K., Karels, M. J. and Quarter-
man, J.S., "The Design and Implementation of the 
4.3BSD UNIX Operating System", Addison-Wesley, 
Reading, Mass., 1989.

	[8]	Mills, D., "Network Time Protocol (Version 3) specifica-
tion, implementation and analysis", RFC 1305, Univer-
sity of Delaware, March 1992.

	[9]	Mills, D., "Simple Network Time Protocol (SNTP)", RFC 
1361, University of Delaware, August 1992.

	[10]	Patterson, D., Gibson, G., and Katz, R., "A Case for Redun-
dant Arrays of Inexpensive Disks (RAID)," in Proc. 
1988 SIGMOD Conf., June 1988.

	[11]	Richer, I. and Fuller, B.B., "An Overview of the MAGIC 
Project," M93B0000173, The MITRE Corp., Bedford, 
MA, 1 Dec. 1993.

	[12]	Rowe, L. and Smith, B.C., "A Continuous Media Player", 
Proc. 3rd International Workshop on Network and Oper-
ating System Support for Digital Audio and Video, San 
Diego, CA, Nov. 1992.

	[13]	Schulzrinne, H. and Casner, S., "RTP: A Real-Time Trans-
port Protocol", Internet Draft, Audio/Video Transport 
Working Group of the IETF, 1993.