################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


	 Calliope: A Distributed, Scalable Multimedia Server

	Andrew Heybey        Mark Sullivan        Paul England
		     Bell Communications Research
			 Morristown, NJ 07960


Abstract

Calliope is a distributed multimedia server constructed from personal
computers.  Preliminary performance measurements indicate that
Calliope can be scaled from a single PC producing about 22 MPEG-1
video streams to hundreds of PCs producing thousands of streams.  The
system can store both variable- and constant-rate video and audio
encodings and can deliver them over any network supported by the
underlying operating system.  Calliope is cost-effective because it
requires only commodity hardware and portable because it runs under
Unix.

1  Introduction

A multimedia server records and plays data that has real-time delivery
requirements.  Examples of multimedia data include video and audio
streams; some typical applications that might use a multimedia server
are video mail, video bulletin boards, and video databases.  Two
crucial requirements of a multimedia server are that it should be low
cost and that it should scale up as demands on the system increase.
This paper describes the design and performance of Calliope, a
distributed multimedia server constructed from personal computers.
Calliope achieves low cost by running on commodity PCs and is easy to
scale up by virtue of its distributed architecture.

A traditional network file server cannot be used as a multimedia
server for several reasons.  First, multimedia data has timeliness
constraints.  Clients have limited buffering, so data that arrives too
late will result in an interruption in audio or a still frame; data
that arrives too early will overflow the buffer and be discarded.
Traditional file servers make no delivery guarantees.  Second, a
traditional file system is not optimized for a multimedia workload.
Multimedia clients are willing to accept relatively long delays when a
read or write begins, but very little jitter when a large file is
being accessed.

Calliope uses several strategies to solve these problems.  Timeliness
constraints are addressed by dividing the system into real-time and
non-real-time components so that data delivery deadlines can be met
without using a hard real-time operating system.  One or more
Multimedia Storage Units (MSUs) handle the real-time services while a
single Coordinator machine handles the non-real-time functions (see
Figure 1).  The MSUs send and receive multimedia data, manage disk
storage, and process simple VCR commands from clients.  The
Coordinator maintains bookkeeping information, serves as an initial
point of contact for users, and schedules requests at MSUs.  For very
small installations, the Coordinator and MSU software may run on the
same machine.  Larger Calliope installations still have a single
coordinator, but add more MSUs as storage requirements or user
bandwidth requirements increase.  To achieve as much performance as
possible using relatively low performance machines, the MSUs are
optimized to handle multimedia files.  The MSU has a large-block file
system to store large, sequentially-accessed files and its process
architecture is oriented toward efficiently moving large quantities of
data through the system from disk to network interface (or vice versa
for recording).

Calliope can record and play back in real-time both constant bit-rate
streams, such as MPEG, and variable bit-rate ones, such as those used
by the MBone tools [5].  Since there are many different network
protocols and audio/video coding standards, Calliope is extensible.
Simple modules can be added if necessary to handle different network
packet formats and to extract timing information from new encodings.

Calliope does not rely on special-purpose hardware or a hard real-time
operating system.  The major advantage of a software-based approach
running under a standard operating system is portability.  As
performance of commodity computers improves, Calliope can continue to
run on the most cost-effective platform.

The performance results presented in this paper show that twenty-two
1.5 Mbit/sec MPEG-1 streams can be transmitted to an FDDI network from
a single Pentium PC running the Calliope MSU software.  By combining
MSUs, our measurements suggest that Calliope could scale to hundreds
of PCs and thousands of customers.  We think that such a distributed
architecture is appropriate for large scale multimedia servers, but
low-end machines have both performance and engineering difficulties
that we never expected.  In the remainder of the paper, we describe
the system architecture of Calliope, measurements of server
performance, and some of the PC-related problems we encountered while
building Calliope.

2  System Architecture

Figure  shows a Calliope installation with three MSUs.  The
intra-server network connecting the Calliope components is a
relatively low-bandwidth network such as Ethernet.  The multimedia
delivery network that connects Calliope to its clients is a
higher-bandwidth network such as FDDI or ATM.  If necessary, a
Calliope installation could eliminate the intra-server network and use
the multimedia delivery network to carry both intra-server and
client-server traffic.  This would reduce the number of clients served
by each MSU, since some network interface bandwidth would have to be
dedicated to MSU control messages.  In Calliope, the Coordinator and
MSUs communicate using TCP connections.  Clients use TCP connections
to communicate control information to Calliope and UDP for real-time
data delivery.  MSUs do not communicate with one another.

The subsections that follow describe Calliope's components and some of
its applications.  We start with a section on the applications to show
what kind of services the system supports.  Next, we describe the ways
in which the Coordinator allocates resources to provide those
services.  Finally, we describe the storage management and real-time
features of the MSU.

2.1  The Client Interface

We have experimented with several different kinds of client
applications for Calliope's multimedia services.  Some are fairly
simple video-on-demand programs that browse Calliope's table of
contents, select content to play, and issue simple VCR commands.  We
have also used Calliope in a simple video mail application and have
used it to record MBone presentations.  Finally, we have experimented
with an application that keeps simple indices of recorded seminars.
Users can examine the index and skip to the portion of the seminar
that interests them.

Our primary clients have been Unix(tm) workstations and PCs
running the MBone teleconferencing tools.  We have also used
Windows(tm) PCs with hardware-assisted MPEG and JPEG decoders.
In one application, we have treated a PC-based MPEG client as dumb
set-top box that accepts raw MPEG over UDP.  So far, we have not
programmed any commercial set-top boxes to act as Calliope clients.

To begin using Calliope, a client establishes a session with the
Calliope coordinator.  The client can then request a listing of
available content, play existing content, or record new content.  With
appropriate permissions, the client can delete an item of content or
make other administrative changes to the server configuration.

The Calliope Coordinator associates a content type with each
item in its table of contents.  Calliope uses the content type to
determine the rate at which the content is to be played. Content type
also tells whether the content is played at constant or variable rate.
Types may be composite; for example, we have a VAT [17] audio
type, an RTP (the Internet Real-time Transport Protocol [13])
video type and a Seminar type composed of one VAT and one RTP
stream.  In the current implementation, clients may not define new
types without the help of a system administrator to update the
Coordinator's internal databases.

Before sending or receiving multimedia content, the client must create
a UDP socket and register that socket with Calliope as a display port.
In Calliope, display ports associate a string name, a content type,
and the socket's IP address and port number.  The display port need
not be on the same machine as the client program that creates it.  The
software that owns the port socket can be a software encoder/decoder
that is part of the client application or a simple driver for a
hardware device.

Display ports for composite types can be constructed from
previously-registered display ports of the component types.  In our
example above, a Seminar display port is composed of display ports for
RTP and VAT.  A client can register many display ports of the same
type and unregister them as necessary, but all ports are associated
with a single client-Coordinator session.  When this session is
dropped, the Coordinator deallocates its local representation of the
ports.

To read an item of Calliope's content, the client sends a request
containing the content name and the name of a display port.  Calliope
checks that the port and the content have the same type and assigns
the resources required to play the content.  When it is ready to play
the content stream, Calliope creates a control connection to the
client on which the client can send VCR commands: pause, play, seek,
and quit.  For some content items, clients will also be able to issue
fast forward and fast backward commands (this is described in more
detail in Section 2.3.1).  Recording on a Calliope server is
similar to reading, but the client request must also contain an
estimate of the recording length.

2.2  Coordinator Architecture

The Coordinator is the global resource manager for Calliope.  It
maintains a small administrative database and a set of scheduling
queues.  The database contains information about customers, content
stored on Calliope, and resources owned by the system.  The
Coordinator uses the database to tell what MSUs are available, how
many disks each one has, and how much disk space remains unused.  The
Coordinator also keeps track of which items of content are stored on
each disk and what content types are available.  As the previous
section stated, each item of content has a type.  The content type
entry contains a bandwidth consumption rate which gives the
expected rate at which content of this type is to be played and
recorded.

The Coordinator uses its databases to authenticate clients, satisfy
their requests for the table of contents, and to allocate resources as
clients play or record content.  When Calliope receives a read
request, the Coordinator finds an MSU with a disk that both contains
the requested content and has enough bandwidth available to satisfy
the request.  As the Coordinator assigns resources to clients, it
keeps track of load by processor and disk.  If a client's request
cannot be satisfied, the Coordinator queues the request until an MSU
with the necessary resources becomes available.

The Coordinator informs an MSU of a scheduling decision by sending a
message describing the client's request.  Once the request is
scheduled, the client does not communicate with the Coordinator until
the stream is terminated.  VCR commands between the client and
Calliope go directly to the MSU.  As soon as it is ready to deliver
the content stream, the MSU establishes a control stream (TCP
connection) with the client for these commands.  After a ``quit''
command from the client, the MSU informs the coordinator that the
stream has been terminated.

If a client requests a composite content type, the Coordinator creates
a stream group to play the content.  A stream group has one
member for each of the atomic subtypes of the composite type.  For
example, if a client played an item of the composite Seminar type
described in the previous section, the RTP video and VAT audio
subitems would become a two-item stream group.  All streams in a
stream group are controlled by the same VCR commands.  Calliope
assigns all streams in a group to the same MSU so that the client's
commands can start and stop all streams simultaneously.  Synchronizing
the streams would be difficult if streams from the same group were
assigned to different machines.

When a client records content on a Calliope server, the Coordinator
must allocate disk storage as well as bandwidth.  The client write
request includes an estimate of the length of the recording (in
seconds).  The Coordinator uses this estimate and the content type
information to determine how much disk space the recording will
consume.  It must schedule the request on an MSU that has both disk
space and bandwidth available.  If the client overestimates the length
of the recording, the unused space will be returned to the system once
the recording session has completed.

To support variable rate content, the content type table contains
separate rates for disk space and bandwidth consumption.  A
constant-rate stream will consume disk bandwidth and disk space at the
same rate.  For data with a variable-rate encoding, the system should
allocate bandwidth more conservatively than disk space.  The bandwidth
consumption rate should be closer to the stream's peak rate and the
storage consumption rate should be closer to the average rate.

Calliope's Coordinator has some support for fault tolerance.  The
Coordinator detects when one of the MSUs fails by a break in the TCP
connection between the coordinator and the MSU.  When an MSU is down,
the Coordinator marks it as unavailable in the scheduling database.
When the MSU becomes available again, it contacts the Coordinator and
is restored to the scheduling database.  Calliope does not recover
from Coordinator failures.

2.2.1  Disk and Network Scheduling

Calliope's MSUs use double buffering and careful disk and network
device scheduling to ensure that data is delivered at a regular rate
and peripheral devices are fully utilized.  In the case of a read,
double buffering means the network process is sending data to a client
from one main memory buffer while the disk process is loading the
other.  When the network process empties its buffer, the disk process
should already have filled the other buffer.  The two processes swap
buffers for the next cycle.

To allocate bandwidth of a single disk, we give the disk a duty
cycle which is divided into slots.  Each slot is long enough to
read or write a single disk block for one client stream.  The number
of slots in a cycle is the maximum number of block transfers that can
be accomplished during the time it takes for a single stream to
transmit its block.  In other words, the cycle must be short enough
that each client's disk transfer completes before the network transfer
runs out of data to transmit.

In order to replay variable-rate data packets at the correct times,
the network process constructs a delivery schedule as the data
is recorded.  The schedule associates delivery times with data
packets.  When variable-rate data is replayed, the network process
uses the stored delivery schedules to determine when packets are
delivered to the network.  For constant bit-rate streams, the delivery
schedule is calculated rather than stored.  The arrival times in
delivery schedules are not absolute; they are offsets from the
beginning of the recording session.

When it stores the delivery schedule and data on disk, Calliope
interleaves them in a single file using a data structure similar to a
primary B-tree [4].  In a primary B-tree, the file blocks contain both
data and a search tree.  The top parts of the search tree are stored
in internal pages containing only keys and pointers to lower-level
pages.  The lower parts the tree are stored in leaf pages with the
data.  In our case, the key for the search tree is delivery time.  A
sequential scan of the B-tree gives the data packets in the order they
must be delivered to the network.

Calliope's variant on B-tree is called Integrated B-tree (IB-tree)
because it integrates the internal pages into the data pages.
Calliope's large disk blocks allow it to make internal pages smaller
than leaf pages without increasing the height of the tree in most
cases.  We use 28 KByte internal pages (with 1024 keys) and 256 KByte
data pages.  As the B-tree is constructed, Calliope creates the
internal pages and data pages in a manner similar to normal B-tree
construction.  When an internal page fills up, it is copied into the
current data page instead of being written separately on disk.  During
seeks, Calliope traverses the internal pages of the search tree in the
usual way.  During sequential reads, the internal pages are read in as
part of the data page but ignored.  They are so small and only appear
in 0.1 of the data pages so they do not affect read bandwidth
appreciably.  On writes, the IB-tree writes both data page and
internal page using a single disk transfer and seek.  If the two pages
were stored separately, the internal page writes would add slots to
Calliope's disk duty cycle and the extra seeks would reduce disk
utilization.

Finally, Calliope does not use a real-time operating system and
FreeBSD timers have only 10 ms granularity, so delivery times are only
approximate.  While timer granularity will introduce jitter, clients
will have to be able to handle the jitter introduced by the multimedia
delivery network anyway.  We assume that clients have enough buffer
space to smooth any jitter introduced by either the approximate
scheduling or the intervening network.  A 200 KByte buffer will hold
more than one second of 1.5 Mbit/sec video.  Calliope will not add
more than 150 milliseconds of jitter in the worst case (see
Section 3.2.1) and any network that introduces more than 850
milliseconds of jitter is probably not usable for video delivery.

2.3  MSU Software Architecture

Each MSU is a PC with a set of disks, an interface to Calliope's
intra-server network and an interface to the external high-speed
network.  The MSU runs a simple multi-process control program that
assigns a process to each network device and disk while a central
process handles RPCs from the Coordinator and from clients.  The MSU
processes must communicate in order to share resources and to start
and stop streams.  Instead of using expensive semaphore operations,
the MSU processes communicate using a shared memory queue structure
that relies on the atomicity of memory read and write instructions to
produce atomic enqueue and dequeue operations.

When the client starts a read stream, the MSU's disk process loads
data from disk into a shared memory buffer.  The network process then
packetizes the buffer and sends it out through the high speed
interface.  The network process ensures that packet delivery proceeds
on schedule.  The disk process makes sure that the network process
always has buffered data ready to send.  When data is recorded, the
network process fills buffers and the disk process writes full ones to
disk.  The description of Calliope's performance in Section 3
contains more detail about the MSU data path.

2.3.1  Fast Forward and Backward

The MSU cannot send a data stream to clients at a higher rate than the
one at which it was received.  To implement fast forward and fast
backward scans, we used an offline filtering program.  Thus far, we
have only implemented filtering for MPEG streams.  The filtering
program reads the recorded stream, selects every fifteenth video
frame, recompresses the filtered stream, and loads it into the server.
For the fast-backward version, the frames are stored in the filtered
stream in reverse order.  This filtering procedure is not automatic in
the current implementation; an administrator has to produce the fast
forward and fast backward versions of the content.

An administrative interface is used to load the fast forward and fast
backward files into the server in a way that allows the server to
associate the files with the fast forward and fast backward VCR
commands.  The MSU remembers which files contain the normal rate, fast
forward, and fast backward versions of the same content.  If a client
issues a command to switch from normal rate to fast forward, the MSU
seeks to the frame in the fast forward file corresponding to the
current frame of the normal rate file.  The user will experience a few
seconds of delay as the MSU waits for the client's next disk slot,
reads the required block of the fast forward file into memory, and
starts delivering frames to the network device.  Switching back to
normal rate follows the same procedure.

We briefly considered implementing a dynamic fast forward and fast
backward mechanism that extracted the fast rate streams directly from
the normal rate stream.  In such a scheme, the MSU would use the
schedule information to skip over frames, delivering only selected
ones to the network.  This scheme was impractical for two reasons.

First, if the data stream uses inter-frame compression, some frames
may not be safely skipped.  In inter-frame compression, decoding a
given frame requires information in other frames around it in the
stream.  If the stream uses intra-frame compression or no compression,
the MSU could send selected frames to the user.  MPEG uses a
combination of inter-frame and intra-frame compression.  Most frames
are inter-encoded, but intra-encoding is used for every N-th frame,
where N is a parameter determined at the time of encoding (typically,
fifteen to thirty).  The MSU could implement fast forward for MPEG
streams by sending selected intra-encoded frames.  However, the MPEG
encoders that we have produce an opaque stream with no framing
information.  While recording, the MSU would have to search the stream
to find the intra-coded frames.  Parsing the MPEG stream is too
expensive to do in real time.

The second reason we did not implement this scheme is that it would
make disk scheduling hard.  Skipping frames allows the MSU to send a
fast forward stream using the same network resources as a normal-rate
stream.  However, fast forward delivery has a larger impact on disk
usage than normal rate delivery.  If the MSU reads from disk only the
frames that it will send in the fast forward stream, it has to issue
many small read requests instead of a few 256 KByte ones.  This will
significantly worsen disk performance.  A more practical approach is
to read all of the stream's frames from the disk and then skip over
the unneeded frames once they are in memory.  However, in this case,
the MSU must read fast forward streams from disk at several times the
normal stream rate.  Allowing users to switch back and forth between
high-rate and low-rate disk reads would complicate our scheduler.

2.3.2  MSU Extensibility

The MSU is designed so that support for new protocols can be added to
the system easily.  A ``protocol'' in this context is something no
more than complex than (for example) RTP---essentially a header
definition and a few control messages.  The MSU currently supports RTP
[13] , VAT [17] audio, some home-grown protocols and any protocol
and/or encoding which can be handled by transmitting fixed sized
packets at a constant rate.

An MSU protocol extension module is comprised of two functions.
The first performs any operations required by the protocol beyond the
normal sending or receiving of data packets.  For example, the RTP
protocol uses two ports---one for control messages and one for data.
The RTP module for the MSU manages the control socket.  During
recording, the RTP module interleaves the control messages with the
rest of the data stream before the data is given to the disk process.
On output, the opposite process is performed.  The MSU calls the
second extension function during recording to construct a delivery
schedule.  This function creates a delivery time to use when the
incoming packet is replayed.  By default, the MSU derives the delivery
time from the packet's arrival time.  If there is a timestamp in the
protocol's header, then a protocol extension function may derive
delivery time from the timestamp.  Using the sender-generated protocol
timestamp instead of the packet's arrival time has the advantage that
it does not include the effects of network-induced jitter.

2.3.3  MSU File System

The MSU has to manage files that are often large (a two hour MPEG-1
movie is 1.35 GByte) and are usually read and written sequentially.
Instead of the BSD fast file system, the MSU uses a simple user-level
file system tuned to the multimedia workload.  The MSU file system
does its own memory management and uses raw disk I/O to avoid
user-to-kernel-space copying.

Instead of a block cache, available main memory is organized into
large buffers to support read ahead and write behind.  Large disk
blocks and large buffers mean that large amounts of data can be
transferred per disk access, lessening the performance impact of disk
seeks.  With 256 KByte transfers, the MSU achieves 70 of the maximum
disk transfer bandwidth.  Large file block size also decreases the
size of the file system meta-data to the point that it can be entirely
cached in main memory.  An LRU block cache would impair performance
because there is not enough data locality or sharing to make the cache
effective.  Different clients may view the same data at approximately
the same time, but a 256 KByte buffer contains only about one second
of 1.5 Mbit/sec MPEG-1 video.  To share data in a block cache, clients
would have to be synchronized to within a second of one another.
Without some application-level coordination, this is unlikely to
occur.  Locality of reference by a single client is also unlikely.
Clients tend to access multimedia files sequentially so the client
would rarely rereference its own cached data.

The current implementation of the MSU does not employ disk head
scheduling.  The MSU services the customers for each disk in a
round-robin fashion, resulting in random seeks between disk transfers.
Further gains in disk performance could be realized by ordering the
I/Os to minimize seek distances.  We may eventually implement disk
head scheduling, but we do not expect large performance improvements
for two reasons.  First, disk rotation and head settling times
contribute a substantial fraction of the total seek time; disk head
scheduling will not have any effect on these.  Second, the MSU's large
block size already limits the effects of seeks on disk performance.
Using a simple program that simulated 24 concurrent users reading
random 256 KByte disk blocks, we found that an elevator scheduling
algorithm improves throughput by only about 6 for our disks.

In the current implementation, Calliope's MSU does not stripe files
over its disks.  When a client writes a file, all blocks go to a
single disk.  It would be easy to lay out a file so that consecutive
blocks are on ``adjacent'' disks.  The disk process in this case would
read or write blocks from its disks in a round-robin fashion.

Disk bandwidth management is a little more complex in this case.  The
duty cycle must cover all disks and contain slots, where is the number
of slots in a single disk's duty cycle.  When a client arrives at the
MSU, it is allocated a disk slot and must wait at most slots before
the MSU begins to deliver data.

The advantage to this kind of disk management is that we can still
utilize the disks well even if workload is unpredictable.  If an MSU
has items of content striped across identical disks, all of the
system's customers can access any of the items.  If each of the items
were on separate disks, only of the system's customers can access any
one item of content.  In the non-striped case, we can make copies of
popular content on several disks, but we must anticipate usage trends
in order to choose the content to copy.  We must also use additional
disk space to get additional disk bandwidth.

One disadvantage of striping is that the client must delay every time
it issues a VCR command while a disk slot becomes available.  As in
the non-striping case, the client waits at most as long as the duty
cycle.  If the MSU file system used striping, this delay is  times
as long as it is in the non-striped case.  We initially felt that this
delay would be unacceptable to our users.  In retrospect, we were
probably wrong.

The second disadvantage is that management of streams with different
fixed rates or a combination of variable and fixed rates is harder in
a striped environment.  When all streams play at the same rate,
controlling the rate at which users enter the system ensures that none
of the disks becomes overloaded.  If different files are consumed at
different rates, the block size must vary by file or clients will
progress across the disk at different rates depending on the content
they are playing or recording.  Variable block sizes make file system
implementation more challenging.  Allowing users to move from disk to
disk at different rates would allow disks to become oversubscribed.


3  Performance

To demonstrate the scalability and capacity of the system, we have run
several kinds of performance measurements.  First, we measured the
throughput capacity of our hardware by running simple programs that
move data through the same data path as the MSU.  These measurements
give an approximate upper bound on the performance of the MSU for our
hardware.  Second, we estimated the MSU's throughput by observing how
well packet delivery deadlines are met when the system delivers
varying numbers of constant-rate and variable-rate streams.  Finally,
we measured the load on the internal network and Coordinator due to
client requests in order to determine how many MSUs can be combined
into a single system.

In our experiments, the MSU software runs under the FreeBSD [7]
operating system version 2.0.5 on a 66 MHz Pentium PC from Micron
Computer Corporation.  The Micron has two busses: a fast PCI bus and a
slower EISA bus.  Each MSU contains one or more Buslogic EISA bus
fast-differential SCSI host adaptors each having one or more 2 GByte
Seagate Barracuda disks, 32 MBytes of main memory, a single SMC ISA
bus Ethernet interface to the internal network and a DEC DEFPA PCI bus
FDDI interface to the high speed network.

3.1  Baseline Measurements

In order to estimate the maximum potential throughput of Calliope, we
measured the performance of several simple programs exercising memory,
disks, and network interface.  Since these programs perform almost no
computation, they are measuring the speed at which the PC hardware and
operating system can move data from disk through memory and out the
FDDI interface.

The baseline tests use one process per device as the MSU does.  The
disk process is a simple program that performs 256 KByte reads of the
raw disk device at random offsets.  The network process is a slightly
modified version of the ttcp [16] program, which sends data
from memory to a peer process on a machine across the FDDI network.
We changed ttcp so it did not touch the data before sending (since the
MSU network I/O process (IOP) does not touch its data).  We also
changed it to send successive segments from a large buffer instead
of repeatedly sending a smaller buffer.  If a small buffer is used,
ttcp reports artificially high performance because the buffer stays in
the processor cache, speeding up the user-to-kernel-space copy.  The
arguments to ttcp were:

ttcp -t -u -s -l 4096 -L 1m -n
  -t		Transmit mode.
  -u		Use UDP instead of TCP.
  -l 4096	Send 4k packets (default is 8k).
  -n 100000	Send 100000 packets.
  -s		Send from memory, not stdin.
  -L 1m		Step through 1MB buffer while
		  sending (we added this option).

Note that, while ttcp misreports network bandwidth for transmitting
UDP packets under some versions of Unix, it is accurate when run on
FreeBSD.  Some systems silently discard packets when the interface
output queue is full, leading ttcp to report inflated performance
numbers.  FreeBSD returns ENOBUFS when the interface output queue is
full.  Ttcp then sleeps briefly and tries to send the packet again.

========================================================================
		 FDDI |     Disk Only     |     Disks and FDDI
		 only | Disk1 Disk2 Disk3 |  FDDI Disk1 Disk2 Disk3
------------------------------------------------------------------------
0 disk		| 8.5 |          	  |
1 disk (one HBA)|     |  3.6		  |  5.9   3.4
2 disk (one HBA)|     |  2.8  2.8	  |  4.7   2.4   2.4
2 disk (two HBA)|     |  2.9	     2.9  |  2.3   2.7	       2.7
3 disk (two HBA)|     |  2.2  2.2    2.7  |  1.4   1.9   1.9   2.5
------------------------------------------------------------------------

Table 1: Baseline Performance Measurements.  The measurements show the
throughputs of simple test programs writing to the FDDI interface and
reading from the disks (in MBytes/sec).  The table is divided across
the top into three groups of experiments: FDDI only, disks only, and
both FDDI and disks running simultaneously.  Within each group, each
row of the table shows the performance of each component that took
part in a particular run.  Tests labeled ``one HBA'' (one SCSI Host
Bus Adaptor) had all disks on the same SCSI chain; those labeled ``two
HBA'' had at least one disk on each of two SCSI chains.  See the text
for a detailed description of the test programs used.}
========================================================================

Table 1 displays the results of the baseline measurements.  According
to the measurements, the highest throughput attainable with our
particular hardware and operating system combination is 4.7 MByte/sec.
(All of the measurements in this section are in 10^6 bytes/sec units.)
This corresponds to the ``2 disk (one HBA)'' experiment in which the
FDDI wrote 4.7 MByte/sec, and the 2 disks read 2.4 MByte/sec each.
With only one disk running, the FDDI can go faster, but the disk does
not then produce enough data to keep up with the network.

It is surprising that we could not achieve better performance with
multiple SCSI host bus adaptors (HBAs) running simultaneously.  As
Table 1 shows, the FDDI performance is dramatically lower when running
two disks on two separate HBAs versus two disks on the same HBA, while
performance achieved by the disks in this case is only slightly
improved.  We believe that this effect is a hardware problem either on
the part of our motherboard or the HBAs.

We first became aware of the problem when we noticed that the system
clock slowed down if multiple HBAs were active.  Upon further
investigation, we discovered that the system was missing clock
interrupts.  It turned out that ``in'' and ``out'' instructions (which
are needed both to service interrupts and to read the hardware timer
used in FreeBSD to keep time) could take a very long time when two
HBAs were running.  Specifically, the sequence of instructions needed
to read the hardware timer took approximately 4 microseconds with no
disk activity; it occasionally took a millisecond with one HBA
running, and often took 20 milliseconds with two HBAs running.  To
measure this phenomenon, we disabled interrupts and timed the code
using the Pentium microprocessor's internal cycle counter.  We worked
around the bug by changing FreeBSD to keep time using the Pentium
cycle counter (so that missed clock interrupts would not affect the
time of day, though they might result in timer interrupts being late).

3.2  MSU Throughput

3.2.1  Constant-Rate Streams

Graph 1 shows how closely an MSU with 22, 23 and 24 constant-rate
streams keeps to the real-time packet delivery schedule.  In each
experiment, the MSU delivered streams from two disks on one host bus
adaptor for six minutes, and generated approximately 16480 four KByte
FDDI packets per stream.  The graph's X-axis is the number of
milliseconds a packet was sent after its deadline.  The Y-axis gives
the cumulative percentage of all packets delivered in the experiment
that fall in each one-millisecond bin.

The measurements show that an MSU can deliver up to 22 1.5 MBit/sec
streams with very good performance for each stream.  In this case,
only 0.4 percent of the packets are delivered more than 50
milliseconds late and no packets are more than 150 milliseconds late.
As more streams are added the quality first degrades gradually and then
dramatically.  With 24 streams, only 38 percent of the packets are
delivered within 50 milliseconds of their deadline.

This performance is approximately 90 of the baseline performance,
indicating that the overhead introduced by the MSU for constant-rate
streams is not excessive.  The biggest performance problem of the
system is the one discussed above in Section 3.1.  The current
baseline performance might be considerably improved with different
hardware.

3.2.2  Variable Rate Streams

Graph 2 shows how closely an MSU with 15, 16, and 17 variable-rate
streams keeps to the real-time packet delivery schedule.  For these
tests, three different files encoded by NV [6] were used (so the files
were played multiple times to produce the required number of streams).
The three different files used in the test had average rates of 650,
635, and 877 KBit/sec.  Note that the performance for variable-rate
streams (by this metric) is substantially worse than that of the
constant-rate streams.  There are several reasons for this performance
difference.

First, the packet size of the variable-rate streams is much smaller
than the four KByte packets used for the constant-rate tests.  Most of
the packets in the streams are about one KByte long.  There is four
times as much processing overhead for both the MSU code and the kernel
networking code.

Second, the video is quite bursty.  NV encodes a frame and then sends
it out as quickly as possible, resulting in bursts of back-to-back
packets.  Measured using a 50 millisecond sliding window, the peak
rates of the files ranged from 2.0 to 5.4 MBit/sec.  It is impossible
for the MSU to preserve the exact timings for many streams with bursty
traffic on this time scale due the non-real-time nature of the MSU.
Packets at the tails of the bursts will be delayed relative to their
deadline.

In addition, the clients in the variable-rate tests viewed only three
different files.  All of the streams in the tests were started
simultaneously.  This means that every time the data file contained a
burst of packets, one third of the streams in the test transmitted the
burst at the same time.  Thus, the MSU often had more packets to
deliver simultaneously than the apparent data rates would suggest and
some packets were necessarily delivered late.  Furthermore, when
tested while transmitting only a single file, the MSU could only
produce 11 streams instead of 15.  This unrealistic scenario is a
limitation of our automated test setup; in practice clients do not
start streams in synchrony nor would there be such a limited selection
of content.

3.2.3  Bottlenecks

At present, the bottleneck in our system is that we cannot make use of
more than one SCSI host bus adaptor simultaneously, limiting the data
rate to 4.7 MBytes/sec.  The next bottleneck is memory bandwidth.  Our
system can read memory at 53 MByte/sec, write it at 25 MByte/sec, and
copy at 18 MByte/sec.  As the MSU reads a file from disk and sends it
to a client, the data traces the following path through the memory of
the MSU PC:

	1. Write (DMA from disk to user memory in the raw disk read).
	2. Copy (user space buffer to kernel mbuf in network send).
	3. Read (UDP checksum).
	4. Read (DMA to FDDI interface).

Therefore, the fastest rate at which our test system
could move data along this path (assuming that DMA
accesses memory at the same speed as the processor) is:

			1
		------------------ = 7.5 MByte/sec.
		1/25 + 1/18 + 2/53

To measure this disk-less data path, we replaced the disk I/O process
in our baseline measurements with a process that simply wrote constant
values into memory buffers.  Using this process setup, the system
moved data at about 6.3 MByte/sec (that is, it sent UDP packets at
that rate while another process simultaneously wrote memory buffers at
the same rate).  The difference between the maximum 7.5 MByte/sec and
the measured value is due to other memory accesses occurring in
addition to the data movement.  In particular, we believe that
instruction fetches account for much of the difference.  All the code
necessary for the MSU (and kernel) to move the data will not fit in
the first-level instruction cache, and the second-level cache will be
flushed by all the data movement.

3.3  Scalability

The Coordinator and internal network are the only shared resources in
the system, so their capacity will eventually limit system size.
During normal operation, most of the load on the Coordinator and
network consists of client requests for new streams and stream
termination notifications.  In a real system, the maximum request rate
depends on the number of MSUs in the system and the length of the
average viewing session.

To measure the effect of scheduling requests on shared resource loads,
we have created a fake MSU which, when scheduled, delays for 50 ms and
then reports that the user has terminated the stream.  We start two of
these MSUs on different machines and started two clients who together
sent 10,000 requests to the coordinator at a rate of about 60 requests
per second.  We measured the Coordinator's CPU utilization at 14 and
the network utilization at 6, both relatively insignificant loads.
Even if sessions are as short as one minute, a large scale
implementation of Calliope serving 3000 simultaneous streams (150 MSUs
at 20 streams each) would need to service only 50 requests per second.

4  Related Work

Commercial network based video and multimedia servers have become
quite common in recent years.  Several large-scale distributed video
servers --- including ones by DEC, Hewlett-Packard, Oracle, IBM, and
Microsoft --- are under production or currently available.  Starlight
sells a smaller-scale ethernet-based centralized video server.  While
some of the commercial systems have designs similar to ours, the focus
of the paper is on performance issues and implementation experience
which the commercial vendors have not published.  Furthermore, many of
the commercial systems rely on custom hardware and operating systems
to achieve acceptable performance while Calliope uses commodity
hardware and OS software for greater portability and lower cost.

We are familiar with three research multimedia servers comparable to
ours.  Freedman and DeWitt [8] describe simulations of a
video-on-demand system in which clients request data as their buffers
empty rather than having the server deliver the data at a
pre-determined rate.  Smith [14] focuses on coding and protocol issues
rather than on the architecture and scalability of the server itself.
Gelman et al. [9] describes a multimedia server built from custom
hardware.  Their system includes an end-to-end network distribution
system for video-on-demand that uses buffering in switches and
specialized telephone line-cards.

Other groups have considered specific aspects of server design, most
notably, disk layout strategies.  Lougher and Shepherd [11] describe a
file system for striping multimedia data across several disks with
file system meta-data on a separate disk.  They concentrate on
constant rate streams but can support streams of different rates by
varying disk I/O sizes so that reads happen at constant intervals.
Kalns and Hsu [10] describe a video server implemented using the Vesta
parallel file system on an IBM SP-1 parallel computer.  It uses JPEG
encoding at a constant frame rate but variable frame size, so some
mapping from timestamp or frame to byte offset would be required in
order to seek to a given frame or time offset, though the paper does
not mention how seeking is implemented.  Chang and Zakhor [3] describe
a disk layout scheme for storing video encoded using a scalable coding
on RAID disks in such a way as to be able to serve as many clients as
possible from the array.  The encoding used is constant rate.  Vin and
Rangan [18] analyze a method of interleaving data on disk to make the
most of disk bandwidth but do not address variable-rate streams.

Another class of systems that include support for video include
several research and commercial DBMSs [15] [2] [1].  Some of these
systems support sophisticated query-by-image-content features, but do
not support real-time data delivery --- the essential feature of
Calliope.

5  Conclusions

In this paper, we have sketched the architecture and performance
characteristics of Calliope---a distributed, scalable real-time
storage system for multimedia data.  Calliope consists of a collection
of off-the-shelf PCs running Unix.  One machine, designated the
Coordinator, maintains a database of the stored content and serves as
the point of contact for new clients.  The rest of the machines are
Multimedia Storage Units (MSUs) which record and play back real-time
content for clients.  Calliope is extensible in order to support many
types of networks, protocols, and audio or video encodings.

We have measured the performance of Calliope running on Pentium-based
PCs under the FreeBSD operating system.  Our test system can support
22 1.5 MBit/sec simultaneous constant-rate streams or 15 variable-rate
streams (at an average rate of 635 to 877 KBit/sec and peaks up to 5.4
MBit/sec).  We have shown that inexpensive hardware and a
general-purpose operating system can be used to transmit many audio
and video streams with good performance.

Acknowledgements

We would like to thank Bob Gray, Peter Bates, Sid Devadhar, Ravi Jain,
Ben Melamed, Andy Ogielski, Marc Pucci, Norman Ramsey, and Yatin
Saraiya for comments and suggestions.

References

[1] A. Biliris.  The performance of three database storage structures
for managing large objects.  In Proc. ACM SIGMOD Conference, San
Diego, California, June 1992.

[2] M. J. Carey et al.  Shoring up persistent applications.  In
Proc. ACM SIGMOD Conference, Minneapolis, MN, May 1994.

[3] E. Chang and A. Zakhor.  Scalable video data placement on parallel
disk arrays.  In Storage and Retrieval for Image and Video Databases
II, volume SPIE 2185, 1994.

[4] D. Comer.  The ubiquitous B-Tree.  ACM Computing Surveys, 11(4),
1979.

[5] H. Eriksson.  MBone: The multicast backbone.  Communications of
the ACM, 37:54--60, 1994.

[6] Ron Frederick.  Experiences with real-time software video
compression.  In Proceedings of the Sixth International Workshop on
Packet Video, pages F1.1--1.4, Portland, Oregon, September 1994.

[7] FreeBSD operating system.  Information and source code available
via HTTP from www.freebsd.org.

[8] C. Freedman and D. DeWitt.  The SPIFFI scalable video-on-demand
system.  In Proc. ACM SIGMOD Conference, pages 352--363, 1995.

[9] A. Gelman, H. Hobrinski, L. Smoot, S. Weinstein, M. Fortier, and
D. Lemay.  A store and forward architecture for video on demand
service.  In Proceedings of the IEEE ICC'91, Denver, Colorado, 1991.

[10] E. Kalns and Y. Hsu.  Video on demand using the Vesta parallel
file system.  In Third Annual Workshop on Input/Output in Parallel and
Distributed Systems, April 1995.

[11] P. Lougher and D. Shepherd.  The design of a storage server for
continuous media.  The Computer Journal, 36(1):32--42, 1993.

[12] Coding of moving pictures and associated audio (MPEG).  Int'l
Organization for Standardization/Int'l Electrotechnical Institute,
September 1990.  ISO/IEC JTC1/SC2/WG11.

[13] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson.  RTP: A
transport protocol for real-time applications.  Internet draft:
draft-ietf-avt-rtp-07, Internet Engineering Task Force, March 1995.

[14] B. Smith.  Implementation techniques for continuous media systems
and applications.  Technical Report UCB/CSD 94/845, Computer Science
Division---EECS, U.C. Berkeley, December 1994.

[15] M. Stonebraker and M. Olson.  Large object support in POSTGRES.
In Ninth International Conference on Data Engineering, 1993.

[16] TTCP network performance measurement program.  Available for
anonymous ftp from ftp.sgi.com:sgi/src/ttcp.

[17] VAT audioconferencing program.  Available for anonymous ftp from
ftp.ee.lbl.gov:conferencing/vat.

[18] H. Vin and P. V. Rangan.  Designing a multiuser HDTV storage
server.  IEEE Journal on Selected Areas in Communications,
11(1):153--164, January 1993.

Biographies

Andrew Heybey (ath@bellcore.com) received his BS and MS in EECS from the
Massachusetts Institute of Technology.  He worked for several years at
MIT before joining Bellcore in 1993.  His interests include networks,
real-time protocols, operating systems, and computer architecture.

Mark Sullivan (sullivan@bellcore.com) received his BS in Math
Sciences from Stanford University and MS/PhD in Computer Science from
the University of California at Berkeley.  Since 1992, he has worked
as a parallel computing researcher at Bellcore.  His interests include
database management systems, networks, fault tolerance, and poetry.

Paul England (england@bellcore.com) holds a BS in physics from
the University of Birmingham, and a PhD in physics from Imperial
College London.  He has been a Research Scientist at Bellcore since
1986 and has worked extensively on the design and characterization of
novel semiconducting, superconducting and optoelectronic devices.  His
current interests include applications and architectures for networked
multimedia.