################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX Summer 1993 Technical Conference
		  Cincinnati, Ohio  June 21-25, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


          sfs:  A  Parallel  File  System  for  the  CM-5*


Susan J. LoVerso         Marshall Isman          Andy Nanopoulos

William Nesheim          Ewan  D.  Milne         Richard Wheeler


                     CM-5 Operating System Group
                     Thinking Machines Corporation
                     245 First Street
                     Cambridge, MA    02142-1264


                           Abstract

This paper describes the creation of a Unix-compatible
file system with highly scalable performance and size.
The file  system  is  on  the  CM-5 backed by a scalable array
of disks.  Using the Unix file system (UFS) from the SunOS 4.1.2 ker-
nel as a base and modifying it to support Connection Machine
(CM) operations, we have created a new file system, the scalable
file system, or sfs. We discuss the CM operations we  support,
such as parallel reads and writes to the processing nodes of the
Connection Machine, the use of NFS to support many partitions
of processing nodes on the CM, support for very large file
sizes (64-bit) and support for odd numbers of disk drives. The
tradeoffs and  decisions  made during the course of this project
as well as performance  data  for  varying  numbers  of  disk
drives are provided.

1      Introduction

The data parallel programming model is the  predominant  program-
ming  paradigm  on  the  CM-5.   Typical  applications  deal with
tremendously large data sets.   When  hundreds  or  thousands  of
SPARC  processors  require  access to this data simultaneously, a
high bandwidth I/O mechanism is necessary.  The CM-5  requires  a
mass  storage  system  that  can  support  terabytes  of data and
transfer speeds of hundreds of megabytes per second. With today's
technology, no single disk can fulfill those requirements; by us-
ing multiple disks and striping the data across all of  them,  we
can meet the data rates required.  We have created a highly scal-
able disk array (SDA) that  provides  great  flexibility  in  I/O
capabilities  as well as flexibility in file system organization.
These systems with the hardware and software  described  in  this
paper are currently running at dozens of customer sites.


    CMOST, the operating system that controls  the  CM-5  derives
directly  from  SunOS  4.1.2.   It  runs on the control processor
(CP), which is a SPARCstation that controls access and use of the
Connection  Machine  processing nodes (PN). The control processor
regulates access to I/O devices from the PNs. Since the CP runs a
derivative of SunOS, we wanted our file system to have Unix- com-
patibility.  Using UFS[5 , 3 ] as a starting point,   we  created
our own file system,  called sfs, which is backed by the SDA. One
of the primary functions that sfs serves is  to  provide  the  CP
______________________________________________
   *CM-5 is a registered trademark of Thinking Machines  Corpora-
tion.
   Unix is a registered trademark of Unix System Laboratories.
   NFS, SunOS, SPARC and SPARCstation are trademarks of  Sun  Mi-
crosystems, Inc.


with the information necessary to allow  parallel  I/O  from  the
processing  nodes to the SDA. UFS was inadequate for our needs in
several ways:


    1. It does not have support for file system block  sizes  and
fragment sizes that are not a power
       of 2.

    2. It does not support our parallel calls.

    3. It does not have support for files  larger  than  2  giga-
bytes.

    4. For files that are many megabytes  in  size,  it  produces
highly fragmented files.


Due to the volume of  changes  required  to  support  these  four
items,  we chose to create a new file system type instead of mak-
ing the changes in UFS.


    In this paper, Section 2 presents a general hardware overview
of the CM-5  and Section 3 gives an overview of the SDA architec-
ture. These give the reader an understanding of how the  hardware
we  are dealing with impacts the decisions we made. Then, Section
4 discusses the basic mechanics of parallel I/O and  how  it  in-
teracts  with  the  rest of the system.  Then Section 5 continues
with a discussion of the support and design changes made  to  our
sfs file system and to NFS in Section 6.  Section 7 shows perfor-
mance data showing the capability of this system. Section 8  con-
cludes with a summary and future work.


2      CM-5  Architecture  Overview


2.1      Hardware Architecture


The CM-5 may have tens to thousands of  processing  nodes.   Each
node  has its own memory and may be executing programs written in
either SIMD-style or MIMD-style.  Access to the nodes  is  super-
vised by the Control Processor.


                                    Figure 1: CM-5 Hardware Over-
view


    Two scalable communication networks connect all CPs  and  PNs
as shown in Figure 1.  The Control Network handles communications
that affect or involve all the nodes, such as broadcasts or  syn-
chronization operations.  The Data Network handles data traffic -
generally point-to-point operations.  The SDA is also attached to
these networks.  The PNs and CPs communicate via the Control Net-
work and all the I/O related activity occurs over the  Data  Net-
work.  The  network design provides low latency for transmissions
to near neighboring addresses, while preserving a high,  predict-
able  bandwidth  for  more distant communications[10 ]. The CP on
which the SDA file system resides is designated as the IOCP.


    High speed external networks, such as  HIPPI,  are  connected
directly  to the Data Network.  Slower external networks, such as
Ethernet or FDDI, connect into the CM-5 via the control pro- ces-
sor.


    In order to provide a scalable high performance disk  subsys-
tem,  one must spread data across multiple disks.  For protection
of data against disk failures,  the SDA uses a Redundant Array of
Inexpensive Disks (RAID) [7 , 9 ] model.  The Data Network within
the CM-5 carries 20-byte packets for I/O; a  4-byte  header  pre-
cedes  16  bytes  of  data. A RAID level 3 implementation implies
that a small amount of data is written to each disk,  and  parity
data  is  written  for each disk stripe of data.  With our packet
size, it is natural to choose 16 bytes  as  the  amount  of  data
written  to  each  disk  before moving on to the next.  With each
stripe of data, 16 bytes of parity information is  stored  on  an
additional parity disk.  The parity disk allows reconstruction of
any data which is missing due to a disk failure.  This RAID-3 im-
plementation  is  geared toward reads and writes of large amounts
of data.


                                      Figure 2: Partitioning  the
CM-5


    The processing nodes of a CM-5 system can be configured  into
one  or  more partitions as shown in Figure 2.  Each partition is
assigned a partition manager - a CP that bears the responsibility
for managing the process executing in that partition. The operat-
ing system configures the Control Network  and  Data  Network  to
match the partition structure.


    The I/O devices and IOCPs reside in areas of the network  ad-
dress space which are outside any specific partition.  Therefore,
the I/O devices appear as a global shared resource.  There is al-
ways additional network capacity for carrying data between parti-
tions and I/O devices.  Hence, system-wide data traffic does  not
interfere with or impede traffic that stays within a partition.


2.2      Software Architecture


A parallel program runs within a partition consisting of a CP and
a  set of PNs. When performing file system operations,  the CP is
the client requesting the I/O operation and the IOCP is the  file
server.   The client/server model is well suited to this environ-
ment because there can be many partitions sharing the  same  file
system. The control part of the transaction is handled by the NFS
protocol. (See Section 6 for more details).  The  data,  however,
flows  directly between the PNs and the I/O devices over the Data
Network.  The separation of control and data is done  because  of
the very different latency versus bandwidth tradeoffs required by
the two types of data movement.  The separation  of  control  and
data  is similar to the mechanism used on the CM-2 I/O system[2 ]
and the IEEE Mass Storage System Reference Model[1 ].


    A program called the time sharing daemon runs on the  CP  and
controls processes requesting


CM-5 resources.  It works in conjunction with the Unix kernel and
schedules  processes  to run on the PNs.  In some ways, it can be
considered an extension of the kernel.  When a CM-5  process  re-
quests  an I/O operation to the SDA, the time sharing daemon acts
on its behalf.  The time sharing daemon, using  the  user's  file
descriptor, makes a parallel I/O system call to the kernel to ac-
cess the SDA via the file system.  However, the time sharing dae-
mon  and the PNs are wholly responsible for the actual I/O opera-
tion. This file system transaction will be covered in more detail
in  Section  4 and Section 5. Now let's take a closer look at the
architecture of the SDA.


3      SDA  Architecture


One or more units known as Disk Storage Nodes (DSN) comprise  the
CM-5  disk subsystem.  As illustrated in Figure 3, each node con-
tains eight 1.2 gigabyte SCSI disk drives, four SCSI channels, an
8  Mb  data  buffer, a 20 Mb/second Network Interface and a SPARC
processor with its own 8 Mb memory.  Each DSN runs  a  controller
microkernel  that  mediates  the  I/O  between  the  CM-5 and the
drives. DSNs are packaged in units of three as backplanes. There-
fore, systems have disks in quantities of 24, 48, 72, etc.  Typi-
cal configurations might use 22 or 46 or 70 data disks, one pari-
ty  disk, and one spare disk. By adding more DSNs, you add a bal-
anced amount of bandwidth from the disk through the  network  in-
terface.  By  increasing  the  number  of Data Network addresses,
which happens automatically when you add more DSNs, you  increase
the overall Data Network bandwidth as well.


    As shown in Figure 1 in Section 2, the SDAs are connected  to
the Data Network.  Each DSN occupies enough network address space
to guarantee its sustained 20 Mb/second transfer rate to anywhere
in the CM-5.


3.1      Logical Devices


The file system interacts with an abstract  virtual  disk  imple-
mented  on  top  of a set of disks. This set of disks is termed a
logical device.  A Logical Device (LD) consists of  an  arbitrary
number of physical disks grouped together into a RAID 3 disk sys-
tem.  Data is striped 16 bytes per disk.  The LD appears  as  one
large  disk to the file system. The system administrator edits an
ASCII file to select which physical disks become a  member  of  a
logical  device.  There may currently be up to 255 data disks and
an optional parity disk in the LD. The configuration of the logi-
cal  devices is downloaded to the CMOST kernel and to all the DSN
kernels running on each of the DSN controllers.


    The configuration information describes each logical  device.
There  are four components of the system that require this infor-
mation:


    1. The device driver in the CMOST kernel.

    2. The DSN microkernel.

    3. The timesharing daemon.

    4. The PN microkernel.


The "master" version of the configuration information resides  in
the  CMOST kernel.  All other components make queries to the ker-
nel to retrieve it. We use a timestamp  mechanism[4  ]  to  main-
tain consistency of this information across the system.  An exam-
ple of some of the information maintained  in  the  configuration
tables includes:


                                             Figure 3: DSN board


    o  The number of disks in the logical device.

    o  The network addresses of those disks.

    o  Whether this logical device has a parity drive.

    o  The timestamp associated with the logical device.

    o  A flag indicating the device's state (such as online).

    o  The network address of the IOCP for this device.

    o  The dev_t representing this device.


Logical devices are global across the CM-5.  Lookups of this  in-
formation can be done using the dev_t.


3.2      DSN Microkernel


The SPARC processor on the DSN board runs code  which  serves  as
the disk controller firmware for the drives connected to that DSN
board.  The controller microkernel translates I/O requests for  a
logical device into specific SCSI controller commands.  If a log-
ical device spans more than one DSN, then one DSN is  the  master
of  the  LD  and  the  other  DSNs are slaves.  An I/O request is
directed to the master DSN which  forwards  the  command  to  the
slaves.


    For a read operation, the master and  slaves  transfer  their
data asynchronously.  As the slaves finish, they send a done mes-
sage to the master.  The master  sends  a  done  message  to  the
source


when all DSNs finish.  For a write operation, the master and each
slave  enable  their  hardware  buffers to receive the data. Each
slave sends a message to the master indicating how much  data  it
can receive. The master sums this information and sends the total
to the source. The source then transmits the data  to  each  DSN.
This  process iterates until all the data is transmitted.  If any
of the DSNs detect a disk error, then the done  message  sent  to
the  source  indicates  the  logical  disk in error and the block
where the error occurred.


4      Parallel  I/O


This section describes parallel I/O in general. The  current  im-
plementation,  described  below,  uses the time sharing daemon to
dispatch a user's parallel I/O  request.  Scheduling  constraints
dictated that  the  time  sharing  daemon  handle  parallel  I/O.
Our  eventual  goal  is  to  move  parallel  I/O completely  into
the kernel and remove some of the latency and overhead. There are
several other issues, such as supervisor access  to  the  Control
Network, that make this work more than trivial.


                                     Figure   4:   Beginning   of
Parallel I/O


                                       Figure 5: Transferring the
Data


                                         Figure 6: Finishing  the
I/O


    A parallel I/O request really contains three distinct  opera-
tions. The first, illustrated in Figure 4, shows the time sharing
daemon asking the  file  system  for  the  list  of  disk  blocks
relevant to the I/O request.  The second portion, shown in Figure
5, indicates that once the I/O is set up, the time sharing daemon
tells  the PNs to commence the I/O, and a high bandwidth transfer
takes place directly between the  SDA  and  the  PNs.  The  final
operation,  seen in Figure 6, cleans up the remainder of the I/O.
The SDA sends an acknowledgment to the time sharing  daemon,  who
then  finishes the I/O with the file system.  The file system un-
locks the inode, increments the file offset and  cleans  up  from
the I/O.


    The major drawback in this scheme is the latency involved  in
initiating a parallel I/O operation.  This overhead is due to the
setup time involved in passing  the  ts-daemon  the  user's  file
descriptor  and to the time spent by the ts-daemon retrieving the
block information from  the  kernel  and  then  coordinating  the
transfer  between  the  SCN's and and the processing nodes.  Once
the data starts flowing between the SDA and the processing nodes,
the disks perform quite reasonably as described in Chapter 7.


5      Scalable  File  System


Since we began with a goal of a Unix-compatible file  system,  it
seemed  logical  to begin with the UFS sources and modify them to
suit the needs of our system. Basically, four major areas  needed
support and modification in our new file system:


    1. Support for large files.

    2. Support for our parallel read and write calls.

    3. Handling of file system block sizes that are not  a  power
of two.

    4. Slight modifications to block allocation.


Our intent was to preserve the current performance of UFS and not
impact  the standard read or write paths with our changes. We can
boot an sfs file system on a Sun  workstation  on  a  SCSI  disk.
Timing  tests show that our file system performs the same as UFS,
when running to the same disk.   While  not  surprising,  it  was
pleasing to know we had no negative impact on general performance
in the file system.


5.1      Large Files


Supercomputer applications typically deal with very large  files.
Data sets of tens of gigabytes are not uncommon.  Therefore, file
system support for very large files  is  imperative.   While  the
Berkeley  Fast  File  System itself does nothing to restrict file
sizes, higher level file I/O support  within  SunOS  limits  file
sizes and offsets to sizes representable within 32 bits.


    In order to support very large files,  several  changes  were
required  to SunOS above the file system level. One change was to
add a vfs flag, VFS_BIGFILES, indicating  that  the  file  system
supports  large files. Other changes relate to maintenance of po-
sition within a file, and to representation of the absolute  size
of a file. We address file offset maintenance first.


    In order to keep track of file offsets not  representable  in
32  bits,  we  added  a new fundamental system data type which we
called a longlong_t.  This structure is a composite type  of  two
32


bit integers laid out most significant word first,  for  compati-
bility  with  the  Gnu  C  compiler  GCC.   An  additional  type,
offset_t, is defined simply as  a  longlong_t  for  clarity  when
dealing with file offsets.


    By including an ANSI long  long within the type,  user  space
applications  including  kernel header files can be compiled with
an ANSI compiler, allowing  them  to  use  long  long  arithmetic
directly. We did not feel comfortable switching the compiler used
to build the Unix kernel itself at this point, so  kernel  opera-
tions on longlong_t data structures are handled explicitly in the
code.


    Only two system data structures required changes  to  support
large file offsets.  First, the file offset field within the file
structure  was  changed  from  an  off_t  to  an  offset_t.   The
uio_offset field within struct uio was also increased to 64 bits.
By adding macros to each modified header file  defining  the  old
structure element names to refer to the low order word of the new
64 bit offset fields we could minimize the  changes  required  to
the  system outside the system call layer and the SDA file system
itself.


    In addition to extending the data structures,  minor  changes
were required to several routines at the file and vnode layers of
the kernel. Any routines above the file system layer which  mani-
pulated  file  offsets  needed  to be taught about the new 64 bit
types.  Specific examples of code requiring  slight  changes  in-
clude  uiomove() and seek().  We also added a new llseek() system
call to allow programs to manipulate file position beyond 2 Giga-
bytes.


    Slight changes to the core system were also necessary to han-
dle very large files. In SunOS, file system attribute information
is contained in a struct vattr, which carries stat information in
a file system independent format. The file size component of this
structure was extended to 64 bits. As with file offsets, we  took
care here to allow existing kernel code to continue to access the
low order portion of the file size field with the same name,  el-
iminating the need for extensive changes within non-sfs file sys-
tem code.


    Special enhancements for  sfs  were  required  in  performing
serial  I/O  when the current seek position was not representable
in 32 bits. SunOS performs serial I/O operations through  the  VM
system. The high level file system entrypoints (the VOP_RDWR rou-
tines) perform any necessary block allocation, and then copy data
between  user  space and a kernel virtual memory segment which is
backed by the file system.  For each page  within  this  segment,
the  VM  system  tracks  the inode and file offset mapped by this
page.  Unfortunately, this file offset is  stored  as  a  32  bit
quantity.  Extending this field to 64 bits required much more ex-
tensive surgery on the SunOS VM system than we  cared  to  under-
take. Instead, we chose to handle serial file I/O differently for
offsets requiring 64 bit representation.


    The sfs serial I/O routine (SFS_rwip) handles serial I/O  for
file  offsets  up  to 2Gb exactly as is done within the SunOS UFS
file system.  By following this path we can support mapped  files
and text paging within the sfs file system just as within UFS, as
long as the file offset remains  representable  within  32  bits.
For  large  files a different path is taken.  I/O is done through
the buffer cache rather than through the VM system.   The  buffer
cache tracks blocks by their block number in DEV_BSIZE (512 byte)
units. By maintaining the cache by block  number  we  effectively
get  8  additional  bits of file offset maintenance in the cache,
allowing sfs to support read and write operations on file offsets
up   to  2**39-1.   Mmap  operations  on  files  at  offsets  not
representable in 32 bits are not supported.


5.2      Parallel I/O


One of the key features of the SDA File System is its ability  to
support data parallel I/O operations.  Parallel I/O is implement-
ed as a combination of code within the time sharing  daemon,  and
file


system support code. The file system provides the mapping between
file  name  and  block  lists,  and  handles  all serial I/O. The
timesharing daemon actually performs all parallel I/O operations.


    Two new system calls were added  to  support  this  function.
CM_read_raw()  and  CM_write_raw()  are very similar in that both
calls return lists of blocks to the  time  sharing  daemon,  with
only  CM_write_raw()  performing  block allocation. Each of these
calls is executable by the time sharing daemon process only.


    The time sharing daemon performs parallel I/O  operations  on
behalf  of user programs. Using a file descriptor passed from the
application, the time sharing daemon calls the  sfs  through  the
CM_read_raw() and CM_write_raw() system calls to obtain a list of
blocks involved in the I/O operation.


    The first call to CM_read_raw() or  CM_write_raw()  places  a
lock  on  the inode involved.  This lock is held for the duration
of the entire parallel I/O operation, being  released  only  when
the  timesharing  daemon  finishes  the  parallel  I/O by calling
CM_read_raw() or CM_write_raw() again.


    A parallel I/O operation generally involves a large amount of
data.   As  it  uses  the existing UFS on-disk structure, the sfs
represents files as lists of individual blocks.  But unlike stan-
dard  UFS  disk  drivers,   the SDA is optimized for transferring
large contiguous block ranges,  not for individual block reads or
writes.   The  CM_read_raw() and CM_write_raw() system calls sup-
port this by coalescing the blocks returned by  the  file  system
block  lookup  routines  into lists.  The block lists returned to
the time sharing daemon are thus effectively a set of extents  to
be used in performing the I/O operations[6 ].


5.3      Odd File System Block Sizes


We use the term block stripe to refer to one 512  byte  block  on
each  data disk. The term disk stripe is used to describe a logi-
cal chunk of data spread evenly with 16 bytes  of  data  on  each
disk in the array.  Therefore, the block stripe size of a 46 data
drive logical device is 23 Kbytes, and the disk  stripe  size  of
such a configuration is 736 bytes.


    UFS was designed for disk systems where the file system logi-
cal  blocksize  is  some power of two multiple of the fundamental
disk block size.  With the sfs we were faced with  the  situation
where  the  fundamental  disk  block  size could be substantially
larger than the desired logical file system blocksize, and  could
under  many  conditions bear no reasonable mathematical relation-
ship to it whatsoever. A number of changes were required to  sup-
port this.


    Adding support for odd blocksizes  involved  changes  in  two
areas,  file  system macros and serial read and write processing.
First all the file system macros which converted bytes to blocks,
blocks  to  fragments,  etc required modification to use division
and modulus instead of shifts  and  masks.   These  changes  were
fairly  straightforward  and  didn't involve major changes to the
file system code itself. In benchmarking  the  system  on  a  Sun
4/300 with a SCSI disk we found that this had little or no impact
on file system performance.


    Enhancing the serial I/O routines to handle odd  block  sizes
required more effort. As mentioned earlier, SunOS performs serial
I/O operations through the VM system. The high level file  system
entrypoints  (the  VOP_RDWR routines) perform any necessary block
allocation, and then copy data between user space  and  a  kernel
virtual  memory  segment which is backed by the file system.  The
copy causes a fault, resulting in the file system  page  handling
routines (VOP_GETPAGE or VOP_PUTPAGE) being called to perform the
actual I/O. But by this time the I/O operation has been broken up
into some sequence of PAGESIZE units, regardless of any effort by
the user application to do I/O in file system blocksize units!


    Pages do not map nicely onto file system blocks which are not
a power-of-two number of bytes in size. The challenge in support-
ing serial I/O efficiently with odd blocksizes is to  group  page
I/O's  together to form some integral number of full block I/O's.
The sfs getpage and putpage routines attempt to do  disk  I/O  in
chunks  of  the least common multiple of the machine pagesize and
the file system blocksize. A second problem is that a single page
can span several logical blocks within a file.  A single page may
require two separate disk I/O's to  complete.   By  limiting  the
minimum file system blocksize to be at least as large as the sys-
tem pagesize we avoid the situation where three I/O's may be  re-
quired to handle a page.


5.4      Block Allocation


The sfs uses the UFS on disk structure and  allocation  policies.
When  rotational  delay is zero, the UFS policies reduce approxi-
mately to a cylinder group based extent allocator.  The file sys-
tem will attempt to allocate new blocks immediately following the
last block in an existing file.  The UFS policies generally allo-
cate  blocks  for different files from different cylinder groups,
with the exception of files within the same directory, which  are
allocated together.


    This policy, in combination with  extent  coalescing  in  the
parallel  I/O  routines,  suits sfs fairly well.  Problems arise,
however, when simultaneous file extensions are done to files  be-
ing  allocated within the same cylinder group.  Under this situa-
tion the files are not allocated  as  extents  but  instead  have
their  blocks  intimately  mingled  together.   For  parallel I/O
operations this results in a single I/O requiring  many  extents,
greatly reducing performance.


    Although work in this area is not yet complete,  an  approach
involving  applying  an  allocation  lock  on  a  cylinder  group
basis  for  the  duration  of  a  file  extension  operation  has
shown   some potential for increasing the average size of extents
allocated under these conditions.


6      Network  File  System


The CM-5 has several SPARCstations acting as partition  managers.
Therefore, it seemed natural to have partition managers mount sfs
file systems over NFS[8 ].  Several reasons motivated this choice
over  implementing  our  own mechanism for remotely accessing the
file system.  Foremost, NFS  is  a  widely  used  and  well   un-
derstood   method.   The  extensions  necessary  to  support  our
parallel calls would be  minimal.   Also,   since  sfs  is  Unix-
compatible,   users  on  any workstation could remotely mount and
access the SDA.


    At mount time, the partition manager, as a client, determines
whether the remote file system is a CM-5 file system. This infor-
mation is maintained in a bit in the mntinfo  structure.  If  the
bit is set, the parallel I/O system calls are allowed.  Otherwise
an error condition is returned to the user.


    Two new RPC calls were added to support the CM_read_raw() and
CM_write_raw()  parallel  I/O  system calls from the time sharing
daemon.  Since the parallel I/O calls split into a "begin" and an
"end"  half,  the  new RPCs, RFS_PARIOB and RFS_PARIOE, are func-
tionally split along those same lines.   The  first  returns  the
disk block information to the time sharing daemon, and the second
completes the operation. The same locking issues that are present
in  sfs  apply  here  as well for the rnode.  The rnode is locked
when the "begin" call completes, and it does not become  unlocked
until the "end" call.


    Additional RPC calls were  necessary  to  handle  file  sizes
greater than 2 gigabytes due to in-


creased sizes in some of the message fields. We needed to support
64-bit  size fields while maintaining compatibility with existing
NFS protocols. Five calls required these changes:


    1. Getting attributes, RFS_GETATTRL.

    2. Setting attributes, RFS_SETATTRL.

    3. Reading, RFS_READL.

    4. Writing, RFS_WRITEL.

    5. Lookup operations, RFS_LOOKUPL.


We could not change the existing versions of  those  calls,   be-
cause  then  our NFS would not be compatible with the rest of the
world for standard Unix calls. So, we added new  64-bit  versions
of  those  calls.   The need for 64-bit version of the first four
calls is straightforward.  Lookups needed to support 64-bit files
because the result of a lookup operation returns attributes about
the file it looked up, including the file size.  As mentioned  in
Section  5,  each  mount  point may have a flag set indicating it
supports "big" files.  If that flag is set, then the client  uses
the 64-bit versions of those calls where appropriate.


7      Performance


Our goal was to provide a scalable file system capable of running
on top of the SDA within a CM-5 system.  The performance goal for
the first release of the sfs software was to obtain a rate of 1.5
Mb/second  per  disk on read operations and a rate of 1 Mb/second
per disk on write operations across  a   wide   range   of   disk
configurations.    We   have  achieved  these  performance  goals
on  file systems spanning from 16 to 118 disk drives.  Although a
system  with  118 data drives was the largest we had available to
run the tests, the hardware and software systems can support much
larger configurations.


    Our test measured the amount of time it took to  execute  the
parallel  read and write system calls.  Each performance test was
run ten times for each data point.  The highest and  lowest  per-
formance  result  were  discarded and the other eight trials were
averaged and plotted here.  All the performance numbers presented
in  this  paper  were measured reading and writing regular files,
not using the raw device.  Each test was run on a partition  with
128  nodes.   The  system  was up and running in multi-user mode,
although our program was the only parallel process running.


    A logical device generally contains data drives and one pari-
ty  drive  and  one spare drive to use in case of a disk failure.
Therefore one to five backplanes of SDAs  typically  contain  22,
46, 70, 94 and 118 data disks respectively.  We also show perfor-
mance numbers for power of two data disk configurations  contain-
ing  16  and  32  data disks.  The disks are SCSI based IBM Model
0663E15, which are able to sustain raw transfers at approximately
2 Mb/second on reads and about 1.8 Mb/second on writes.


    Figure 7 shows how  the  effective  transfer  rate  for  read
operations  varies according to the size of the transfer for file
systems utilizing various disk configurations.  The  graph  shows
that read performance improves as we continue to add disks drives
over a range from 16 to 118 drives.  The  highest  read  transfer
rate  achieved was over 185 Mb/second when using 118 data drives.
Each curve periodically shows a slight  drop  in  performance  at
transfer  sizes  that  are  somewhere between 64 and 256 Mb. This
slight decrease in performance at those specific values is due to
the  additional overhead involved in switching to a different in-
direct block. File systems with a power of 2 number


                                Figure 7: Read  performance  with
128 nodes


of disks have a file system block size of 16Kb. This  allows  for
4096 block addresses in each indirect block.  Each indirect block
can thus address 16Kb x 4096 = 64Mb of data.  This is most clear-
ly  shown  in  the  line for 32 disk drives. For the file systems
with a non power of two number of data  disks,  the  file  system
block  size  is  somewhere between 16Kb and 32Kb depending on the
number of disks.  Notice that for small  transfers,  larger  disk
configurations  perform more slowly.  This is shown by the cross-
ing lines of the graphs near the origin. This  behavior  reflects
additional  latency  in  the  larger configurations, due to their
having additional DSN controllers and hence, needing to send more
messages.   The next release of the system software will decrease
this latency with the goal of eliminating any differences in per-
formance at the small transfer sizes.


    Figure 8 shows how the  effective  transfer  rate  for  write
operations  varies according to the size of the transfer for file
systems  utilizing  various  disk  configurations.   Results  are
presented  for the same disk configurations as reads.  The shapes
of the curves are similar between reads and writes with the write
performance  topping  out  around  1 Mb/second per disk while the
read performance tops out around 1.5 Mb/second per  disk.   Write
performance  is  less  than read performance for a number of rea-
sons, including the fact that the physical  disks  perform  reads
faster  than  writes  and also the buffer size limitations on the
DSN boards.  Again notice the slight dips due to  indirect  block
access.   Latency  for large disk configurations causes the lines
to cross near the origin for writes as well.


    Figure 9 shows how the maximum effective  transfer  rate  per
disk for read and write operations varies according to the number
of disks used for file systems with a varying number  of  drives.
The  amount of data transferred per disk is held constant at 2 Mb
for this graph.  (I.e.  32Mb for 16 disks,  44Mb  for  22  disks,
etc.)  An  optimum  line for this graph would be a perfectly flat
horizontal line.  Our data shows nearly  optimum  scaling.   This
graph  clearly  shows  that we have achieved scalable file system
performance across almost an order of  magnitude  change  in  the
number of disk drives.


                               Figure 8: Write  performance  with
128 nodes


    The performance of the  SDA  file  system  is  determined  by
several  factors, including the time required to execute the file
system code on the CP and the time required to  process  the  re-
quest on the DSN controllers.  These times remain relatively con-
stant regardless of the partition size  and  disk  configuration.
The  number  of disks used and the number of PNs in the partition
determine the effective transfer rate, which allows  the  perfor-
mance  of  the  file system to scale according to the size of the
machine. By allocating blocks contiguously we are able to sustain
transfer rates which are a significant percentage of the raw disk
transfer rate.


8      Conclusions  and  Future  Work


We have shown that you can build a  Unix-compatible  file  system
which  is truly scalable in both size and performance.  For large
data transfers 118 disks have been used in  parallel  to  achieve
extremely high transfer rates on a single I/O. Modifying the disk
block allocation algorithms to give us large  contiguous  extents
gave us excellent performance results while not adversely affect-
ing the standard read and write paths.  We  have  shown  that  we
have met our performance goals for reads and writes.


    We recognize some limitations and latency issues in  our  im-
plementation. We would like to move the parallel I/O system calls
into the kernel and remove the overhead of having the time  shar-
ing  daemon  coordinate the I/O. We would like to address the la-
tencies involved in coordinating many DSN boards.  Obviously,  we
are  also  investigating  new  hardware,  faster  and bigger disk
drives to increase the performance of the system.


                     Figure 9: Per-disk read  and  write  perfor-
mance with 128 nodes


9      Acknowledgments


We would like to thank John Smith, Tom Moser, Eric Rowe,  Soroush
Shakib,  Eric Sharakan, Andre Vignos, Roger Lee, Mark Bromley and
David Taylor. Their help and contributions to this  project  were
key to its success.


References


  [1] S. Coleman and  editors  S.  Miller.  Mass  Storage  System
Reference Model: Version 4. IEEE Technical
      Committee on Mass Storage System and Technology, May 1990.

  [2] B. Kahle, W. Nesheim, and M. Isman.  Unix and  the  Connec-
tion Machine Operating System.  In
      Workshop Proceedings, USENIX Workshop on Unix and Supercom-
puters, pages 93-107, Pittsburgh,
      PA, 1988. USENIX Association.

  [3] S. Leffler, M. K. McKusick, M. Karels, and  J.  Quarterman.
The Design and Implementation of the
      4.3BSD UNIX Operating System. Addison-Wesley, 1989.

  [4] S. LoVerso, N. Paciorek, A.  Langerman,  and  G.  Feinberg.
The OSF/1 Unix Filesystem (UFS).  In
      Conference  Proceedings,  1991  Winter   USENIX   Technical
Conference, pages 207-218, El Toro, CA,
      1991. USENIX Association.

  [5] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S.  Fabry.
A Fast File System for UNIX.  ACM
      Transactions on Computer Systems, 2:181-197, August 1984.

  [6] L. W. McVoy and S. R. Kleiman. Extent-like Performance from
a UNIX File System. In Conference
      Proceedings, 1991 Winter USENIX Technical Conference, pages
33-43, El Toro, CA, 1991. USENIX
      Association.

  [7] D. Patterson, G. Gibson, and R. Katz. A Case for  Redundant
Arrays of Inexpensive Disks (RAID).
      In Conference Proceedings,  1988 SIGMOD Conference,   pages
109-116. Association of Computing
      Machinery (ACM), 1988.


  [8] R. Sandberg, D. Goldberg, S.  Kleiman,  D.  Walsh,  and  R.
Lyon.  Design and Implementation of the
      Sun Network Filesystem.  In  Conference  Proceedings,  1985
Summer USENIX Technical Conference,
      pages 119-130, Berkeley, CA, 1985. USENIX Association.

  [9] M. Schulze.  Considerations in the Design of a RAID  Proto-
type.  In Master's Report, Berkeley, CA,
      1988. University of California at Berkeley.

[10]  Thinking  Machines  Corporation,  Cambridge,  Massachusetts
02142-1264.  The Connection Machine
      CM-5 Techincal Summary, 1991.


10       Author  Information


Sue LoVerso is a senior software engineer  at  Thinking  Machines
Corporation working on the CM- 5 Operating System. Prior to join-
ing Thinking Machines, she was employed at Encore  Computer  Cor-
poration for several years as a member of the Mach Operating Sys-
tem group.  She received her Master's degree in Computer  Science
from  the  State  University  of  New  York at Buffalo.  Sue is a
member of the IEEE Computer Society and the Society of Women  En-
gineers.  She can be contacted at sue@think.com.


    Marshall Isman has been with  Thinking  Machines  Corporation
since  1986  and  is a manager of Operating Systems.  He has been
involved in the architecture, design and  implementation  of  the
CM2  and CM5 I/O systems and Operating Systems.  Prior to joining
Thinking Machines Corporation, Marshall worked at  Computer  Con-
soles  and Bell Telephone Laboratories.  Marshall holds a BS from
State University of New York at Albany and  an  MS  from  Harvard
University   in   Computer   Science.  He  can  be  contacted  at
marshall@think.com.


    Andy Nanopoulos is a senior  software  engineer  at  Thinking
Machines  Corporation  and was responsible for the design and im-
plementation of the DSN software on the SDA. He has a BS in  Com-
puter  Engineering  from Boston University.  Before coming to TMC
he was president of Yorfrenz Development Corporation. He  can  be
contacted at andyn@think.com.


    Bill Nesheim holds a BS from Cornell University. He has  been
with  Thinking Machines since 1986, where he is currently manager
of I/O system software.  Prior to joining  Thinking  Machines  he
was a research programmer at the Cornell University Computer Sci-
ence department.  He can be contacted at nesheim@think.com.


    Ewan Milne is an OS Engineer working  on  CMOST  development.
He  is  currently  working  on the SDA project.  Prior to joining
Thinking Machines in October of 1992,  he worked for 8  years  at
Prime  Computer Inc., where he was a Principal Engineer in the OS
group. Ewan studied Computer Science  at  Rensselaer  Polytechnic
Institute  and  is  currently completing his Bachelor's Degree at
Boston University. He can be contacted at milne@think.com.


    Rick Wheeler has a BA from Brandeis University and a MSc from
Hebrew  University.  Before joining Thinking Machines in 1990, he
was a member of the staff at the Israel Air Force  Institute  and
Hebrew  University's  Computer Science Department. He can be con-
tacted at ric@think.com.