1
Introduction
Farsite [1] is a serverless, distributed file system that
logically functions as a centralized file server but is physically distributed
among a network of desktop workstations. Farsite provides the
location-transparent file access and reliable data storage commonly provided by
a central file server, without a central file server’s infrastructure costs,
susceptibility to geographically localized faults, and reliance on the
error-freedom and trustworthiness of system administrators.
Farsite replaces the physical security of a locked room with
the virtual security of cryptography and Byzantine-fault tolerance (BFT) [7]. It replaces high-reliability server hardware with
distributed redundant storage, computation, and communication provided by the
unused resources of existing desktop machines. And it replaces central system
administrators with autonomic processes that transparently migrate replicas of
file content and file-system metadata to adapt to variations in applied load,
changes in machines’ availability, and the arrival and departure of machines.
In 2002, we published a paper [1] describing a Farsite system that provides a Windows
file-system interface via a file-system driver, that stores encrypted replicas
of file content on remote machines, that autonomically rearranges these
replicas to maintain file availability [11], that provides transparent access to remotely stored
file content, that caches content on clients for efficiency, and that lazily propagates
file updates to remote replicas. This implementation also includes a directory
service that manages file-system metadata, issues leases to clients, and
receives metadata updates from clients. The directory service tolerates
malicious clients by validating metadata updates before applying them to the
persistent metadata state. In addition, the directory service tolerates
failures or malice in the machines on which it runs, by replicating its
execution on a BFT group of machines [7].
However, because every machine in a BFT group redundantly
executes the directory-service code, the group has no more throughput than a
single machine does, irrespective of the group size. Since this directory
service runs on a single BFT group, the rate at which the group can process
file-system metadata governs the maximum scale of the system. By contrast,
every other Farsite subsystem avoids scalability limitations either by implicit
parallelism or by relaxed consistency that is tolerable because it is concealed
by the directory service.
This paper presents a new design, implementation, and
evaluation of a directory service that is scalable and seamlessly distributed.
This work involves several novel techniques for partitioning file-system
metadata, maintaining name-space consistency, coordinating multi-machine
operations, and mitigating metadata hotspots. Experiments in a modest-sized
configuration show that our techniques enable the system to sustain throughput
that is proportional to system scale.
Our contributions include the following techniques:
·
tree-structured file identifiers
·
multi-machine operation protocol
·
recursive path leases
·
file-field leases
·
disjunctive leases
Our contributions also include the
end result, namely:
·
a file system that seamlessly distributes and
dynamically repartitions metadata among multiple machines
Section 2 overviews the Farsite system. Section 3 describes
the design decisions that drove the techniques detailed in Section 4 on
metadata partitioning and Section 5 on hotspot mitigation. Section 6 reports on
design lessons we learned, and Section 7 describes our implementation. Section
8 experimentally evaluates the system. Sections 9 and 10 describe future work
and related work. Section 11 summarizes and concludes.
2
Farsite overview
This section briefly summarizes the
goals, assumptions, and basic system design of Farsite. Many more details can
be found in our 2002 paper [1].
2.1
Goals
Farsite is Windows-based file system
intended to replace the central file servers in a large corporation or
university, which can serve as many as ~105 desktop clients [4]. Of Farsite’s many design goals, the key goal for the
present work is minimizing human system administration, which is not only a
significant cost for large server systems but also a major cause of system
failure [26]. In this paper, we address the specific sub-goal of automated load balancing.
2.2
Assumptions
Farsite is intended for wired
campus-area networks, wherein all machines can communicate with each other
within a small number of milliseconds, and network topology can be mostly
ignored. Farsite expects that machines may fail and that network connections
may be flaky, but it is not designed to gracefully handle long-term
disconnections.
Farsite’s intended
workload is that of normal desktop systems, which exhibit high access
locality, a low rate of updates that survive subsequent overwrites, and
infrequent, small-scale read/write sharing [20, 41]. It is not intended for high-performance parallel
I/O.
2.3
Basic system design
This section briefly overviews the
aspects of the Farsite system relevant to the metadata service, completely
ignoring issues relating to file content.
Each machine functions in two roles, a client that performs operations for its local user and a server that provides directory service
to clients. To perform an operation, a client obtains a copy of the relevant
metadata from the server, along with a lease [15] over the metadata. For the duration of the lease,
the client has an authoritative value for the metadata. If the lease is a write
lease, then the server temporarily gives up authority over the metadata,
because the lease permits the client to modify the metadata.
After the client performs the metadata operation, it lazily
propagates its metadata updates to the server. To make it easier for servers to
validate a client’s updates, the updates are described as operations rather
than as deltas to metadata state.
When the load on a server becomes excessive, the server
selects another machine in the system and delegates
a portion of its metadata to this machine.
From the perspective of the directory service, there is
little distinction between directories and data files, other than directories
have no content and data files may not have children. For simplicity, we refer
to them both as “files” except when clarity dictates otherwise.
3
Directory service design decisions
This section motivates our design
decisions and describes why we did not follow through with many of the design
ideas enumerated in our 2002 paper.
3.1
Fully functional rename
A fundamental decision was to support a fully functional rename
operation. Decades of experience by Unix and Windows users have shown that
fully functional rename is part of what makes a hierarchical file system a
valuable tool for organizing data. In fact, rename is the only operation that
makes non-trivial use of the name space’s hierarchy, by atomically changing the
paths of every file in a subtree while preserving open handles. Without rename,
the file-system structure is only slightly more involved than a flat name space
wherein the path separator (‘/’ in Unix, ‘\’ in Windows) is just another
character. It is slightly more involved because, for example, one can create a
file only if its parent exists, so in the flattened name space, one could
create a new name only if a constrained prefix of the name already exists.
Some prior distributed file systems have divided the name
space into user-visible partitions and disallowed renames across partitions;
examples include volumes in AFS [19], shares in Dfs [25], and domains in Sprite [27]. Anecdotal evidence suggests this is quite tolerable
as long as the system administrator exercises care when selecting files for each
partition: “[T]hey need to be closely
related, in order for rename to be useful, and there should be enough of them
so that the restriction on rename is not perceived by users as a problem [34].” However, in keeping with our goal of minimizing
system administration, we shied away from user-visible partitions, which demand
such careful organization [6]. Instead, we designed our metadata service to
present a semantically seamless name space.
3.2
Name-space consistency
Given a fully functional rename
operation, name-space consistency is necessary to avoid permanently
disconnecting parts of the file system, as a simplified example illustrates.
Fig. 1a shows a file-system tree in which every file is managed by a different
server. Client X renames file C to be
a child of file G, as shown in Fig. 1b, and Client Y independently renames file F to be a child of file D, as shown in
Fig. 1c. No single server is directly involved in both rename operations, and
each independent rename is legal. Yet, if both renames were allowed to proceed,
the result would be an orphaned loop, as shown in Fig. 1d.
Once we had a mechanism for scalable name-space consistency,
it seemed reasonable to use this mechanism to provide a consistent name space
for all path-based operations, not merely for renames.


3.3
Long-term client disconnection
Although Farsite is not intended to
gracefully support long-term client disconnections, we still need to decide
what to do when such disconnections occur. Since we employ a lease-based
framework, we attach expiration times – typically a few hours – to all leases
issued by servers. When a lease expires, the server reclaims its authority over
the leased metadata.
If a client had performed operations under the authority of
this lease, these operations are effectively undone when the lease expires.
Farsite thus permits lease expiration to cause metadata loss, not merely file
content loss as in previous distributed file systems. However, these situations
are not as radically different as they may first appear, because the
user-visible semantics depend on how applications use files.
For example, Microsoft's email program Outlook [28] stores all of its data, including application-level
metadata, in a single file. By contrast, maildir [40] uses the file system’s name space to maintain email
folder metadata. Under Outlook, content-lease expiration can cause email folder
metadata updates to get lost; whereas under maildir, expiring a file-system
metadata lease can cause the same behavior.
3.4
Prior design ideas
Our 2002 paper [1] envisioned a distributed directory service, for
which it presented several ideas that seemed reasonable in abstract, but which
proved problematic as we developed a detailed design.
Our intent had been
to partition file metadata among servers according to file path names. Each
client would maintain a cache of mappings from path names to their
managing servers, similar to a Sprite prefix table [43]. The client could verify the authority of the server
over the path name by evaluating a chain of delegation certificates extending
back to the root server. To diffuse metadata hotspots, servers would issue
stale snapshots instead of leases when the client load got too high, and
servers would lazily propagate the result of rename operations throughout the
name space.
Partitioning by path name complicates renames across
partitions, as detailed in the next section. In the absence of path-name
delegation, name-based prefix tables are inappropriate. Similarly, if
partitioning is not based on names, consistently resolving a path name requires
access to metadata from all files along a path, so delegation certificates are
unhelpful for scalability. Stale snapshots and lazy rename propagation allow
the name space to become inconsistent, which can cause orphaned loops, as
described above in Section 3.2. For these reasons, we abandoned these design
ideas.
4
Metadata partitioning
In building a distributed metadata
service, the most fundamental decision is how to partition the metadata among
servers. One approach is to partition by file path name, as in Sprite [27] and the precursor to AFS [32], wherein each server manages files in a designated
region of name space. However, this implies that when a file is renamed, either
the file’s metadata must migrate to the server that manages the destination
name, or authority over the destination name must be delegated to the server
that manages the file’s metadata. The former approach – migration without
delegation – is insufficient, because if a large subtree is being renamed, it may
be not be manageable by a single server. The latter approach may be plausible,
but coupling delegation to rename restricts the system’s ability to use
delegation for its primary purpose of load balancing. It also precludes clean
software layering, wherein a file’s semantic attributes, including its name,
are built on an abstraction that hides a file’s mobility.
The problems with partitioning by path name arise from a
name’s mutability. We avoid these problems by partitioning according to file
identifiers, which are not mutable. Our file identifiers have a tree structure,
which supports arbitrarily fine-grained partitioning (§ 4.1). This
partitioning is not user-visible, because operations can span multiple servers
(§ 4.2). Despite partitioning, the name space is kept consistent – in a
scalable fashion – by means of recursive path leases (§ 4.3).
4.1
Tree-structured file identifiers
Partitioning by file identifier is
not original to Farsite; it is also the approach taken by AFS [19] and xFS [2]. All three systems need to efficiently store
and retrieve information on which server manages each identifier. AFS addresses
this problem with volumes [34], and xFS addresses it with a similar unnamed
abstraction.
An AFS file identifier is the concatenation of two
components: The “volume identifier,”
which indicates the volume to which the file belongs, and the “file number,”
which uniquely identifies a file within the volume. All files in a volume
reside on the same server. Volumes can be dynamically
relocated among servers, but files cannot be dynamically repartitioned among
volumes without reassigning file identifiers. An xFS “file index number”
is similar; we defer more detailed discussion to Section 10 on related work.
4.1.2 File
identifier design issues
A main design goal for Farsite’s
file identifiers was to enable metadata repartitioning without reassigning file
identifiers. Specifically, we considered four issues:
- To maximize delegation-policy freedom, regions of
identifier space should be partitionable with arbitrary granularity.
- To permit growth, each server should manage an
unbounded region of file-identifier space.
- For efficiency, file identifiers should have a
compact representation.
- Also for efficiency, the (dynamic) mapping from file
identifiers to servers should be stored in a time- and space-efficient
structure.
4.1.3 Abstract
structure
Abstractly, a file identifier is a
sequence of positive integers, wherein the null sequence identifies the
file-system root. Each server manages identifiers beginning with a specified
prefix, except for those it has explicitly delegated away to other servers.
Fig. 2 shows a system of six servers, A
– F. The root server, A, manages all files except those whose
identifiers are prefixed by á1ñ, á3ñ, or á4ñ; server B
manages all files with identifiers prefixed by á1ñ but not by á1.3ñ; and
so forth. The file identifier space is thus a tree, and servers manage subtrees
of this space.

At any moment, some portion of each file identifier
determines which server manages the file; however, the size of this portion is
not fixed over time, unlike AFS file identifiers. For example, in the partitioning
of Fig. 2, a file with identifier á1.3.3.1ñ is managed by server D, because the longest delegated prefix of á1.3.3.1ñ is á1.3ñ. If
server D later delegates prefix á1.3.3ñ, the
file’s managing server will be determined by the first three integers in the
file identifier, namely