Clue Tables: A Distributed, Dynamic-Binding Naming Mechanism Cheng-Zen Yang, Chih-Chung Chen, and Yen-Jen Oyang Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan, R.O.C. ABSTRACT This paper presents a distributed, dynamic naming mechanism called clue tables for building highly scalable, highly available distributed file systems. The clue tables naming mechanism is distinctive in three aspects. First, it is designed to cope well with the hierarchical structure of the modern large-scale computer networks. Second, it implicitly carries out load balancing among servers to improve system scalability. Third, it supports file replication and dynamically designates a primary copy to resolve possible data inconsistency. This paper also reports a performance evaluation of the clue tables mechanism when compared with NFS, a popular distributed file system. 1 Introduction Distributed file systems are the backbone of the modern network computing environment. The naming mechanism in a distributed file system maps the logical name of each individual file to its physical location. In the design of a modern distributed file system, availability and scalability are two essential concerns [1 , 2]. In order to build a highly available, highly scalable distributed file system, a designer must incorporate a naming mechanism that can cope with these concerns. To meet the demands, a naming mechanism must be distributed in nature, and support file replication and dynamic binding. The naming mechanism must be distributed in nature because centralized naming mechanisms suffer low scalability due to limited capacity of the central naming server. The naming mechanism must support file replication because file replication improves both availability and scalability of the system. The presence of replicated file copies prevents service disruption due to failure of a single file server and thus improves system availability. With file replication, the scalability of the system is improved because clients can access replicated file copies on different servers to avoid congestion of a particular file server. To maximize the benefits of file replication, the naming mechanism must support dynamic binding. With dynamic binding, the clients that initially turn to a crashed server for file service can establish new connections on-the-fly to other servers that have replicated copies of the files. Also, clients can dynamically select a server for binding to achieve a good load balancing among servers. In this paper, we propose a new distributed, dynamic-binding naming mechanism called clue tables. The clue tables mechanism offers the basis to build a highly available, highly scalable distributed file system and is distinctive in three aspects: 1. It is designed to cope with the hierarchical structure of modern computer networks - In a modern computer network, particularly a large-scale computer network, bridges are commonly installed _____________________________________________________ This research was sponsored by National Science Council of R.O.C. under grant NSC 83-0408-E-002-002 1 to partition the network into a number of clusters. For example, the network in a research institute may be partitioned so that the computers in each laboratory form a local cluster. The hierarchy of network partitions may extend over several levels. The major distinction of the clue tables mechanism is that it was designed to cope with the hierarchical network structure. With clue tables, we can make clients turn first to local file servers to locate a file. If a client can not find the file on local servers, or the local servers that store the file are unavailable, e.g. crashed, then the client will automatically go one level up in the network hierarchy to locate the file on remote servers. The main reason behind adopting this practice is to make most bindings among clients and servers occur in conformity with network hierarchy design. In reality, this is the main distinction between the clue tables mechanism and the similar prefix tables mechanism [3 , 4]. With prefix tables, a client that is bound to a remote server for file service due to failure of local servers will not automatically switch back to local servers upon successful recovery of local servers. As a result, a client may still rely heavily on a remote server for file service for a long period of time after a crashed local server is back to work again. On the other hand, with clue tables, a client will always turn to local servers first to locate a file when it starts a new file session, i.e. opens a file. This guarantees that the bindings among clients and servers occur in conformity with network hierarchy design whenever possible. 2. It implicitly carries out load balancing among servers to improve system scalability - The clue tables mechanism implements dynamic binding between clients and servers upon file open to achieve load balancing among servers. This is the main distinction of the clue tables mechanism when compared with other distributed file systems such as AFS[1 , 5 ], Coda[6 ], Locus[7 ], V kernel[8 ], Amoeba[9 ], Ficus[10 ], and Deceit[11 ] that also implement dynamic binding. With clue tables, a client, upon opening a file, multicasts access requests to servers that have a replica of the file. If more than one server acknowledges the request, the client always chooses the server that responds fastest for binding. With this practice, the clue tables mechanism implicitly carries out load balancing among servers since a server with a lighter load has a better chance to respond faster than a server with a heavier load. Though it is not guaranteed that the server with the lightest load among multiple candidates always responds fastest, a near-optimum load balancing situation should persist most of time. 3. It supports file replication and dynamically designates a primary file copy to resolve possible data inconsistency - One crucial issue with file replication is how to maintain data consistency. The clue tables mechanism resolves this issue by dynamically designating a primary file copy. It is the dynamic nature and granularity of the mechanism that distinguishes the clue tables mechanism from other distributed file systems that also employ a primary copy based approach, e.g. Locus[7 ]. The main reason behind employing the dynamic approach with file-level granularity is to distribute servers' load. With clue tables, when one or more clients attempt to write to a file, one server is dynamically designated as the primary server and all accesses to the file are temporarily forwarded to the primary server. Once the situation that could cause data inconsistency no longer exists, the primary server broadcasts the new version of the file to other servers. In the following part of the paper, section 2 describes the basic structure of clue tables. Section 3 elaborates on the system operations with clue tables. Section 4 addresses the implementation and performance issues. Section 5 concludes this paper. 2 Basic Structure of Clue Tables The clue tables mechanism implements a global naming space. The primitive entities in the clue tables mechanism are file collections termed domains. A domain is a subtree in the integrated file system of a distributed system. A file server can contain one or more domains while a domain cannot spread over multiple file servers. A domain must be entirely stored on one file server. However, we may have replicated copies of a domain stored in a number of servers. Fig. 1 illustrates the naming architecture with clue tables. Each file in a domain is an object composed of two attributes: logical file name and physical location. Each file is uniquely identified by its logical file name in the integrated file system. The physical location attribute specifies where the file resides. Clue tables are the directories that the clients refer to for locating a file based on its logical file name. Fig. 2 shows an example of clue tables. A clue table contains a number of entries, each of which corresponds to a domain in the integrated distributed file system. Figure 1: The naming architecture. Domain X is stored on servers A and B. Domain Y is stored on server C. Domain Z is stored on server B. ___________________________________________________________________________________________________________________________________* *__ For example, in Fig. 2, there are two entries. The first entry corresponds to the domain with root "/usr" and the second entry is corresponds to the domain with root "/usr/bin". A clue table entry specifies the file servers that have a copy of the domain. For example, the domain with root "/usr" has three replicated copies on file servers solar , earth , and global . With file replication, the availability and scalability of the system is significantly enhanced. The multiple servers listed in a clue table entry are grouped and prioritized. For example, in the first entry of Fig. 2, servers solar and earth form the first group, which is separated from the second group by a semicolon. The second group contains only one server, global . When trying to locate a file, a client first turns to the servers in the first group. If all the servers in this group are unavailable, e.g. crashed or unreachable due to network failure, the client then turns to the second group and so on. The motivation to group and prioritize servers in the list is to make the bindings among clients and servers occur in conformity with network hierarchy design whenever possible. A clue table entry is overridden by another entry when the second entry is with root a subdirectory of the first domain. For example, in Fig. 2, the second entry, corresponding to the domain rooted by "/usr/bin", overrides the first entry. When a client tries to locate a file, it first searches the clue table for the longest prefix of the domain that matches the filename. The client then sends requests to the servers in the list. A clue aliasing mechanism is incorporated to provide more flexibility in system integration. The second entry in Fig. 2 shows an example of clue aliasing. When the client accesses directory "/usr/bin" and sends a request to server csm (see Fig. 3), it is actually accessing directory "/usr/sparc/bin" on csm . Through clue aliasing, a directory on a server can appear to have different names on different clients. This adds desirable flexibility to system integration. Certain rules apply to the creation of clue tables in a distributed file system: 1. Each node, a client or server, in a distributed system should have a clue table. A clue table can be shared by two or more nodes but each node must have access to one clue table. The clue table of a node can be stored in a node's local disk if it has one. If the node is diskless, then the clue table is stored in a remote server and is ________________________________________________________________________________|||| | | | /usr :rep=3: # three replicated copies. | | | | solar, # server name list. | | earth; | | | | global. # second group. | | | | /usr/bin :rep=3: # three replicated copies. | | earth, | | | | solar, | | csm=/usr/sparc/bin.# clue aliasing. | | | | .. | | . | | | |_______________________________________________________________________________ | Figure 2: An example of clue tables and clue aliasing. ___________________________________________________________________________________________________________________________________* *__ cached by the node in the memory while it is operating. 2. Each node may have some private entries in its own clue table. One good use of this flexibility is the creation of the private "/tmp" directory. If a client has local disks, it may be more appropriate, from the performance aspect, to place temporary files created by this client on its local disks rather than on remote servers. Even if the client does not have a local disk, it may share a localized "/tmp" directory with other diskless clients in the local cluster. 3. If a domain is shared by multiple nodes, all the nodes must contain the same set of servers in their clue table entries corresponding to this domain. The grouping and prioritizing of these servers may be different, reflecting each node's physical position in the network hierarchy. However, the set of servers in the entries must be identical. Otherwise, data consistency among replicas of the domain cannot be maintained. (Detailed discussion on data consistency guarantees is presented in next section.) 4. When creating a clue table for a node, we should group and prioritize the servers according to their proximity to the node in the network hierarchy. By doing so, the node will always find a file in the nearest available server and its interference with remote nodes in the network will be minimized. 3 System Operations with Clue Tables This section discusses how the system operates with clue tables. Since the clue tables mechanism implements dynamic binding, the bindings among clients and servers can change on-the-fly. A client establishes a binding to a server upon the start of a file session. A file session is a series of file operations to a file enclosed by open and close operations. When a client starts a file session, it first searches the clue table for the domain that includes the file and sends requests to the servers according to the grouping and priority specified in the clue table. The servers that receive the request respond by locating the file in their own storage and returning a succeeded or failed message. If replicated copies exist on several servers, the client will choose the server that responds most quickly to a succeeded message for service of this file session. Fig. 3 illustrates the access request flow. As mentioned earlier, through implementing dynamic binding upon file opening, the clue tables mechanism implicitly carries out load balancing among servers. It is conceivable that a server with a lighter load has a better chance to respond faster than a server with a heavier load. Though it is not guaranteed that the server with the lightest load among multiple candidates always responds fastest, a near-optimum load balancing situation should persist most of time. When a client selects a particular server for a file session, the client caches the attribute block of the file, the server identification, and a file handle returned by the server to speed up following file operations. The file handle is a unique file index assigned by the server for speedily identifying and locating an opened file. Note that the bindings among clients and servers are per file session basis. A client may turn to different servers for service of different file sessions. Due to network proximity and server load, two clients that are concurrently accessing the same file may be Figure 3: Flow of access requests. In this case, server csm is the fastest to reply a succeeded message. Client dfs1 will establish a binding to csm 's /usr/sparc/bin/foo. ___________________________________________________________________________________________________________________________________* *__ served by two different servers. As a result, access load is distributed over servers and the scalability of the system is significantly upgraded. The bindings among clients and servers may change during a file session. One reason is to elude access disruption caused by server or network failure. When such a failure occurs, the client will search the clue table for another server that has a replicated copy of the file. If this search succeeds, the client will establish a binding with the second server and the user will not observe service disruption except that the latency of some file operations is longer. Another occasion in which rebinding is invoked is to maintain data consistency among replicated file copies. With file replication, the system must be able to resolve potential data inconsistency when concurrent writing or read-write sharing occurs. We addressed the data consistency issue by introducing a primary copy based preventive mechanism. When one or more clients open a file for writing, a server that holds a replica of the file is designated as the primary server. All other servers that also have a replica of the file invalidate their copies. Meanwhile, those clients that are initially bound with the servers that have temporarily invalidated copies will carry out rebinding operations to connect to the primary server. The primary server will provide all accessing services to the file as long as the writing operations continue. Upon termination of the situation, the primary server will broadcast the new version of the file to other servers. An interesting issue here is how the primary server is selected. As mentioned earlier, the clue tables mechanism dynamically designates the primary server to distribute servers' load. In the situation that only one client attempts to write to the file, the server that receives the write request will become the primary server. If two or more clients want to open the file for writing at the same time, all the servers that receive a write request will compete to become the primary server. A simple arbitration mechanism based on a pre-assigned priority is used to determine the primary server. 4 Implementation and Performance Evaluation Fig. 4 shows the structure of an implementation of the clue tables mechanism. This implementation is based on Mach 2.6 operating system [12 ] and the VFS (Virtual File System) [13 , 14 ]. On the client side, the VFS(Virtual File System) forwards file accesses that invoke the clue tables mechanism to the CluFS interface. The CluFS interface checks whether the file access hits the local disk/file cache. If not, the file access request is forwarded to a user-level process called the client daemon. The client daemon looks up the local clue table and multicasts the request to the servers Figure 4: An implementation of the clue tables mechanism. ___________________________________________________________________________________________________________________________________* *__ according to the grouping and priority in the matched clue table entry. On the server side, the incoming requests are processed by a user-level process called the server daemon. The server daemon interfaces with the VFS to locate the file and returns a succeeded or failed message to the requesting client. To study the performance with the clue tables mechanism, we have conducted an experiments and compared the results with NFS[15 ]. The experimental system consists of five Intel 80486 based personal computers connected by an ethernet network. All machines run Mach 2.6 operating system and three out of the five act as servers while the remaining two act as clients. Fig. 5 shows the results from the experiment. In the experiment, we repeatedly open and close 30 files to test the overhead of dynamic binding operations. The horizontal axis gives the number of times the operations are repeated. The vertical axis is time the operations take in seconds. Fig. 5 shows that the clue tables mechanism performs slightly better than NFS in file opens. This is due to the use of the socket mechanism[16 ] in the clue tables file system. The socket mechanism induces less overhead than the RPC (Remote Procedure Call) mechanism used in NFS. For file closes, NFS virtually takes no time. The reason is that NFS uses a stateless file cache coherence protocol and, as a result, does not invoke a remote procedure call when a file is closed. 5 Conclusion In this paper, we presented a distributed, dynamic naming mechanism called clue tables for building highly scalable, highly available distributed file systems. The clue tables naming mechanism is distinctive in three aspects. First, it is designed to cope well with the hierarchical structure of modern large-scale computer networks. Second, it implicitly carries out load balancing among servers. Third, it supports file replication and dynamically designates a primary copy to resolve possible data inconsistency caused by concurrent accesses to multiple file replicas. The clue tables naming mechanism is incorporated in the Azalea distributed file system currently being developed at National Taiwan University. Acknowledgments We wish to thank Mary Baker, the shepherd of this paper, and the reviewers for many helpful suggestions and comments. (a) Each loop opens and closes 30 read-only files. (b) Each loop opens and closes 30 write-able files. Figure 5: 30 different files are repetitively opened and closed. References [1] Mahadev Satyanarayanan. Scalable, Secure, and Highly Available Distributed File Access. IEEE Computer, 23(5):9-21, May 1990. [2] Mary G. Baker, John H. Hartman, Michael D. Kupfer, Ken W. Shirriff, and John Ousterhout. Measurements of a Distributed File System. In Proc. of the 13th Symposium on Operating Systems Principles, pages 198-212. ACM, 1991. [3] Brent Welch and John Ousterhout. Prefix Tables: A Simple Mechanism for Locating Files in a Distributed System. In The 6th Intl. Conference on Distributed Computing Systems, pages 184-189. IEEE, May 1986. [4] J. Ousterhout, A. Cherenson, F. Douglis, M. Nelson, and B. Welch. The Sprite Network Operating System. IEEE Computer, 21(2):23-36, Feb. 1988. [5] Mahadev Satyanarayanan, John H. Howard, David A Nichols, Robert N. Sidebotham, Alfred Z. Spector, and Michael J. West. The ITC Distributed File System: Principles and Design. In Proc. of the 10th Symposium on Operating Systems Principles, pages 35-50. ACM, Dec. 1985. [6] Mahadev Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, and David C. Steere. Coda: A Highly Available File System for a Distributed Workstation Environment. IEEE Tans. on Computer, 39(4):447-459, Apr. 1990. [7] Bruce Walker, Gerald Popek, Robert English, Charles Kline, and Greg Thiel. The LOCUS Distributed Operating System. In Proc. of the 9th Symposium on Operating Systems Principles, pages 49-70. ACM, Oct. 1983. [8] David R. Cheriton. The V Distributed System. Communications of the ACM, 31(3):314-333, March 1988. [9] S. J. Mullender and A. S. Tanenbaum. A Distributed File Service Based on Optimistic Concurrency Control. In Proc. of the 10th Symposium on Operating Systems Principles, pages 51-62. ACM, Dec. 1985. [10] Richard G.Guy, John S. Heidemann, Wai Mak, Thomas W. Page, Gerald J. Popek, and Dieter Rothmeier. Implementation of the Ficus Replicated File System. In USENIX Summer '90, pages 63-71, 1990. [11] Alex Siegel, Kenneth Birman, and Keith Marzullo. Deceit: A Flexible Distributed File System. In USENIX Summer '90, pages 51-61, 1990. [12] J. Boykin and A. Langerman. Mach/4.3BSD: A Conservative Approach to Parallelization. Computer Systems, 3:69-99, 1990. [13] Steve R. Kleiman. Vnode: An Architecture for Multiple File System Types in SUN UNIX. In Summer USENIX Conference proceedings, pages 238-247, Atlanta, GA, June 1986. [14] David S. H. Rosenthal. Evolving the Vnode Interface. In Summer USENIX Conference proceedings, pages 107-118, Anaheim, CA, 1990. [15] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Design and Implementation of the Sun Network File System. In USENIX Conference proceedings, pages 119-130, June 1985. [16] S. J. Leffler, M. K. McKusick, M. J. Karels, and J. S. Quaterman. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley Publishing Company, 1989. Author Information Cheng-Zen Yang is currently a Ph.D. student at the Dept. of Computer Science and Information Engineering of National Taiwan University, Taipei, Taiwan. He received the B.S. degree in Computer Engineering from National Chiao Tung University in 1988, and M.S. degree in Computer Science and Information Engineering from the same university in 1990. His major research interests include distributed computing systems and operating systems. He can be reached by e-mail at dennis@solar.csie.ntu.edu.tw. Chih-Chung Cheng was born in Tainan, Taiwan, R.O.C. He received the B.S. degree in Computer Science and Information Engineering from National Taiwan University in 1992. He is currently in the master program of the same department and expected to complete his degree in 1994. His research interests are distributed file systems and computer networks. He can be reached by e-mail at chung@solar.csie.ntu.edu.tw. Yen-Jen Oyang received the B.S. degree in Information Engineering from National Taiwan University in 1982, the M.S. degree in Computer Science from the California Institute of Technology in 1984, and the Ph.D. degree in Electrical Engineering from Stanford University in 1988. He is currently an Associate Professor in the Department of Computer Science and Information Engineering, National Taiwan University. His research interests include computer architecture, distributed systems, and VLSI system design. He can be reached by e-mail at yjoyang@csie.ntu.edu.tw.