The following paper was originally published in the Proceedings of the USENIX Conference on Object-Oriented Technologies (COOTS) Monterey, California, June 1995 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: http://www.usenix.org Software Configuration Management in an Object Oriented Database Mick Jordan mjj@Eng.Sun.COM Michael L. Van De Vanter mlvdv@Eng.Sun.COM Sun Microsystems Laboratories 2550 Garcia Avenue Mountain View, CA 94043 USA Abstract The task of configuration management for software development environments is not well supported by conventional files, directories, and ad hoc persistence mechanisms. Typed, immutable objects combined with ubiquitous versioning provide a much more sound basis. A prototype configuration management system cur- rently under development relies upon persistent objects provided by a commercial object-oriented database sys- tem. Mechanisms and policies developed for this proto- type simplify programming with persistent objects; results include simplicity of both design and implemen- tation, as well as flexibility, and extensibility. Initial measurements suggest that performance is acceptable. 1 Introduction A practical software system consists of hundreds or thousands of separate but related components. A soft- ware development environment (SDE) must help man- age this collection by providing configuration management services. Closely related is version man- agement, often associated with tracking the evolution of individual components. However, configurations natu- rally exist as versions, and it is the effective manage- ment of such configurations that distinguishes a truly useful SDE for large, multi-user projects. An SDE typically relies on the persistent storage facili- ties provided by the underlying operating system, usu- ally in the form of a file system with two basic mechanisms: files and directories. Unfortunately the weak functionality of these primitives does not permit adequate modelling of configurations. This has led to the suggestion that an SDE rely instead on a database management system (DBMS) [1]. Experience with SDEs built using traditional DBMS technology has not been encouraging. Indeed such sys- tems have acquired a reputation for being resource hun- gry, and they have not found their way into widespread use. However, in recent years there has been a surge of interest and development in object-oriented database (OODB) technology. While the relational database com- munity might argue that OODBs are fated to repeat the perceived failure of the network databases of the 1970's, there is no doubt that the better impedance match with everyday programming has a great appeal to many developers. Indeed one can argue that OODBs are becoming successful because they are delivering, in part, the promise of persistent programming languages [2]. This paper describes our experience building a proto- type configuration management system using Object- Store® [3], a commercial database system produced by Object Design Inc.® We begin by reviewing the advantages of an OODB as infrastructure for an SDE and then provide an overview of the essential facilities provided by ObjectStore. Next we briefly describe the structure and operation of our configuration management system and explain how objects help simplify both design and implementation. We then turn to the mechanisms and policies that we have developed to exploit ObjectStore successfully in our prototype, many of which have wide applicability. After presenting some preliminary performance mea- surements we describe related work and discuss our conclusions and future plans. 2 Why a Database? Traditionally an SDE is built on top of a file system. The inertia caused by existing file-based tools makes it quite difficult to change this situation, and any database solu- tion must provide a way to integrate legacy tools. Mean- while progress in tools continues, and the limits of file- based systems are being pushed. In particular, it is increasingly common to find ad hoc "databases" being added to file based SDEs. Examples include data to sup- port browsing, C++ [4] template instantiation, and build state for make [5]. Managing these ad hoc persistent stores, particularly in the face of concurrent access, is difficult and error prone. Yet, there is no shortage of object-based systems offering solutions, for example CORBA [6] and Microsoft OLE 2 [7]. However, although these systems provide varying degrees of sup- port for persistent data, they currently do not scale to the task of supporting an SDE efficiently. In contrast, OODBs have matured to the point that we believe it is practical to deploy them as an alternative to file systems and ad hoc persistence mechanisms. An OODB offers two main advantages over a file sys- tem as infrastructure for an SDE: · A transactional store, and · Precise modelling of application data. 2.1 Transactional Store All updates to a database take place inside a transaction; they either complete in their entirety or not at all. There- fore the evolution of data in a database proceeds through a sequence of well-defined states. Furthermore, data modified in a transaction by one process is not visible to other processes until the transaction commits. Providing such a guarantee for a file-based system is substantially harder. While the transactional property might be seen as a minor feature in a single user programming envi- ronment, it is extremely important in large, multi-user projects. Even in single user environments, the use of multiple processes can easily put data into an inconsis- tent state, for example should a process abort at an inop- portune moment. 2.2 Precise Data Modelling We believe that the capability to design objects that closely model the application domain is the more signif- icant benefit of an OODB. File based systems are ham- pered by the performance and modelling discontinuities that occur at the file and directory boundaries. An OODB can provide a more consistent solution. In partic- ular, there are substantial advantages to be gained from utilizing fine-grain objects: objects too small to be stored efficiently as individual files. That all objects are also typed, significantly improves the readability and reliability of programs that manipulate them. Traditional solutions to storing fine-grain objects in a file system involve pickling schemes [8] [9], whereby programming language objects are preserved in some way inside one or more files. Although this can work well for simple structures, it rapidly becomes unwieldy for the complex, linked structures that prevail in compilers and associ- ated tools [10]. 3 ObjectStore ObjectStore is a commercially available OODB that runs under unix® and Microsoft® Windows. It is prima- rily targeted to the C++ language, but support for Small- talk [11] has been added recently. Our experience is with the C++ system for the Solaris® variant of unix®. The impedance match between ObjectStore and C++ is generally very good. Objects are allocated in a database by overloaded variants of the C++ operator new. In consequence all objects in the database can be refer- enced by C++ pointers. ObjectStore transparently maps portions of the database into the address space of the C++ process, using virtual memory mapping primitives of the underlying operating system. Access to objects in the database must occur inside a transaction, and any pointers that are mapped within that transaction are not valid after it terminates. Multiple processes may access the same database, and ObjectStore enforces a multiple reader-single writer locking policy. Within these constraints, and inside a transaction, manipulation of database objects is no different from manipulating virtual memory (heap) objects, and it is this feature that makes use of the database transparent. However, to avoid lock conflicts with other processes it is necessary to limit the time spent in each transaction. A lock conflict occurs when one process holds a write lock and another process requests a read or write lock, or vice versa; this delays the requesting process until the lock is relinquished. Since locks are acquired only as data is accessed, deadlock can occur when access pat- terns produce cyclic lock dependencies. In case of dead- lock one transaction is automatically aborted and subsequently retried up to a user-defined maximum number of times. In order to retain access to an object between transac- tions, it is necessary to use an ObjectStore reference: a kind of heavyweight pointer. Unless some care is used to minimize the use of references, much of the transpar- ency is lost. We will discuss this issue in more detail in section 5.4. ObjectStore is based on a client-server model. Data- bases are managed by a server process on a designated server machine. Client processes, typically on separate machines, communicate with the server to read and write data. Client-side caching is heavily used to mini- mize communication overhead. A process can access multiple ObjectStore databases within a transaction, and database objects can contain references to objects in other databases. An update transaction that modifies data in multiple databases managed by different servers is implemented by a two- phase commit protocol. We currently do not use ObjectStore's other features, such as querying, collections library, and generic ver- sion management. Our primary interest is in exploiting the shared, distributed, and transacted virtual memory model provided by the basic infrastructure. 4 Configuration Management in Forest Our configuration management system is part of a larger project, Forest, that aims to provide an integrated, multi- user, software development environment for medium to large scale systems programming. A major goal of For- est is to discover whether current OODB technology is up to the task. The goal of the configuration management system is to support the precise specification of a software system that may occur in many versions and variants. It must be capable of dealing with a spectrum of object types, from very small objects that might represent a fragment of a source program through to structured objects that might represent a complete system build containing thousands of objects. Intermediate in this spectrum are objects that represent object modules, programs, etc. In file-based systems the structure of these kinds of objects is typi- cally defined by a file format. In object-based systems we expect such objects to be defined increasingly as abstract object types. Figure 1 suggests how services are provided by different architectural layers of the prototype. 4.1 Components Since the term object is rather overloaded, we use the term component to denote the objects that are manipu- lated by the Forest configuration management system. Although ObjectStore can store instances of any C++ type, we restrict components to be a subclass of an abstract C++ class Component, which defines a com- mon set of operations and attributes. It is possible to construct arbitrarily complex collections of components, connected in an OODB, much more eas- ily than one can with a file system. In practice, we enforce some familiar high level structure, partly to meet the goals of the configuration management system and partly for administrative and engineering reasons. 4.2 Projects Components are allocated in and belong for their life- time to a project. A project physically groups related components and, as the name implies, is intended to be shared by a group of collaborating programmers. A project is similar to a disk volume or mounted file sys- tem and is implemented as a single ObjectStore data- base. Components in one project may refer to those in another, but only through special components called links. At the outer level a project looks very like a conven- tional file system. A root folder (analogous to a file sys- tem directory) can contain other folders or arbitrary components. Two particular component types, Package and BuildableFolder, are the essence of the configura- tion management system; we will now describe these in more detail. Figure 2 suggests how these are related in practice. 4.3 Packages and BuildableFolders Forest follows the Vesta [12] approach to configuration management. A project contains a collection of hetero- geneous source components, organized into families called packages, each of which is arranged into a tree of versions. Source components are so-called because they cannot be generated automatically by the system. Users typically create source components using tools such as editors. However, a binary program that is imported into the database from elsewhere is also a source component in this context. All other components, i.e. those that can be generated automatically by the system, are derived. Components, once created, are immutable. Tools used to construct derived components are themselves modelled as components and versioned in the same way. Systems are built by executing a functional program in a system modelling language. These properties provide the basis for reliable and repeatable system builds. Unlike Vesta, in which the package structure is hard- wired, a Forest package is just a particular subtype of the Component class. The operations available on a package are specified as ordinary C++ methods, and each executes as a transaction. A package version can bind any subtype of Component, although to date we have not exploited this flexibility. Variations of the package type could be defined (for example to imple- ment different organizations of versions), as indeed could quite different structures, and all could coexist in the same database. This flexibility and extensibility is one of the clear benefits of the OODB approach. Following Vesta, the component type that we currently bind to a package version is a special kind of container that we call a BuildableFolder. This is similar to a file system directory, in that it contains a set of heteroge- neous (source) components. However, it also plays a key role in the system building process in that it is also input to the system builder. In this role a Buildable- Folder acts as a module in the system modelling lan- guage. The source components act as constant values that are the inputs to the algorithm that computes (builds) the system. Large systems are built up by con- necting package versions together using an import mechanism. The essential feature of the import mecha- nism is that it is not intensional; each Buildable- Folder precisely and permanently identifies the other folders that it imports. The details of the system model- ling process are beyond the scope of this paper; the mechanisms are similar to those described for Vesta, but the details are rather different. Development typically proceeds by the user checking out the latest version of a package. This produces a new branch on the version tree and a new Buildable- Folder, which initially shares the components con- tained in the folder that was the subject of the checkout operation. The package components, which can be of arbitrary type, are then evolved by appropriate editors. We use the term evolved rather than modified to empha- size that any editing must take place on copies. We expect editors to exploit the immutability provided by the system and share as much as possible of original components when constructing copies. Coordination of component editors is overseen by a spe- cial editor for a BuildableFolder; this editor ensures, amongst other things, that the entire folder is saved as a group, thus versioning the entire configura- tion. Once a version is saved, which involves commit- ting any new components to the database, it can be built. Therefore, it is important that the group save operation be fast, in user time, and our experience with the data- base is positive in this regard. Eventually a version of this checkout branch, usually the last, is checked in by creating a new version on the main branch. This involves no copying or building, since it merely serves to associate a new version name with the same folder. 4.4 Fine-grain Components Some components in a BuildableFolder have internal structure that is made visible through the inter- face to the component. Since this structure corresponds to the sub-file granularity in a file-based system, we refer to such components as fine-grain. Most current program editors are text-based, so typical source com- ponents are not structured, but we expect this to change as object-based systems and compound documents become more widespread. In contrast, it is somewhat easier to represent derived components, those created by tools during a system build, as structured objects by way of wrappers and similar techniques. In Forest, these are defined as appropriate subtypes of the Component class, tailored to model the abstract objects they repre- sent. For example, we specify the input to the parser for a programming language as a TokenSequence compo- nent that represents a sequence of tokens in that lan- guage. This component might be produced by a lexer from a text component or it might be manipulated directly by a language sensitive editor. Either way, the parser, as well as all downstream tools in the compila- tion pipeline, are only interested in the elements of the token sequence, and are unconcerned that it might be derived from or be part of another component. This pre- cise modelling of the compilation pipeline provides opportunities to avoid certain phases. For example, a change to a comment or other annotation, such as a graphic, in the original source component, would only require re-lexing. 4.5 Uids and Abstract Values Two particularly important attributes of a component are its uid and its value. Each component is assigned a globally unique identifier (uid) that denotes it for all time. The uid acts much like a pointer or reference in a programming language. Forest uids are 128 bits wide, part of which encodes the database in which the compo- nent is allocated. The uid provides a guaranteed way to identify a component, in contrast to the facilities offered by the unix file system. The uid also provides a sure way to determine if two references are to the same object. However, uid comparison sometimes draws too fine a distinction. Two components with different uids may in fact represent the same abstract value, for example the same sequence of tokens in the TokenSequence component of the previous section. The abstract value attribute is represented by a fingerprint [13]. A finger- print is an opaque bit sequence, similar to a hash code, that encodes value probabilistically. Two components with the same value have the same fingerprint. Con- versely, if two components have different values, then, with extremely high probability, they have different fin- gerprints. Each component implementation can provide its own definition of the value attribute; the default, inherited, implementation uses the uid, which is safe but conservative. Fingerprints can be combined efficiently, which is how the abstract value of structured compo- nents is computed. Component implementations are free to cache the fingerprint or compute it on demand. The system builder deals with components through their abstract values because, in the system modelling lan- guage, object equality is defined as equality of abstract values. 4.6 Copy or Share? The combination of immutability and abstract value provides many benefits, not least the freedom to copy or share components at will. For example, a system that is built in a database at a developer site can be copied to a database at a client site. Provided that client and devel- oper share the same environment with respect to the dependencies of the copied system, no rebuilding is nec- essary because, even though the copied components will have different uids, the abstract values of the compo- nents will not have changed. If, on the other hand, the client uses a different version of the compiler, then the affected components will be rebuilt transparently; the system takes full responsibility for the construction and location of derived components. The ability to share components reliably has two major benefits. First, it permits components to be constructed as incremental modifications to existing components. This is similar to the way in which versioning systems such as RCS [14] use "diffs" to encode changes to text files, but it differs by being proactive rather than retroac- tive. In addition, arbitrary component types can be con- structed incrementally in the Forest environment, leaving details to component implementations. The sec- ond benefit is the ability to share large derived compo- nents such as programs, libraries, or entire subsystems among many versions and many users. This can dramat- ically reduce overall storage requirements, making the system competitive or better than current file-based environments. 5 Storage Management and Locking We turn now to the implementation of the configuration management system, and in particular to the impact ObjectStore has on the Forest/C++ programmer. This is felt in two main areas: storage management and data locking. We will describe the mechanisms and policies that we have developed to minimize this impact. 5.1 Allocation Mechanisms As noted earlier, objects are allocated in an ObjectStore database using overloaded variants of the C++ opera- tor new. The C++ language permits optional place- ment arguments to new; ObjectStore uses these to control where in the database an object is allocated, using two basic placement abstractions: segments and clusters. Segments support large scale structuring of the database and provide hooks for various policies to improve per- formance. For example, whole segments can be trans- ferred from the server in one request. Segments can be of unlimited size. A segment can be divided into clusters, which are essen- tially abstract pages. A cluster has a minimum and a maximum size, the minimum being one page. Object- Store's locking is actually page-based, so objects in dif- ferent clusters cannot cause lock conflicts. Clusters also provide for locality of reference for groups of small objects. Since clusters have a maximum size, a client must always be prepared for an allocation request to fail, requiring the cluster to be extended or a new one to be allocated. Objects larger than the maximum cluster size cannot be allocated in a cluster at all. It would be unworkable for each allocation site to deal with all these issues, so in Forest we abstract placement data into a C++ class called DBNewPlace. Although this successfully encapsulates the ObjectStore place- ment information, a client must still acquire or generate an instance of this class in order to allocate any objects. Fortunately ObjectStore provides enquiry methods that return the cluster or segment in which an existing object is allocated. Provided such an object is at hand, it is easy to generate a DBNewPlace that will cause allocation with the same placement. In practice, almost all objects are associated with some container object, and alloca- tion with the same placement as the container is usually the right thing to do. Since such allocations typically take place inside a method of the container object, the C++ this pointer can be used to generate the place- ment. In cases where a this pointer is not available, such as static member functions, a DBNewPlace instance must be passed as an explicit argument. 5.2 Allocation Policy The DBNewPlace class provides the mechanism for abstracting the placement information, but who provides the policy? Our experience bears out Object Design's advice: unless this is carefully planned, lock conflicts, poor locality of reference, and consequent inadequate performance are inevitable. The versioned, mostly immutable nature of a Forest database is of considerable help in establishing appro- priate policy. By design, conflict can only occur on the mutable components that represent the leading edge of development. Recall that development by an individual user takes place on a package. Often two or more pack- ages may be involved, for example a package that pro- vides a library of code and an application package which uses the library. Different individuals will typi- cally be working on different packages, ultimately inte- grating their work by way of packages that denote subsystems. Each Forest package is allocated in a separate segment, so all source components in all versions of the package are allocated together. Lock conflicts thus cannot arise from modifications to different packages. In addition, when a particular version of a package is built, all new derived components are allocated in a fresh segment. So concurrent builds of different versions of the same pack- age also cannot conflict. Should two programmers be working on different branches (versions) of the same package, conflict can occur while creating new source components. However, this happens in user-time, and the lock is only held for the time it takes to write the new component. If such conflicts proved to be problem- atic, it would be relatively straightforward to allocate extra segments, for example, for each version branch. Only the code that creates new package versions would be affected by this change; editors and other mutators would simply inherit the appropriate placement. Owing to the size constraints, we find clusters much less useful than segments; we use clusters only for compo- nents whose size we can predict with confidence. 5.3 Uid Generation The most significant hot spot in the database is the data structure that deals with uids. At a minimum the next- uid value must be updated each time a component is cre- ated. The data structures that provide a mapping from a uid to the associated component are also potential hot spots. Evidently this data must be clustered separately, otherwise read-only navigation of the database will immediately conflict with mutating. In fact, as the measurements in section 7 show, the con- flicts that arise when multiple processes are mutating the database quickly cause serious performance problems. To address this problem we have made it possible for uid generation to be distributed amongst components in the database. Rather than hard-wire a particular distribu- tion, any component type can opt to be a uid-generator by inheriting and implementing the UidGenerator interface. This interface defines a protocol that permits a hierarchy of components to collectively manage the uid space in one database. Initially the only uid generator is at the root component, which also serves as the entry point to a database. A component can register with the root and acquire a por- tion of the uid space that it then manages. Any compo- nent can be asked for its uid-generator. Since allocation and uid generation go together, we extend the DBNew- Place class to carry both the database placement and uid-generator instance. There are no constraints on how a component implements the UidGenerator inter- face; the protocol provides for the enumeration of the components under its control. In principle this hierarchy could be of any depth. In practice, we have only used a depth of two, by making each package be a uid genera- tor that registers with the root component. The contention caused by the uid structures could be avoided by binding our implementation more tightly to ObjectStore, since each persistent C++ object has a unique persistent address. However, an initial goal of the project was to avoid too much dependence on imple- mentation details of any one database, hence our own uid layer. Since the uid representation is hidden from Forest clients, we should be able to perform this optimi- zation in the future. Nevertheless, the uid structure is just a particular example of an index, which occur repeatedly in database applications. A consequence of the distributed shared memory model is that such indi- ces are certain hot spots if the update frequency is high. 5.4 Locking Careful clustering and distributed uid generation does much to alleviate the potential for lock conflicts in For- est, but it does not permit the programmer to ignore the issue completely. In particular, any process that involves user-interaction or indeterminate delay cannot be under- taken inside a transaction without the risk of locking out another process. In Forest, this issue manifests itself most obviously in the user interface to the configuration management system. Much user activity involves browsing the package structure, using read-only transac- tions, with occasional bursts of editing and building that require update transactions. In order to avoid holding a read lock that might prevent an update, user-interface code must use short transac- tions for database navigation. Pointers that represent navigational context must be converted to ObjectStore references, which were alluded to in section 3. Object- Store provides a variety of reference types, with varying costs and functionality. All are represented in C++ as template types and behave like smart pointers to the real object type. For the most part this fiction succeeds, but the C++ type system sometimes falls short. For exam- ple, if B is a derived class of A, then a B* is assignable to an A*. However, there is no such relation between a DBRef and a DBRef, where DBRef is a reference to a type T. One particular trap for the unwary involves closing a transaction within a method of an object whose this pointer becomes invalid when the transaction ends. This is analogous to sawing off the branch on which one is sitting. The solution is to use functions or transient objects to scope transactions. 5.5 How much to Store Persistently? One reason that database-based environments have a reputation for storage excess is that the data structures associated with compilers and related tools can be very large. A good example is a program represented as an Abstract Syntax Tree (AST). Another is the debugging information in typical unix object files. Forest solves this problem with two distinct mecha- nisms: · Separation of interface and implementation; and · Aggressive sharing of substructure. The second mechanism, discussed earlier, deserves fur- ther emphasis. The intrinsic immutability of Forest com- ponents enables common structure to be shared, confident in the knowledge that it cannot change. The value of physically separating interface and imple- mentation is well understood in the abstract data type programming communities. In the C++ community physical separation by way of abstract base classes is uncommon, perhaps because of efficiency concerns and poor language and compiler support. In a database con- text, efficiency concerns are clearly secondary to the flexibility that abstract interfaces provide. Forest com- ponents are defined and accessed through interfaces defined as C++ abstract base classes. These abstract interfaces show no implementation detail: all methods are pure virtual, and no data members are permitted. The implementor enjoys complete freedom to represent the abstract state of the component in the most conve- nient way. The actual implementation takes the form of a class that is derived from the abstract class and con- tains the physical data members. When coupled with immutability, this separation of interface and implementation becomes a powerful tool to control database size. Consider the AST example mentioned previously. An AST, when annotated with semantic information, is a large data structure that is not obviously worth storing persistently in its entirety. For- tunately, we can easily establish the minimum amount that must be stored: the source component from which the AST was generated, any ASTs that correspond to imported (included) source components, the tool responsible for the AST generation, and any environ- mental information, such as the target machine, upon which the generation depended. Since, by definition, these dependent components are immutable, references to these components are all that need be stored persis- tently. The full AST can be generated in transient mem- ory on demand, and cached in a table keyed by its abstract value (fingerprint). At the other extreme we might choose to store the entire AST persistently. Abstract base classes, combined with ObjectStore's pointer transparency, leave clients unaware of this transition, other than possible delays when the AST must be regenerated in transient memory. Evidently we have reduced this problem to a classic space-time trade-off. Indeed, it is possible to create both kinds of components in the same database and therefore measure the trade-off for a given processor/network/ disk combination, all other factors being held equal. Given the continuing divergence in processor and disk speeds, generating infrequently used data on demand in transient memory seems a sound strategy. Note that in either case the same amount of virtual memory is required for the AST; either it is mapped from the data- base or allocated transiently by the process. 5.6 Garbage Collection The high degree of sharing and the concurrent nature of data access makes manual storage management imprac- tical in a Forest database. Fortunately the configuration management framework makes automatic garbage col- lection a practical possibility. First we must explain the different ways in which garbage can occur in a database. Although most components are immutable, there are exceptions. · Derived components that result from building a package version can be discarded at will since, by definition, they can be recreated reliably from source components when needed. · Building typically creates garbage in the form of intermediaries, for example components analogous to compiler intermediate files. These become gar- bage because they are not reachable from any pack- age. · Users can delete old package versions, causing some source components, modulo sharing, to become garbage. Since this is a potentially danger- ous operation, and since derived components domi- nate storage costs, it is not clear that source deletion is really necessary. Garbage collection is currently supported by reference counts on components. Reference counting of derived components is mostly handled automatically by the sys- tem builder. However, editors and other tools that manipulate structured components, such as the version- ing system, are required to handle reference counting manually. The reference counting interface is part of the Component class, but the implementation is private and could be altered without affecting clients. Since components may reference components in other databases, we have, in general, a distributed garbage collection problem. Each database occupies a distinct piece of the 128-bit wide address space, so there is no conceptual difficulty in garbage collecting a large enough portion of this address space to bound all inter- database references. Note that manipulation of reference counts and garbage collection are the only ways in which a distributed update transaction can occur in For- est. Given that the system builder is based on the abstract value of immutable components rather than uids, and that we can copy any package version from one database to another at will, it is not clear that inter- database references should be encouraged. We expect that a group of collaborating programmers will share a single database, and we postulate that sharing among larger groups might be served best by copying. How- ever, this a matter for further research. 6 Safety The shared memory model creates the potential for cor- ruption by rogue applications of critical data structures, for example, those that underpin the folder structure and versioning system,. The solutions are either to abandon the shared memory model or to rely on the safety of the programming language in which the system is imple- mented. Cedar [15] is a classic example of the latter approach. The potential for corruption is real, since C++ is unsafe, but ObjectStore provides some mechanisms to mitigate this. For example, a program fault is caught and trans- lated into an exception that aborts the transaction, leav- ing the database unscathed. It is also possible to leverage the type information stored with the database to check the validity of pointers prior to committing a modified page. In the long term we would prefer a safer implementation language. Compiler support for safe C++ [16], with gar- bage collection for transient objects, would be a viable alternative. 7 Performance Measurements Although we have not yet put the prototype system into everyday use, early indications are encouraging. With no tuning at all, the prototype performs quite ade- quately. We intend to develop the prototype into a work- ing system that we will use ourselves for everyday development. Meanwhile, in order to estimate the performance of the environment in real use, we have developed a test pro- gram that simulates a user performing a set of editing operations. The test database, characteristic of a typical medium-sized project, contains approximately 2500 text components, mostly C++ source files, imported from a unix file system. These are distributed among 216 pack- ages, each of which corresponds to a directory in the file system. The initial size of the database is approximately 18Mb. The program simulates a user checking out a package at random, choosing a random number of components of that package to edit, editing them (by making copies), checking the package in, and repeating this process for a given number of iterations. Unlike real users, the test program performs edits in zero time and so provides a somewhat more stressful test than would occur in real use. Each instance of the program simulates one user. By running the program on several machines, all accessing a shared database on the server, we can discover how the system scales as users are added. In particular we can measure the effect of making each package a uid generator. All times are wall clock since, in a distributed system, that is the only measurement of relevance to users. The experiments were run on dual-processor SPARCsta- tion` 10 client machines with 64Mb of memory, con- nected by a 10 Megabit Ethernet® to a six-processor SPARCcenter` 2000 running a beta version of the ObjectStore 4.0 database server. The database was stored in a unix file, a convenient feature of Object- Store, but one that obviously limits performance in com- parison to storing the database on a raw disk. Client cache size, the amount of virtual memory available for mapping in database pages, was set to the default value of 8Mb. For this benchmark, which only accesses about 0.5Mb of data, the cache size is not an issue. The results break down the run time into six activities: initial scan, checkout, checkin, analyzing and construct- ing a new BuildableFolder, reading the text com- ponent data, and writing it back in the new component. Each of these operations is implemented as an indepen- dent transaction to maximize the potential for concur- rency. The initial scan walks the top-level folder structure and counts the number of packages. As a side- effect this partially warms each client cache. Table 1 shows the results when the database contains a single uid-generator. These clearly confirm that contention for write access to data is disastrous in the distributed shared memory model. For operations that require object creation, it is as if each new client adds its load to every other client machine. In other words, no benefit is accrued from distributing the processing among multi- ple client machines. Table 2 shows the effect of distributing the uid genera- tion amongst the packages. These results are much more encouraging. In particular the sum of the checkout and write times only doubles as we go from one to eight cli- ents, because of the much lower contention for locks. The consistent times for the scanning and reading phases indicates that the database server is affected only slightly by read-only clients and that the client side caching is effective. Clearly, as the number of clients increases, the probability that the same package will be chosen for editing by more than one client increases, and this does indeed occur during this experiment. This causes more communication between the server and cli- ents in order to reclaim locks, and therefore increases the real-time delay for the affected client. As noted earlier, the critical measure is the time taken to write back modified components and check in a new package version. In this experiment an average of five components are modified in each checkout/edit/checkin cycle. The time to commit the updates ranges from approximately five to twelve seconds. For the common case, a modification to one file, the results suggest a range of two to five seconds. While we would prefer an upper bound on the order of one second, these figures are acceptable given the prototype nature of the system. 8 Related Work The Vesta system [12] [17] [18] [19] clearly demon- strated that configuration management could be placed on a firm foundation through the use of immutable com- ponents and modular system modelling. The Vesta repository, however, remained essentially file-based, and was implemented on top of an existing file system. Tools (bridges in Vesta parlance) were still required to pickle language level objects into repository files and the caches for the builder were also stored in this way. Much of the implementation was concerned with imple- menting key transactional operations in terms of the underlying file-system. The OODB infrastructure pro- vides this directly, while also supplying more flexibility and extensibility, and a consistent object model from the top-level structure of the repository to the fine-grained structure used by tools. To illustrate the leverage of the OODB, it is stated in [18] that the Vesta repository took less than one person year to implement. In contrast the Forest configuration management subsystem took less than one person month to build. Vesta was itself influenced by the Cedar System Mod- eler [20]. The Cosmos [21] project also bases its config- uration management on immutable objects. PCTE [1] includes a database management system, OMS [22], which provides an entity-relationship model with object-oriented extensions, but it was not designed to support fine-grained components. The PACT project [23] developed configuration management services on top of OMS. ClearCase® [24] and DSEE [25] both support a ver- sioned file system with views to select particular ver- sions in individual user contexts. Both support integrated system building. Clearcase is a hybrid sys- tem, storing some data in conventional files but storing versioning information and other attributes, some of which can be user-defined, in a database. Magnusson et. al. [26] describe a revision control sys- tem for fine-grained, tree-structured, documents imple- mented with a client-server model and a specialized database. Their storage model for documents is very similar to our scheme for BuildableFolders, except that their documents are strictly tree-structured, whereas our import components produce graph-struc- tured configurations. Because our versioning system is independent of component type, we believe that it can be re-used to represent arbitrary versioned documents. We share their view that a change to an internal node in a document requires the entire document to be revised. ObjectStore also provides generic version management for arbitrary C++ objects, using a binary differencing mechanism for deltas between versions. References (pointers) to other objects float to an appropriate version as pages are mapped in, based on a selection mechanism that is similar to the view selection of ClearCase. This contrasts with the immutable configuration bindings that characterize the Vesta approach. In addition, checkout/ checkin is tied to a hierarchy of workspaces, with the understanding that changes will eventually propagate from the leaves to the root of the hierarchy. This approach, often referred to as the copy-modify-merge model, is also taken by Teamware [27], which uses a combination of whole directory trees and SCCS [28] files to represent a configuration. In contrast the Vesta approach tries to minimize the need for copy-modify- merge by using modularization and interfaces between subsystems. We conjecture that both approaches have a role to play in configuration management, but with copy-modify-merge best restricted to fine-grained com- ponents for which modularization is not applicable. Onodera [29] describes experience representing fine- grain C++ program information in a database, also using ObjectStore. The results show that a more global approach to storing information on multiple programs does reduce overall space requirements. It is also noted that pointer-based structures are inherently less space efficient than the tight encodings in current file formats. We believe that our approach of separating interface and implementation and the judicious use of transient objects, can maintain an interface that is convenient for programming yet space efficient in the database. Baker [30] has argued that functional objects can sim- plify the construction of distributed and parallel compu- tations, and his conclusions echo many of the ideas of the Vesta approach. 9 Conclusions We have described the design and implementation of a prototype modern configuration management system. It follows the approach pioneered by Vesta, but is imple- mented on a commercially available object-oriented database. Our experience suggests that transacted data, combined with typed, immutable components, provide an excel- lent basis for a software development environment. Many problems of scale that were experienced by ear- lier efforts can be solved by the pervasive application of incremental techniques. These are made possible by the combination of immutability and abstract interfaces to objects. Our performance measurements suggest that it is no longer necessary to build a special purpose repository to support an SDE, and that the current generation of object-oriented databases can provide an adequate infra- structure, provided applications pay appropriate atten- tion to object clustering. 10 Future Plans Our future plans are mainly in two areas. First, we intend to evolve the prototype into a system that we can use for everyday programming, including the develop- ment of the system itself. Our current prototype supports an experimental evolution of C++ [31], which afforded us the luxury of designing the compiler to exploit the database directly. A working system for ANSI C++ will be somewhat more constrained, but we expect at least to replace the use of ad hoc file databases for derived infor- mation. For unmodifiable tools, Vesta's bridge tech- niques [17] are directly applicable. The second area of research is to expand and exploit the use of fine-grained components in all areas of the soft- ware development environment. We plan to investigate the integration of structured document editors with the configuration management system. We are also inter- ested in exploiting component abstract values to achieve build-avoidance at a finer grain, using the basic mecha- nisms of the system builder. 11 Acknowledgments Jon Gibbons designed and implemented a prototype ver- sion of the Forest component system. Ted Goldstein designed the DBNewPlace interface. Roy Levin pro- vided helpful insight and discussions on Vesta. 12 Trademarks Ethernet is a registered trademark of Xerox Corporation. Microsoft is a registered trademark and Windows is a trademark of Microsoft Corporation. Object Design Inc. and ObjectStore are registered trademarks of Object Design Inc. Solaris is a trademark of Sun Microsystems Inc. SPARCstation and SPARCcenter are trademarks of SPARC International, Inc., licensed exclusively to Sun Microsystems Inc. unix is a registered trademark in the United States and other countries, exclusively licensed through X/Open, Ltd. ClearCase and Atria are regis- tered trademarks of Atria Software, Inc. 13 References [1] An Overview of PCTE and PCTE+, Gerard Boud- ier, Ferdinando Gallo, Regis Minot and Ian Tho- mas, Proceedings of the ACM/SIGSOFT Software Engineering Symposium on Practical Software Development Environments, Boston, Massachu- setts, November 1988, 248-257. [2] Types and Persistence in Database Programming Languages, Malcom P. Atkinson and O. Peter Bun- eman, ACM Computing Surveys 19,2 (June 1987) 105-190. [3] The ObjectStore Database System, Charles Lamb. Jack Orenstein and Dan Weinreb, Communications of the ACM 4,10 (October 1991) 50-63. [4] The Annotated C++ Reference Manual, Margaret A. Ellis and Bjarne Stroustrop, Addison-Wesley, Reading, Massachusetts, 1990. [5] Make-A Program for Maintaining Computer Pro- grams, Stuart I. Feldman, Software--Practice & Experience 9,3 (March 1979) 255-265. [6] The Common Object Request Broker: Architecture and Specification, Object Management Group, Document No. 91.12.1, 1991. [7] Inside OLE 2, Kraig Brockschmidt, Microsoft Press, ISBN 1-55615-618-9, 1994. [8] A Simple and Efficient Implementation for Small Databases, Birrel et al., Proceedings of the Elev- enth ACM Symposium on Operating System Prin- ciples, August 1987. [9] A Study of Pickling Emphasizing C++, Daniel Craft, Olivetti Software Technology Laboratory Technical Report STL-89-2, September 1989. [10] An Extensible Programming Environment for Modula-3, Mick Jordan, Proceedings of the Fourth ACM SIGSOFT Symposium on Software Develop- ment Environments, Irvine, California, December 1990, 66-76. [11] Smalltalk-80, The Language and its Implementa- tion, Adele Goldberg and David Robson, Addison- Wesley, Reading, Massachusetts, 1983. [12] The Vesta Approach to Configuration Manage- ment, Roy Levin and Paul McJones, DEC Systems Research Center TR 105, June 1993. [13] Some applications of Rabin's fingerprinting method, Capocelli et al. (ed), Sequences II: Meth- ods in Communication, Security and Computer Science, Springer-Verlag, New York, 1991. [14] Design, Implementation, and Evaluation of a Revi- sion Control System, Walter F. Tichy, Proceedings 6th International Conference on Software Engi- neering, Tokyo, Japan, September 1982, 58-67. [15] A Structural View of the Cedar Programming Environment, Daniel C. Swinehart, Polle T. Zell- weger, Richard J. Beach and Robert B. Hagmann, ACM Transactions on Programming Languages and Systems 8,4 (October 1986) 419-490. [16] Safe, Efficient Garbage Collection for C++, John R. Ellis and David L. Detlefs, USENIX C++ Con- ference Proceedings, Cambridge Massachusetts, April 1994, 143-177. [17] Bridges: Tools to Extend the Vesta Configuration Management System, Mark R. Brown and John R. Ellis, DEC Systems Research Center TR 108, June 1993. [18] The Vesta Repository: A File System Extension for Software Development, Sheng-Yang Chin and Roy Levin, DEC Systems Research Center TR 106, June 1993. [19] The Vesta Language for Configuration Manage- ment, Christine B. Hanna and Roy Levin, DEC Systems Research Center TR 107, June 1993. [20] Practical Use of a Polymorphic Applicative Lan- guage, Butler W. Lampson, and Eric E. Schmidt, Conference Record of the Tenth ACM Symposium on Principles of Programming Languages, Austin, Texas, January 1983, 237-255. [21] A Unifying Model for Consistent Distributed Soft- ware Development Environments, J. Walpole, G. S. Blair, J. Malik and J. R. Nicol, Proceedings of the ACM/SIGSOFT Software Engineering Sympo- sium on Practical Software Development Environ- ments, Boston, Massachusetts, November 1988, 183-190. [22] The Object Management System of PCTE as a Software Engineering Database Management Sys- tem, Ferdinando Gallo, Regis Minot and Ian Tho- mas, Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, Palo Alto, California, December 1986, 12-15. [23] Configuration Management in the PACT Software Engineering Environment, Ian Simmonds, Pro- ceedings of the 2nd International Workshop on Software Configuration Management, Princeton, New Jersey, October 1989, 118-121. [24] ClearCase Concepts Manual, Atria Software, 1992. [25] Computer-Aided Software Engineering in a Dis- tributed Workstation Environment, David B. Leb- lang and Robert P. Chase, Jr., Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, Pittsburgh, Pennsylvania, April 1984, 104-112. [26] Fine-grained Revision Control for Collaborative Software Development. B. Magnusson, U. Ask- lund, S. Minör, Lund Institute of Technology, Department of Computer Science, LU-CS-TR:93- 112. [27] CodeManager User's Guide, Sun Microsystems Inc., Part No. 801-2169-11. [28] The Source Code Control System, Marc J. Roch- kind, IEEE Transactions on Software Engineering, SE-1,4 (December 1975) 364-370. [29] Experience with Representing C++ Program Infor- mation in an Object-Oriented Database, T. Onodera, Proceedings Object-Oriented Program- ming Systems, Languages, and Applications, Port- land, Oregon, October 1994, 403-413. [30] Equal Rights for Functional Objects or, The More Things Change, The More They Are the Same, Henry G. Baker, ACM OOPS Messenger 4,4 (October 1993) 2-27. [31] The Clarity Language Definition, Mick Jordan et. al. (ed) Sun Microsystems Laboratories, Technical Report: in preparation. Table 1: Simulation Results with a Single UID Generator Number of Concurrent Users Operation 1 2 3 4 5 6 7 8 Initial Scan 36 40 39 44 45 53 57 106 Checkout 21 39 94 114 155 157 195 226 Checkin 16 36 83 92 123 172 179 217 Folder Edit 24 60 92 141 152 153 180 242 Read 29 31 29 30 30 32 30 33 Write 105 162 372 526 682 814 940 1145 Table 2: Simulation Results with Multiple UID Generators Number of Concurrent Users Operation 1 2 3 4 5 6 7 8 Initial Scan 44 43 42 44 48 52 60 65 Checkout 21 26 27 37 49 51 60 76 Checkin 16 16 21 24 22 26 28 31 Folder Edit 21 23 26 30 34 37 37 44 Read 30 30 30 31 32 34 33 37 Write 98 109 127 134 136 181 186 222