Garbage Collection and DSM Consistency* Paulo Ferreira and Marc Shapiro INRIA - Projet SOR Abstract property of persistence by reachability [2 ]; this prop- This paper presents the design of a copying garbage erty states that only objects reachable from the persis- collector for persistent distributed shared objects in tent root should be persistent. In other words, objects a loosely coupled network with weakly consistent dis- that are no longer reachable from the persistent root tributed shared memory (DSM). should not be stored on disk. Even in a persistent 64- The main goal of the design for this garbage collec- bit address space, there is a need for memory reorga- tor is to minimize the communication overhead due to nization and address recycling, otherwise the address collection between nodes of the system, and to avoid space gets severely fragmented and secondary storage any interference with the DSM memory consistency fills with garbage. protocol. In addition, distributed shared memory systems Our design is based on the observation that, in a have become popular because they support a simpler weakly consistent DSM system, the memory consis- programming model for distributed applications than tency requirements of the garbage collector are less RPC-based systems [3 ]. Furthermore, weak consis- strict than those of the applications. Thus, the tency protocols seem to offer the best performance garbage collector reclaims objects independently of when compared to sequential consistency [4 ]. other copies of the same objects without interfering For these reasons, we have designed and imple- with the DSM consistency protocol. Furthermore, our mented a platform, called BMX [8 ], that provides design does not require reliable communication sup- persistent weakly consistent shared distributed virtual port, and is capable of reclaiming distributed cycles of memory and copying garbage collection. We chose dead objects. a copying garbage collector because it can improve an application's locality [13 ], it contributes to reduce 1 Introduction memory fragmentation, and it provides sufficient sup- port to reclaim cycles of unreachable objects. Garbage collection (GC) is a fundamental component The paper focuses on the issue of how the garbage for supporting persistent objects in distributed sys- collector copies shared objects without interfering with tems. The importance of garbage collection in such the DSM consistency protocol. This is an impor- systems is twofold: first, the object graphs of appli- tant systems problem, since interference between the cations, like financial or design databases, cooperative garbage collector and the consistency protocol could work and exploratory tools similar to the World-Wide- potentially nullify the advantages of using a weakly Web, are very intricate, which makes manual storage consistent DSM system. For example, when updating management increasingly difficult and error-prone, of- a reference inside an object, to reflect the new location ten resulting in dangling pointers and storage leaks. of a live descendent that has already been copied, the Second, garbage collection is necessary to support the garbage collector should not require exclusive write- _____________________________* access to modify the object. If exclusive write-access This work has been done within the framework of the ES- PRIT Basic Research Action Broadcast 6360, and was partially was needed, read-access to all other replicas of the ob- supported by Digital Equipment Corporation. ject would have to be invalidated, therefore nullifying yFull time Ph.D. student at Universit'e Pierre et Marie Cur- the advantage of using weak consistent DSM. Current rie (Paris VI). Supported by a JNICT Fellowship of Program distributed GC algorithms do not handle this problem; Ci^encia (Portugal). Email: Paulo.Ferreira@inria.fr. Tel: +33 (1) 39 63 52 08. Fax: +33 (1) 39 63 53 30. Address: INRIA - they implicitly assume the existence of a single object Rocquencourt, B.P. 105 - 78153 Le Chesnay Cedex, FRANCE. copy, which is not the case in a DSM system. Besides the problem of avoiding interference with memory pages with a constant size. BMX ensures that the DSM consistency protocol, the other two problems segments have non-overlapping addresses. that need to be addressed in the design of a garbage Segments are logically grouped into bunches because collector for distributed shared objects are: (i) collec- a single segment is not flexible enough to support solu- tion of acyclic distributed dead objects, and (ii) col- tions for situations like segment overflow. Each bunch lection of distributed cycles of garbage. Due to space has an associated owner, and protection attributes like limitations, we only present a general overview of our the usual Unix read, write, and execute permissions. solution to these two issues (see Ferreira[9 ] for more BMX supports recovery for operations on bunches. detail). As mentioned above, this paper focuses on Recovery is based on the recoverable virtual mem- the techniques used by the garbage collector to avoid ory techniques proposed by Satyanarayanan et al. [19 ]. interference between the collector and the DSM con- Therefore, after a bunch is mapped into memory, ev- sistency protocol. ery modification performed on the bunch's range of Our GC algorithm is orthogonal to DSM consis- addresses has an associated log entry and can be re- tency, that is, it tolerates inconsistent objects, there- covered after a system failure. fore it is generally applicable to other consistency pro- In summary, the user program, called the mutator in tocols. Furthermore, our design is application- and the GC literature [7 ], operates on a single, shared, per- language-independent. However, the compiler must be sistent, possibly large graph of objects allocated from a instrumented in order to interface with the algorithm number of bunches. These bunches can be simultane- support layer. ously replicated on several nodes in the system and are The paper is organized as follows. The next sec- kept weakly consistent by the DSM system described tion presents the basic aspects of the BMX platform. below. Section 3 gives a global overview of the GC design. The collection of shared objects and its relation to the DSM consistency protocol is described in Sections 4 2.2 DSM Support and 5. Sections 6 and 7 briefly describe acyclic and cyclic distributed garbage collection. Section 8 gives The BMX system supports weakly consistent dis- an overview of the implementation, and finally, the tributed shared memory based on the entry consis- paper ends with related work and some conclusions. tencymprotocole[4s].saApplicationsgdoenotssend.explicitThey o* *nly use the distributed @ ory paradigm for communication. 2 BMX Overview The entry consistency protocol provides the tradi- tional model of multiple readers and a single writer. This section presents the aspects of the BMX platform Thus, there can either be several read tokens, or that are relevant to the GC design described in this one exclusive write token associated with each object. paper. Nodes holding a read token are ensured to be reading a consistent version of the corresponding object. The 2.1 Objects, Segments, and Bunches possessionooftthehwriteetokenrmeanscthatotherenissnoistent co* *py of the object at any @ BMX offers a 64-bit single address space spanning all the network. The entry consistency protocol therefore the nodes of a network, including secondary storage. guarantees that an object is consistent, with respect The object, which consists of a contiguous sequence to previous operations on the object, as long as a node of bytes, is the basic unit of identification and invo- holds the corresponding read or write token. Other- cation. An object is represented by its address; ob- wise the observed state of the object is undefined. ject references are therefore ordinary pointers. Each Every object has an owner, which is either the node object has an header that precedes the object's data, currently holding the object's write token, or the node which includes system information such as the object's that last held the write token. A write token can only size. We assume that objects are passive and generally be obtained from the object's owner, while a read to- small, that is, the size of most objects is much smaller ken can be obtained from any node already holding a than a virtual memory page. read token. A token is obtained by performing a read Objects can become persistent by reachability, that or write token acquire operation and is freed by the is, they are persistent if reachable from the persistent corresponding release. root. Once mapped in main memory, such objects are Tokens are managed with an algorithm similar to shared through a DSM mechanism, just like any other Li's dynamic distributed manager with distributed non-persistent object. copy sets [16 ]. Thus, the copy-set of an object (list For clustering purposes, objects are allocated within of nodes with a read token) is not centralized by its segments. A segment is a set of contiguous virtual owner; rather, it is distributed among the owner and those nodes that have transitively granted a read token algorithms support an integral GC solution for a DSM to other nodes. system providing a high degree of scalability and par- In addition, there is a forwarding pointer mech- allelism. anism indicating which node is the current object's This paper focuses on the garbage collection of a owner. We call such a forwarding pointer an ownerPtr. replicated bunch and how the BGC interacts with Therefore, for each bunch there is a set of entering DSM consistency. However, for the sake of complete- ownerPtrs that originate at nodes with non-owned ness, we will include an overview of both the scion replicas for the objects in this bunch, and a set of cleaner and the group garbage collector (Sections 6 exiting ownerPtrs pointing to the owner node of each and 7). of the objects from this bunch. 3.1 Stubs and Scions 3 Garbage Collection Design Figure 1 illustrates the use of stubs and scions. These stub-scion pairs are simpler than the ones used in The premise of our garbage collection algorithm is that RPC-based distributed systems [20 ], because: (i) they an application's object graph can be enormous and are not used for indirections, that is, they are just widely distributed among the nodes of the system. It auxiliary data structures that describe relevant refer- would therefore not be feasible to collect all objects ences, and (ii) stubs/scions do not perform any kind of an application at the same time. Our algorithm of marshaling/un-marshaling. collects each bunch of objects independently of any There are two kinds of slightly different SSPs: an other bunch. inter-bunch SSP describes references that cross bunch To be able to collect a bunch independently, it must boundaries, while an intra-bunch SSP records relevant be isolated from all other bunches. In other words, dependencies between copies of the same bunch. every copy of a bunch has to contain enough local in- Inter-bunch SSPs have the same direction of the formation to independently make all reachability deci- cross-bunch reference they represent (for example, sions for its objects, that is, without requesting infor- O3!O5 in Figure 1). When an object becomes cached mation from any other bunch, nor from other copies of on multiple nodes, the inter-bunch stubs that repre- the same bunch. For this purpose each cached copy of sent the object's links to objects in other bunches do a bunch holds two tables (in addition to the tables of not have to be replicated. This is not problematic be- entering and exiting ownerPtrs): the stub table and cause a single SSP is enough to keep the target object the scion table. The stub table contains information alive in the whole system, as will be described further about outgoing links, that is, which objects referenced on. Figure 1 illustrates this situation: in spite of the from within the bunch are allocated in some other fact that O3 is cached on N1 and N2, there is only one bunch and the bunches to which they correspond. The inter-bunch stub due to O3!O5 that is kept at N2. scion table contains information about incoming refer- Intra-bunch SSPs have the opposite direction of the ences, that is, which local objects are referenced from corresponding ownerPtr, and are used to preserve an other bunches and from where these references orig- object's replica at a node that stores inter-bunch stubs inate. Thus, for each stub there is a corresponding created when the node was previously the owner of the scion, these two entities form a stub-scion pair (SSP). object, but is no longer so. Intra-bunch SSPs are nec- The bunch isolation provided by stubs and scions af- essary because inter-bunch SSPs do not move with fects how inter-bunch reachability is propagated, and the ownership of an object, instead the intra-bunch therefore how inter-bunch garbage is collected. For SSP serves as a forwarding link. For example, in spite this reason, our GC design is based on three sub- of being unreachable by the mutator at N2, object O3 algorithms that perform complementary tasks: the must be kept alive at this node because: (i) there is an first component, called the bunch garbage collector inter-bunch reference in O3 for which the correspond- (BGC), executes the collection on a local replica of a ing stub is allocated at N2, and (ii) a copy of O3* * is bunch, independently from the collection of any other reachable by a mutator at N1. Object O3 can be col- bunch and other replicas of the same bunch; the sec- lected in N2 only after becoming unreachable in N1. ond component, called the scion cleaner, uses informa- tion generated by the bunch garbage collectors of other 3.2 Creation of Stub-Scion Pairs bunches (stub tables and lists of outgoing ownerPtrs) to recognize which objects are no longer reachable from An inter-bunch SSP is automatically constructed im- remote bunches or remote copies of the same bunch; mediately after detecting the creation of the corre- and finally, the last component is called the group sponding inter-bunch reference. This detection is done garbage collector (GGC), which is in charge of reclaim- with a write-barrier [10 ] associated with every write ing inter-bunch cycles of garbage. Together, these sub- performed by an application. Figure 1: Bunch B1 is mapped on nodes N1 and N2, and bunch B2 is mapped only on N3; mapped bunches are represented with a solid line, unmapped bunches are represented with a dashed line. Stub and scion tables contain inter-bunch and intra-bunch SSPs. For each object, the state of its token is indicated as follows: letters r and w indicate that the node has a read or a write token respectively; o means that the node is the object's owner (thicker objects); i is used for inconsistent copies. The local root includes mutator stacks. No GC has taken place on any node. When an inter-bunch reference is created, either inter-bunch reference was created, to N1, the corre- both the source and target bunches are already sponding intra-bunch SSP from N1 to N2 is created. mapped on the local node, or only the target bunch We decided to use intra-bunch SSPs, instead of repli- has not yet been locally mapped. In the first case, the cating inter-bunch SSPs, in order to reduce the num- corresponding stub and scion get created locally. The ber of scion messages and the amount of memory con- second case requires that a message be sent to a node sumed for GC purposes. In fact, if inter-bunch SSPs where the target bunch is mapped. This message is were replicated, each time object ownership changes, a called a scion-message and is used to inform the tar- new inter-bunch SSP would have to be created, which get bunch about the necessity of creating the scion would imply sending the corresponding scion-message. corresponding to the the new cross-node inter-bunch By using intra-bunch SSPs, no extra messages are reference. needed, because the information is piggy-backed onto Figure 1 shows a cross-node inter-bunch SSP re- consistency protocol messages. In addition, an inter- quired by the link O3!O5 from N2 to N3. Node N3 gets bunch SSP occupies more memory than an intra-bunch a scion-message from N2 when the inter-bunch refer- SSP. ence O3!O5 is created and creates the matching scion for the stub on N2. Intra-bunch scions are created when the ownership 4 Bunch Garbage Collection of an object moves from one node to another and the old owner holds an inter-bunch stub for this object. This section describes the garbage collection of a In other words, the old owner created a link to an bunch with multiple, possibly inconsistent, cached object in a different bunch and the information re- copies on several nodes. We first present the main quired by the garbage collector for this purpose, that aspects of the algorithm; then, we describe the algo- is the inter-bunch stub, needs to be preserved after rithm in more detail, focusing on how live objects are the object's ownership is transferred. The intra-bunch copied and scanned, how reachability information is SSP takes care of creating a forwarding link between regenerated by the BGC, how references are updated the new owner and the inter-bunch stub at the old and how the from-space is reused. Each of these issues owner. Therefore, in the example illustrated by Fig- is discussed in light of what is needed for the collector ure 1, when O3's write token goes from N2, where the not to interfere with the DSM consistency protocol. 4.1 Outline could disrupt the application's working-set. For exam- ple, each readable copy would be invalidated. The BGC is based on the algorithm by O'Toole et. Our solution avoids this drawback by only copy- al [17 ] for three main reasons: (i) the time to flip1 ing those objects that are locally owned. Thus, if is very small and therefore not disruptive to applica- the node holds the write token or was the last one to tions, (ii) portability (no virtual memory manipula- hold it, the corresponding object is copied to to-space tions), and (iii) objects are non-destructively copied and scanned, and a forwarding pointer is written into (suitable for recovery purposes). However, any other the object's header, which is left in from-space. This copying algorithm could be used. header modification is strictly local and does not im- When a bunch has several copies on different nodes, ply acquiring the object's write token because, at this a separate local BGC operates on each copy. The local time, only the local node has to be aware of the ob- collection of a bunch proceeds independently of the col- ject's new location. As explained later in this section, lection of other bunches, and of the collection of copies other nodes will eventually be informed of the object's of the same bunch at other nodes. The roots of the new location. BGC are located in the mutators stack, intra-bunch and inter-bunch scions, and list of entering ownerPtrs. Other live objects, that is, those not locally owned, Each time the local BGC is executed, it reconstructs are simply scanned. An important aspect of this de- a new version of the bunch's stub table, and a new sign is that these objects can be scanned, even though set of exiting ownerPtrs. The new stubs and exiting their copy might not be consistent with the owner's. ownerPtrs will later be sent to the scion cleaner of In fact, an inconsistent copy of the object is sufficient, all nodes that either have cached copies of the same because scanning an old version results in making a bunch or contain the scions corresponding to the stubs more conservative decision about the referenced ob- of both old and reconstructed stub tables. On those jects reachability, ensuring that they will not be erro- nodes, the scion cleaner discovers and removes all local neously collected if not dead. intra-bunch and inter-bunch scions that are no longer The header with the forwarding pointer, left in place reachable from any stub, and all incoming ownerPtrs of an object copied to to-space, is deleted only when for local copies of objects that are no longer live re- every reference to that object has been updated with motely. the new address. These references can be local or re- To avoid synchronizing all nodes in an object's copy- mote, as shown in Figure 2: object O2 has been copied set, a local BGC only copies the objects that are locally to the to-space segment by the BGC in N2; thus, point- owned. Non-locally owned objects are simply scanned. ers inside O1 and O3 must be updated accordingly, on This asynchronous collection of different copies of the both nodes. same bunch can cause inconsistent views of object's After updating the local references to a copied ob- addresses across cached replicas. However, as ex- ject, and before performing the same operation on re- plained in the following sections, the asynchronous col- mote references to the object, the copied object will be lection of object replicas is not a problem, because at different addresses on different nodes. However, this addresses on other nodes can be updated when these is not a problem. The data inside the copied object is nodes synchronize on behalf of application's DSM con- kept consistent from the applications point of view, as sistency needs. Thus, applications designed for weakly guaranteed by the consistency protocol. Thus, remote consistent DSM systems will work correctly. references can be updated lazily. As an example, con- sider Figure 2: after updating the references to O2 in N2, and before performing the same operation in N1, 4.2 Copying/Scanning Live Objects object O2 will exist in different addresses on these two nodes. However, mutators in both nodes continue to A local BGC copies live objects from the from-space work correctly. segment to the to-space segment, independently of other BGCs of the same bunch. The first challenge of Applications perform correctly because they are im- such an algorithm is to avoid that two BGCs, simul- plemented for a weakly consistent DSM. Remember taneously executing on different replicas of a bunch, that applications do not send explicit messages that move the same live object to different memory loca- could, for example, reference objects with a new ad- tions. One obvious solution to this problem would dress not known to the receiver. Furthermore, to ac- be to acquire the write token of every live object be- count for the existence of forwarding pointers, a spe- fore copying it. However, this solution is undesirable, cial operation is provided to perform pointer compar- since it would trigger memory consistency actions that ison. The conditions that need to be upheld at DSM _____________________________1 synchronization points (read or write token acquires) Time during which an application is stopped due to garbage to ensure the proper execution of applications are de- collection. scribed in Section 5. Figure 2: Zooming into Figure 1, we show more detail of bunch B1 on nodes N1 and N2. The BGC on N2 only copies locally-owned live objects, that is, O2. The update of pointers to O2 is represented by dashed arrows. Node N1 has not yet been informed of the O2's new address, and the local BGC of B1 has not been executed. 4.3 Creating new Stub and Outgoing o if neither the inter-bunch reference has been cre- ownerPtr Tables ated locally, nor the local node is the object's owner, than no stub is added to the new stub list. As mentioned earlier, a local execution of the BGC scans all objects that are reachable from the mutators Furthermore, for objects that can only be accessed stacks, scions, and entering ownerPtrs, and creates via an intra-bunch scion the exiting ownerPtr will be a new table of inter-bunch and intra-bunch stubs, as omitted from the new outgoing list (see Section 6.2). well as a new list of outgoing ownerPtrs. The new The new intra-bunch stubs and the new set of exit- stub table and list of outgoing ownerPtrs represents ing ownerPtrs constructed by the local BGC are even- all objects in other bunches or in other nodes that are tually sent to other nodes where the corresponding ob- accessible from the local copy of the bunch. An ob- jects are mapped. ject that has been locally garbage collected will neither add a stub nor an outgoing ownerPtr to the new ta- 4.4 Updating References bles. An inter-bunch stub will not be added to the new stub table if the corresponding local object no longer The next challenge in designing our garbage collec- includes the inter-bunch reference associated with the tor is how to update all local and remote references stub. When the BGC scans a live object containing to an object that has been copied to to-space, with- an inter-bunch reference, three actions may be taken out interfering with the DSM consistency protocol and (see Figure 1): without incurring in high communication overhead be- tween nodes. o if the inter-bunch reference has been created at At first, it may seem impossible to update a refer- the local node (as is the case of O3 at N2), than ence without interfering with the consistency proto- the corresponding inter-bunch stub is added to col, because it implies updating the object that con- the new stub table, tains the reference. Normally to update an object it is necessary to acquire the write token for the object, o if the inter-bunch reference has not been created which would in turn make outstanding readable copies locally, but the scanned object is locally owned (as stale. However, the same way it was possible to copy is the case of O3 at N1), than the corresponding an object without exclusive-access, the object can also intra-bunch stub is added to the new stub list, be modified without acquiring the corresponding write and token, because the modification is visible only to the local node, and does not affect the applications behav- and that the forwarding pointers are no longer neces- ior on other nodes, which might be accessing another sary. copy of the object. Hence, reference updating can be It is worthy to note that a from-space segment will done without interfering with the consistency protocol. be reused for allocating new objects only after the For example, in Figure 2, updating O1 in N2 with O2's corresponding to-space segment becomes full. Until new address can be done without acquiring O1's write then, non-owned objects remaining in the from-space token. In spite of the difference between O1 on both segment may either die or be copied by their owners. nodes, mutators at N1 and at N2 still see a copy of O1 Furthermore, all the space that was occupied by dead that is consistent with the application's requirements. objects could be completely reused. For the purpose of updating remote references to Nevertheless, suppose that we want to fully reuse or locally copied objects we want to avoid sending an discard a from-space segment. In that case, we must explicit message from the node where an object was ensure that it contains neither forwarding pointers to copied to the node holding the remote reference. Fur- already copied objects nor non-locally owned live ob- thermore, we want to avoid blocking an application jects. Both conditions are guaranteed by informing while such an update is taking place. all other nodes affected by the address changes in this Because remote references can be updated lazily, an segment about these changes, and by asking the owner object's new address can be communicated to other nodes to copy those live objects still allocated in the nodes by piggy-backing such information onto mes- from-space segment. Once the local node receives the sages due to the consistency protocol, which are per- replies to the above messages, the from-space segment formed on behalf of applications. Thus, no extra mes- can be fully reused or freed. sage is used. For instance (see Figure 2), O2's new Since the address-change messages are exchanged in address can be sent from N2 to N1 in a message due to the background, applications can make progress with- the consistency protocol exchanged between these two out having to process them immediately. From the nodes. After N1 receives O2's new address, O2 is copied point of view of the application, processing these mes- to the indicated address, and all the local references sages is part of the garbage collection overhead, and is are updated accordingly without requiring any token. no more disruptive than garbage collection itself. Reference updating does not have to be done imme- For example, consider Figure 2: node N2 informs N1 diately after receiving the corresponding message with of O2's new address, asks the BGC in N1 to copy its an object's new location. In fact, it can be postponed locally owned live objects (O1 and O3), and updates its until a bunch garbage collection takes place at the local local references to O1's and O3's new location. After node. Therefore, there is a tradeoff on how consistent that, B1's from-space segment can be freed or reused the addresses are going to be and the overhead of im- in its entirety. mediately executing the updates at the remote nodes. Note that the list of nodes where an object's refer- It is important to note, that if there is no com- ence must be updated is already kept in the object's munication between nodes on behalf of applications, owner node for the purpose of the DSM mechanism: then there is no need for updating references unless nodes from where the set of entering ownerPtrs origi- the from-space needs to be reused. In this case, as ex- nate. This implies that there is no extra memory over- plained in the next section, explicit messages must be head due to the GC. used. 4.5 Reusing From-Space 5 DSM Acquires and the BGC Since during bunch garbage collection only locally As explained before, since remote object references are owned objects are copied to to-space, it may happen updated lazily, different replicas of the same object can that some live objects and forwarding pointers remain be located at different addresses on different nodes af- in the from-space after the local BGC has completed. ter the execution of a BGC. We also mentioned that Thus, immediately after a local bunch garbage collec- mutators on these nodes will operate correctly, because tion, the from-space segment might not be fully reused the data held by the objects is consistent with respect nor freed. This is the case of B1's from-space segment to the guarantees made by the DSM consistency pro- after the BGC on N2 has finished (see Figure 2): ob- tocol. However, after synchronization points in the jects O1 and O3 are live and remain in the from-space application, that is, read or write token acquires, the segment. These objects will eventually be copied by system has to ensure that the synchronizing nodes ref- B1's local BGC running on node N1. Before reusing or erence objects with the same addresses. In this section freeing a from-space segment, it is therefore necessary we describe three invariants that the garbage collector to ensure that no live objects remain in the segment has to ensure for this purpose. 1. The acquisition of a read or write token for an ob- stubs or an intra-bunch stub for the object. Re- ject can complete only after ensuring that the ob- member that the intra-bunch SSPs serve as for- ject's address and all references inside it are valid warding pointers from the new owner to inter- at the acquiring node. bunch stubs located at previous owners of the ob- This invariant avoids erroneous situations such as ject. the following: suppose that object O1 points to The three invariants can be maintained without in- O2, both objects are allocated from bunch B and curring in extra communication overhead, by taking nodes N1 and N2 hold replicas of the objects. advantage of the messages transferred by the DSM system on behalf of the applications at synchroniza- (a) Node N1 is the owner of both objects. tion points, that is, replies to acquire messages. For (b) Bunch B is collected at N1, therefore O1's a DSM system that does not require applications to copy at N1 will point to the new location of synchronize on accesses to shared objects, the invari- object O2. antsncanobedguaranteedebyfensuringathatuwheneverlats on the acc* *ess to an object O, th@ (c) Node N2 issues a read or write token acquire supplies O also sends all the necessary location updates request for O1 to N1. and intra-bunch SSP information. (d) If invariant 1 was not maintained, the new Now, let us present a detailed example of the specific copy of O1 at N2 could point to the new ad- operations that are executed to satisfy the invariants dress of O2, while O2 was still at the old ad- presented above before a write token acquire is com- dress. pleted (see Figure 3). Suppose that a write token for object O1 is requested by node N2 from node N1: This first invariant prevents the situation de- o If neither O1 nor any of the objects referenced scribed above, by ensuring that N2 is aware of directly by O1 have been copied to to-space at O2's new location before the acquire request com- either node (situation represented by case (a) and pletes. The invariant is easily maintained by (c)), no special operation has to be performed. piggy-backing information with the new locations of the object being acquired and of every object o On the other hand, if either O1 or an object directly referenced from it, onto the reply to the pointed at by O1 have been copied to to-space at acquire message. N1 (situation represented by case (b) and (c)), their new locations are piggy-backed in the mes- 2. A node that receives a message with the new loca- sage granting the token to N2 and are processed at tion for an object forwards this information to all N2 before the application returns from the token the nodes that are in the local copy-set for the ob- request. ject, that is, to which it has granted a read token. In addition, the new locations of O1 and of the ob- jects directly referenced by O1 are forwarded by N2 This second invariant is necessary because a dis- to all nodes to which it had granted a read token tributed copy-set algorithm is used for token man- for those objects. Nodes to which N2 granted a agement. Remember that a read token for an ob- read token are listed in the corresponding object's ject can be obtained from a node already holding copy-set. a read token. Therefore, a node with a read to- ken for an object, that is not the object's owner, o If any of the objects pointed at by O1 gets copied is responsible for forwarding messages containing to to-space at N2 prior to the write token acquire an object's new location. The mechanism, to for- operation (situation represented by case (d)), ward new location information to all nodes with when N2 receives the valid copy of O1 from N1 with a read token, is similar to the one used to invali- the token, it updates all references in O1 pointing date all read copies of an object when some node to forwarding pointers in from-space, to point to acquires the object's write token. the new addresses in to-space directly. 3. The acquisition of a write token for an object com- o Ifranyeinter-bunchpstubsrexisteforsO1eatnN1t(noted in Fig* *ure 3), before the wri@ pletes only after all necessary intra-bunch SSPs acquire can complete, an intra-bunch SSP for O1 have been created. has to be created, pointing from N2 to N1. N1 cre- This invariant ensures that the appropriate intra- ates the intra-bunch scion before it replies with bunch SSP is created between the new owner of the token-grant message, and piggy-backs a re- the object and an old owner, when ownership is quest for N2 to create the appropriate intra-bunch transferred from a node holding either inter-bunch stub upon reception of the message. Figure 3: Bunch B is mapped on nodes N1 and N2 (before and after the execution of the BGC). Thicker arrows represent references locally updated by the BGC. Dashed lines are used to represent forwarding pointers left in place of copied objects. This set of operations, executed when a write token ability information must be received in FIFO order. is acquired, ensures that, after synchronizing, a pair of Otherwise, the scion cleaner may use an old stub ta- nodes sees consistent addresses for objects in the DSM ble that does not match the scions being considered. system, by guaranteeing the three invariants described Such an inconsistency could result in the erroneous earlier. deletion of a scion. Since the messages with stub and ownerPtr information are exchanged between a pair of nodes (point-to-point communication), FIFO 6 Scion Cleaner ordering is easily guaranteed by numbering the mes- This section presents an overview of the scion cleaner sages.thInaFerreira[9t]cweadescribensomeoracecsituationscur b* *etween stub table messag@ and then describes in detail how intra-bunch SSPs are messages; however, the description of these race con- deleted when no longer necessary. ditions is beyond the scope of this paper. Messages with the reachability information can be 6.1 Overview piggy-backed onto messages used by the DSM consis- The scion cleaner locally processes the reachability in- tencytprotocol,horeexchangedyinatherbackground.eThus,not disr* *uptive in the sense that@ formation (stubs and outgoing ownerPtrs) that has do not depend on the scion cleaner having finished to been constructed by the execution of a BGC on other continue their execution. Note that the scion cleaner bunches. This process is identical for information re- does not have to process each new message it receives ceived from a bunch with or without a local replica. immediately; messages can be accumulated and their There is a scion cleaner service per node that operates processing can be postponed until the start of the next on all bunches of that node. The cleaner recognizes local BGC. and removes all scions that are no longer reachable from any stub and all entering ownerPtrs that corre- spond to remote replicas that have been garbage col- 6.2 Deletion of Intra-bunch SSPs lected. In essence, the scion cleaner, updates the roots for the next execution of the local BGC. This section describes how an object with copies on The main advantage of sending messages with ta- several nodes is collected and the corresponding intra- bles containing all the reachability information, over bunch SSPs and ownerPtrs are deleted, once all repli- sending increment/decrement messages [5 ], is that the cas become unreachable. former are idempotent. In case of message loss they An object that becomes unreachable from all muta- can be resent without the need for a reliable com- tors at all nodes will end up not being referenced by munication protocol. However, messages with reach- any inter-bunch scion at any node and by any entering Figure 4: O1 is cached on nodes N1, N2, and N3 and is reachable from a single mutator in N1. ownerPtr at its owner node. Thus, when the BGC is rithm used by the GGC is identical to the one used by executed on the owner node, the mentioned object will the BGC, only that it operates on a group of bunches, not be seen. Should the object have any intra-bunch rather than on one bunch at a time. stub(s) on the owner node, these will not become part The root of the GGC includes: mutator stacks, of the new stub table. Transitively, the scion cleaner intra-bunch scions, entering ownerPtrs, and inter- running on the other nodes will delete the unreachable bunch scions identifying source bunches that are not intra-bunch scion(s) and the object will be reclaimed members of the group being collected. The inter-bunch by the next BGC at those nodes as well. scions corresponding to SSPs that originate within the For example, consider Figure 4 and suppose that group that is being collected are not part of the root. the BGC of B is executed at N3; the new set of exiting Therefore, objects in the group, which are not reach- ownerPtrs will not include the one from N3 to N2, able from any sources other than these SSPs, will be because O1 is not reachable from the mutator at N3. collected. In particular, objects that form an inter- However, O1 remains alive at N3 due to the intra-bunch bunch cycle, but are non-reachable from bunches out- scion. Note that if the ownerPtr from N3 to N2 was side the group or the mutator's stack, will be collected, included in the list of outgoing pointers, O1 could never because they are not artificially held over by SSPs from be reclaimed because of the following cycle: O1 on N2 within the group. ! intra-bunch SSP ! O1 on N3 ! ownerPtr from The significance of the group garbage collector is N3 to N2 ! O1 on N2. By not including the outgoing that it can collect an arbitrary subset of the distributed ownerPtr from N3 the cycle is broken and the scion- and persistent objects on a single site, independently cleaner at N2 deletes the entering ownerPtr for N3. of the rest of the address space. Bunches are grouped The BGC running on N2 considers O1 alive because based on a heuristic that maximizes the amount of of the entering ownerPtr, which originates at N1. Now, inter-bunch garbage that is collected and minimizes imagine that O1 becomes unreachable at N1, that is, the cost of performing the collection. Currently, we the reference to O1 is deleted from the local root of the use a locality-based heuristic, that is, we collect a* *ll mutator at N1. Then, a BGC is executed for B on N1. bunches that are in memory at the site where the GGC Object O1 can be reclaimed at N1, and the ownerPtr is going to run. This heuristic avoids disk input-output pointing from N1 to N2 will not be part of the new set overheads. of ownerPtrs exiting B on N1. This locality-based heuristic does not collect cycles Finally, when N2 receives the new information gener- of garbage that partially reside on disk, that is, cy- ated by the BGC at N1, the scion cleaner at N2 deletes cles with objects allocated in bunches not currently the last entering ownerPtr for O1. Therefore, during mapped in memory. Collecting such a cycle involves the next execution of B's BGC at N2, object O1 is no input-output costs that need to be balanced against longer reachable, which in turn will drop the intra- the expected gain. In addition, if an application does bunch stub pointing to O1 at N3 from the new stub not move bunches around the nodes there is a possi- table. Thereafter, when N3 receives this new informa- bility that some dead cycles may not ever be removed tion from N2 and runs its own BGC on B, object O1 at all. We believe that some of these cycles can be col- will no longer be reachable on N3 either, and will also lected by improving the grouping heuristic. However, be garbage collected there. we intend to do that only after having experimented with the locality-based heuristic. If experimental re- sults mandate it, we will explore more complex heuris- 7 Group Garbage Collector tics. Dynamic grouping of collection spaces was first proposed by Lang[14 ]. However, our solution is much The GGC is used to reclaim inter-bunch cycles of simpler because a group collection occurs at a single garbage. There is one GGC per node and it operates site, instead of spanning multiple nodes. Therefore, on groups of bunches local to that node. The algo- we expect our solution to be more scalable. 8 Implementation location of objects inside the bunch, and a reference- map, which indicates where pointers are located inside The current prototype of BMX is implemented in C++ each object. These data structures are implemented for a network of DEC Alpha workstations. The pro- as bit arrays; each bit describes the contents of a 4- totype implementation was simplified by placing the byte address range inside the bunch. A set bit in the following constraints on the system: first, a bunch is object-map means that at the corresponding address is shared only by processes on different nodes, in other an object; a set bit in the reference-map means that at words, there is only one process per node accessing a the corresponding address is a pointer to some object. bunch; second, the copy-set of an object is centralized One of the performance goals of our design is to at the object's owner node, instead of being distributed support replication of bunches on several nodes with- among those nodes that have transitively granted a out increasing the cost of the bunch garbage collection, read token to other nodes; and finally, persistence is when compared to a system without support for bunch supported by associating each segment with a Unix replication. From the point of view of the applica- file. tion, the cost of the BGC should be the same whether The current prototype is based on BMX-servers and the bunch is replicated or not. We believe we can BMX-clients. A BMX-server runs on every node in the ensure this cost property by avoiding the interference system and provides basic services, such as allocation between the garbage collector and the DSM mecha- of non-overlapping segments. The BMX-client is a li- nisms. This expectation is based on two observations: brary that is linked with each application and is used (i) the BGC never acquires a token for any object, and to interact with the BMX system internals, in partic- consequently does not interfere with the DSM consis- ular with the BMX-server. tency protocol, and (ii) information exchanged among Bunches are mapped into shared virtual memory. nodes is either piggy-backed onto messages due to the The BGC runs as a thread inside the process that is consistency protocol, or exchanged in the background. accessing the corresponding bunch. This particular implementation of the BGC is facilitated by our sim- plification of allowing only one process per node to 9 Related Work have access to a bunch. The scion cleaner and the GGC each run as a privileged process on each node. A large amount of literature exists in the area of con- Because these processes are privileged they have access current GC either for multiprocessors [1 , 6], or for to all bunches local to the node. RPC-based distributed systems (see Plainfoss'e[18 ] for Recovery is based on the recoverable virtual mem- a survey). On the contrary, to our knowledge, little ory (RVM) techniques proposed by Satyanarayanan work has been done on garbage collecting objects in a et al. [19 ]. RVM provides simple recoverable transac- loosely coupled network with weakly consistent DSM. tions with no support for nesting, distribution, or con- The fundamental difference between our system and currency control. Recovery in RVM is implemented a multiprocessor system, with respect to GC, is that with a disk-based log. In our prototype we use the of scale and synchronization overhead. This difference approach proposed by O'Toole et al. [17 ], in which implies that, if we apply a GC algorithm designed for the from-space and the to-space are each supported multiprocessors (for instance, Appel[1 ]) to our case, by a file. Changes to mapped segments are atomically the overhead will be unacceptable due to communica- transferred to disk by RVM. tion and synchronization costs. These costs are due As previously mentioned, inter-bunch pointers are to the fact that current multiprocessor GC algorithms described by inter-bunch SSPs that are automatically implicitly assume the existence of strongly consistent allocated whenever an inter-bunch reference is created. objects. In fact, communication and synchronization We detect the creation of cross-bunch references using overhead arises because of the necessity of providing a write barrier technique, that is, write barriers are strongly consistent objects and the interference with inserted into applications by instrumenting every write applications' consistency needs. Note that the over- with a C++ macro. Another macro is provided to head is not due to the synchronization between the perform pointer comparison. This macro is necessary mutator and the GC algorithm, as is usually the case to account for the use of forwarding pointers left by in non-distributed settings. the execution of a garbage collection. Currently, the Furthermore, GC in DSM is more difficult than programmer must include these macros explicitly. In in distributed RPC-based systems (for example, the future, we expect to modify the pre-processor to Juul[11 ]) due to the existence of multiple copies of the insert these macros automatically. same object on several nodes, and the problem of con- The contents of a bunch are described by two special sistency interference. data structures that contain information needed for Le Sergent[15 ] describes an extension of a copying garbage collection: an object-map, which describes the garbage collector first developed for a multiprocessor, to a DSM system. Objects are kept strongly consis- References tent, the entire address space is collected at the same time, which is not scalable, and the garbage collector [1] AndrewcW.oAppel,nJohncR.uEllis,randrKaieLi.ntReal-timec* *ollection on stock multi@ locks pages while scanning, which interferes with the PLAN'88 - Conference on Programming Language Design consistency protocol. and Implementation, pages 11-20, Atlanta (USA), June Kordale's GC design for distributed shared mem- 1988. ory [12 ] is very complex and relies on a large amount [2] M.sP.hAtkinson,oP.tJ.tBailey,,K.aJ.nChisholm,dP.RW..Coc* *k-Morrison. An approach@ of auxiliary information. This GC algorithm is based gramming. Computer, 26(4):360-365, 1983. on the mark & sweep technique, and objects are kept [3] John K. Bennett, John B. Carter, and Willy Zwaenepoel. strongly consistent. Munin: Distributed shared memory based on type-specific memory coherence. In Proc. 2nd Annual Symp. on Prin- ciples and Practice of Parallel Programming, Seattle, WA (USA), March 1990. ACM SIGPLAN. In SIGPLAN No- 10 Conclusion tices 25(3). [4] Brian N. Bershad and Matthew J. Zekauskas. The Midway We have presented a copying garbage collector algo- distributed shared memory system. In Proceedings of the rithm for objects accessed via DSM in a loosely cou- COMPCON'93 Conference, pages 528-537, February 1993. pled network. Objects are allocated from bunches [5] D. I. Bevan. Distributed garbage collection using refer* *ence of non-overlapping segments in a single 64-bit ad- counting.gInuPARLE'87_ParallelaArchitecturesgandeLan-s * *Europe, number 259 in Le@ dress space spanning the whole network, including sec- Science, pages 117-187, Eindhoven (the Netherlands), Ju* *ne ondary storage. Objects are kept weakly consistent by 1987. Springer-Verlag. the entry consistency protocol. [6] Hans-J. Boehm, Alan J. Demers, and Scott Shenker. Our design goals were that the garbage collector nei- Mostly parallel garbage collection. In Proc. of the SI* *G- ther interfere with the consistency protocol, nor intro- PLAN'91ImConf.plonemProgrammingenLanguagetaDesigntiando* *n, pages 157-164, Toront@ duce high communication overheads. Therefore, in our 1991. ACM. garbage collection design: (i) a cached copy of a bunch [7] E. Dijkstra, L. Lamport, A. J. Martin, C. S. Scholten, * *and can be collected independently of any other copy of E. F. M. Steffens. On-the-fly garbage collection: an e* *xer- that same bunch on other nodes, (ii) only locally- cise in cooperation. Comm. of the ACM, 21(11):966-975, owned live objects are copied by a bunch garbage col- November 1978. lector; not owned live objects are simply scanned, and [8] Paulo Ferreira and Marc Shapiro. Distribution and per- (iii) references to copied objects are lazily updated, ei- sistencePinrmultipleoandcheterogeneous.addressospaces.f* * Inthe International Wor@ ther by taking advantage of messages sent on behalf in Operating Systems, Ashville, North Carolina, (USA), of the consistency protocol (piggy-backing), or in the December 1993. IEEE Comp. Society Press. background. In any circumstance, the garbage collec- [9] Paulo Ferreira and Marc Shapiro. Garbage collection of tor acquires neither a read nor a write token. persistent objects in distributed shared memory. In Pro* *c. of The fundamental observation that guided our design thet6theInternationalmWorkshopson,PersistentTObjectaSys* *-rascon (France), Septem@ is that GC consistency needs are less strict than ap- [10] Antony L. Hosking, J. Eliot B. Moss, and Darko Stefanov* *i`e. plications'. Thus, the garbage collector can work with A comparative performance evaluation of write barrier i* *m- objects that are inconsistent from the point of view of plementations. In Conf. on Object-Oriented Programming the consistency protocol. This allows the GC to be Systems, Languages, and Applications, volume 27 of SIG- performed without interfering with applications' con- PLANbNotices,epagesr92-109,1Vancouver9(Canada),9Octo-2.* * ACM Press. sistency needs and requiring very little synchronization [11] Niels C. Juul. Comprehensive, Concurrent, and Robust or communication. Garbage Collection in the Distributed, Object-Based Sys- We are currently in the process of evaluating the tem Emerald. PhD thesis, Dept. of Computer Science, performance of BMX. In future work, we hope to gen- Univ. of Copenhagen, Denmark, February 1993. eralize our design to other consistency protocols and [12] R. Kordale, M. Ahamad, and J. Shilling. Dis- other GC algorithms, in addition, to evaluating the im- tributed/concurrentshgarbagearcollectionedinmedistribut* *edmory systems. In P@ pact of the consistency granularity on our approach. Workshop on Object Orientation and Operating Systems, We are also extending the current GC design to incor- Ashville, North Carolina (USA), December 1993. IE* *EE porate a weakly consistent distributed shared memory Comp. Society Press. system with full support for transactions. [13] M. S. Lam, P. R. Wilson, and T. G. Moher. Object type directed garbage collection to improve locality. In Pr* *oc. Int. Workshop on Memory Management, number 637 in Lecture Notes in Computer Science, pages 404-425, Saint- Acknowledgments: We are grateful to our shep- Malo (France), September 1992. herd, Karin Petersen, and to the anonymous referees [14] Bernard Lang, Christian Queinnec, and Jos'e Pique* *r. for their help with improving this paper. Garbage collecting the world. In Proc. of the 19th Annu* *al ACM SIGPLAN-SIGACT Symp. on Principles of Pro- gramming Lang., Albuquerque, New Mexico (USA), Jan- uary 1992. [15] T. Le Sergent and B. Berthomieu. Incremental multi- threaded garbage collection on virtually shared memory architectures. In Proc. Int. Workshop on Memory Man- agement, number 637 in Lecture Notes in Computer Sci- ence, pages 179-199, Saint-Malo (France), September 1992. Springer-Verlag. [16] Kai Li and Paul Hudak. Memory coherence in shared vir- tual memory systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989. [17] James O'Toole, Scott Nettles, and David Gifford. Con- current compacting garbage collection of a persistent heap. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, pages 161-174, Asheville, NC (USA), December 1993. [18] David Plainfoss'e. Distributed Garbage Collection and Reference Management in the Soul Object Support Sys- tem. PhD thesis, Universit'e Paris-6, Pierre-et-Marie-Curie, Paris (France), June 1994. Available from INRIA as TU- 281, ISBN-2-7261-0849-0. [19] M. Satyanarayanan, Henry H. Mashburn, Puneet Kumar, David C. Steere, and James J. Kistler. Lightweight recov- erable virtual memory. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, pages 146- 160, Asheville, NC (USA), December 1993. [20] Marc Shapiro, Peter Dickman, and David Plainfoss'e. SSP chains: Robust, distributed references supporting acyclic garbage collection. Rapport de Recherche 1799, Institut National de la Recherche en Informatique et Automatique, Rocquencourt (France), nov 1992. Also available as Broad- cast Technical Report #1. ----------- cut here (postscript, gziped and uuencoded--------------- begin 664 OSDI-098-CRC-FERREIRA.ps.gz M'XL("',MD2X``T]31$DM,#DX+4-20RU&15)214E202YP
LY1<_=`(HO%Z;DL)A`1\'"WAYH^CZK^X__VNV^_WE[ -,K!37)^0F:)L%V
MA.QUJDE13BZ&B(6@62;P2@[^$PDS>6.2P P\4`Z P^6`\OF#Y*`<5(YC]!ZNB.94N^5@YCA
MGJ-H*-RT_,Z<2G^+2ARQWV$@<_3P23<67C01I6BVRV<+O=&UQ:_SB&T%N6/=
ME3D=^/EO:!&2-:R$Z/ES]/329*#>ZLG"*7S.IA@=B867[TV#ZU5>KHB%:&C^
MUIG"4? _M?$?;8UMZR$I9A9X9X!@[X\!7%XNM[6SLMJ-V,PD%/**%-
MS'*R+)P$KU\=?OQP-_7U,O+JU5WA,
MO2]2L`M24?3%KJFW]5B/L<%M.@QJ&'AW^]U/PRZE+MG3QZ+J6AMO.V[+9JAW
M^%<.:1_:<3>D]-:68?_/FD81BS:FR7Y'>PANU>+WA(=%6Z^JB_5@/PM3>F,:
MB]07-K1@UT_[SF;:C3O[652]_6ZGKK+_;*RIZNHM/[.1Q;*UVV&
S5F]?3CV%`VBF3-24JBTU_`(&"E8[4=#:
MP&=]2U,#,S&".Y70D6ALVB9H_M2E!;MGLM*C0M+#=46A+KR&TW=9#[;OP&
M%W?++IYNXIY+18/!F*JM#\YP12U<5EE'9;Y+4EU!9)([(KC.I0^HTY8!Y-7Q
MZ3,W'3DYFRE/=X'G@G/1KML6SJ%,VI?4P\DUTF@Z*`^!'9P8H+#AT,-*!6?L
ML6315)`17-.^#6.R,TNGIV(A71!V%YFT:8+VJ+(J[81/OC1903ILJF0*4DD%
MJ4P\DSOJC"YC91KNC;M$6Q2H=';$C`=2=[)U-.Y<%O:U:)P8GYE6W<>VW>Y,
M?K]-]IA_QLFD%;DK:"2!"4$7JW
N8F6@JX2)&YNH?CVIDV,7A/LX/1NJ_+\F8\$4ADQ"T@T.(@6H
=.S^1@0$].8'`I30OR-&6@AN,LL=A(
M/Y!'A^D'%
H$I!V*)Q)3@[Z"=3,L-+E2W0UVKT.#:^TLI:&,'F;:K,O.,0M''%8X0R)^1Z"HXGW(W
M1D+EV:"IR%VP,EI^IW;7+3JXR8?KUE%*:'0W[?I4O(U1_1<+,WFEL?!L
M"/LM!T#$HMULR^V-'H3R$D;DK>I&`YGSB3X;3"T?QOK[PW&!2:!CR7^23V:KKB)Y$>
M9G9\T4%LUZL$(XS%,EL=1^85H7?=$+'@$/[RCWJN"'KM%F3K=D+G0D2VCR16
M]EK))=%Z1A2FTYNEWG=O#'/>`1Y!;['GV"'+#K&(_%?3V/B=.#I$+0=NQ%^"
M4@J%LE=F'XL%T`J&^!_
(%_GWT&1JLC:_LL?H1VK
MQ@9Q0^J^Z8GZXA,$&2J%GTB^P,(S;39?F&+0#EZ$1M8N$V-M4GNU&X4&4PF\
M2B6MK,NVV\?0I+<-U$15:T%4;U(ENRW2+*:QK93LC[/EU9X(D=TKWH9&E9[I
MT[)$#?^QF$BU_.L:_NRZQO^U$?F=J:U5O33N`>[#.R2>3754L5-TR33R9/;$
M-O5L;LE*%U5C_VID6);\614=L$!XW;=R4,-5;\H*AH>?R'!"1S3WX@Q05WJ6
MZ-%$1V2O(CFCAVEB:S/L8\.6/2DHW86G`-!P/P6JB3J?@D%G0KAMG(B^Q"&`
M,PP;BD,0>!Z64]`I>J9S@)SZ2EV:2J4?P.:@B<5C`)'6LNZ%K34H/Y\'0(WS
MNY=/@HX`R-X627&U7D?`3TBQG(9N?1H0(T3-+SA+$")L1[2]I@U`J$WGN?)(
M&$/E!`(_"0Y3NL`^_X-9S`(RHPKU@]`%>)