, Pedro Fonseca
, Rodrigo Rodrigues
, Petros Maniatis
MPI-SWS, Rice University,
TU Berlin/Deutsche Telekom Laboratories, Intel Research Berkeley
Many distributed services are hosted at large, shared, geographically diverse data centers, and they use replication to achieve high availability despite the unreachability of an entire data center. Recent events show that non-crash faults occur in these services and may lead to long outages. While Byzantine-Fault Tolerance (BFT) could be used to withstand these faults, current BFT protocols can become unavailable if a small fraction of their replicas are unreachable. This is because existing BFT protocols favor strong safety guarantees (consistency) over liveness (availability).
This paper presents a novel BFT state machine replication protocol called Zeno that trades consistency for higher availability. In particular, Zeno replaces strong consistency (linearizability) with a weaker guarantee (eventual consistency): clients can temporarily miss each other's updates but when the network is stable the states from the individual partitions are merged by having the replicas agree on a total order for all requests. We have built a prototype of Zeno and our evaluation using micro-benchmarks shows that Zeno provides better availability than traditional BFT protocols.
Data centers are becoming a crucial computing platform for large-scale Internet services and applications in a variety of fields. These applications are often designed as a composition of multiple services. For instance, Amazon's S3 storage service and its e-commerce platform use Dynamo  as a storage substrate, or Google's indices are built using the MapReduce  parallel processing framework, which in turn can use GFS  for storage.
Ensuring correct and continuous operation of these services is critical, since downtime can lead to loss of revenue, bad press, and customer anger . Thus, to achieve high availability, these services replicate data and computation, commonly at multiple sites, to be able to withstand events that make an entire data center unreachable  such as network partitions, maintenance events, and physical disasters.
When designing replication protocols, assumptions have to be made about the types of faults the protocol is designed to tolerate. The main choice lies between a crash-fault model, where it is assumed nodes fail cleanly by becoming completely inoperable, or a Byzantine-fault model, where no assumptions are made about faulty components, capturing scenarios such as bugs that cause incorrect behavior or even malicious attacks. A crash-fault model is typically assumed in most widely deployed services today, including those described above; the primary motivation for this design choice is that all machines of such commercial services run in the trusted environment of the service provider's data center .
Unfortunately, the crash-fault assumption is not always valid even in trusted environments, and the consequences can be disastrous. To give a few recent examples, Amazon's S3 storage service suffered a multi-hour outage, caused by corruption in the internal state of a server that spread throughout the entire system ; also an outage in Google's App Engine was triggered by a bug in datastore servers that caused some requests to return errors ; and a multi-day outage at the Netflix DVD mail-rental was caused by a faulty hardware component that triggered a database corruption event .
Byzantine-fault-tolerant (BFT) replication protocols are an attractive solution for dealing with such faults. Recent research advances in this area have shown that BFT protocols can perform well in terms of throughput and latency , they can use a small number of replicas equal to their crash-fault counterparts [37,9], and they can be used to replicate off-the-shelf, non-deterministic, or even distinct implementations of common services [29,36].
However, most proposals for BFT protocols have focused on strong semantics such as linearizability , where intuitively the replicated system appears to the clients as a single, correct, sequential server. The price to pay for such strong semantics is that each operation must contact a large subset (more than , or in some cases ) of the replicas to conclude, which can cause the system to halt if more than a small fraction ( or , respectively) of the replicas are unreachable due to maintenance events, network partitions, or other non-Byzantine faults. This contrasts with the philosophy of systems deployed in corporate data centers [15,34,21], which favor availability and performance, possibly sacrificing the semantics of the system, so they can provide continuous service and meet tight SLAs .
In this paper we propose Zeno, a new BFT replication protocol designed to meet the needs of modern services running in corporate data centers. In particular, Zeno favors service performance and availability, at the cost of providing weaker consistency guarantees than traditional BFT replication when network partitions and other infrequent events reduce the availability of individual servers.
Zeno offers eventual consistency semantics , which intuitively means that different clients can be unaware of the effects of each other's operations, e.g., during a network partition, but operations are never lost and will eventually appear in a linear history of the service--corresponding to that abstraction of a single, correct, sequential server--once enough connectivity is re-established.
In building Zeno we did not start from scratch, but instead adapted Zyzzyva , a state-of-the-art BFT replication protocol, to provide high availability. Zyzzyva employs speculation to conclude operations fast and cheaply, yielding high service throughput during favorable system conditions--while connectivity and replicas are available--so it is a good candidate to adapt for our purposes. Adaptation was challenging for several reasons, such as dealing with the conflict between the client's need for a fast and meaningful response and the requirement that each request is brought to completion, or adapting the view change protocols to also enable progress when only a small fraction of the replicas are reachable and to merge the state of individual partitions when enough connectivity is re-established.
The rest of the paper is organized as follows. Section 2 motivates the need for eventual consistency. Section 3 defines the properties guaranteed by our protocol. Section 4 describe how Zeno works and Section 5 sketches the proof of its correctness. Section 6 evaluates how our implementation of Zeno performs. Section 7 presents related work, and Section 8 concludes.
Various levels and definitions of weak consistency have been proposed by different communities , so we need to justify why our particular choice is adequate. We argue that eventual consistency is both necessary for the guarantees we are targetting, and sufficient from the standpoint of many applications.
Consider a scenario where a network partition occurs, that causes half of the replicas from a given replica group to be on one side of the partition and the other half on the other side. This is plausible given that replicated systems often spread their replicas over multiple data centers for increased reliability , and that Internet partitions do occur in practice . In this case, eventual consistency is necessary to offer high availability to clients on both sides of the partition, since it is impossible to have both sides of the partitions make progress and simultaneously achieve a consistency level that provided a total order on the operations (``seen'' by all client requests) . Intuitively, the closest approximation from that idealized consistency that could be offered is eventual consistency, where clients on each side of the partition agree on an ordering (that only orders their operations with respect to each other), and, when enough connectivity is re-established, the two divergent states can be merged, meaning that a total order between the operations on both sides can be established, and subsequent operations will reflect that order.
Additionally, we argue that eventual consistency is sufficient from the standpoint of the properties required by many services and applications that run in data centers. This has been clearly stated by the designers of many of these services [15,34,21,3,13]. Applications that use an eventually consistent service have to be able to work with responses that may not include some previously executed operations. To give an example of applications that use Dynamo, this means that customers may not get the most up-to-date sales ranks, or may even see some items they deleted reappear in their shoping carts, in which case the delete operation may have to be redone. However, those events are much preferrable to having a slow, or unavailable service.
Beyond data-center applications, many other examples of eventually consistent services has been deployed in common-use systems, for example, DNS. Saito and Shapiro  provide a more thourough survey of the theme.
We now informally specify safety and liveness properties of a generic eventually consistent BFT service. The formal definitions appear in a separate technical report due to lack of space .
Informally, our safety properties say that an eventually consistent system behaves like a centralized server whose service state can be modelled as a multi-set. Each element of the multi-set is a history (a totally ordered subset of the invoked operations), which captures the intuitive notion that some operations may have executed without being aware of each other, e.g., on different sides of a network partition, and are therefore only ordered with respect to a subset of the requests that were executed. We also limit the total number of divergent histories, which in the case of Zeno cannot exceed, at any time, , where is the current number of failed servers, is the total number of servers and is the maximum number of servers that can fail.
We also specify that certain operations are committed. Each history has a prefix of committed operations, and the committed prefixes are related by containment. Hence, all histories agree on the relative order of their committed operations, and the order cannot change in the future. Aside from this restriction, histories can be merged (corresponding to a partition healing) and can be forked, which corresponds to duplicating one of the sets in the multi-set.
Given this state, clients can execute two types of operations, weak and strong, as follows. Any operation begins its execution cycle by being inserted at the end of any non-empty subset of the histories. At this and any subsequent time, a weak operation may return, with the corresponding result reflecting the execution of all the operations that precede it. In this case, we say that the operation is weakly complete. For strong operations, they must wait until they are committed (as defined above) before they can return with a similar way of computing the result. We assume that each correct client is well-formed: it never issues a new request before its previous (weak or strong) request is (weakly or strongly, respectively) complete.
The merge operation takes two histories and produces a new history, containing all operations in both histories and preserving the ordering of committed operations. However, the weak operations can appear in arbitrary ordering in the merged histories, preserving the causal order of operations invoked by the same client. This implies that weak operations may commit in a different order than when they were weakly completed.
On the liveness side, our service guarantees that a request issued by a correct client is processed and a response is returned to the client, provided that the client can communicate with enough replicas in a timely manner.
More precisely, we assume a default round-trip delay and we say that a set of servers , is eventually synchronous if there is a time after which every two-way message exchange within takes at most time units. We also assume that every two correct servers or clients can eventually reliably communicate. Now our progress requirements can be put as follows:
In particular, (L1) and (L2) imply that if there is a an eventually synchronous set of correct replicas, then each (weak or strong) request issued by a correct client will eventually be committed.
As we will explain later, ensuring (L1) in the presence of partitions may require unbounded storage. We will present a protocol addition that bounds the storage requirements at the expense of relaxing (L1).
Zeno is a BFT state machine replication protocol. It requires replicas to tolerate Byzantine faults, i.e., we make no assumption about the behavior of faulty replicas. Zeno also tolerates an arbitrary number of Byzantine clients. We assume no node can break cryptographic techniques like collision-resistant digests, encryption, and signing. The protocol we present in this paper uses public key digital signatures to authenticate communication. In a separate technical report , we present a modified version of the protocol that uses more efficient symmetric cryptography based on message authentication codes (MACs).
The protocol uses two kinds of quorums: strong quorums consisting of any group of distinct replicas, and weak quorums of distinct replicas.
The system easily generalizes to any , in which case the size of strong quorums becomes , and weak quorums remain the same, independent of . Note that one can apply our techniques in very large replica groups (where ) and still make progress as long as replicas are available, whereas traditional (strongly consistent) BFT systems can be blocked unless at least replicas, growing with , are available.
Like most traditional BFT state machine replication protocols, Zeno has three components: sequence number assignment (Section 4.4) to determine the total order of operations, view changes (Section 4.5) to deal with leader replica election, and checkpointing (Section 4.8) to deal with garbage collection of protocol and application state.
The execution goes through a sequence of configurations called views. In each view, a designated leader replica (the primary) is responsible for assigning monotonically increasing sequence numbers to clients' operations. A replica is the primary for the view numbered iff .
At a high level, normal case execution of a request proceeds as follows. A client first sends its request to all replicas. A designated primary replica assigns a sequence number to the client request and broadcasts this proposal to the remaining replicas. Then all replicas execute the request and return a reply to the client.
Once the client gathers sufficiently many matching replies--replies that agree on the operation result, the sequence number, the view, and the replica history--it returns this result to the application. For weak requests, it suffices that a single correct replica returned the result, since that replica will not only provide a correct weak reply by properly executing the request, but it will also eventually commit that request to the linear history of the service. Therefore, the client need only collect matching replies from a weak quorum of replicas. For strong requests, the client must wait for matching replies from a strong quorum, that is, a group of at least distinct replicas. This implies that Zeno can complete many weak operations in parallel across different partitions when only weak quorums are available, whereas it can complete strong operations only when there are strong quorums available.
Whenever operations do not make progress, or if replicas agree that the primary is faulty, a view change protocol tries to elect a new primary. Unlike in previous BFT protocols, view changes in Zeno can proceed with the concordancy of only a weak quorum. This can allow multiple primaries to coexist in the system (e.g., during a network partition) which is necessary to make progress with eventual consistency. However, as soon as these multiple views (with possibly divergent sets of operations) detect each other (Section 4.6), they reconcile their operations via a merge procedure (Section 4.7), restoring consistency among replicas.
In what follows, messages with a subscript of the form denote a public-key signature by principal . In all protocol actions, malformed or improperly signed messages are dropped without further processing. We interchangeably use terms ``non-faulty'' and ``correct'' to mean system components (e.g., replicas and clients) that follow our protocol faithfully. Table 1 collects our notation.
We start by explaining the protocol state at the replicas. Then we present details about the three protocol components. We used Zyzzyva  as a starting point for designing Zeno. Therefore, throughout the presentation, we will explain how Zeno differs from Zyzzyva.
A prefix of the ordered history upto sequence number is called committed when a replica gathers a commit certificate (denoted and described in detail in Section 4.4) for ; each replica only remembers the highest CC it witnessed.
To prevent the history of requests from growing without bounds, replicas assemble checkpoints after every sequence numbers. For every checkpoint sequence number , a replica first obtains the for and executes all operations upto and including . At this point, a replica takes a snapshot of the application state and stores it (Section 4.8).
Replicas remember the set of operations received from each client in their request[c] buffer and only the last reply sent to each client in their reply[c] buffer. The request buffer is flushed when a checkpoint is taken.
To describe how sequence number assignment works, we follow the flow of a request.
First, Zeno clients only need matching replies from a weak quorum, whereas Zyzzyva requires at least a strong quorum; this leads to significant increase in availability, when for example only between and replicas are available. It also allows for slightly lower overhead at the client due to reduced message processing requirements, and to a lower latency for request execution when inter-node latencies are heterogeneous.
Second, Zeno requires clients to use sequential timestamps instead of monotonically increasing but not necessarily sequential timestamps (which are the norm in comparable systems). This is required for garbage collection (Section 4.8). This raises the issue of how to deal with clients that reboot or otherwise lose the information about the latest sequence number. In our current implementation we are not storing this sequence number persistently before sending the request. We chose this because the guarantees we obtain are still quite strong: the requests that were already committed will remain in the system, this does not interfere with requests from other clients, and all that might happen is the client losing some of its initial requests after rebooting or oldest uncommitted requests. As future work, we will devise protocols for improving these guarantees further, or for storing sequence numbers efficiently using SSDs or NVRAM.
Third, whereas Zyzzyva offers a single-phase performance optimization, in which a request commits in only three message steps under some conditions (when all replicas operate roughly synchronously and are all available and non-faulty), Zeno disables that optimization. The rationale behind this removal is based on the view change protocol (Section 4.5) so we defer the discussion until then. A positive side-effect of this removal is that, unlike with Zyzzyva, Zeno does not entrust potentially faulty clients with any protocol step other than sending requests and collecting responses.
Finally, clients in Zeno send the request to all replicas whereas clients in Zyzzyva send the request only to the primary replica. This change is required only in the MAC version of the protocol but we present it here to keep the protocol description consistent. At a high level, this change is required to ensure that a faulty primary cannot prevent a correct request that has weakly completed from committing--the faulty primary may manipulate a few of the MACs in an authenticator present in the request before forwarding it to others, and during commit phase, not enough correct replicas correctly verify the authenticator and drop the request. Interestingly, we find that the implementations of both PBFT and Zyzzyva protocols also require the clients to send the request directly to all replicas.
Our protocol description omits some of the pedantic details such as handling faulty clients or request retransmissions; these cases are handled similarly to Zyzzyva and do not affect the overheads or benefits of Zeno when compared to Zyzzyva.
We now turn to the election of a new primary when the current primary is unavailable or faulty. The key point behind our view change protocol is that it must be able to proceed when only a weak quorum of replicas is available unlike view change algorithms in strongly consistent BFT systems which require availability of a strong quorum to make progress. The reason for this is the following: strongly consistent BFT systems rely on the quorum intersection property to ensure that if a strong quorum decides to change view and another strong quorum decides to commit a request, there is at least one non-faulty replica in both quorums ensuring that view changes do not ``lose'' requests committed previously. This implies that the sizes of strong quorums are at least , so that the intersection of any two contains at least replicas, including--since no more than of those can be faulty--at least one non-faulty replica. In contrast, Zeno does not require view change quorums to intersect; a weak request missing from a view change will be eventually committed when the correct replica executing it manages to reach a strong quorum of correct replicas, whereas strong requests missing from a view change will cause a subsequent provable divergence and application-state merge.
In the latter case, if the replica does not receive an message before it times out, it broadcasts to all replicas, but continues to participate in the current view. If a replica receives such accusations from a weak quorum, it stops participating in the current view and sends a to other replicas, where is the highest commit certificate, and is 's ordered request history since that commit certificate, i.e., all messages for requests with sequence numbers higher than the one in . It then starts the view change timer.
The primary replica for view starts a timer with a shorter timeout value called the aggregation timer and waits until it collects a set of messages for view from a strong quorum, or until its aggregation timer expires. If the aggregation timer expires and the primary replica has collected or more such messages, it sends a to other replicas, where is the set of messages it gathered (we call this a weak view change, as opposed to one where a strong quorum of replicas participate which is called a strong view change). If a replica does not receive the message before the view change timer expires, it starts a view change into the next view number.
Note that waiting for messages from a strong quorum is not needed to meet our eventual consistency specification, but helps to avoid a situation where some operations are not immediately incorporated into the new view, which would later create a divergence that would need to be resolved using our merge procedure. Thus it improves the availability of our protocol.
Each replica locally calculates the initial state for the new view by executing the requests contained in , thereby updating both and the history chain digest . The order in which these requests are executed and how the initial state for the new view is calculated is related to how we merge divergent states from different replicas, so we defer this explanation to Section 4.7. Each replica then sends a to all others, and once it receives such messages matching in , , and from a weak or a strong quorum (for weak or strong view changes, respectively) the replica becomes active in view and stops processing messages for any prior views.
The view change protocol allows a set of correct but slow replicas to initiate a global view change even if there is a set of synchronized correct replicas, which may affect our liveness guarantees (in particular, the ability to eventually execute weak requests when there is a synchronous set of correct servers). We avoid this by prioritizing client requests over view change requests as follows. Every replica maintains a set of client requests that it received but have not been processed (put in an ordered request) by the primary. Whenever a replica receives a message from related to the view change protocol ( , , , or ) for a higher view, first forwards the outstanding requests to the current primary and waits until the corresponding ORs are received or a timer expires. For each pending request, if a valid OR is received, then the replica sends the corresponding response back to the client. Then processes the original view change related messages from according to the protocol described above. This guarantees that the system makes progress even in the presence of continuous view changes caused by the slow replicas in such pathological situations.
The following sections describe additions to the view change protocols to incorporate functionality for detecting and merging concurrent histories, which are also exclusive to Zeno.
Concurrent histories (i.e., divergence in the service state) can be formed for several reasons. This can occur when the view change logic leads to the presence of two replicas that simultaneously believe they are the primary, and there are a sufficient number of other replicas that also share that belief and complete weak operations proposed by each primary. This could be the case during a network partition that splits the set of replicas into two subsets, each of them containing at least replicas.
Another possible reason for concurrent histories is that the base history decided during a view change may not have the latest committed operations from prior views. This is because a view change quorum (a weak quorum) may not share a non-faulty replica with prior commitment quorums (strong quorums) and remaining replicas; as a result, some committed operations may not appear in messages and, therefore, may be missing from the new starting state in the message.
Finally, a misbehaving primary can also cause divergence by proposing the same sequence numbers to different operations, and forwarding the different choices to disjoint sets of replicas.
For clarity, we first describe how we detect divergence within a view and then discuss detection across views. We also defer details pertaining to garbage collection of replica state until Section 4.8.
Suppose replica is in view , has executed up to sequence number , and receives a properly authenticated message or from replica .
If , i.e., has executed a request with sequence number , then the fill-hole mechanism is started, and receives from a message , where and .
Otherwise, if , both replicas have executed a request with sequence number and therefore must have the some message in its log, where and .
If the two history digests match (the local or , depending on whether , and the one received in the message), then the two histories are consistent and no concurrency is deduced.
If instead the two history digests differ, the histories must differ as well. If the two messages are authenticated by the same primary, together they constitute a proof of misbehavior (POM); through an inductive argument it can be shown that the primary must have assigned different requests to the same sequence number . Such a POM is sufficient to initiate a view change and a merge of histories (Section 4.7).
The case when the two messages are authenticated by different primaries indicates the existence of divergence, caused for instance by a network partition, and we discuss how to handle it next.
Now assume that replica receives a message from replica indicating that . This could happen due to a partition, during which different subsets changed views independently, or due to other network and replica asynchrony. Replica requests the message for from . (The case where is similar, with the exception that pushes the message to instead.)
When node receives and verifies the message, where is the issuing primary of view , it compares its local history to the sequence of messages obtained after ordering the message present in the message (according to the procedure described in Section 4.7). Let and be the lowest and highest sequence numbers of those messages, respectively.
Like traditional view change protocols, a replica does not enter if the message for that view did not include all of 's committed requests. This is important for the safety properties providing guarantees for strong operations, since it excludes a situation where requests could be committed in without seeing previously committed requests.
Once concurrent histories are detected, we need to merge them in a deterministic order. The solution we propose is to extend the view change protocol, since many of the functionalities required for merging are similar to those required to transfer a set of operations across views.
We extend the view change mechanism so that view changes can be triggered by either PODs, POMs or POAs. When a replica obtains a POM, a POD, or a POA after detecting divergence, it multicasts a message of the form , , or in addition to the message for . Note here that in POM and POD is one higher than the highest view number present in the conflicting messages, or one higher than the view number in the component in the case of a POA.
Upon receiving an authentic and valid or or a , a replica broadcasts a along with the triggering POM, POD, or POA message.
The view change mechanism will eventually lead to the election of a new primary that is supposed to multicast a message. When a node receives such a message, it needs to compute the start state for the next view based on the information contained in that message. The new start state is calculated by first identifying the highest present among all messages; this determines the new base history digest for the start sequence number of the new view.
But nodes also need to determine how to order the different messages that are present in the message but not yet committed. Contained messages (potentially including concurrent requests) are ordered using a deterministic function of the requests that produces a total order for these requests. Having a fixed function allows all nodes receiving the message to easily agree on the final order for the concurrent present in that message. Alternatively, we could let the primary replica propose an ordering, and disseminate it as an additional parameter of the message.
Replicas receiving the message then execute the requests in the messages according to that fixed order, updating their histories and history digests. If a replica has already executed some weak operations in an order that differs from the new ordering, it first rolls back the application state to the state of the last checkpoint (Section 4.8) and executes all operations after the checkpoint, starting with committed requests and then with the weak requests ordered by the message. Finally, the replica broadcasts a message. As mentioned, when a replica collects matching messages on , , and it becomes active in the new view.
Our merge procedure re-executes the concurrent operations sequentially, without running any additional or alternative application-specific conflict resolution procedure. This makes the merge algorithm slightly simpler, but requires the application upcall that executes client operations to contain enough information to identify and resolve concurrent operations. This is similar to the design choice made by Bayou  where special concurrency detection and merge procedure are part of each service operation, enabling servers to automatically detect and resolve conflicts.
Zeno's view changes motivate our removal of the single-phase Zyzzyva optimization for the following reason: suppose a strong client request was executed (and committed) at sequence number at replicas. Now suppose there was a weak view change, the new primary is faulty, and only replicas are available. A faulty replica among those has the option of reporting in a different order in its message, which enables the primary to order arbitrarily in its message; this is possible because only a single--potentially faulty--replica need report any request during a Zeno view change. This means that linearizability is violated for this strong, committed request . Although it may be possible to design a more involved view change to preserve such orderings, we chose to keep things simple instead. As our results show, in many settings where eventual consistency is sufficient for weak operations, our availability under partitions tramps any benefits from increased throughput due to the Zyzzyva's optimized single-phase request commitment.
The protocol we have presented so far has two important shortcomings: the protocol state grows unboundedly, and weak requests are never committed unless they are followed by a strong request.
To address these issues, Zeno periodically takes checkpoints, garbage collecting its logs of requests and forcing weak requests to be committed.
When a replica receives an message from the primary for sequence number , it checks if . If so, it broadcasts the message corresponding to to other replicas. Once a replica receives messages matching in , , and , it creates the commit certificate for sequence number . It then sends a to all other replicas. The is a snapshot of the application state after executing requests upto and including . When it receives matching messages, it considers the checkpoint stable, stores this proof, and discards all ordered requests with sequence number lower than along with their corresponding client requests.
Also, in case the checkpoint procedure is not run within the interval of time units, and a replica has some not yet committed ordered requests, the replica also initiates the commit step of the checkpoint procedure. This is done to make sure that pending ordered requests are committed when the service is rarely used by other clients and the sequence numbers grow very slowly.
Our checkpoint procedure described so far poses a challenge to the protocol for detecting concurrent histories. Once old requests have been garbage-collected, there is no way to verify, in the case of a slow replica (or a malicious replica pretending to be slow) that presents an old request, if that request has been committed at that sequence number or if there is divergence.
To address this, clients send sequential timestamps to uniquely identify each one of their own operations, and we added a list of per-client timestamps to the checkpoint messages, representing the maximum operation each client has executed up to the checkpoint. This is in contrast with previous BFT replication protocols, including Zyzzyva, where clients identified operations using timestamps obtained by reading their local clocks. Concretely, a replica sends , where is a vector of tuples, where is the timestamp of the last committed operation from .
This allows us to detect concurrent requests, even if some of the replicas have garbage-collected that request. Suppose a replica receives an with sequence number that corresponds to client 's request with timestamp . Replica first obtains the timestamp of the last executed operation of in the highest checkpoint = [ ]. If , then there is no divergence since the client request with timestamp has already been committed. But if , then we need to check if some other request was assigned , providing a proof of divergence. If , then the and the form a POD since some other request was assigned . Else, we can perform regular conflict detection procedure to identify concurrency (see Section 4.6).
Note that our checkpoints become stable only when there are at least replicas that are able to agree. In the presence of partitions or other unreachability situations where only weak quorums can talk to each other, it may not be possible to gather a checkpoint, which implies that Zeno must either allow the state concerning tentative operations to grow without bounds, or weaken its liveness guarantees. In our current protocol we chose the latter, and so replicas stop participating once they reach a maximum number of tentative operations they can execute, which could be determined based on their available storage resources (memory as well as the disk space). Garbage collecting weak operations and the resulting impact on conflict detection is left as a future work.
In this section, we sketch the proof that Zeno satisfies the safety properties specified in Section 3. A proof sketch for liveness properties is presented in a separate technical report .
In Zeno , a (weak or strong) response is based on identical histories of at least replicas, and, thus, at least one of these histories belongs to a correct replica. Hence, in the case that our garbage collection scheme is not initiated, we can reformulate the safety requirements as follows: (S1) the local history maintained by a correct replica consists of a prefix of committed requests extended with a sequence of speculative requests, where no request appears twice, (S2) a request associated with a correct client appears, in a history at a correct replica only if has previously issued the request, and (S3) the committed prefixes of histories at every two correct replicas are related by containment, and (S4) at any time, the number of conflicting histories maintained at correct replica does not exceed , where is the number of currently failed replicas and is the total number of replicas required to tolerate a maximum of faulty replicas. Here we say that two histories are conflicting if none of them is a prefix of the other.
Properties (S1) and (S2) are implied by the state maintenance mechanism of our protocol and the fact that only properly signed requests are put in a history by a correct replica. The special case when a prefix of a history is hidden behind a checkpoint is discussed later.
A committed prefix of a history maintained at a correct replica can only be modified by a commitment of a new request or a merge operation. The sub-protocol of Zeno responsible for committing requests are analogous to the two-phase conservative commitment in Zyzzyva , and, similarly, guarantees that all committed requests are totally ordered. When two histories are merged at a correct replica, the resulting history adopts the longest committed prefix of the two histories. Thus, inductively, the committed prefixes of all histories maintained at correct replicas are related by containment (S3).
Now suppose that at a given time, the number of conflicting histories maintained at correct replica is more than . Our weak quorum mechanism guarantees that each history maintained at a correct process is supported by at least distinct processes (through sending and messages). A correct process cannot concurrently acknowledge two conflicting histories. But when replicas are faulty, there can be at most sets of replicas that are disjoint in the set of correct ones. Thus, at least one correct replica acknowledged two conflicting histories -- a contradiction establishes (S4).
Checkpointing. Note that our garbage collection scheme may affect property (S1): the sequence of tentative operations maintained at a correct replica may potentially include a committed but already garbage-collected operation. This, however, cannot happen: each round of garbage collection produces a checkpoint that contains the latest committed service state and the logical timestamp of the latest committed operation of every client. Since no correct replica agrees to commit a request from a client unless its previous requests are already committed, the checkpoint implies the set of timestamps of all committed requests of each client. If a replica receives an ordered request of a client corresponding to a sequence number preceding the checkpoint state, and the timestamp of this request is no later than the last committed request of , then the replica simply ignores the request, concluding that the request is already committed. Hence, no request can appear in a local history twice.
We have implemented a prototype of Zeno as an extension to the publicly available Zyzzyva source code .
Our evaluation tries to answer the following questions: (1) Does Zeno incur more overhead than existing protocols in the normal case? (2) Does Zeno provide higher availability compared to existing protocols when there are more than unreachable nodes? (3) What is the cost of merges?
We generate a workload with a varying fraction of strong and weak operations. If each client issued both strong and weak operations, then most clients would block soon after network partitions started. Instead, we simulate two kind of clients: (i) weak clients only issue weak requests and (ii) strong clients always pose strong requests. This allows us to vary the ratio of weak operations (denoted by ) in the total workload with a limited number of clients in the system and long network partitions. We use a micro-benchmark that executes a no-op when the execute upcall for the client operation is invoked.
We have also built a simple application on top of Zeno, emulating a shopping cart service with operations to add, remove, and checkout items based on a key-value data store. We also implement a simple conflict detection and merge procedure. Due to lack of space, the design and evaluation of this service is presented in the technical report .
Our results presented in Table 2 show that Zeno and Zyzzyva's throughput are similar, with Zyzzyva achieving slightly (3-6%) higher throughput than Zeno's throughput for weak operations. The results also show that, with batching, Zeno's throughput for strong operations is also close to Zyzzyva's peak throughput: Zyzzyva has 7% higher throughput when the single phase optimization is employed. However, when a single replica is faulty or slow, Zyzzyva cannot achieve the single phase throughput and Zeno's throughput for strong operations is identical to Zyzzyva's performance with a faulty replica.
In this experiment, all clients reside in the first LAN. We initiate a partition at 90 seconds which continues for a minute. Since there are no clients in the second LAN, there are no requests processed in it and hence there is no concurrency, which avoids the cost of merging. Replicas with id 0 (primary for view initial view 0) and 1 reside in the first LAN while replicas with ids 2 and 3 reside in the second LAN. We also present the results of Zyzzyva to compare the performance in both normal cases as well as under the given failure.
Second, weak operations continue to be processed and completed during the partition and this is because Zeno requires (for ) only 2 non-faulty replicas to complete the operation. The fraction of total requests completed increases as increases, essentially improving the availability of such operations despite network partitions.
Third, when replicas in the other LAN are reachable again, they need to obtain the missing requests from the first LAN. Since the number of weak operations performed in the first LAN increases as increases, the time to update the lagging replicas in the other partition also goes up; this puts a temporary strain on the network, evidenced by the dip in the throughput of weak operations when the partition heals. However, this dip is brief compared to the duration of the partition. We explore the impact of the duration of partitions next.
Figure 2 presents the results. We observe that weak operations are always available in this experiment since all weak operations were completed in the first LAN and the replicas in the first LAN are up-to-date with each other to process the next weak operation. Strong operations are unavailable for the entire duration of the partition due to unavailability of the replicas in the second LAN and the additional unavailability is introduced by Zeno due to the operation transfer mechanism. However, the additional delay is within 4% of the partition duration (12 seconds for a 5 minute partition). Our current prototype is not yet optimized and we believe that the delay could be further reduced.
In this experiment, we keep half the clients on each side of a partition. This ensures that both partitions observe a steady load of weak operations that will cause Zeno to first perform a weak view change and later merge the concurrent weak operations completed in each partition. Hence, this microbenchmark additionally evaluates the cost of weak view changes and the merge procedure. As before, the primary for the initial view resides in the first LAN. We measure the overall throughput of weak and strong operations completed in both partitions. Again, we compare our results to Zyzzyva.
When , Zeno does not give additional benefits since there are no weak operations to be completed. Also, as soon as the partition starts, strong operations are blocked and resume after the partition heals. As above, Zyzzyva provides greater throughput thanks to its single-phase execution of client requests, but it is as powerless to make progress during partitions as Zeno in the face of strong operations only.
When , we have only one client sending weak operations in one LAN. Since there are no conflicts, this graph matches that of Figure 1.
When , we have at least two weak clients, at least one in each LAN. When a partition starts, we observe that the throughput of weak operations first drops; this happens because weak clients in the second partition cannot complete operations as they are partitioned from the current primary. Once they perform the necessary view changes in the second LAN, they resume processing weak operations; this is observed by an increase in the overall throughput of weak operations completed since both partitions can now complete weak operations in parallel - in fact, faster than before the partition due to decreased cryptographic and message overheads and reduced round trip delay of clients in the second partition from the primary in their partition. The duration of the weak operation unavailability in the non-primary partition is proportional to the number of view changes required. In our experiment, since replicas with ids 2 and 3 reside in the second LAN, two view changes were required (to make replica 2 the new primary).
When the partition heals, replicas in the first view detect the existence of concurrency and construct a POD, since replicas in the second LAN are in a higher view (with ). At this point, they request a from the primary of view 2, move to view 2, and then propagate their locally executed weak operations to the primary of view 2. Next, replicas in the first LAN need to fetch the weak operations that completed in the second LAN and needs to complete them before the strong operations can make progress. This results in additional delay before the strong operations can complete, as observed in the figure.
Next, we simulate partitions of varying duration as before, for . Again, we measure the unavailability of both strong and weak operations using the earlier definition: unavailability is the duration for which the throughput in either partition was less than 10% of average throughput before the failure. With a longer partition duration, the cost of the merge procedure increases since the weak operations from both partitions have to be transferred prior to completing the new client operations.
Figure 4 presents the results. We observe that weak operations experience some unavailability in this scenario, whose duration increases with the length of the partition. The unavailability for weak operations is within 9% of the total time of the partition.
The unavailability of strong operations is at least the duration of the network partition plus the merge cost (similar to that for weak operations). The additional unavailability due to the merge operation is within 14% of the total time of the partition.
We simulate a partition duration of 60 seconds and calculate the number of clients blocked and the length of time they were blocked during the partition. Figure 6 presents the cumulative distribution function of clients on the -axis and the maximum duration a client was blocked on the -axis. This metric allows us to see how clients were affected by the partition. With Zyzzyva, all clients will be blocked for the entire duration of the partition. However, with Zeno, a large fraction of clients do not observe any wait time and this is because they exit from the system after doing a few weak operations. For example, more than 70% of clients do not observe any wait time as long as the probability of performing a strong operation is less than 15%. In summary, this result shows that Zeno significantly improves the user experience and masks the failure events from being exposed to the user as long as the workload contains few strong operations.
The trade-off between consistency, availability and tolerance to network partitions in computing services has become folklore long ago .
Most replicated systems are designed to be ``strongly'' consistent, i.e., provide clients with consistency guarantees that approximate the semantics of a single, correct server, such as single-copy serializability  or linearizability .
Weaker consistency criteria, which allow for better availability and performance at the expense of letting replicas temporarily diverge and users see inconsistent data, were later proposed in the context of replicated services tolerating crash faults [33,17,38,30]. We improve on this body of work by considering the more challenging Byzantine-failure model, where, for instance, it may not suffice to apply an update at a single replica, since that replica may be malicious and fail to propagate it.
There are many examples of Byzantine-fault tolerant state machine replication protocols, but the vast majority of them were designed to provide linearizable semantics [8,4,11,23]. Similarly, Byzantine-quorum protocols provide other forms of strong consistency, such as safe, regular, or atomic register semantics . We differ from this work by analyzing a new point in the consistency-availability tradeoff, where we favor high availability and performance over strong consistency.
There are very few examples of Byzantine-fault tolerant systems that provide weak consistency.
SUNDR  and BFT2F  provide similar forms of weak consistency (fork and fork*, respectively) in a client-server system that tolerates Byzantine servers. While SUNDR is designed for an unreplicated service and is meant to minimize the trust placed on that server, BFT2F is a replicated service that tolerates a subset of Byzantine-faulty servers. A system with fork consistency might conceal users' actions from each other, but if it does, users get divided into groups and the members of one group can no longer see any of another group's file system operations.
These two systems propose quite different consistency guarantees from the guarantees provided by Zeno, because the weaker semantics in SUNDR and BFT2F have very different purposes than our own. Whereas we are trying to achieve high availability and good performance with up to Byzantine faults, the goal in SUNDR and BFT2F is to provide the best possible semantics in the presence of a large fraction of malicious servers. In the case of SUNDR, this means the single server can be malicious, and in the case of BFT2F this means tolerating arbitrary failures of up to of the servers. Thus they associate client signatures with updates such that, when such failures occur, all the malicious servers can do is conceal client updates from other clients. This makes the approach of these systems orthogonal and complementary to our own.
Another example of a system that provides weak consistency in the presence of some Byzantine failures can be found in . However, the system aims at achieving extreme availability but provides almost no guarantees and relies on a trusted node for auditing.
To our knowledge, this paper is the first to consider eventually-consistent Byzantine-fault tolerant generic replicated services.
In this paper we presented Zeno, a BFT protocol that privileges availability and performance, at the expense of providing weaker semantics than traditional BFT protocols. Yet Zeno provides eventual consistency, which is adequate for many of today's replicated services, e.g., that serve as back-ends for e-commerce websites. Our evaluation of an implementation of Zeno shows it provides better availability than existing BFT protocols, and that overheads are low, even during partitions and merges.
Zeno is only a first step towards liberating highly available but Byzantine-fault tolerant systems from the expensive burden of linearizability. Our eventual consistency may still be too strong for many real applications. For example, the shopping cart application does not necessarily care in what order cart insertions occur, now or eventually; this is probably the case for all operations that are associative and commutative, as well as operations whose effects on system state can easily be reconciled using snapshots (as opposed to merging or totally ordering request histories). Defining required consistency per operation type and allowing the replication protocol to relax its overheads for the more ``best-effort'' kinds of requests could provide significant further benefits in designing high-performance systems that tolerate Byzantine faults.
We would like to thank our shepherd, Miguel Castro, the anonymous reviewers, and the members of the MPI-SWS for valuable feedback.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation singh_html