Quorum in Amazon Aurora

I recently read a serial of posts about the quorum mechanism in Amazon Aurora, which is a distributed relational database. These posts are:

  • post1: quorums and correlated failure.
  • post2: quorum reads and mutating state.
  • post3: reducing costs using quorum sets.
  • post4: quorum membership.

Besides, there is actually a Sigmod’17 paper about Amazon Aurora which could be found here. I only briefly went through that paper which spends most of words talking about the basic architecture.

I like this serial of posts which is a very good tutorial if you want to learn practical usage of quorum. By definition, a quorum model is

Formally, a quorum system that employs V copies must obey two rules. First, the read set, Vr, and the write set, Vw, must overlap on at least one copy.

Second, you need to ensure that the quorum used for a write overlaps with prior write quorums, which is easily done by ensuring that Vw > V/2.

At the heart of this model is that each read/write to the cluster of nodes overlaps at least one node with each other.

While it is cool to enjoy the replication benefit with the quorum model, there comes cost for both read and write. For read, a client may need to consult multiple nodes (i.e., the read set) in order to ensure reading the latest state. For write, the multiple copies need to be materialized in order to maintain the quorum model. The author introduced the basic ideas of solving these two problems in post2 and post3. Especially, for the read penalty, the master maintains a cache of the status of all successful replicas, including their latency estimations. Therefore, a client need only to find information from the master in order to read the latest information.

Membership management is discussed in post4 where they use the approach of overlapping quorums to solve the node failure problem. One nice feature is that this approach is robust given new failures happening right during the handling process.

Finally, I’d like to end up with the following sentence from the posts:

State is often considered a dirty word in distributed systems—it is hard to manage and coordinate consistent state as you scale nodes and encounters faults. Of course, the entire purpose of database systems is to manage state, providing atomicity, consistency, isolation, and durability (ACID).