Things about replication in Elasticsearch


Updated on 2018-04-18

Elasticsearch is evolving fast in the past few years. There have been quite some discussions on data loss during node crashes, which can be found here and here. Most of the issues have been fixed as described here. However, since Elasticsearch carried out a major upgrade to version 5+, some serious issues still remain for low versions, e.g., the stale replica problem described here.

I already discussed in the original article about the two node issue. I recently carried out an experiment with 3 nodes which is actually the recommended minimum size for an Elasticsearch cluster. With 3 nodes, the quorum size is 2 and the minimum master nodes is 2 (discovery.zen.minimum_master_nodes). Therefore, there is always an overlap where some nodes have the latest state. Let me explain this with an example. The nodes are A, B, and C. We go through the following test steps:

  1. Create a new index with 2 replicas, i.e., 3 copies in total;
  2. Shut down A;
  3. Index 1 document on index B and C successfully;
  4. Shut down B and C;
  5. Turn on A;

What about the state for the index? The replica on A will not be allocated as the primary shard since there is only one alive node less than the minimum master nodes 2. Now, we turn on B. As B has the latest state, B propagate the latest state to A.

Most of open sourced distributed system rely on a mature consensus approach such as Raft or Zookeeper. However, Elasticsearch decided to invent its own. This actually leads to most of those serious issues. This really drive me crazy :(

So far as I know, the most close setting to make elasticsearch strong consistent in a 3 node cluster consists of:

  • A minimun master nodes = 2
  • write consistency = quorum
  • write/read/search preference = primary
  • check if index succeeds on >= 2 nodes
  • always refresh before search (assume search is an infrequent operation)

========

Replication is a key feature for Elasticsearch from two aspects: (1) When some machines fail, the alive ones can still serve requests; (2) Replication boosts the read speed since read can retrieve data from multiple nodes simultaneously.

Elasticsearch follows a primary-secondary fashion for replication. When a write request (e.g., create, index, update, delete) arrives, it is first forward to the primary replica. The primary replica finishes the request, and then, concurrently forward the requests to all alive secondary replicas. There are a few details about this process.

First, there is a concept of write consistency level in Elasticsearch, with available options one, quorum, and all. This concept is a bit different from what we normally find for other systems such as Kafka. It barely forces the primary replica to check if there are enough alive replicas available receiving a write request. For instance, suppose we have a cluster of 3 nodes with replica number 2, i.e., each shard is with one primary replica and 2 secondary replicas. If we set the write consistency level to quorum, when the primary replica receives a index request, it checks if there are at least 2 replicas available (i.e., >=replicas/2+1). If the check passes, the primary replica will start the action of index, after which it forward the request to all replicas. One should note that the consistency level is only for the check. This means there is a chance when a replica fails after the check, and right before the action.

Second, we need to answer the question: when shall the primary replica respond to the client? It turns out that there are two modes, sync and async as discussed here. The sync mode means the primary replica only responds the client if “all” secondary replicas have finished the request. Note that the “all” here, which has nothing to do with the selected write consistency level. Under the async mode, the primary replica responds to the client right after itself finishing the request. The request is then forward to other replicas in an async way. This accelerate the response timing for the client, which however may lead to overload for the Elasticsearch cluster. Mean while, since the request propagates eventually to the replicas, there will be no read-write consistency guarantee even inside the same session if the read preference is not set to primary.

In normal case, there is only one primary replica for each shard. Once the primary replica fails, a secondary replica is elected to serve as primary. In some special situations, the primary replica may lose connection to other replicas, leading to multiple primary replicas in the system, which is called the split brain problem as discussed here. The cue to this problem is by setting the discovery.zen.minimum_master_nodes to >= half of nodes + 1. For example, if you have 3 nodes, the minimum_master_nodes should be set to 2. By setting the minimum_master_nodes we ensure that the service is only available if there are more than minimum_master_nodes living nodes within one master domain. In other words, there can not be two masters in the system.

Finally, I want to discuss the problem of stale shard which I read recently from here. Let’s start by use a concrete example. Say if we have two nodes and each shard has two replicas (one primary and the other secondary). We first index 10 documents with the secondary shard node turned off. Then, we turn off the primary shard node, and bring up the secondary shard node. The question here is whether the secondary shard will be promoted to primary? If it is, how about the 10 documents we indexed before? According to this blog, with Elasticsearch v5+, the primary shard will not only do the index, but also inform the master about in-sync shards. In this case, the answer to our questions are no. Because the secondary shard is not in in-sync state after being brought up. I didn’t experiment it myself about this since I don’t have a Elasticsearch v5+ environment. I only tested this with Elasticsearch 2.4.5 where I found different answer. After secondary shard node was brought up, the secondary shard was indeed promoted to primary, and the 10 documents were lost if I then brought up the previous primary shard node. This is indeed a problem if such special situation happens, which however should be quite rare in practice especially if you have more than 2 nodes, and with quorum write consistency level.