Monday 14 November 2016

Redis Cluster - How to fix a cluster which is in broken state after migration

Sometimes in redis cluster, we need to expand the redis cluster by adding more machines. This is accomplished by adding more machines, making them as part of the cluster and resharding the existing cluster as explained here.

However, many times, the cluster is stuck in an inconsistent state when there is an error in resharding. This can happen because of many reasons like sudden termination of reshard script, redis timeout error while moving slots(in case the keys are too big), or if a key already exists in the target database(because it involves migration).

There is no quick fix to these problems, and one would have to understand the internals of how a slot movement happens.

But there is a general rule on what can be done to try to fix if a redis cluster is stuck in an inconsistent state during reshard/migration.

The following two are important ways to fix a cluster which was broken because of migration.


Run the Fix Script

Fixing a resharding error can be done by running the fix script provided by redis.
It can be run using

./redis-trib.rb fix 127.0.0.1:7000

We will need to change the ip address and port as per your configuration. Also, you only need to provide the ip address/port of only a single node which is part of the cluster. The configuration of the other nodes are read automatically by the script.


If the above cannot fix the cluster state, then you can follow the below step.

Setting the Cluster slots to a particular node.

Manually checking the keys in the unstable slot, and setting the slot to be served by a particular slot. We can execute the "cluster nodes" command on all the nodes and see if any slot in set in a migrating/importing state. If we are sure a slot belongs to a particular node and the node holds and serves the data for that slot, we can set the slot to that node by executing the cluster setslot <slot> node <nodeid> command as described here.

If a node 127.0.0.1:7000 does not have the correct configuration as per cluster nodes command executed on it, and it shows that the slot 1000 is with some other node, but for all other nodes, it shows that it is with the node with node id abcdefgghasgdkashd, then we need to correct the configuration of that node(127.0.0.1:7000). It can be executing like  the following.

redis-cli -h 127.0.0.1 -p 7001 cluster setslot 1000 node abcdefgghasgdkashd

The above command just assures the node 127.0.0.1:7000 that the slot 1000 is served by the node with node id abcdefgghasgdkashd, and that the node 127.0.0.1 should correct its configuration to affect the below.

Note that you need to run it if you are sure that all other nodes agree to it with their cluster nodes command, and you are sure that the data for slot 1000 resides in the node abcdefgghasgdkashd.


No comments:

Post a Comment