Friday, 11 November 2016

redis cluster - slave sync with master failed - client scheduled to be closed ASAP for overcoming of output buffer limits.

Sometimes while working with redis cluster, and while the master-slave replication happens, sometimes slave is not able to sync itself to the master.
While syncing, the slave's connection to the master breaks again and again.

The following are the logs on the slave instance(IP:Port are masked).

10 Nov 17:09:20.387 * Connecting to MASTER IP1:PORT1
10 Nov 17:09:20.387 * MASTER <-> SLAVE sync started
10 Nov 17:09:20.387 * Non blocking connect for SYNC fired the event.
10 Nov 17:09:20.387 * Master replied to PING, replication can continue...
10 Nov 17:09:20.388 * Trying a partial resynchronization (request 479e75d20ac03150de943a24d9b61300b6fa59d0:266086101935).
10 Nov 17:09:20.894 * Full resync from master: 479e75d20ac03150de943a24d9b61300b6fa59d0:266259284057
10 Nov 17:09:20.894 * Discarding previously cached master state.
10 Nov 17:10:31.281 * MASTER <-> SLAVE sync: receiving 3811948917 bytes from master
10 Nov 17:10:40.138 * MASTER <-> SLAVE sync: Flushing old data
10 Nov 17:11:11.933 * MASTER <-> SLAVE sync: Loading DB in memory
10 Nov 17:12:37.115 * MASTER <-> SLAVE sync: Finished with success
10 Nov 17:12:37.849 # Connection with master lost.
10 Nov 17:12:37.849 * Caching the disconnected master state.
10 Nov 17:12:38.866 * Connecting to MASTER IP1:PORT1
10 Nov 17:12:38.866 * MASTER <-> SLAVE sync started
10 Nov 17:12:38.866 * Non blocking connect for SYNC fired the event.

The error keeps repeating itself.
Somehow the error above does not give much clue. Just when the master-slave sync succeeds, connection with master is lost.

Looking at the master logs give more clue.
Below are master logs.

10 Nov 17:09:20.383 * Slave IP2:PORT2 asks for synchronization
10 Nov 17:09:20.383 * Unable to partial resync with slave IP2:PORT2 for lack of backlog (Slave request was: 266086101935).
10 Nov 17:09:20.383 * Starting BGSAVE for SYNC with target: disk
10 Nov 17:09:20.889 * Background saving started by pid 37651
10 Nov 17:10:30.557 * DB saved on disk
10 Nov 17:10:30.844 * RDB: 283 MB of memory used by copy-on-write
10 Nov 17:10:31.276 * Background saving terminated with success
10 Nov 17:10:39.561 * Synchronization with slave IP2:PORT2 succeeded
10 Nov 17:11:09.060 # Client id=2292912 addr=IP2:23666 fd=233 name= age=109 idle=9 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=7104 omem=127259816 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
10 Nov 17:11:09.160 # Connection with slave IP2:PORT2 lost.
10 Nov 17:12:38.861 * Slave IP2:PORT2 asks for synchronization
10 Nov 17:12:38.861 * Unable to partial resync with slave IP2:PORT2 for lack of backlog (Slave request was: 266261610586).
10 Nov 17:12:38.861 * Starting BGSAVE for SYNC with target: disk


The Key part is cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

It seems to show that the slave connection was closed as it reached output buffer limits for the master.
Somehow we need to fix this.
There is a setting in the conf file
client-output-buffer-limit slave 256mb 64mb 60

Reading the excellent documentation seems to suggest that the two limits, 256 mb and 64 mb are two limits, hard limits and soft limits. If the client connection for slave exceeds 256 mb or exceeds 64 mb continuously for 60 seconds.

We can either increase this setting, and restart the master server, but it seems too much of an effort and a risk and restarting a master may make its outdated slave trying to become a master.
Thankfully, this can be changed at runtime by executing the following command on master server.


redis-cli -h MASTER_IP -p MASTER_PORT CONFIG get client-output-buffer-limit

It gives the output as 
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"

Notice that the hard and soft limits for slave is 256 MB/64 MB (268435456/67108864 bytes)
We will update it to  1 GB/256 MB(1073741824/268435456 bytes) by executing the following command.


redis-cli -h MASTER_IP -p MASTER_PORT CONFIG set client-output-buffer-limit "normal 0 0 0 slave 1073741824 268435456 60 pubsub 33554432 8388608 60"

Note that depending on how far your slave is lagging behind the master and how fast your slave is able to consume the data from the master while syncing, you may have to increase the limit to more than 1GB/256 MB. Just be sure to go through the excellent documentation of client-output-buffer-limit in redis.conf.

After executing this command, we see that the slave succeeds in syncing with the master and does not give any other error.  


Below are the logs on slave after executing the command.

10 Nov 17:15:58.692 * MASTER <-> SLAVE sync started
10 Nov 17:15:58.692 * Non blocking connect for SYNC fired the event.
10 Nov 17:15:58.692 * Master replied to PING, replication can continue...
10 Nov 17:15:58.692 * Trying a partial resynchronization (request 479e75d20ac03150de943a24d9b61300b6fa59d0:266435075281).
10 Nov 17:15:59.195 * Full resync from master: 479e75d20ac03150de943a24d9b61300b6fa59d0:266612963059
10 Nov 17:15:59.195 * Discarding previously cached master state.
10 Nov 17:17:10.490 * MASTER <-> SLAVE sync: receiving 3811953308 bytes from master
10 Nov 17:17:19.613 * MASTER <-> SLAVE sync: Flushing old data
10 Nov 17:17:53.203 * MASTER <-> SLAVE sync: Loading DB in memory
10 Nov 17:19:19.332 * MASTER <-> SLAVE sync: Finished with success
10 Nov 18:01:31.139 * Background saving started by pid 16989


:)

1 comment: