Wednesday 21 December 2016

Redis Cluster monitoring - part 1 - node monitoring script

Redis is an in-memory database used for caching which provides very high performance and can run uninterrupted for months. Considering redis stores all the data in memory, if our data size is more than the memory for a single machine, we have to distribute the data on various machines. This is where Redis Cluster comes in, and it provides us a way to distribute the data on different machines, add/remove new machines etc.

However, once we have a large numbers of nodes in a redis cluster, it becomes imperative to continuously monitor the state and health of each redis cluster node/system because the chances of failure of one or more nodes increase.

Also, it helps to have automatic scripts which can monitor the redis nodes and alert in case it senses an error or that some memory/connection threshold has reached.

We should have the following monitoring in a cluster.
  1. monitoring individual nodes.
  2. monitoring overall cluster health.
  3. monitoring stats on redis nodes.
  4. viewing the redis stats over time on some graph

Monitoring individual nodes basically involves monitoring the redis process running on of each node and restarts it if it stops.

Monitoring overall cluster health is required because it is possible that one of the machine is down, so that its monitoring script running on it cannot send an alert. In this case, the global redis monitoring script should try to do a basic insert in each cluster stack, and if it fails, it should trigger the email alert.

Monitoring stats on redis nodes is required so that we don't have to wait for the things to go bad in redis, and we can identify whenever the threshold is reached for various indices in redis. This involves automatic monitoring of individual redis nodes, for the connections, memory, replication lag etc.

Finally, we need to have the stats of various redis nodes represented in terms of graphs. This is required to identify uneven patterns in the data usage/access and to have a global view of how the redis stack is used.

In this part, we will go through how we can monitor individual nodes.

Individual nodes can be monitored by a shell script, essentially a shell script will run every 30 seconds or so and will see if a redis server is running along with a port(s), if the redis server is not running on the ports defined, it restarts the redis server.

This can be achieved using a simple shell script as below.

#!/bin/bash

START_PORT=7000
END_PORT=7003

error=0
ports=''
checkRedis(){
        count=`ps -ef | grep "redis-server" | grep ":$1" | wc -l`
        if [[ $count -ne 1 ]]
        then
                error=1
                ports="$ports $1"
                echo "starting redis on port $1"
                # start redis either by redis-server or by service if redis is installed as a service
                # service redis-$1 start
                src/redis-server cluster-test/$1/redis.conf
        fi
}

for ((i=START_PORT;i<=END_PORT;i++)); do
    checkRedis $i
done

if [[ $error -eq 1 ]]
then
        echo "need to send mail that redis was started on ports $ports"
fi


The above script should run on each machine having one or many redis nodes. If a machine has redis running on different ports, they can be specified. In the above script, we specified that the redis will be running on ports 7000, 7001, 7002, 7003.

The above script can be saved in a file like 'monitorIndividualNodes.sh' and can be run every 2 minutes in crontab using

*/2 * * * * sh /redis/monitorIndividualNodes.sh

The script can be configured to run every interval, like every minute or so through crontab or any other trustworthy scheduling service, and will check whether the redis server is running on predefined ports on those machines. If it is not running, it will start the redis. Optionally, it should also send an email to alert the concerned.

Also, even in case of system restart, cron will run the script appropriately and all the redis instances will start.

Considering redis is very stable, and does not stop unless there is a machine restart, we don't have to worry about receiving too many emails. :)

In the next part, we will see the script to monitor the overall health of the cluster. This can be useful in case one or more machines are down as as result of which the monitoring script of individual nodes cannot run on them and no alert is generated by them.

Happy redising. :)

No comments:

Post a Comment