Friday, 8 September 2017

How does HyperLogLog work in redis

               
Redis provides various functionality for a particular key implemented using each data structure it provides(like string, map, set, sorted set).

One of the most amazing functionality which redis provides is estimating the unique counts.

Imagine you were asked to keep track of unique visitors on a particular page.
Ideally to keep the count unique visitors on the page, whenever a particular user lands on the site with an id, it needs to be checked whether it exists in the existing set of users or not. Doing so requires space to keep all the individual user ids.
Almost magically, redis does it with finite space of 12 KB, and with an average accuracy of over 99%. It does it with the commands PFADD and PFCOUNT

It is not something which is unique to redis, as the concept of HyperLogLog(HLL) originated from this paper by Flajolet et al.

In the below section, we will learn the basic concept of HyperLogLog(HLL) and we will reach on the number 12 KB, which is the maximum amount of memory, which redis will use for a particular key to count the unique values in it, regardless of the number of values in it.

Lets say you have some free time on your hands, and you flip a coin N1 times and keep a track of the heads obtained. A flip can result in a heads or tails with equal probability.
Then you mention it to me that you had got 3 consecutive heads results.

Then, you start flipping the coin again after ignoring the earlier results. This time you mention to me that you had flipped the coin N2 times and got 5 consecutive head results.

If i had to make a half decent guess between relationship between N1 and N2, i would guess that N1 < N2.

Now, coming to probability, the probability of getting 3 consecutive heads is (1/2)^3  = 1/8, and of getting 5 consecutive heads is 1/16.
It can more of less mean that you could get 3 consecutive heads if you flip the coin around 8 times(or may be between 8-10 times because you need 3 flips for 3 consecutive occurence).

But you may get the 3 heads in 3 flips also, and its possible, if you are quite unlucky, to spend more time like 50-100 flips to get to that elusive 3rd consecutive heads.
So, may be we should take an average or the mean of the number of times we do this iteration(getting 3 consecutive heads and counting the no).

Viewing it in another way, if you tell me the number of consecutive heads, i can roughly guess the number of times you had to do the coin flip by asking you to do the iteration multiple times.

The above is the core concept of the HyperLogLog.

What redis does.

When we do "PFADD KEY VAL" in redis, redis will find out the hashcode of the key. The hashcode will be a 64 bit integer.

Now, know that redis has 16384 slots or buckets. Somehow redis needs to map each key to a bucket.

So what it will do is that it will take the first 14 bits of the hashcode of the key and map it to a bucket.
Note that 16384 = 2^14, and each bit can have 2 values, ie 0 and 1. So that there can be 2^14 = 16384 different values for the first 14 bits of the hashcode.
So, redis will take the first 14 bits of the hashcode, map it to the appropriate bucket out of 16384 buckets.

Now, remember that the hashcode had 64 bits, out of which 14 bits were used for selecting the bucket.
Out of the rest 50 bits, redis will find out the number of consecutive 0s that are occuring for that hashcode. It will try to keep the number of consecutive 0 in the bucket where the key's hashcode belongs. If this number is greater than the number of consecutive 0s seen by that bucket, it will replace the existing number in the bucket.

At the end, each of the 16384 buckets will have to keep a number to denote the number of consecutive 0s which have been seen by it.
Note that we had seen consecutive 0s in 50 bits. The maximum num of consecutive 0s in 50 bits is 50(if all bits are 0).
So, each bucket will have to keep a number between 0 and 50.
How many bits does each bucket need? To store a number between 0 and 50, you need 6 bits(because 2^6 = 64 > 50  > 32 = 2^5).
Now, there are 16384 buckets, each needs 6 bits.
So that 16384 * 6 bits.
Recall that 8 bits = 1 byte.
So 16384 buckets will need 16384 * 6 bits = 16384 * 6/8 bytes = 2024 * 6 bytes = 12 KB.
So, each value will need a max of 12 KB of memory. :)

Also, the standard error in this has been shown to be 1.04/sqrt(N) where N is the no of buckets. Considering redis uses 16384 slots or buckets, the error is ~0.8%, or the accuracy is > 99%.

Considering, the harmonic mean of the values in the buckets is taken, and not the simple mean, the algo is called HyperLogLog.

Harmonic mean is taken because it minimises the impact of outliers in the buckets(in a bucket having seen 10 consecutive 0s while all other having seen 3-4 consecutive 0s, 10 is the outlier because it is very far from the normal mean)


The below are very useful links from where i could grasp this concept.

http://antirez.com/news/75
http://blog.kiip.me/engineering/sketching-scaling-everyday-hyperloglog/
http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf

:)