CAn someone explain me How Bloom filters work?
Bloom filters are an interesting ‘probabilistic’ data structure that can be used to see if an item has ‘never been added’ to a set previously. The curious wording here is intentional. It is probabilistic in that it is possible to have false positives but not possible to have false negatives. Bloom filters provide a much more compact and faster way of checking to see if an item exists than storing all items in a set and calling SISMEMBER.
Bloom filters work by running an item through a quick hashing function and sampling bits from that hash and setting them from a 0 to 1 at particular interval in a bitfield. To check for existence in a Bloom filter, the same bits are sampled. Many items may have bits that overlap, but since a hashing function produce unique identifiers, if a single bit from the hash is still a 0, then we know it has NOT been previously added. If it hasn’t been added, then it’s not in the database and you don’t have to go fetch it, possibly saving a lot of time and increasing performance.
For more information, checkout
Dave gave an excellent technical explanation for the way a Bloom Filter works.
Your incentive to use one is space and speed.
- With 1% false positive error rate you would require space of about 8 bits per item (that is one byte!). That would always be much smaller than keeping the complete item and the infrastructure that holds it.
- Items not in the Filter will return false on average after checking only two bits. Using a hash table will likely require more memory skips.
- Adding an item doesn’t require memory allocation since the full capacity is allocated upon initialization.
To complement the previous explanation, Bloom filters have been used within Redis for years through client-side libraries which took advantage of GETBIT and SETBIT to work with a bit field in a key.
Since ssali Redis 4.0 the RedisBloom module is active, this has helped any implementation overload of the Bloom filter.
within the RedisBloom documentation they give an example of a case for a Bloom filter where they tell you to look for a name of an already used user, on a small scale it is not a problem but when the service grows this can bring more work to the database data.
BF.ADD user names funnyfred
BF.ADD username fredisfunny
BF.ADD username fred
BF.ADD user names funfred
Now a test is run vs the Bloom filter:
BF.EXISTS username fred
BF.EXISTS username fred_is_funny
Let me add some other interesting resources about Bloom Filter/Redis Bloom:
Can someone explain me the working Count-Min Sketch in Bloom
A CMS (Count Min Sketch) can give an approximation of the number an element appears in a flow. For example, the count of a specific IP from which requests arrive.
The result given by a CMS is relevant only if it is higher than the noise in the sketch. For example, if a CMS has 100 buckets and 1000 elements in it, any result up to 10 is just noise since on average, each bucket will get 10 hits. If a bucket has 50 results, we can assume a
heavy-hitter or an
elephant is counted into it. Therefore, in this case, if the return count is 10, you should ignore it, if it is 50, you can use it and assume the rel count is slightly lower.
Note: The sketch uses multiple arrays/sketches in order to reduce the probability of collision between
@ashtul thanks for the help, it been nice to see so much anticipation in this redis community forum .