CAn someone explain me How Bloom filters work?
Bloom filters are an interesting ‘probabilistic’ data structure that can be used to see if an item has ‘never been added’ to a set previously. The curious wording here is intentional. It is probabilistic in that it is possible to have false positives but not possible to have false negatives. Bloom filters provide a much more compact and faster way of checking to see if an item exists than storing all items in a set and calling SISMEMBER.
Bloom filters work by running an item through a quick hashing function and sampling bits from that hash and setting them from a 0 to 1 at particular interval in a bitfield. To check for existence in a Bloom filter, the same bits are sampled. Many items may have bits that overlap, but since a hashing function produce unique identifiers, if a single bit from the hash is still a 0, then we know it has NOT been previously added. If it hasn’t been added, then it’s not in the database and you don’t have to go fetch it, possibly saving a lot of time and increasing performance.
For more information, checkout
Dave gave an excellent technical explanation for the way a Bloom Filter works.
Your incentive to use one is space and speed.
- With 1% false positive error rate you would require space of about 8 bits per item (that is one byte!). That would always be much smaller than keeping the complete item and the infrastructure that holds it.
- Items not in the Filter will return false on average after checking only two bits. Using a hash table will likely require more memory skips.
- Adding an item doesn’t require memory allocation since the full capacity is allocated upon initialization.
To complement the previous explanation, Bloom filters have been used within Redis for years through client-side libraries which took advantage of GETBIT and SETBIT to work with a bit field in a key.
Since ssali Redis 4.0 the RedisBloom module is active, this has helped any implementation overload of the Bloom filter.
within the RedisBloom documentation they give an example of a case for a Bloom filter where they tell you to look for a name of an already used user, on a small scale it is not a problem but when the service grows this can bring more work to the database data.
BF.ADD user names funnyfred
BF.ADD username fredisfunny
BF.ADD username fred
BF.ADD user names funfred
Now a test is run vs the Bloom filter:
BF.EXISTS username fred
BF.EXISTS username fred_is_funny