CAn someone explain me How Bloom filters work?
Bloom filters are an interesting ‘probabilistic’ data structure that can be used to see if an item has ‘never been added’ to a set previously. The curious wording here is intentional. It is probabilistic in that it is possible to have false positives but not possible to have false negatives. Bloom filters provide a much more compact and faster way of checking to see if an item exists than storing all items in a set and calling SISMEMBER.
Bloom filters work by running an item through a quick hashing function and sampling bits from that hash and setting them from a 0 to 1 at particular interval in a bitfield. To check for existence in a Bloom filter, the same bits are sampled. Many items may have bits that overlap, but since a hashing function produce unique identifiers, if a single bit from the hash is still a 0, then we know it has NOT been previously added. If it hasn’t been added, then it’s not in the database and you don’t have to go fetch it, possibly saving a lot of time and increasing performance.
For more information, checkout
Dave gave an excellent technical explanation for the way a Bloom Filter works.
Your incentive to use one is space and speed.
- With 1% false positive error rate you would require space of about 8 bits per item (that is one byte!). That would always be much smaller than keeping the complete item and the infrastructure that holds it.
- Items not in the Filter will return false on average after checking only two bits. Using a hash table will likely require more memory skips.
- Adding an item doesn’t require memory allocation since the full capacity is allocated upon initialization.