Bloom Filters Flashcards
(9 cards)
What is a bloom filter?
A bloom filter is a data structure that can quickly test for set membership (if something is not in a set or probably in it) without storing the entire collection.
What is one of the key benefits of a bloom filter?
It doesn’t require the entire dataset being evaluated to be stored in memory.
How do bloom filters handle false negatives? What about false positives?
Bloom filters make false negatives impossible (they guarantee something isn’t in a set), but can yield false positives (say something is in a set that isn’t).
Since bloom filters can yield false positives, why would you use them?
In systems of extreme scale, the tiny error rate of false positives may be negligible as a trade off for memory and performance.
How does a bloom filters work at a high-level?
It uses a fixed-size array as a “fingerprint” or “signature” of a set and uses multiple hash functions to distribute items across the fingerprint.
Describe the internals of a bloom filter? What does it consist of? How does it handle new entries or queries?
A bloom filter consists of two parts:
- bit array (initialized at 0)
- hash functions (take input and map to an index)
When an item is added, it’s run through all of the hash functions and those are indices set to one.
When an item is queried, all of the hash functions are processed and the bits are checked:
- if any bit is zero, it’s not there
- if all bits are 1, it’s probably there
Why does a bloom filter use multiple different hash functions?
You can think of the independent hash functions as labels (positions in the bit array). The more labels, the more data is spread across those positions in the bit array.
What are the four steps when dealing with bloom filters?
Initialize the array, add an item, querying an item, and dealing with false positives.
Initializing the array with n positions along with k independent hash functions (each returning 0 to n-1).
Adding an item to the array will feed it into the k hash functions, and set the but at that position to 1 (if it isn’t already).
Querying will run through the k hash functions and if any bit is 0, it confirms it’s not in the set, all 1 values will indicate that it is probably in it.
When dealing with false positives, you may consider a secondary double-check via a slower lookup to confirm.
How does a bloom filter handle removals?
It doesn’t since all of the entries rely on hashing and previous items can affect the bits used, so you can’t remove one without potentially affecting others.