Cuckoo Filter
A Cuckoo filter for instances of T
. Cuckoo filters are probabilistic hash tables similar to Bloom filters but with several advantages. Like Bloom filters, a Cuckoo filter can determine if an object is contained within a set at a specified false positive rate with no false negatives. Like Bloom, a Cuckoo filter can determine if an element is probably inserted or definitely is not. In addition, and unlike standard Bloom filters, Cuckoo filters allow deletions and counting. They also use less space than a Bloom filter for similar performance.
The false positive rate of the filter is the probability that mightContain} will erroneously return true
for an object that was not added to the filter. Unlike Bloom filters, a Cuckoo filter will fail to insert when it reaches capacity. If an insert fails put will return false
.
Cuckoo filters allow deletion like counting Bloom filters using #delete(Object)
. While counting Bloom filters invariably use more space to allow deletions, Cuckoo filters achieve this with no space or time cost. Like counting variations of Bloom filters, Cuckoo filters have a limit to the number of times you can insert duplicate items. This limit is 8-9 in the current design, depending on internal state. You should never exceed 7 if possible. Reaching this limit can cause further inserts to fail and degrades the performance of the filter. Occasional duplicates will not degrade the performance of the filter but will slightly reduce capacity.
This Cuckoo filter implementation also allows counting the number of inserts for each item using #approximateCount(Object)
. This is probabilistic like the rest of the filter and any error is always an increase. The count will never return less than the number of actual inserts, but may return more. The insert limit of 7 still stands when counting so this is only useful for small numbers.
Once the filter reaches capacity (put returns false). It's best to either rebuild the existing filter or create a larger one. Deleting items in the current filter is also an option, but you should delete at least ~2% of the items in the filter before inserting again.
Existing items can be deleted without affecting the false positive rate or causing false negatives. However, deleting items that were not previously added to the filter can cause false negatives.
Hash collision attacks are theoretically possible against Cuckoo filters (as with any hash table based structure). If this is an issue for your application, use one of the cryptographically secure (but slower) hash functions. The default hash function, Murmer3 is not secure. Secure functions include SHA and SipHash. All hashes, including non-secure, are internally seeded and salted. Practical attacks against any of them are unlikely.
This implementation of a Cuckoo filter is serializable.
Author
Mark Gunlogson
Parameters
the type of items that the CuckooFilter
accepts
See also
paper on Cuckoo filter properties.
implementation
implementation
Functions
CuckooFilter
that's a copy of this instance.LongBitSet
table for the filter, in bits.true
if the element might have been put in this Cuckoo filter, false
if this is definitely not the case.