CuckooFilter

public final class CuckooFilter implements Serializable

A Cuckoo filter for instances of T. Cuckoo filters are probabilistic hash tables similar to Bloom filters but with several advantages. Like Bloom filters, a Cuckoo filter can determine if an object is contained within a set at a specified false positive rate with no false negatives. Like Bloom, a Cuckoo filter can determine if an element is probably inserted or definitely is not. In addition, and unlike standard Bloom filters, Cuckoo filters allow deletions and counting. They also use less space than a Bloom filter for similar performance.

The false positive rate of the filter is the probability that mightContain} will erroneously return true for an object that was not added to the filter. Unlike Bloom filters, a Cuckoo filter will fail to insert when it reaches capacity. If an insert fails put will return false .

Cuckoo filters allow deletion like counting Bloom filters using #delete(Object). While counting Bloom filters invariably use more space to allow deletions, Cuckoo filters achieve this with no space or time cost. Like counting variations of Bloom filters, Cuckoo filters have a limit to the number of times you can insert duplicate items. This limit is 8-9 in the current design, depending on internal state. You should never exceed 7 if possible. Reaching this limit can cause further inserts to fail and degrades the performance of the filter. Occasional duplicates will not degrade the performance of the filter but will slightly reduce capacity.

This Cuckoo filter implementation also allows counting the number of inserts for each item using #approximateCount(Object). This is probabilistic like the rest of the filter and any error is always an increase. The count will never return less than the number of actual inserts, but may return more. The insert limit of 7 still stands when counting so this is only useful for small numbers.

Once the filter reaches capacity (put returns false). It's best to either rebuild the existing filter or create a larger one. Deleting items in the current filter is also an option, but you should delete at least ~2% of the items in the filter before inserting again.

Existing items can be deleted without affecting the false positive rate or causing false negatives. However, deleting items that were not previously added to the filter can cause false negatives.

Hash collision attacks are theoretically possible against Cuckoo filters (as with any hash table based structure). If this is an issue for your application, use one of the cryptographically secure (but slower) hash functions. The default hash function, Murmer3 is not secure. Secure functions include SHA and SipHash. All hashes, including non-secure, are internally seeded and salted. Practical attacks against any of them are unlikely.

This implementation of a Cuckoo filter is serializable.

Author

Mark Gunlogson

Parameters

<T>

the type of items that the CuckooFilter accepts

See also

<a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf">

paper on Cuckoo filter properties.

<a href="https://github.com/seiflotfy/cuckoofilter">Golang Cuckoo filter

implementation

<a href="https://github.com/efficient/cuckoofilter">C++ reference

implementation

Types

Link copied to clipboard
public class Builder
* Builds a Cuckoo Filter.

Functions

Link copied to clipboard
public int approximateCount(long item)
This method returns the approximate number of times an item was added to the filter.
Link copied to clipboard
public CuckooFilter copy()
Creates a new CuckooFilter that's a copy of this instance.
Link copied to clipboard
public boolean delete(long item)
Deletes an element from this CuckooFilter.
Link copied to clipboard
public boolean equals(Object object)
Link copied to clipboard
public long getActualCapacity()
Gets the absolute maximum number of items the filter can theoretically hold.
Link copied to clipboard
public long getCount()
Gets the current number of items in the Cuckoo filter.
Link copied to clipboard
public double getLoadFactor()
Gets the current load factor of the Cuckoo filter.
Link copied to clipboard
public long getStorageSize()
Gets the size of the underlying LongBitSet table for the filter, in bits.
Link copied to clipboard
public int hashCode()
Link copied to clipboard
public boolean mightContain(long item)
Returns true if the element might have been put in this Cuckoo filter, false if this is definitely not the case.
Link copied to clipboard
public boolean put(long item)
Puts an element into this CuckooFilter.