Can putAll of guava bloom filter be used to create a new bloom filter of larger size? - guava

Google guava has an implementation of the classic bloom filter. Creating one involves specifying number of insertions and false positive probability expected. I want to know whether the putAll function provided can be used to create a new filter allowing a larger number of insertions than that of the bloom filter argument passed to it while retaining same fpp.

No. It will not.
As per the javadoc the putAll function throws an exception when the filters are not compatible. Two bloom filter are compatible if they have the same number of hash functions; have the same bit size; have the same strategy and; have equal funnels.
The number of hash functions and bitsize are derived from the number of insertions and fpp rate. Creating a new bloom filter with the same fpp and a larger number of insertion will result in different bitsize and number of hash functions.

Related

Unusual hash map implementation

Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.

Why are bloom filters not implemented like count-min sketch?

So I only recently learned about these, but from what I understood counting bloom filters are very similar to count-min sketches. The difference being that the former use a single array for all hash functions and the latter use an array per hash function.
If using separate arrays for each hash function will result in less collisions and reduce false positives, why are counting bloom filters not implemented as such?
Though both are space-efficient probabilistic data structures, BloomFilter and Count-min-sketch solve diff use-cases.
BloomFilter is used to test whether an element is a member of a set or not. and It gives boolean False positive results. False positive means, it might tell that a given element is already present but actually, it’s not. See here for working details: https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
Count-min-sketch tells about keeping track of the count of things i.e, How many times an element is present in a set. See here for working details: https://www.geeksforgeeks.org/count-min-sketch-in-java-with-examples/
I would like to add to #roottraveller answer and try to answer the OP question. First, I find the following resources really helpful for understanding the basic difference between Bloom Filter, Counting Bloom Filter and Count-min Sketch: https://octo.vmware.com/bloom-filter/
As can be find the document:
Bloom Filter is used to test whether an element is a member of a set or not
Count-min-sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data
Counting Bloom Filter an extension of the Bloom filter that allows deletion of elements by storing the frequency of occurrence
So, in short, Counting Bloom Filter only supports deletion of elements and cannot return the frequency of elements. Only CM sketch can return the frequency of elements. And, to answer OP question, sketches are a family of probabilistic data structures that deals with data stream with efficient space time complexity and they have always been constructed using an array per hash function. (https://www.sciencedirect.com/science/article/abs/pii/S0196677403001913)

Hyphenation algorithm using Bloom filter

A classical example of where Bloom filters shine is in hyphenation algorithms. It's even the example given in the original paper on Bloom filters.
I don't understand how a Bloom filter would be used in a hyphenation algorithm.
A hyphenation algorithm is defined as something that takes an input word and gives back the possible ways that that word can be hyphenated.
Would the Bloom filter contain both hyph-enation and hyphena-tion, and client code would query the filter for h-yphenation, hy-phenation, hyp-henation, ...?
Here's what the original paper says:
Analysis of Hyphenation Sample Application
[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.
For this case,
the dictionary is stored on disk and contains all words with the correct hyphenation,
the Bloom filter contains just keys that require special hyphenation, e.g. maybe hyphenation itself,
the Bloom filter responds with "probably" or "no".
Then the algorithm to find the possible hyphenations of a word is:
word = "hyphenation"; (or some other word)
x = bloomFilter.probablyContains(word);
if (x == "probably") {
lookupInDictionary(word).getHypenation();
} else {
// x == "no" case
useSimpleRuleBasedHypenation(word);
}
If the Bloom filter responds with "probably", then the algorithm would have to do a disk read in the dictionary.
The Bloom filter would respond with "probably" sometimes if there are in fact no special rules, in which case a disk I/O is done unnecessarily. But that's OK as long as that doesn't happen too often (false positive rate is low, e.g. 1/16).
The Bloom filter, as it doesn't have false negatives, would never respond with "no" for cases do have special hyphenation.

Bloom filter - using the number of elements as the number of hash functions

my question is, if we use k (number of hash functions) as n (the number of elements to be inserted), I saw that the probability of getting false positive is extremely low. I realize this is really slow..is this the worst case? And will this ensure that a false positive is never made?
No, you can never ensure that a Bloom filter has no false positives by changing the number of hash functions. The optimal number of hash functions is linear in the ratio of the space to the cardinality, so your choice will be quite far from optimal unless you use so much space that you might as well store the set directly, avoiding the Bloom filter altogether.
This last part is not true if your objects are truly enormous or you have a tiny number of them.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.