Bloom Filter vs. Hashset - hash

Bloomfilters are often less space efficient than hashsets if the key cardinality is small.
Say we have 20bit keys and a set size of 1024 (10bit). A hashset with 1024 entries nees to store a 10-bit tag in each entry. As a result the size of a hashset is ~10Kbit, with zero chance for false positives.
A bloomfilter with a FP rate of 10^-7 has a size of ~33Kbit (3x larger): https://hur.st/bloomfilter/?n=1000&p=1.0E-7&m=&k=20
I can accept some false positives. Is there a probabilistic data structure that is more space efficient that the two techniques above?
One could try hashing the key

For starters, I think you’ve underestimated the size of your hash set. A hash set holding 20-bit keys needs to store 20 bits per item (each element of the hash set is one of the items being stored), and with 1024 items you’d need a minimum of 20Kb for the hash table just to store the items. That’s ignoring the internal hash table overhead; most hash tables overallocate space in some way. If you’re using something like a linear probing table, your load factor will typically be around 2/3 and so there’s an extra 50% space premium on top of the baseline 20Kb. So something like 20Kb - 30Kb is a good estimate for your space usage here.
Now let’s talk probabilistic data structures. The memory usage for a Bloom filter with false positive rate ε and n items is roughly 1.44n lg ε-1, where lg is the binary logarithm (log2 x).
The main reason your Bloom filter usage is so high here is that with a false positive rate of 10-7 is set very, very low, which means that each item is going to need a lot of bits to store.
You’ve said that you’re okay with some false positives. If that’s the case, I would first recommend simply increasing your false positive rate. Increasing ε to 10-2, for example, would drop your space usage by a factor of 2/7, which would get the space usage down to 10Kb.
Another option would be to use a more modern data structure in place of a Bloom filter. An XOR filter uses 1.23n lg ε-1 bits, which is an improvement over a Bloom filter. More recent improvements on the XOR filter drops that leading coefficient down to around 1.13. Cuckoo filters use (roughly) 1.08n lg ε-1 + 3n bits. All of these are better than Bloom filters for an appropriate choice of ε, but as above making ε bigger would still lead to more dramatic savings.

Related

What should I do if primary or secondary clustering occurs in hash table

I get what primary and secondary clustering are but how to get rid of what way to minimise them properly
how to get rid of what way to minimise them properly
You can use a higher quality hash function to distribute the keys in a less collision-prone fashion. For some scenarios, the best practical hash function achievable has a kind of pseudo-random-but-repeatable placement property. In other cases, you might know something about the keys that lets you create a less collision-prone hash function - for example, you might know that the keys tend to be incrementing numbers, possibly with a few small gaps: in that case, an identity hash function h(n) = n will tend to place values in adjacent buckets, with less chance of collision than if the placements were more random.
In some cases, using prime numbers of buckets helps distribute elements better across the buckets than using a power-of-two bucket count. Basically, bucket counts that are powers of two are effectively masking out the high-order bits of the hash value when mapping onto the buckets: any randomness in the high order bits is discarded instead of helping to create a more uniform distribution across buckets. Still, bitwise masking is faster than a mod calculation on most hardware/CPUs.
You can also reduce the load factor: the ratio of elements to buckets. Clustering effects for hash tables using closed hashing get exponentially worse as the load factor approaches 1 (i.e. every bucket being full).
You could also stop using closed hashing and use separate chaining (maintaining containers of elements colliding at each bucket) instead, which doesn't suffer from primary clustering, but the indirection can lead to more memory usage overheads, indirection, and less optimal use of CPU cache, with consequently lower runtime performance - especially when the elements are small (a few bytes each).
You can also use multiple hash functions to identify successive buckets at which an element may be stored, rather than simple offers as in linear or quadratic probing, which reduces clustering. When you have alternative buckets, you can use techniques to move elements around to reduce the worse areas of clustering - search for robin hood hashing for example.

Don't you get a random number after doing modulo on a hashed number?

I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?
Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.
I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.

Hyphenation algorithm using Bloom filter

A classical example of where Bloom filters shine is in hyphenation algorithms. It's even the example given in the original paper on Bloom filters.
I don't understand how a Bloom filter would be used in a hyphenation algorithm.
A hyphenation algorithm is defined as something that takes an input word and gives back the possible ways that that word can be hyphenated.
Would the Bloom filter contain both hyph-enation and hyphena-tion, and client code would query the filter for h-yphenation, hy-phenation, hyp-henation, ...?
Here's what the original paper says:
Analysis of Hyphenation Sample Application
[...] Let us assume that there are about 500,000 words to be hyphenated by the program and that 450,000 of these words can be hyphenated by application of a few simple rules. The other 50,000 words require reference to a dictionary. It is reasonable to estimate that at least 19 bits would, on the average, be required to represent each of these 50,000 words using a conventional hash-coding method. If we assume that a time factor of T = 4 is acceptable, we find from eq. (9) that the hash area would be 2,000,000 bits in size. This might very well be too large for a practical core contained hash area. By using method 2 with an allowable error frequency of, say, P = 1/16, and using the smallest possible hash area by having T = 2, we see from eq. (22) that the problem can be solved with a hash area of less than 300,000 bits, a size which would very likely be suitable for a core hash area. With a choice for P of 1/16, an access would be required to the disk resident dictionary for approximately 50,000 + 450,000/16 ~ 78,000 of the 500,000 words to be hyphenated, i.e. for approximately 16 percent of the cases. This constitutes a reduction of 84 percent in the number of disk accesses from those required in a typical conventional approach using a completely disk resident hash area and dictionary.
For this case,
the dictionary is stored on disk and contains all words with the correct hyphenation,
the Bloom filter contains just keys that require special hyphenation, e.g. maybe hyphenation itself,
the Bloom filter responds with "probably" or "no".
Then the algorithm to find the possible hyphenations of a word is:
word = "hyphenation"; (or some other word)
x = bloomFilter.probablyContains(word);
if (x == "probably") {
lookupInDictionary(word).getHypenation();
} else {
// x == "no" case
useSimpleRuleBasedHypenation(word);
}
If the Bloom filter responds with "probably", then the algorithm would have to do a disk read in the dictionary.
The Bloom filter would respond with "probably" sometimes if there are in fact no special rules, in which case a disk I/O is done unnecessarily. But that's OK as long as that doesn't happen too often (false positive rate is low, e.g. 1/16).
The Bloom filter, as it doesn't have false negatives, would never respond with "no" for cases do have special hyphenation.

Bloom filter - using the number of elements as the number of hash functions

my question is, if we use k (number of hash functions) as n (the number of elements to be inserted), I saw that the probability of getting false positive is extremely low. I realize this is really slow..is this the worst case? And will this ensure that a false positive is never made?
No, you can never ensure that a Bloom filter has no false positives by changing the number of hash functions. The optimal number of hash functions is linear in the ratio of the space to the cardinality, so your choice will be quite far from optimal unless you use so much space that you might as well store the set directly, avoiding the Bloom filter altogether.
This last part is not true if your objects are truly enormous or you have a tiny number of them.

Choosing a minimum hash size for a given allowable number of collisions

I am parsing a large amount of network trace data. I want to split the trace into chunks, hash each chunk, and store a sequence of the resulting hashes rather than the original chunks. The purpose of my work is to identify identical chunks of data - I'm hashing the original chunks to reduce the data set size for later analysis. It is acceptable in my work that we trade off the possibility that collisions occasionally occur in order to reduce the hash size (e.g. 40 bit hash with 1% misidentification of identical chunks might beat 60 bit hash with 0.001% misidentification).
My question is, given a) number of chunks to be hashed and b) allowable percentage of misidentification, how can one go about choosing an appropriate hash size?
As an example:
1,000,000 chunks to be hashed, and we're prepared to have 1% misidentification (1% of hashed chunks appear identical when they are not identical in the original data). How do we choose a hash with the minimal number of bits that satisifies this?
I have looked at materials regarding the Birthday Paradox, though this is concerned specifically with the probability of a single collision. I have also looked at materials which discuss choosing a size based on an acceptable probability of a single collision, but have not been able to extrapolate from this how to choose a size based on an acceptable probability of n (or fewer) collisions.
Obviously, the quality of your hash function matters, but some easy probability theory will probably help you here.
The question is what exactly are you willing to accept, is it good enough that you have an expected number of collisions at only 1% of the data? Or, do you demand that the probability of the number of collisions going over some bound be something? If its the first, then back of the envelope style calculation will do:
Expected number of pairs that hash to the same thing out of your set is (1,000,000 C 2)*P(any two are a pair). Lets assume that second number is 1/d where d is the the size of the hashtable. (Note: expectations are linear, so I'm not cheating very much so far). Now, you say you want 1% collisions, so that is 10000 total. Well, you have (1,000,000 C 2)/d = 10,000, so d = (1,000,000 C 2)/10,000 which is according to google about 50,000,000.
So, you need a 50 million ish possible hash values. That is a less than 2^26, so you will get your desired performance with somewhere around 26 bits of hash (depending on quality of hashing algorithm). I probably have a factor of 2 mistake in there somewhere, so you know, its rough.
If this is an offline task, you cant be that space constrained.
Sounds like a fun exercise!
Someone else might have a better answer, but I'd go the brute force route, provided that there's ample time:
Run the hashing calculation using incremental hash size and record the collision percentage for each hash size.
You might want to use binary search to reduce the search space.