Making a hash table - hash

So its time for me to index my database file format and after looking at various methods, I decided that a hash table would be my best option. Since I've only familiarized myself with the inner workings of a hash table just today though, heres my understanding of it so please correct me if I'm wrong:
A hash table has a constant size that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
That in essence would make the hash table a sort of compressed lookup table. If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots. Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
So here are my questions:
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Any other general tips/pitfalls I should know of?

A hash table has a constant size
...not necessarily - a hash table can support resizing, but that tends to be done in fairly dramatic and invasive chunks where you can reason about the hash table as if it were constant size both before and after.
...that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
Not at all. A better way to calculate size is to say there are N values of a certain size, and you want to maintain a capacity:size ratio somewhere between say 3:1 and 5:4: the table memory usage is: N * sizeof(Value) * ratio.
The number of bits in the hash value is only relevant in that it indicates the maximum number of distinct buckets you can hash to: if you try to have a bigger table then you'll get more collisions than you would with a hash function generating wider-bit hash values. If you have more bits from your hash function than you need it is not a problem, you e.g. take the modulus with the current table size to find your bucket: hashed_to_bucket = hash_value % num_buckets.
That in essence would make the hash table a sort of compressed lookup table.
That's a good way to look at a hash table.
If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots.
Definitely reevaluated/regenerated. Otherwise adding to empty slots is but one of the undesirable consequences.
Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
As above, that (e.g. 65536) is not a hard maximum, but "to perform well without collisions" going over that should be avoided. To perform well it does not have to be much less: anything right up to 65536 is perfectly fine if it's a good quality 16-bit hash function.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
Could be... a lot depends on your access patterns. For example, if you happen to try to access the keys following the "short sequences" then a data organisation model that tends to put them nearby in memory/disk helps. Some types of tree structures do that nicely, and you can sometimes hack your hash function to do it too (but need to balance that up against collision proneness).
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Not so... you have 64 bit integer keys - a 64 bit or larger hash would be desirable. That said, a 32 bit hash may well be fine too - that generates 4 billion distinct values which is greater than your 100 million keys.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Not that I'm aware of.
Any other general tips/pitfalls I should know of?
For tips... I'd say start simple (e.g. with the hash function returning the key unchanged and using modulus with a hash table capacity that's a prime number, OR using any common hash if you're picking up a hash table implementation that uses e.g. power-of-2 numbers of buckets) and measure your collision rates: that tells you how much effort it's worth putting into improving your hashing.
One very simple way to get "ideal, randomised" hashing in your case is to have 8 tables of 256 32-bit integers - initialised with hardcoded random numbers (you can google for random number download websites). Given any 64-bit key, just slice it into 8 bytes then use each byte as a key in the successive tables, XORing the 32-bit values you look up. A single bit of difference in any of the 64 input bits will then impact all 32 bits in the hash value with equal probability.
uint32_t table[8][256] = { ...add some random numbers... };
uint32_t h(uint64_t n)
{
uint32_t result = 0;
unsigned char* p = (unsigned char*)&n;
for (int i = 0; i < 8; ++i)
result ^= table[i][*p++];
return result;
}

Related

Don't you get a random number after doing modulo on a hashed number?

I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?
Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.
I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.

Fast long calculation

I have the following task.
I have 1 billion or more 20-bytes distinct hashes (stored in some database) which total number
is less than Java's Long.MAX_VALUE;
After that I have almost infinite stream of such hashes.
Is there possibility to create some bijective mapping from the set of these 20-bytes distinct hashes
to the set of numbers between 0 and Long.MAX_VALUE ?
Kind of Lagrange polynomial calculation - but may be there is something really fast and effective for such case.
We need fast long value calculation for each hash from this almost infinite stream.
Each 20 - bytes hash is just a number.
Before stream's processing we can create mapping
20-byte | 8-byte
(hash1 1)
....
(hashN N)
After that when we have next hash from the infinite stream we will obtain 8-byte hash value without lookups using only arithmetical calculations.
Since you gave no practical constraints on size or storage beyond "It has to be fast", I am going to assume you can take your time to pre-process the set of hashes in order to "make it fast". I am further assuming the hashes are distributed randomly and that the mapping to 8-byte numbers is likewise unpredictable.
My first approach would be a local SQLite database. That allows you to use its native BTree indexing to quickly retrieve results. With a large enough page size you can store 256 pointers per BTree node for an expected amount of log_256(10^9)= 3.737169106748283 disk seeks per lookup. This will improve as more of your BTree structures get cached.
Second approach, if you have the memory for it: in-memory BTree.
Would it work something like this?
aNextHash = Stream.getHash();
long aValue = aNextHash % Long.MAX_VALUE;

Is it safe to cut the hash?

I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)

Choosing a minimum hash size for a given allowable number of collisions

I am parsing a large amount of network trace data. I want to split the trace into chunks, hash each chunk, and store a sequence of the resulting hashes rather than the original chunks. The purpose of my work is to identify identical chunks of data - I'm hashing the original chunks to reduce the data set size for later analysis. It is acceptable in my work that we trade off the possibility that collisions occasionally occur in order to reduce the hash size (e.g. 40 bit hash with 1% misidentification of identical chunks might beat 60 bit hash with 0.001% misidentification).
My question is, given a) number of chunks to be hashed and b) allowable percentage of misidentification, how can one go about choosing an appropriate hash size?
As an example:
1,000,000 chunks to be hashed, and we're prepared to have 1% misidentification (1% of hashed chunks appear identical when they are not identical in the original data). How do we choose a hash with the minimal number of bits that satisifies this?
I have looked at materials regarding the Birthday Paradox, though this is concerned specifically with the probability of a single collision. I have also looked at materials which discuss choosing a size based on an acceptable probability of a single collision, but have not been able to extrapolate from this how to choose a size based on an acceptable probability of n (or fewer) collisions.
Obviously, the quality of your hash function matters, but some easy probability theory will probably help you here.
The question is what exactly are you willing to accept, is it good enough that you have an expected number of collisions at only 1% of the data? Or, do you demand that the probability of the number of collisions going over some bound be something? If its the first, then back of the envelope style calculation will do:
Expected number of pairs that hash to the same thing out of your set is (1,000,000 C 2)*P(any two are a pair). Lets assume that second number is 1/d where d is the the size of the hashtable. (Note: expectations are linear, so I'm not cheating very much so far). Now, you say you want 1% collisions, so that is 10000 total. Well, you have (1,000,000 C 2)/d = 10,000, so d = (1,000,000 C 2)/10,000 which is according to google about 50,000,000.
So, you need a 50 million ish possible hash values. That is a less than 2^26, so you will get your desired performance with somewhere around 26 bits of hash (depending on quality of hashing algorithm). I probably have a factor of 2 mistake in there somewhere, so you know, its rough.
If this is an offline task, you cant be that space constrained.
Sounds like a fun exercise!
Someone else might have a better answer, but I'd go the brute force route, provided that there's ample time:
Run the hashing calculation using incremental hash size and record the collision percentage for each hash size.
You might want to use binary search to reduce the search space.

Hash length reduction?

I know that say given a md5/sha1 of a value, that reducing it from X bits (ie 128) to say Y bits (ie 64 bits) increases the possibility of birthday attacks since information has been lost. Is there any easy to use tool/formula/table that will say what the probability of a "correct" guess will be when that length reduction occurs (compared to its original guess probability)?
Crypto is hard. I would recommend against trying to do this sort of thing. It's like cooking pufferfish: Best left to experts.
So just use the full length hash. And since MD5 is broken and SHA-1 is starting to show cracks, you shouldn't use either in new applications. SHA-2 is probably your best bet right now.
I would definitely recommend against reducing the bit count of hash. There are too many issues at stake here. Firstly, how would you decide which bits to drop?
Secondly, it would be hard to predict how the dropping of those bits would affect the distribution of outputs in the new "shortened" hash function. A (well-designed) hash function is meant to distribute inputs evenly across the whole of the output space, not a subset of it.
By dropping half the bits you are effectively taking a subset of the original hash function, which might not have nearly the desirably properties of a properly-designed hash function, and may lead to further weaknesses.
Well, since every extra bit in the hash provides double the number of possible hashes, every time you shorten the hash by a bit, there are only half as many possible hashes and thus the chances of guessing that random number is doubled.
128 bits = 2^128 possibilities
thus
64 bits = 2^64
so by cutting it in half, you get
2^64 / 2^128 percent
less possibilities