What are the pitfalls of using the power of two hash table sizes, instead of prime-numbered sizes, used traditionaly? Does using a prime number guarantees fixing the deficiencies of the naive hash functions (like i.e. xoring key bytes) or it is just a "shotgun debugging"? What simpler hash function would work with the power of two table sizes without clustering the keys too close?
Related
I get what primary and secondary clustering are but how to get rid of what way to minimise them properly
how to get rid of what way to minimise them properly
You can use a higher quality hash function to distribute the keys in a less collision-prone fashion. For some scenarios, the best practical hash function achievable has a kind of pseudo-random-but-repeatable placement property. In other cases, you might know something about the keys that lets you create a less collision-prone hash function - for example, you might know that the keys tend to be incrementing numbers, possibly with a few small gaps: in that case, an identity hash function h(n) = n will tend to place values in adjacent buckets, with less chance of collision than if the placements were more random.
In some cases, using prime numbers of buckets helps distribute elements better across the buckets than using a power-of-two bucket count. Basically, bucket counts that are powers of two are effectively masking out the high-order bits of the hash value when mapping onto the buckets: any randomness in the high order bits is discarded instead of helping to create a more uniform distribution across buckets. Still, bitwise masking is faster than a mod calculation on most hardware/CPUs.
You can also reduce the load factor: the ratio of elements to buckets. Clustering effects for hash tables using closed hashing get exponentially worse as the load factor approaches 1 (i.e. every bucket being full).
You could also stop using closed hashing and use separate chaining (maintaining containers of elements colliding at each bucket) instead, which doesn't suffer from primary clustering, but the indirection can lead to more memory usage overheads, indirection, and less optimal use of CPU cache, with consequently lower runtime performance - especially when the elements are small (a few bytes each).
You can also use multiple hash functions to identify successive buckets at which an element may be stored, rather than simple offers as in linear or quadratic probing, which reduces clustering. When you have alternative buckets, you can use techniques to move elements around to reduce the worse areas of clustering - search for robin hood hashing for example.
I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.
So its time for me to index my database file format and after looking at various methods, I decided that a hash table would be my best option. Since I've only familiarized myself with the inner workings of a hash table just today though, heres my understanding of it so please correct me if I'm wrong:
A hash table has a constant size that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
That in essence would make the hash table a sort of compressed lookup table. If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots. Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
So here are my questions:
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Any other general tips/pitfalls I should know of?
A hash table has a constant size
...not necessarily - a hash table can support resizing, but that tends to be done in fairly dramatic and invasive chunks where you can reason about the hash table as if it were constant size both before and after.
...that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
Not at all. A better way to calculate size is to say there are N values of a certain size, and you want to maintain a capacity:size ratio somewhere between say 3:1 and 5:4: the table memory usage is: N * sizeof(Value) * ratio.
The number of bits in the hash value is only relevant in that it indicates the maximum number of distinct buckets you can hash to: if you try to have a bigger table then you'll get more collisions than you would with a hash function generating wider-bit hash values. If you have more bits from your hash function than you need it is not a problem, you e.g. take the modulus with the current table size to find your bucket: hashed_to_bucket = hash_value % num_buckets.
That in essence would make the hash table a sort of compressed lookup table.
That's a good way to look at a hash table.
If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots.
Definitely reevaluated/regenerated. Otherwise adding to empty slots is but one of the undesirable consequences.
Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
As above, that (e.g. 65536) is not a hard maximum, but "to perform well without collisions" going over that should be avoided. To perform well it does not have to be much less: anything right up to 65536 is perfectly fine if it's a good quality 16-bit hash function.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
Could be... a lot depends on your access patterns. For example, if you happen to try to access the keys following the "short sequences" then a data organisation model that tends to put them nearby in memory/disk helps. Some types of tree structures do that nicely, and you can sometimes hack your hash function to do it too (but need to balance that up against collision proneness).
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Not so... you have 64 bit integer keys - a 64 bit or larger hash would be desirable. That said, a 32 bit hash may well be fine too - that generates 4 billion distinct values which is greater than your 100 million keys.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Not that I'm aware of.
Any other general tips/pitfalls I should know of?
For tips... I'd say start simple (e.g. with the hash function returning the key unchanged and using modulus with a hash table capacity that's a prime number, OR using any common hash if you're picking up a hash table implementation that uses e.g. power-of-2 numbers of buckets) and measure your collision rates: that tells you how much effort it's worth putting into improving your hashing.
One very simple way to get "ideal, randomised" hashing in your case is to have 8 tables of 256 32-bit integers - initialised with hardcoded random numbers (you can google for random number download websites). Given any 64-bit key, just slice it into 8 bytes then use each byte as a key in the successive tables, XORing the 32-bit values you look up. A single bit of difference in any of the 64 input bits will then impact all 32 bits in the hash value with equal probability.
uint32_t table[8][256] = { ...add some random numbers... };
uint32_t h(uint64_t n)
{
uint32_t result = 0;
unsigned char* p = (unsigned char*)&n;
for (int i = 0; i < 8; ++i)
result ^= table[i][*p++];
return result;
}
I know that jenkinshash produces an integer (2^32) for a given value. The documentation at this link:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/JenkinsHash.html
says
Returns:
a 32-bit value. Every bit of the key affects every bit of the return value. Two keys differing by one or two bits will have totally different hash values.
jenkinshash can return at most 2^32 different results for given values.
What if I have more than 2^32 values?
Will it return same result for two different values?
Thanks
As most hash functions, yes, it may return duplicate hash values for different input data. The guarantee, according to the documentation you linked to, is that values that differs with one or two bits are different. As soon as they differ with 3 bits or more you have no uniqueness-guarantee.
The input data to the hash function may be of a larger size (have more unique input values) than the output of the hash. This trivially makes it so that duplicates must exist in the output data. Consider a hashing function that outputs an integer in the range 1-10 but takes an input in the range 1-100: it is obvious that multiple values must hash to the same value because you cannot enumerate the values 1-100 using only ten different integers. This is called the pigeonhole principle.
Any good hashing function will, however, try to distribute the output values evenly. In the 1-10 example you can expect a good hashing function to give a 2 approximately the same amount of times as a 6.
Hashing functions that guarantee uniqueness are called perfect hash functions. They all provide an output data of at least the same cardinality as the input data. A perfect hashing function for the input integers 1-100 must at least have 100 different output values.
Note that according to Wikipedia the Jenkins hash functions are not cryptographic. This means that you should avoid them for password security and the like, but you can use the hash for somewhat even work distribution and checksums.
I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).
Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?
Hash functions are generally written to evenly distribute the data across all result buckets.
If you assume that your files are evenly distributed over a fixed range of available sizes, lets say that there are only 1024 (2^10) evenly distributed distinct sizes for your files. Storing file size at best only reduces the chance of a collision by the number of distinct file sizes.
Note: we could assume it's 2^32 evenly distributed and distinct sizes and it still doesn't change the rest of the math.
It is commonly accepted that the general probability of a collision on MD5 (for example) is 1/(2^128).
Unless there is something that is specifically built into a hash function that says otherwise. Given any valid X such that Probability of P(MD5(X) == MD5(X+1)) remains the same as any two random values {Y, Z} That is to say that P(MD5(Y) == MD5(Z)) = P(MD5(X) == MD5(X+1)) = 1/(2^128) for any values of X, Y and Z.
Combining this with the 2^10 of distinct files means that by storing file size you are at most getting an additional 10 bits that signify if items are different or not (again this is assuming your files are evenly distributed for all values).
So at the very best all you are doing is adding another N bytes of storage for <=N bytes worth of unique values (it can never be >N). Therefore you're much better off to increase the bytes returned by your hash function using something such as SHA-1/2 instead as this will be more likely to give you an evenly distributed data of hash values than storing the file size.
In short, if MD5 isn't good enough for collisions use a stronger hash, if the stronger hashes are too slow then use a fast hash with low chance of collisions such a as MD5, and then use a slower hash such as SHA-1 or SHA256 to reduce the chance of a collision, but if SHA256 is fast enough and the doubled space isn't a problem then you probably should be using SHA256.
Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.
The size of the hash is the same regardless of the size of the original data. As there is only a limited number of possible hashes it is theoretically possible that two files with different sizes may have the same hash. However, this means that it is also possible that two files with the same size may have the same hash.
Hash functions are designed the way that it's very difficult to get the collision, otherwise they won't be effective.
If you have hash collision that is absolutely unbelievable about 1 : number_of_possible_hashes probability that says nothing about file size.
If you really want to be double-sure about hash collisions, you can calculate two different hashes for the same file - it will be less error-prone than saving hash + file size.
The whole point of the family of cryptographic hashes (MD5, SHA-x, etc) is to make collisions vanishingly unlikely. The notion is that official legal processes are prepared to depend on it being impractical to manufacture a collision on purpose. So, really, it's a bad use of space and CPU time to add a belt to the suspenders of these hashes.