best way to resolve collisions in hashing strings - hash

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?

best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.

Use a linked list as the hash bucket. So any collisions are handled gracefully.

Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.

Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.

Related

Is there a way to verify a common seed to a cumulative sequence of hashes with unknown repetitions between each value presented?

I am writing a variant of the Cuckoo Cycle that uses an adjacency list for presenting solutions from two pairs of 8 bit coordinates, and I am not having any problems finding what I think should be an optimal solver for it, that uses two pairs of head/tail binary search trees to keep track of possible solution nodes, reject (branches) nodes and a binary tree that keeps a list of the candidate cycles as they are being assembled (as I understand it, binary search trees shorten the amount of processing for finding duplicates), but I need to refine the verifier function for solutions.
I see in Cuckoo that there is some process by which it modifies the edges with XOR functions and masks to identify a valid cycle, but I have two issues.
One is that each hash is generated from the previous hash, starting with the nonce, and proving that all offered node/edge pairs are valid derivatives from the nonce seems to me to require the verifier to repeat the hash function each time checking for a match until it gets a hit, which could be up to several thousand, in the worst case. Is there some property that can be used to shortcut this identification process, since unlike protection against DoS, we are providing the salt of the hash?
Second is that even if the presented cycle is perfectly valid, it is possible that one or more of the node/edge pairs in the cycle has a duplicate coordinate. The hashes are 32 bits long, and each coordinate is 8 bits. The answer to this probably has some relation to the previous question also, as having the seed for a hash function is a known security risk because of collisions. So obviously, as well as verifying the nodes are part of a cycle in the lowest possible values in the finite field, I need a way to be sure that a pair does not overlap with another possible, and branching pair.
I will be studying the verifier closer in the Cuckoo Cycle implementation to see if I can figure out how the algorithm ensures it is not approving a cycle that actually has a branch (and thus is invalid), but I thought I'd pop the question on this site in case someone knows better the ways of recognising hashes from a common seed, and if there is any way to recognise a 50% collision between a given coordinate and another one.
Note: After thinking about it for a while, I realised that I could solve the 'fake cycle' with one or more nodes having a branch by simply splitting the heads and tails into separate hashes, subsequent (odd then even), such as Murmur3 16 bit hashes.
And further thinking about it, I realised that Cuckoo Cycle is actually a special type of hash collision search that seeks only collisions that occur only once in the low order of the finite field. I am devising a new scheme called Hummingbird, which instead will not target the smallest numbers (which is also the same thing done by hashcash) but instead will target the most proximate hashes in a chain to the seed nonce. This will mean that attempts to insert branched nodes in the graph of the solution will be discovered in the verification. Which will probably take about 2-5 seconds depending on how deep. These solutions could be eliminated by specifying a maximum hash chain length as part of the consensus.
I just wanted to add that I answered my own question by realising that I am looking for, essentially, a hash collision, in my algorithm, and the simplest solution, with the least bit-twiddling was to make each coordinate a distinct hash in a hash chain (hash of nonce, then hash of hash, etc)
I didn't understand fully that Cuckoo Cycle is essentially a search for partial hash collisions, and when that dawned on me, I realised that the simple solution is to just make it into a search for hash collisions.
I have, from this realisation, moved very quickly forward to figuring out how my variation of Cuckoo can be much more simply implemented, as well as how to structure the B-tree based progressive search algorithm, the difficulty adjustment, and the rest.
I wasn't aware there was a stackexchange specialist site for math, or cryptography, or I would have posted it there instead. I studied FEC a few months ago and that opened the floodgates to a whole bunch of other ideas that led me to getting so worked up about Cuckoo Cycle. I believe I have generalised the Cuckoo Cycle into a generic, parameterisable graph theoretic proof of work and I will get back to finishing my implementation.
Thanks to everyone who submitted an answer, I will upvote as I deem correct, though I have zero or nearly zero rep, for what it's worth.

Comparing hashes to test for collisions

I wish to compare hashes to check for collisions (Yes, I know it is time consuming, but never mind that). In checking for collisions, hashes need to be compared. Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
I would prefer the first option because it is much faster, but is there a recommended method? Are you less likely to find a collision by using the first method?
Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
Neither.
I would prefer the first option because it is much faster, but is there a recommended method?
I don't understand why you think the first method might work, but then you haven't fully explained your situation. Still, if you want to detect hash values that repeat, you do indeed need to keep track of already-seen hash values: to do that you don't want to search linearly though a list, and should use a set container to store seen hashes; a hash table - as suggested in a comment by gnasher729 a few hours back - would give O(1) performance e.g. in C++ in your hashes are 64 bit, std::unordered_set<uint64_t>), or a balance binary tree for O(logN) performance (e.g. C++ std::set<uint64_t>).
Are you less likely to find a collision by using the first method?
You're very likely to miss collisions.
All that said, you may want to reexamine your premise. The chance of a good (cryptographic quality) hash function producing collisions closely approaches the odds described by the "birthday paradox". As a rule of thumb, if you have 2^N distinct values to hash you're statistically unlikely to see collisions if your hashes are comfortably more than 2*N bits wide: if you allow enough "comfort", you're more likely to be hit on the noggin by a meteor than have your program see a collision. You mentioned MD5 so I'd expect 128 bits: unless you're storing order-of a quadrillion values or more (literally), it's pretty safe to ignore the potential for collisions.
Do note one important use of hash values where collisions happen more often for a different reason, and that's in hash tables, where even non-colliding hash values may collide at the same bucket index after they're "wrapped" - often a la h % N when N is the number of buckets. In general, it's impractical to ignore the potential for collisions in a hash table, and very unwise to try.

compare the way of resolving collisions in hash tables

how can I compare methods of conflict resolution (ie. linear hashing, square hashing and double hashing) in the tables hash? What data would be best to show the differences between them? Maybe someone has seen such comparisons.
There is no simple approach that's also universally meaningful.
That said, a good approach if you're tuning an actual app is to instrument (collect stats) for the hash table implementation you're using in the actual application of interest, with the real data it processes, and for whichever functions are of interest (insert, erase, find etc.). When those functions are called, record whatever you want to know about the collisions that happen: depending on how thorough you want to be, that might include the number of collisions before the element was inserted or found, the number of CPU/memory cache lines touched during that probing, the elapsed CPU or wall-clock time etc..
If you want a more general impression, instrument an implementation and throw large quantities of random data at it - but be aware that the real-world applicability of whatever conclusions you draw may only be as good as the random data is similar to the real-world data.
There are also other, more subtle implications to the choice of collision-handling mechanism: linear probing allows an implementation to cleanup "tombstone" buckets where deleted elements exist, which takes time but speeds later performance, so the mix of deletions amongst other operations can affect the stats you collect.
At the other extreme, you could try a mathematical comparison of the properties of different collision handling - that's way beyond what I'm able or interested in covering here.

why hastable's rehash complexity may be quadratic in worst case

I do not understand why hastable's rehash complexity may be quadratic in worst case at :
http://www.cplusplus.com/reference/unordered_set/unordered_multiset/reserve/
Any help would be appreciated !
Thanks
Just some basics:
Hash collisions is when two or more elements take on the same hash. This can cause worst-case O(n) operations.
I won't really go into this much further, since one can find many explanations of this. Basically all the elements can have the same hash, thus you'll have one big linked-list at that hash containing all your elements (and search on a linked-list is of course O(n)).
It doesn't have to be a linked-list, but most implementations does it this way.
A rehash creates a new hash table with the required size and basically does an insert for each element in the old table (there may be a slightly better way, but I'm sure most implementations don't beat the asymptotic worst-case complexity of simple inserts).
In addition to the above, it all comes down to this statement: (from here1)
Elements with equivalent values are grouped together in the same bucket and in such a way that an iterator (see equal_range) can iterate trough all of them.
So all elements with equivalent values needs to be grouped together. For this to hold, when doing an insert, you first have to check if there exists other elements with the same value. Consider the case where all the values take on the same hash. In this case, you'll have to look through the above-mentioned linked-list for these elements. So n insertions, looking through 0, then 1, then 2, then ..., then n-1 elements, which is 0+1+2+...+n-1 = n*(n-1)/2 = O(n2).
Can't you optimize this to O(n)? To me it makes sense that you may be able to, but even if so, this doesn't mean that all implementations have to do it this way. When using hash-tables it's generally assumed that there won't be too many collisions (even if this assumption is naive), thus avoiding the worst-case complexity, thus reducing the need for the additional complexity to have a rehash not take O(n2).
1: To all the possible haters, sorry for quoting CPlusPlus instead of CPPReference (for everyone else - CPlusPlus is well-known for being wrong), but I couldn't find this information there (so, of course, it could be wrong, but I'm hoping it isn't, and it does make sense in this case).

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.