My knowledge of hash tables is limited and I am currently learning it. I have a question on Hash collision resolution by open hashing or separate chain hashing.
I understand that the hash buckets in this case hold the pointer to the linked list where all the elements that map into the same key are linked. so the search complexity would be in the order of o(n) where n is the number of elements in the linked list. Is there a way to make this simpler ?
Also if there is a constraint on the size of the linked list, say it can hold only 5 elements max and if more than 5 elements hash into the same bucket, what would be the best way to handle this scenario ?
Any pointers for learning more on the above and any help would be greatly appreciated.
Hash collisions shouldn't be too common, otherwise you're doing something wrong (e.g. a bad hash function or not a big enough hash table). So the number of elements in each linked-list should be minimal and the O(n) complexity shouldn't be too bad.
You could theoretically replace it with one of many other data structures. A binary search tree, for example, would get O(log n) search time (assuming the items are comparable), but then insert time will be up to O(log n) instead of O(1), and it would take more space.
There should be no maximum on the number of elements in a list. If there were, you could probably resort to probing (e.g. linear probing), but deletions could be a nightmare as you may need to move elements around quite a bit.
Related
Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.
So hash tables are really cool for constant-time lookups of data in sets, but as I understand they are limited by possible hashing collisions which leads to increased small amounts of time-complexity.
It seems to me like any hashing function that supports a non-finite range of inputs is really a heuristic for reducing collision. Are there any absolute limitations to creating a perfect hash table for any range of inputs, or is it just something that no one has figured out yet?
I think this depends on what you mean by "any range of inputs."
If your goal is to create a hash function that can take in anything and never produce a collision, then there's no way to do what you're asking. This is a consequence of the pigeonhole principle - if you have n objects that can be hashed, you need at least n distinct outputs for your hash function or you're forced to get at least one hash collision. If there are infinitely many possible input objects, then no finite hash table could be built that will always avoid collisions.
On the other hand, if your goal is to build a hash table where lookups are worst-case O(1) (that is, you only have to look at a fixed number of locations to find any element), then there are many different options available. You could use a dynamic perfect hash table or a cuckoo hash table, which supports worst-case O(1) lookups and expected O(1) insertions and deletions. These hash tables work by using a variety of different hash functions rather than any one fixed hash function, which helps circumvent the above restriction.
Hope this helps!
I do not understand why hastable's rehash complexity may be quadratic in worst case at :
http://www.cplusplus.com/reference/unordered_set/unordered_multiset/reserve/
Any help would be appreciated !
Thanks
Just some basics:
Hash collisions is when two or more elements take on the same hash. This can cause worst-case O(n) operations.
I won't really go into this much further, since one can find many explanations of this. Basically all the elements can have the same hash, thus you'll have one big linked-list at that hash containing all your elements (and search on a linked-list is of course O(n)).
It doesn't have to be a linked-list, but most implementations does it this way.
A rehash creates a new hash table with the required size and basically does an insert for each element in the old table (there may be a slightly better way, but I'm sure most implementations don't beat the asymptotic worst-case complexity of simple inserts).
In addition to the above, it all comes down to this statement: (from here1)
Elements with equivalent values are grouped together in the same bucket and in such a way that an iterator (see equal_range) can iterate trough all of them.
So all elements with equivalent values needs to be grouped together. For this to hold, when doing an insert, you first have to check if there exists other elements with the same value. Consider the case where all the values take on the same hash. In this case, you'll have to look through the above-mentioned linked-list for these elements. So n insertions, looking through 0, then 1, then 2, then ..., then n-1 elements, which is 0+1+2+...+n-1 = n*(n-1)/2 = O(n2).
Can't you optimize this to O(n)? To me it makes sense that you may be able to, but even if so, this doesn't mean that all implementations have to do it this way. When using hash-tables it's generally assumed that there won't be too many collisions (even if this assumption is naive), thus avoiding the worst-case complexity, thus reducing the need for the additional complexity to have a rehash not take O(n2).
1: To all the possible haters, sorry for quoting CPlusPlus instead of CPPReference (for everyone else - CPlusPlus is well-known for being wrong), but I couldn't find this information there (so, of course, it could be wrong, but I'm hoping it isn't, and it does make sense in this case).
I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.
I was going through Linear hashing article on Wiki. One line puzzled me and here it is:
" The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to being incurred all at once.[2]"
In case of linear hashing if hash value of item to be inserted is smaller than split variable then a new node (or bucket) is created and value inserted in that.And according to above line( the time complexity is measured over each "insertion operation" which if compared to "dynamic array" implementation where we do amortized analysis , the insertion in Linear hashing must take O(n) time. Please correct me if i am wrong.
One more thing: Second line on wiki says "Linear hashing is therefore well suited for interactive applications."
Can i compare B+ tree with Linear hashing in "interactive cases" (since both are extendible searching techniques) ?
From what I know O(n) is the worst time complexity but in most cases a hash table would return results in constant time which is O(1). As oppose to B+ tree where one must traverse the tree hash tables work on hashing function where the result of hashing function points to the address of a stored value. In the worst case if all the keys have same hashing results then the time complexity might become O(n) because all the results will be stored in one bucket.
According to wikipedia b plus tree has following time complexities.
Inserting a record requires O(logbn) operations
Finding a record requires O(logbn) operations
An LH implementation can guarantee strictly bounded insertion time.
There's no reason for the split location and the key-hash location to be related, if collisions are handled by overflows. The trick is to link the creation of overflow slots to the split operation.
For example, if every Nth slot is always reserved to be an overflow slot, then you need to do at most N-1 splits to create a new overflow slot. In practice it's fewer than (N-1)/2 splits, because splitting one slot may free up an overflow slot.
http://goo.gl/6dbuH for a description, https://github.com/mischasan/hx for source code.