Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.
Related
I was looking at this StackOverflow answer to understand hashing better and saw the following (regarding the fact that we would need to get bucket size in constant time):
if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.
What does this mean that it's "linear on the number of items that hashed to the same or a colliding value"? Wouldn't it be linear on total number of items in the hashtable, since it's possible that it will need to walk through every value during linear probing? I don't see why it would just have to go through the ones that collided.
Like for example, if I am using linear probing (step size 1) on a hashtable and I have different keys (not colliding, all hash to unique values) mapping to the odd index slots 1,3,5,7,9..... Then, I want to insert many keys that all hash to 2, so I fill up all my even index spots with those keys. If I wanted to know how many keys hash to 2, wouldn't I need to go through the entire hash table? But I'm not just iterating through items that hashed to the same or colliding value, since the odd index slots are not colliding.
A hash table is conceptually similar to an array (table) of linked lists (bucket in the table). The difference is in how you manage and access that array: using a function to generate a number that is used to compute the array index.
Once you have two elements placed in the same bucket (the same computed value, i.e. collission), then the problem turns out to be a search in a list. The number of elements in the list is hopefully lower than the total elements in the hash table (meaning that other elements exist in other buckets).
However, you are skipping the important introduction in that paragraph:
If you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though -- it's linear on the number of items that hashed to the same or a colliding value.
Linear probing is a different implementation of a hash table in which you don't use any list (chain) for your collissions. Instead, you just find the nearest available spot in the array, starting from the expected position and going forward. The more populated the array is, the higher the chance is to find that the next position is being used too, so you just need to keep searching. The positions are used by items that hashed to the same or colliding value, although you are never (and you don't really care) which of these two cases is, unless you explicitly see the hash of the existing element there.
This CppCon presentation video makes a good introduction and in-depth analysis of hash tables.
I'm gonna write the problem as I found it and I will then explain what confuses me.
"A teacher is marking his students' work from 0-10 but he only marks with an 8 or above for a certain number 'x'(x=15 for example) of the 'n' students. You are given an array with all the students' marks in random order. Find the 'x' best marks in O(1)."
We certainly have been taught hashing but this requires me to store all the data in a hash table which is definitely not O(1). Maybe we don't have to take the 'conversion' into account? If we do , maybe the coversion combined with the search time after will lead to a method different than hashing.
In that case, leaving O(1) aside , what is the fastest algorithm including both the conversion and the search time?
Simple: It's not possible.
O(1) can only achieved if all of input size, number of necessary comparisons and output size are constants. You may argue that x could be treated as constant, but it still doesn't work:
You need to inspect every single input element, all n of them, as the random input order does not even allow any heuristics to guess where the xth element would be, even if you already had correctly guessed the other x-1 elements already in constant time.
As the problem is stated, there is no solution which can do it in the upper bounds of O(1) or O(x).
Let's just assume your instructor corrects his mistake, and gives you a revised version which correctly states O(n) as the required upper bound.
In that case your hash approach is (almost) correct. The catch of using a hash function, is that you now need to account for potential collisions on the hash function, which are the reason why hash maps don't work strictly in O(1), but rather only "on average" in O(1).
As you know all possible values (grades from 0-10), you can just allocate buckets with a known index. Inside each bucket you may use linked lists, as they also allow constant time insertions and linear time iteration.
Initially, all entries in the hash table are empty lists.
All elements with hash address i will be inserted into the linked list h[i]. If there is collision, during hashing of keys, the key will be added to the end of a linkedList.
For the average case of successful search, do i count it when the comparison is to check if the h[i] is null? if it's null it means that the linkedlist is null and it should return not found. Should it be 1 comparison or 0 comparison? in terms of complexity.
Sorry for this stupid question, i'm still learning algorithm complexity.
For "big-O" complexity it just doesn't matter, as there is no such thing as "O(2N+1)" complexity (from counting element and pointer comparisons) - it simplifies to O(N), where N is the number of elements in the bucket h[i]. Alternatively, you might say the average big-O complexity across buckets is O(N) where N is size / buckets, aka load factor.
If you're not doing big-O complexity analysis, we can't really tell you what you want to count. I would point out that comparisons of pointers to nullptr are much cheaper than object comparison involving an extra level of indirection or scanning along a large object (e.g. std::string objects too long for any Short-String-Optimisation buffer), so can often be neglected.
If in doubt as to what's wanted, I'd suggest you report the comparisons as in "searching for an element that's not present involves N object value comparisons and N+1 pointer comparisons, where N is the number of elements chained from h[i]".
If you must give just one expression (for example, some computerised multiple-choice test), I'd suggest a count of element comparisons is likely the desired answer - the number of value comparisons (i.e. 0 for an empty hash bucket), as it's most common to be interested in the complexity as a function of the number of data elements.
0 comparisons. If at h[i] you see a list of one entry and this is a hit (since you analyze successful search), this would be 1 comparison, and so on.
A HashMap (or) HashTable is an example of keyed array. Here, the indices are user-defined keys rather than the usual index number. For example, arr["first"]=99 is an example of a hashmap where theb key is first and the value is 99.
Since keys are used, a hashing function is required to convert the key to an index element and then insert/search data in the array. This process assumes that there are no collisions.
Now, given a key to be searched in the array and if present, the data must be fetched. So, every time, the key must be converted to an index number of the array before the search. So how does that take a O(1) time? Because, the time complexity is dependent on the hashing function also. So the time complexity must be O(hashing function's time).
When talking about hashing, we usually measure the performance of a hash table by talking about the expected number of probes that we need to make when searching for an element in the table. In most hashing setups, we can prove that the expected number of probes is O(1). Usually, we then jump from there to "so the expected runtime of a hash table lookup is O(1)."
This isn't necessarily the case, though. As you've pointed out, the cost of computing the hash function on a particular input might not always take time O(1). Similarly, the cost of comparing two elements in the hash table might also not take time O(1). Think about hashing strings or lists, for example.
That said, what is usually true is the following. If we let the total number of elements in the table be n, we can say that the expected cost of performing a looking up the hash table is independent of the number n. That is, it doesn't matter whether there are 1,000,000 elements in the hash table or 10100 - the number of spots you need to prove is, on average, the same. Therefore, we can say that the expected cost of performing a lookup in a hash table, as a function of the hash table size, is O(1) because the cost of performing a lookup doesn't depend on the table size.
Perhaps the best way to account for the cost of a lookup in a hash table would be to say that it's O(Thash + Teq), where Thash is the time required to hash an element and Teq is the time required to compare two elements in the table. For strings, for example, you could say that the expected cost of a lookup is O(L + Lmax), where L is the length of the string you're hashing and Lmax is the length of the longest string stored in the hash table.
Hope this helps!
My knowledge of hash tables is limited and I am currently learning it. I have a question on Hash collision resolution by open hashing or separate chain hashing.
I understand that the hash buckets in this case hold the pointer to the linked list where all the elements that map into the same key are linked. so the search complexity would be in the order of o(n) where n is the number of elements in the linked list. Is there a way to make this simpler ?
Also if there is a constraint on the size of the linked list, say it can hold only 5 elements max and if more than 5 elements hash into the same bucket, what would be the best way to handle this scenario ?
Any pointers for learning more on the above and any help would be greatly appreciated.
Hash collisions shouldn't be too common, otherwise you're doing something wrong (e.g. a bad hash function or not a big enough hash table). So the number of elements in each linked-list should be minimal and the O(n) complexity shouldn't be too bad.
You could theoretically replace it with one of many other data structures. A binary search tree, for example, would get O(log n) search time (assuming the items are comparable), but then insert time will be up to O(log n) instead of O(1), and it would take more space.
There should be no maximum on the number of elements in a list. If there were, you could probably resort to probing (e.g. linear probing), but deletions could be a nightmare as you may need to move elements around quite a bit.