Why is the cost of a hash lookup O(1) when evaluating the hash function might take more time than that? - hash

A HashMap (or) HashTable is an example of keyed array. Here, the indices are user-defined keys rather than the usual index number. For example, arr["first"]=99 is an example of a hashmap where theb key is first and the value is 99.
Since keys are used, a hashing function is required to convert the key to an index element and then insert/search data in the array. This process assumes that there are no collisions.
Now, given a key to be searched in the array and if present, the data must be fetched. So, every time, the key must be converted to an index number of the array before the search. So how does that take a O(1) time? Because, the time complexity is dependent on the hashing function also. So the time complexity must be O(hashing function's time).

When talking about hashing, we usually measure the performance of a hash table by talking about the expected number of probes that we need to make when searching for an element in the table. In most hashing setups, we can prove that the expected number of probes is O(1). Usually, we then jump from there to "so the expected runtime of a hash table lookup is O(1)."
This isn't necessarily the case, though. As you've pointed out, the cost of computing the hash function on a particular input might not always take time O(1). Similarly, the cost of comparing two elements in the hash table might also not take time O(1). Think about hashing strings or lists, for example.
That said, what is usually true is the following. If we let the total number of elements in the table be n, we can say that the expected cost of performing a looking up the hash table is independent of the number n. That is, it doesn't matter whether there are 1,000,000 elements in the hash table or 10100 - the number of spots you need to prove is, on average, the same. Therefore, we can say that the expected cost of performing a lookup in a hash table, as a function of the hash table size, is O(1) because the cost of performing a lookup doesn't depend on the table size.
Perhaps the best way to account for the cost of a lookup in a hash table would be to say that it's O(Thash + Teq), where Thash is the time required to hash an element and Teq is the time required to compare two elements in the table. For strings, for example, you could say that the expected cost of a lookup is O(L + Lmax), where L is the length of the string you're hashing and Lmax is the length of the longest string stored in the hash table.
Hope this helps!

Related

Unusual hash map implementation

Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.

What is the big O of a perfect hash function?

Regular hash functions, in which collisions are probable, run in constant time: O(1). But what is the time complexity of a perfect hash function? Is it 1?
If the hash function is intended to be used to access a hash table, then there is no difference in terms of complexity between perfect and regular hash functions, since both of them may also create collisions in the table. The reason is that the index associated to an element in a hash table is the remainder of the division of the hash by the length of the table (usually a prime number). This is why two elements which hash to different values will collide if their remainder modulo the (said) prime is the same for both of them. This means that the time complexity of accessing the table is O(1) in both cases.
Note also that the computation of the hash usually depends on the size of the input. For instance, if the elements to be hashed are strings, good hashes take all their characters into account. Therefore, for the complexity to remain O(1), one has to limit the size (or length) of the inputs. Again, this applies to both perfect and regular hashes.

How hashtables are linear on same or collision values?

I was looking at this StackOverflow answer to understand hashing better and saw the following (regarding the fact that we would need to get bucket size in constant time):
if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.
What does this mean that it's "linear on the number of items that hashed to the same or a colliding value"? Wouldn't it be linear on total number of items in the hashtable, since it's possible that it will need to walk through every value during linear probing? I don't see why it would just have to go through the ones that collided.
Like for example, if I am using linear probing (step size 1) on a hashtable and I have different keys (not colliding, all hash to unique values) mapping to the odd index slots 1,3,5,7,9..... Then, I want to insert many keys that all hash to 2, so I fill up all my even index spots with those keys. If I wanted to know how many keys hash to 2, wouldn't I need to go through the entire hash table? But I'm not just iterating through items that hashed to the same or colliding value, since the odd index slots are not colliding.
A hash table is conceptually similar to an array (table) of linked lists (bucket in the table). The difference is in how you manage and access that array: using a function to generate a number that is used to compute the array index.
Once you have two elements placed in the same bucket (the same computed value, i.e. collission), then the problem turns out to be a search in a list. The number of elements in the list is hopefully lower than the total elements in the hash table (meaning that other elements exist in other buckets).
However, you are skipping the important introduction in that paragraph:
If you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though -- it's linear on the number of items that hashed to the same or a colliding value.
Linear probing is a different implementation of a hash table in which you don't use any list (chain) for your collissions. Instead, you just find the nearest available spot in the array, starting from the expected position and going forward. The more populated the array is, the higher the chance is to find that the next position is being used too, so you just need to keep searching. The positions are used by items that hashed to the same or colliding value, although you are never (and you don't really care) which of these two cases is, unless you explicitly see the hash of the existing element there.
This CppCon presentation video makes a good introduction and in-depth analysis of hash tables.

Number of comparison during a a closed address hashing?

Initially, all entries in the hash table are empty lists.
All elements with hash address i will be inserted into the linked list h[i]. If there is collision, during hashing of keys, the key will be added to the end of a linkedList.
For the average case of successful search, do i count it when the comparison is to check if the h[i] is null? if it's null it means that the linkedlist is null and it should return not found. Should it be 1 comparison or 0 comparison? in terms of complexity.
Sorry for this stupid question, i'm still learning algorithm complexity.
For "big-O" complexity it just doesn't matter, as there is no such thing as "O(2N+1)" complexity (from counting element and pointer comparisons) - it simplifies to O(N), where N is the number of elements in the bucket h[i]. Alternatively, you might say the average big-O complexity across buckets is O(N) where N is size / buckets, aka load factor.
If you're not doing big-O complexity analysis, we can't really tell you what you want to count. I would point out that comparisons of pointers to nullptr are much cheaper than object comparison involving an extra level of indirection or scanning along a large object (e.g. std::string objects too long for any Short-String-Optimisation buffer), so can often be neglected.
If in doubt as to what's wanted, I'd suggest you report the comparisons as in "searching for an element that's not present involves N object value comparisons and N+1 pointer comparisons, where N is the number of elements chained from h[i]".
If you must give just one expression (for example, some computerised multiple-choice test), I'd suggest a count of element comparisons is likely the desired answer - the number of value comparisons (i.e. 0 for an empty hash bucket), as it's most common to be interested in the complexity as a function of the number of data elements.
0 comparisons. If at h[i] you see a list of one entry and this is a hit (since you analyze successful search), this would be 1 comparison, and so on.

Is there a solution to creating a perfect hash table for non-finite inputs?

So hash tables are really cool for constant-time lookups of data in sets, but as I understand they are limited by possible hashing collisions which leads to increased small amounts of time-complexity.
It seems to me like any hashing function that supports a non-finite range of inputs is really a heuristic for reducing collision. Are there any absolute limitations to creating a perfect hash table for any range of inputs, or is it just something that no one has figured out yet?
I think this depends on what you mean by "any range of inputs."
If your goal is to create a hash function that can take in anything and never produce a collision, then there's no way to do what you're asking. This is a consequence of the pigeonhole principle - if you have n objects that can be hashed, you need at least n distinct outputs for your hash function or you're forced to get at least one hash collision. If there are infinitely many possible input objects, then no finite hash table could be built that will always avoid collisions.
On the other hand, if your goal is to build a hash table where lookups are worst-case O(1) (that is, you only have to look at a fixed number of locations to find any element), then there are many different options available. You could use a dynamic perfect hash table or a cuckoo hash table, which supports worst-case O(1) lookups and expected O(1) insertions and deletions. These hash tables work by using a variety of different hash functions rather than any one fixed hash function, which helps circumvent the above restriction.
Hope this helps!