How hashtables are linear on same or collision values? - hash

I was looking at this StackOverflow answer to understand hashing better and saw the following (regarding the fact that we would need to get bucket size in constant time):
if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.
What does this mean that it's "linear on the number of items that hashed to the same or a colliding value"? Wouldn't it be linear on total number of items in the hashtable, since it's possible that it will need to walk through every value during linear probing? I don't see why it would just have to go through the ones that collided.
Like for example, if I am using linear probing (step size 1) on a hashtable and I have different keys (not colliding, all hash to unique values) mapping to the odd index slots 1,3,5,7,9..... Then, I want to insert many keys that all hash to 2, so I fill up all my even index spots with those keys. If I wanted to know how many keys hash to 2, wouldn't I need to go through the entire hash table? But I'm not just iterating through items that hashed to the same or colliding value, since the odd index slots are not colliding.

A hash table is conceptually similar to an array (table) of linked lists (bucket in the table). The difference is in how you manage and access that array: using a function to generate a number that is used to compute the array index.
Once you have two elements placed in the same bucket (the same computed value, i.e. collission), then the problem turns out to be a search in a list. The number of elements in the list is hopefully lower than the total elements in the hash table (meaning that other elements exist in other buckets).
However, you are skipping the important introduction in that paragraph:
If you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though -- it's linear on the number of items that hashed to the same or a colliding value.
Linear probing is a different implementation of a hash table in which you don't use any list (chain) for your collissions. Instead, you just find the nearest available spot in the array, starting from the expected position and going forward. The more populated the array is, the higher the chance is to find that the next position is being used too, so you just need to keep searching. The positions are used by items that hashed to the same or colliding value, although you are never (and you don't really care) which of these two cases is, unless you explicitly see the hash of the existing element there.
This CppCon presentation video makes a good introduction and in-depth analysis of hash tables.

Related

Unusual hash map implementation

Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.

What is the big O of a perfect hash function?

Regular hash functions, in which collisions are probable, run in constant time: O(1). But what is the time complexity of a perfect hash function? Is it 1?
If the hash function is intended to be used to access a hash table, then there is no difference in terms of complexity between perfect and regular hash functions, since both of them may also create collisions in the table. The reason is that the index associated to an element in a hash table is the remainder of the division of the hash by the length of the table (usually a prime number). This is why two elements which hash to different values will collide if their remainder modulo the (said) prime is the same for both of them. This means that the time complexity of accessing the table is O(1) in both cases.
Note also that the computation of the hash usually depends on the size of the input. For instance, if the elements to be hashed are strings, good hashes take all their characters into account. Therefore, for the complexity to remain O(1), one has to limit the size (or length) of the inputs. Again, this applies to both perfect and regular hashes.

The hash table probability

I still confuse how to find hash table probability. I have hash table of size 20 with open addressing uses the hash function
hash(int x) = x % 20
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
I use birthday paradox concerns to find it https://en.wikipedia.org/wiki/Birthday_problem and seems get an incorrect answer. Where is my mistake?
calculating
1/2=1-e^(-n^2/(2*20))
ln(1/2)=ln(e)*(-n^2/40)
-0.69314718=-n^2/40
n=scr(27.725887)=5.265538
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
Well, it depends on a few things.
The simple case is that you've already performed 11 inserts with distinct and effectively random integer keys, such that 11 of the buckets are in use, and your next insertion uses another distinct and effectively random key so it will hash to any bucket with equal probability: clearly there's only a 9/20 chance of that bucket being unused which means your chance of a collision during that 12th insertion exceeds 50% for the first time. This is the answer most formulas, textbooks, people etc. will give you, as it's the most meaningful for situations where hash tables are used with strong hash functions and/or prime numbers of buckets etc. - the scenarios where hash tables shine and are particularly elegant.
Another not-uncommon scenario is that you're putting say customer ids for a business into the hash table, and you're assigning the customers incrementing id numbers starting at 1. Even if you've already inserted customers with ids 1 to 19, you know they're in buckets [1] to [19] with no collisions - your hash just passes the keys through without the mod kicking in. You can then insert customer 20 into bucket [0] (after the mod operation) without a collision. Then, the 21st customer has 100% chance of a collision. (But, if your data's like this, please use an array and index directly using the customer id, or customer_id - 1 if you don't want to waste bucket [0].)
There are many other possible patterns in the keys that can affect when you exceed a 50% probability of a collision: e.g. all the keys being odd or multiples of some value, or being say ages or heights with a particular distribution curve.
The mistake with your use of the Birthday Paradox is thinking it answers your question. When you put "1/2" and "20" into the formula, it's telling you that the point at which your cumulative probability of a collision reaches 1/2, but your question is "the probability of the next element hitting a collision exceeds 50%" (emphasis mine).

Why is the cost of a hash lookup O(1) when evaluating the hash function might take more time than that?

A HashMap (or) HashTable is an example of keyed array. Here, the indices are user-defined keys rather than the usual index number. For example, arr["first"]=99 is an example of a hashmap where theb key is first and the value is 99.
Since keys are used, a hashing function is required to convert the key to an index element and then insert/search data in the array. This process assumes that there are no collisions.
Now, given a key to be searched in the array and if present, the data must be fetched. So, every time, the key must be converted to an index number of the array before the search. So how does that take a O(1) time? Because, the time complexity is dependent on the hashing function also. So the time complexity must be O(hashing function's time).
When talking about hashing, we usually measure the performance of a hash table by talking about the expected number of probes that we need to make when searching for an element in the table. In most hashing setups, we can prove that the expected number of probes is O(1). Usually, we then jump from there to "so the expected runtime of a hash table lookup is O(1)."
This isn't necessarily the case, though. As you've pointed out, the cost of computing the hash function on a particular input might not always take time O(1). Similarly, the cost of comparing two elements in the hash table might also not take time O(1). Think about hashing strings or lists, for example.
That said, what is usually true is the following. If we let the total number of elements in the table be n, we can say that the expected cost of performing a looking up the hash table is independent of the number n. That is, it doesn't matter whether there are 1,000,000 elements in the hash table or 10100 - the number of spots you need to prove is, on average, the same. Therefore, we can say that the expected cost of performing a lookup in a hash table, as a function of the hash table size, is O(1) because the cost of performing a lookup doesn't depend on the table size.
Perhaps the best way to account for the cost of a lookup in a hash table would be to say that it's O(Thash + Teq), where Thash is the time required to hash an element and Teq is the time required to compare two elements in the table. For strings, for example, you could say that the expected cost of a lookup is O(L + Lmax), where L is the length of the string you're hashing and Lmax is the length of the longest string stored in the hash table.
Hope this helps!

Separate chain Hashing for avoiding Hash collision

My knowledge of hash tables is limited and I am currently learning it. I have a question on Hash collision resolution by open hashing or separate chain hashing.
I understand that the hash buckets in this case hold the pointer to the linked list where all the elements that map into the same key are linked. so the search complexity would be in the order of o(n) where n is the number of elements in the linked list. Is there a way to make this simpler ?
Also if there is a constraint on the size of the linked list, say it can hold only 5 elements max and if more than 5 elements hash into the same bucket, what would be the best way to handle this scenario ?
Any pointers for learning more on the above and any help would be greatly appreciated.
Hash collisions shouldn't be too common, otherwise you're doing something wrong (e.g. a bad hash function or not a big enough hash table). So the number of elements in each linked-list should be minimal and the O(n) complexity shouldn't be too bad.
You could theoretically replace it with one of many other data structures. A binary search tree, for example, would get O(log n) search time (assuming the items are comparable), but then insert time will be up to O(log n) instead of O(1), and it would take more space.
There should be no maximum on the number of elements in a list. If there were, you could probably resort to probing (e.g. linear probing), but deletions could be a nightmare as you may need to move elements around quite a bit.