Hash tables - complexity of insert, search, and delete - hash

I've been given two homework problems on the complexity of hash tables, but I'm struggling to understand the difference between them.
They are as follows:
Consider a hash function which is to take n inputs and map them to a table of size m.
Write the complexity of insert, search, and deletion for the hash function which distributes all n inputs evenly over the buckets of the hash table.
Write the complexity of insert, search, and deletion for the (supposedly perfect but unrealistic) hash function which will never has two items to the same bucket, i.e. this hash function will never result in a collision.
These two questions seem quite similar to me and I'm not really sure of their differences.
For question one, since the n inputs are distributed evenly we can assume there will be zero or one items in each bucket, so all of insert, search and delete will be O(1). Is this correct?
How then does question two differ in any way? If the function never results in a collision then all the items will be spread evenly so wouldn't this result in O(1) for each operation?
Is my thinking correct for these problems or am I missing something?
EDIT:
I believe I've identified where I've gone wrong. O(1) is correct for every operation in question 3 because the hash function is ideal and never results in collision.
However for question 2, the items are spread evenly BUT DOES NOT MEAN there is only 1 item in each bucket, every bucket could have 20 items in a linked list, for example. So insertion would be O(1).
But what about search? It would be O(1) + cost of searching the linked list. But we don't know the size, only know it's spread evenly. Can we get an expression for the length in terms of n (number of inputs) and m (size of table)?

Your edit is on the right track.
Can we get an expression for the length in terms of n (number of inputs) and m (size of table)?
For 1, if the hash table sizing is inhibited in some way that means the load factor (i.e. number of items per bucket) n/m is greater than 1 and not constant nor within constant bounds, then you can postulate a relationship m = f(n), then the load factor will be n / f(n), so the complexity will be O(n/f(n)) too.
In the second case, the complexity is always O(1).

Related

quick sort is slower than merge sort

I think the speed of quick sort is less efficient when arranging an array with duplicate data, right? when datatype is char, the bigger the array(over 100000), the closer it gets to the n^2 order.
and assuming there is no duplicate data, to get the best case of a quick sort where the first element is placed as a pivot, first elementsI think we can recursively change the first and intermediate elements by dividing the already aligned array like a merge sort. right? is there general best case?
Lomuto partition scheme, which scans from one end to the other during partition, is slower with duplicates. If all the values are the same, then each partition step splits it into sizes 1 and n-1, a worst case scenario.
Hoare partition scheme, which scans from both both ends towards each other until the indexes (or iterators or pointers) cross, is usually faster with duplicates. Even though duplicates result in more swaps, each swap occurs just after reading and comparing two values to the pivot and are still in the cache for the swap (assuming object size is not huge). As the number of duplicates increases, the splitting improves towards the ideal case where each partition step splits the data into two equal halves. I ran a benchmark sorting 16 million 64 bit integers: with random data, it took about 1.37 seconds, improving with duplicates and with all values the same, it took about about 0.288 seconds.
Another alternative is a 3 way partition, which splits a partition into elements < pivot, elements == pivot, elements > pivot. If all the elements are the same, it's done in O(n) time. For n elements with only k possible values, then time complexity is O(n ⌈log3(k)⌉), and since k is constant, the time complexity is still O(n).
Wiki links:
https://en.wikipedia.org/wiki/Quicksort#Repeated_elements
https://en.wikipedia.org/wiki/Dutch_national_flag_problem

Hash table O(1) amortized or O(1) average amortized?

This question may seem a bit pedantic but i've been really trying to dive deeper into Amortized analysis and am a bit confused as to why insert for a hash table is O(1) amortized.(Note: Im not talking about table doubling, I understand that)
Using this definition, "Amortized analysis gives the average performance (over time) of each operation in the worst case." It seems like the worst case for N inserts into a hashtable would result in a collision for every operation. I believe universal hashing guarantees collision at a rate of 1/m when the load balance is kept low, but isn't it still theoretically possible to get a collision for every insert?
It seems like technically the average amortized analysis for hashtable's insert is O(1).
Edit: You can assume the hashtable uses basic chaining where the element is placed at the end of the corresponding linked list. The real meat of my question refers to amortized analysis on probabilistic algorithms.
Edit 2:
I found this post on quicksort,
"Also there’s a subtle but important difference between amortized running time and expected running time. Quicksort with random pivots takes O(n log n) expected running time, but its worst-case running time is in Θ(n^2). This means that there is a small possibility that quicksort will cost (n^2) dollars, but the probability that this will happen approaches zero as n grows large." I think this probably answers my question.
You could theoretically get a collision every insert but that would mean that you had a poor performing hashing function that failed to space out values across the "buckets" for keys. A theoretically perfect hash function would always put a new value into a new bucket so that each key would refer to it's own bucket. (I am assuming a chained hash table and referring to the chain field as a "bucket", just how I was taught). A theoretically worst case function would stick all keys into the same bucket leading to a chain in that bucket of length N.
The idea behind the amortization is that given a reasonably good hashing function you should end up with a linear time for insert because the amount of times that insertion is > O(1) would be greatly dwarfed by the number of times that insertion is simple and O(1). That is not to say that insertion is without any calculation (the hash function still has to be calculated and in some special cases hash functions can be more calc heavy than just looking through a list).
At the end of the day this brings us to an important concept in big-O which is the idea that when calculating time complexity you need to look at the most frequently executed action. In this case that is the insertion of a value that does not collide with another hash.

Number of comparison during a a closed address hashing?

Initially, all entries in the hash table are empty lists.
All elements with hash address i will be inserted into the linked list h[i]. If there is collision, during hashing of keys, the key will be added to the end of a linkedList.
For the average case of successful search, do i count it when the comparison is to check if the h[i] is null? if it's null it means that the linkedlist is null and it should return not found. Should it be 1 comparison or 0 comparison? in terms of complexity.
Sorry for this stupid question, i'm still learning algorithm complexity.
For "big-O" complexity it just doesn't matter, as there is no such thing as "O(2N+1)" complexity (from counting element and pointer comparisons) - it simplifies to O(N), where N is the number of elements in the bucket h[i]. Alternatively, you might say the average big-O complexity across buckets is O(N) where N is size / buckets, aka load factor.
If you're not doing big-O complexity analysis, we can't really tell you what you want to count. I would point out that comparisons of pointers to nullptr are much cheaper than object comparison involving an extra level of indirection or scanning along a large object (e.g. std::string objects too long for any Short-String-Optimisation buffer), so can often be neglected.
If in doubt as to what's wanted, I'd suggest you report the comparisons as in "searching for an element that's not present involves N object value comparisons and N+1 pointer comparisons, where N is the number of elements chained from h[i]".
If you must give just one expression (for example, some computerised multiple-choice test), I'd suggest a count of element comparisons is likely the desired answer - the number of value comparisons (i.e. 0 for an empty hash bucket), as it's most common to be interested in the complexity as a function of the number of data elements.
0 comparisons. If at h[i] you see a list of one entry and this is a hit (since you analyze successful search), this would be 1 comparison, and so on.

Separate chain Hashing for avoiding Hash collision

My knowledge of hash tables is limited and I am currently learning it. I have a question on Hash collision resolution by open hashing or separate chain hashing.
I understand that the hash buckets in this case hold the pointer to the linked list where all the elements that map into the same key are linked. so the search complexity would be in the order of o(n) where n is the number of elements in the linked list. Is there a way to make this simpler ?
Also if there is a constraint on the size of the linked list, say it can hold only 5 elements max and if more than 5 elements hash into the same bucket, what would be the best way to handle this scenario ?
Any pointers for learning more on the above and any help would be greatly appreciated.
Hash collisions shouldn't be too common, otherwise you're doing something wrong (e.g. a bad hash function or not a big enough hash table). So the number of elements in each linked-list should be minimal and the O(n) complexity shouldn't be too bad.
You could theoretically replace it with one of many other data structures. A binary search tree, for example, would get O(log n) search time (assuming the items are comparable), but then insert time will be up to O(log n) instead of O(1), and it would take more space.
There should be no maximum on the number of elements in a list. If there were, you could probably resort to probing (e.g. linear probing), but deletions could be a nightmare as you may need to move elements around quite a bit.

Linear hashing complexity

I was going through Linear hashing article on Wiki. One line puzzled me and here it is:
" The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to being incurred all at once.[2]"
In case of linear hashing if hash value of item to be inserted is smaller than split variable then a new node (or bucket) is created and value inserted in that.And according to above line( the time complexity is measured over each "insertion operation" which if compared to "dynamic array" implementation where we do amortized analysis , the insertion in Linear hashing must take O(n) time. Please correct me if i am wrong.
One more thing: Second line on wiki says "Linear hashing is therefore well suited for interactive applications."
Can i compare B+ tree with Linear hashing in "interactive cases" (since both are extendible searching techniques) ?
From what I know O(n) is the worst time complexity but in most cases a hash table would return results in constant time which is O(1). As oppose to B+ tree where one must traverse the tree hash tables work on hashing function where the result of hashing function points to the address of a stored value. In the worst case if all the keys have same hashing results then the time complexity might become O(n) because all the results will be stored in one bucket.
According to wikipedia b plus tree has following time complexities.
Inserting a record requires O(logbn) operations
Finding a record requires O(logbn) operations
An LH implementation can guarantee strictly bounded insertion time.
There's no reason for the split location and the key-hash location to be related, if collisions are handled by overflows. The trick is to link the creation of overflow slots to the split operation.
For example, if every Nth slot is always reserved to be an overflow slot, then you need to do at most N-1 splits to create a new overflow slot. In practice it's fewer than (N-1)/2 splits, because splitting one slot may free up an overflow slot.
http://goo.gl/6dbuH for a description, https://github.com/mischasan/hx for source code.