Hash table O(1) amortized or O(1) average amortized? - hash

This question may seem a bit pedantic but i've been really trying to dive deeper into Amortized analysis and am a bit confused as to why insert for a hash table is O(1) amortized.(Note: Im not talking about table doubling, I understand that)
Using this definition, "Amortized analysis gives the average performance (over time) of each operation in the worst case." It seems like the worst case for N inserts into a hashtable would result in a collision for every operation. I believe universal hashing guarantees collision at a rate of 1/m when the load balance is kept low, but isn't it still theoretically possible to get a collision for every insert?
It seems like technically the average amortized analysis for hashtable's insert is O(1).
Edit: You can assume the hashtable uses basic chaining where the element is placed at the end of the corresponding linked list. The real meat of my question refers to amortized analysis on probabilistic algorithms.
Edit 2:
I found this post on quicksort,
"Also there’s a subtle but important difference between amortized running time and expected running time. Quicksort with random pivots takes O(n log n) expected running time, but its worst-case running time is in Θ(n^2). This means that there is a small possibility that quicksort will cost (n^2) dollars, but the probability that this will happen approaches zero as n grows large." I think this probably answers my question.

You could theoretically get a collision every insert but that would mean that you had a poor performing hashing function that failed to space out values across the "buckets" for keys. A theoretically perfect hash function would always put a new value into a new bucket so that each key would refer to it's own bucket. (I am assuming a chained hash table and referring to the chain field as a "bucket", just how I was taught). A theoretically worst case function would stick all keys into the same bucket leading to a chain in that bucket of length N.
The idea behind the amortization is that given a reasonably good hashing function you should end up with a linear time for insert because the amount of times that insertion is > O(1) would be greatly dwarfed by the number of times that insertion is simple and O(1). That is not to say that insertion is without any calculation (the hash function still has to be calculated and in some special cases hash functions can be more calc heavy than just looking through a list).
At the end of the day this brings us to an important concept in big-O which is the idea that when calculating time complexity you need to look at the most frequently executed action. In this case that is the insertion of a value that does not collide with another hash.

Related

Given an array, find top 'x' in O(1)

I'm gonna write the problem as I found it and I will then explain what confuses me.
"A teacher is marking his students' work from 0-10 but he only marks with an 8 or above for a certain number 'x'(x=15 for example) of the 'n' students. You are given an array with all the students' marks in random order. Find the 'x' best marks in O(1)."
We certainly have been taught hashing but this requires me to store all the data in a hash table which is definitely not O(1). Maybe we don't have to take the 'conversion' into account? If we do , maybe the coversion combined with the search time after will lead to a method different than hashing.
In that case, leaving O(1) aside , what is the fastest algorithm including both the conversion and the search time?
Simple: It's not possible.
O(1) can only achieved if all of input size, number of necessary comparisons and output size are constants. You may argue that x could be treated as constant, but it still doesn't work:
You need to inspect every single input element, all n of them, as the random input order does not even allow any heuristics to guess where the xth element would be, even if you already had correctly guessed the other x-1 elements already in constant time.
As the problem is stated, there is no solution which can do it in the upper bounds of O(1) or O(x).
Let's just assume your instructor corrects his mistake, and gives you a revised version which correctly states O(n) as the required upper bound.
In that case your hash approach is (almost) correct. The catch of using a hash function, is that you now need to account for potential collisions on the hash function, which are the reason why hash maps don't work strictly in O(1), but rather only "on average" in O(1).
As you know all possible values (grades from 0-10), you can just allocate buckets with a known index. Inside each bucket you may use linked lists, as they also allow constant time insertions and linear time iteration.

Is QuickSort really the fastet sorting technique

Hello all this is my very first question here. I am new to datastructure and algorithms my teacher asked me to compare time complexity of different algorithms including: merge sort, heap sort, insertion sort, and quick sort. I search over internet and find out that quick sort is the fastest of all but my version of quick sort is the slowest of all (it sort 100 random integers in almost 1 second while my other sorting algorithms took almost 0 second). I tweak my quick sort logic many times (taking first value as pivot than tried to take middle value as pivot but in vain) I finally search the code over internet and there was not much difference in my code and code on internet. Now I really am confused that if this is behaviour of quick sort is natural (I mean whatever your logic is you are gonna get same results.) or there are some specific situations where you should use quick sort. In the end I know my question is not clear (I don't know how to ask besides my english is also not very good.) I hope someone can help me I really wanted to attach picture of awkward result I am having but I can't (reputation < 10).
Theoretically, quicksort is supposed to be the fastest algorithm for sorting, with a runtime of O(nlogn). It's worst case would be O(n^2), but only occurs if there are repeated values are equal to the pivot.
In your situation, I can only assume that your pivot value is not ideal in your data array, but is still able to sort the values using that pivot. Otherwise, your quicksort implementation is unfortunately incorrect.
Quicksort has O(n^2) worst-case runtime and O(nlogn) average case runtime. A good reason why Quicksort is so fast in practice compared to most other O(nlogn) algorithms such as Heapsort, is because it is relatively cache-efficient. Its running time is actually O(n/Blog(n/B)), where B is the block size. Heapsort, on the other hand, doesn't have any such speedup: it's not at all accessing memory cache-efficiently.
The value you choose as pivot may not be appropriate hence your sorting may be taking some time.You can avoid quicksort’s worst-case run time of O(n^2) almost entirely by using an appropriate choice of the pivot – such as picking it at random.
Also , the best and worst case often are extremes rarely occurring in practice.But any average case analysis assume some distribution of inputs. For sorting, the typical choice is the random permutation model (as assumed on Wikipedia).

Separate chain Hashing for avoiding Hash collision

My knowledge of hash tables is limited and I am currently learning it. I have a question on Hash collision resolution by open hashing or separate chain hashing.
I understand that the hash buckets in this case hold the pointer to the linked list where all the elements that map into the same key are linked. so the search complexity would be in the order of o(n) where n is the number of elements in the linked list. Is there a way to make this simpler ?
Also if there is a constraint on the size of the linked list, say it can hold only 5 elements max and if more than 5 elements hash into the same bucket, what would be the best way to handle this scenario ?
Any pointers for learning more on the above and any help would be greatly appreciated.
Hash collisions shouldn't be too common, otherwise you're doing something wrong (e.g. a bad hash function or not a big enough hash table). So the number of elements in each linked-list should be minimal and the O(n) complexity shouldn't be too bad.
You could theoretically replace it with one of many other data structures. A binary search tree, for example, would get O(log n) search time (assuming the items are comparable), but then insert time will be up to O(log n) instead of O(1), and it would take more space.
There should be no maximum on the number of elements in a list. If there were, you could probably resort to probing (e.g. linear probing), but deletions could be a nightmare as you may need to move elements around quite a bit.

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.

Linear hashing complexity

I was going through Linear hashing article on Wiki. One line puzzled me and here it is:
" The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to being incurred all at once.[2]"
In case of linear hashing if hash value of item to be inserted is smaller than split variable then a new node (or bucket) is created and value inserted in that.And according to above line( the time complexity is measured over each "insertion operation" which if compared to "dynamic array" implementation where we do amortized analysis , the insertion in Linear hashing must take O(n) time. Please correct me if i am wrong.
One more thing: Second line on wiki says "Linear hashing is therefore well suited for interactive applications."
Can i compare B+ tree with Linear hashing in "interactive cases" (since both are extendible searching techniques) ?
From what I know O(n) is the worst time complexity but in most cases a hash table would return results in constant time which is O(1). As oppose to B+ tree where one must traverse the tree hash tables work on hashing function where the result of hashing function points to the address of a stored value. In the worst case if all the keys have same hashing results then the time complexity might become O(n) because all the results will be stored in one bucket.
According to wikipedia b plus tree has following time complexities.
Inserting a record requires O(logbn) operations
Finding a record requires O(logbn) operations
An LH implementation can guarantee strictly bounded insertion time.
There's no reason for the split location and the key-hash location to be related, if collisions are handled by overflows. The trick is to link the creation of overflow slots to the split operation.
For example, if every Nth slot is always reserved to be an overflow slot, then you need to do at most N-1 splits to create a new overflow slot. In practice it's fewer than (N-1)/2 splits, because splitting one slot may free up an overflow slot.
http://goo.gl/6dbuH for a description, https://github.com/mischasan/hx for source code.