Linear hashing complexity - hash

I was going through Linear hashing article on Wiki. One line puzzled me and here it is:
" The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to being incurred all at once.[2]"
In case of linear hashing if hash value of item to be inserted is smaller than split variable then a new node (or bucket) is created and value inserted in that.And according to above line( the time complexity is measured over each "insertion operation" which if compared to "dynamic array" implementation where we do amortized analysis , the insertion in Linear hashing must take O(n) time. Please correct me if i am wrong.
One more thing: Second line on wiki says "Linear hashing is therefore well suited for interactive applications."
Can i compare B+ tree with Linear hashing in "interactive cases" (since both are extendible searching techniques) ?

From what I know O(n) is the worst time complexity but in most cases a hash table would return results in constant time which is O(1). As oppose to B+ tree where one must traverse the tree hash tables work on hashing function where the result of hashing function points to the address of a stored value. In the worst case if all the keys have same hashing results then the time complexity might become O(n) because all the results will be stored in one bucket.
According to wikipedia b plus tree has following time complexities.
Inserting a record requires O(logbn) operations
Finding a record requires O(logbn) operations

An LH implementation can guarantee strictly bounded insertion time.
There's no reason for the split location and the key-hash location to be related, if collisions are handled by overflows. The trick is to link the creation of overflow slots to the split operation.
For example, if every Nth slot is always reserved to be an overflow slot, then you need to do at most N-1 splits to create a new overflow slot. In practice it's fewer than (N-1)/2 splits, because splitting one slot may free up an overflow slot.
http://goo.gl/6dbuH for a description, https://github.com/mischasan/hx for source code.

Related

Hash table O(1) amortized or O(1) average amortized?

This question may seem a bit pedantic but i've been really trying to dive deeper into Amortized analysis and am a bit confused as to why insert for a hash table is O(1) amortized.(Note: Im not talking about table doubling, I understand that)
Using this definition, "Amortized analysis gives the average performance (over time) of each operation in the worst case." It seems like the worst case for N inserts into a hashtable would result in a collision for every operation. I believe universal hashing guarantees collision at a rate of 1/m when the load balance is kept low, but isn't it still theoretically possible to get a collision for every insert?
It seems like technically the average amortized analysis for hashtable's insert is O(1).
Edit: You can assume the hashtable uses basic chaining where the element is placed at the end of the corresponding linked list. The real meat of my question refers to amortized analysis on probabilistic algorithms.
Edit 2:
I found this post on quicksort,
"Also there’s a subtle but important difference between amortized running time and expected running time. Quicksort with random pivots takes O(n log n) expected running time, but its worst-case running time is in Θ(n^2). This means that there is a small possibility that quicksort will cost (n^2) dollars, but the probability that this will happen approaches zero as n grows large." I think this probably answers my question.
You could theoretically get a collision every insert but that would mean that you had a poor performing hashing function that failed to space out values across the "buckets" for keys. A theoretically perfect hash function would always put a new value into a new bucket so that each key would refer to it's own bucket. (I am assuming a chained hash table and referring to the chain field as a "bucket", just how I was taught). A theoretically worst case function would stick all keys into the same bucket leading to a chain in that bucket of length N.
The idea behind the amortization is that given a reasonably good hashing function you should end up with a linear time for insert because the amount of times that insertion is > O(1) would be greatly dwarfed by the number of times that insertion is simple and O(1). That is not to say that insertion is without any calculation (the hash function still has to be calculated and in some special cases hash functions can be more calc heavy than just looking through a list).
At the end of the day this brings us to an important concept in big-O which is the idea that when calculating time complexity you need to look at the most frequently executed action. In this case that is the insertion of a value that does not collide with another hash.

Given an array, find top 'x' in O(1)

I'm gonna write the problem as I found it and I will then explain what confuses me.
"A teacher is marking his students' work from 0-10 but he only marks with an 8 or above for a certain number 'x'(x=15 for example) of the 'n' students. You are given an array with all the students' marks in random order. Find the 'x' best marks in O(1)."
We certainly have been taught hashing but this requires me to store all the data in a hash table which is definitely not O(1). Maybe we don't have to take the 'conversion' into account? If we do , maybe the coversion combined with the search time after will lead to a method different than hashing.
In that case, leaving O(1) aside , what is the fastest algorithm including both the conversion and the search time?
Simple: It's not possible.
O(1) can only achieved if all of input size, number of necessary comparisons and output size are constants. You may argue that x could be treated as constant, but it still doesn't work:
You need to inspect every single input element, all n of them, as the random input order does not even allow any heuristics to guess where the xth element would be, even if you already had correctly guessed the other x-1 elements already in constant time.
As the problem is stated, there is no solution which can do it in the upper bounds of O(1) or O(x).
Let's just assume your instructor corrects his mistake, and gives you a revised version which correctly states O(n) as the required upper bound.
In that case your hash approach is (almost) correct. The catch of using a hash function, is that you now need to account for potential collisions on the hash function, which are the reason why hash maps don't work strictly in O(1), but rather only "on average" in O(1).
As you know all possible values (grades from 0-10), you can just allocate buckets with a known index. Inside each bucket you may use linked lists, as they also allow constant time insertions and linear time iteration.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Separate chain Hashing for avoiding Hash collision

My knowledge of hash tables is limited and I am currently learning it. I have a question on Hash collision resolution by open hashing or separate chain hashing.
I understand that the hash buckets in this case hold the pointer to the linked list where all the elements that map into the same key are linked. so the search complexity would be in the order of o(n) where n is the number of elements in the linked list. Is there a way to make this simpler ?
Also if there is a constraint on the size of the linked list, say it can hold only 5 elements max and if more than 5 elements hash into the same bucket, what would be the best way to handle this scenario ?
Any pointers for learning more on the above and any help would be greatly appreciated.
Hash collisions shouldn't be too common, otherwise you're doing something wrong (e.g. a bad hash function or not a big enough hash table). So the number of elements in each linked-list should be minimal and the O(n) complexity shouldn't be too bad.
You could theoretically replace it with one of many other data structures. A binary search tree, for example, would get O(log n) search time (assuming the items are comparable), but then insert time will be up to O(log n) instead of O(1), and it would take more space.
There should be no maximum on the number of elements in a list. If there were, you could probably resort to probing (e.g. linear probing), but deletions could be a nightmare as you may need to move elements around quite a bit.

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.