I think the speed of quick sort is less efficient when arranging an array with duplicate data, right? when datatype is char, the bigger the array(over 100000), the closer it gets to the n^2 order.
and assuming there is no duplicate data, to get the best case of a quick sort where the first element is placed as a pivot, first elementsI think we can recursively change the first and intermediate elements by dividing the already aligned array like a merge sort. right? is there general best case?
Lomuto partition scheme, which scans from one end to the other during partition, is slower with duplicates. If all the values are the same, then each partition step splits it into sizes 1 and n-1, a worst case scenario.
Hoare partition scheme, which scans from both both ends towards each other until the indexes (or iterators or pointers) cross, is usually faster with duplicates. Even though duplicates result in more swaps, each swap occurs just after reading and comparing two values to the pivot and are still in the cache for the swap (assuming object size is not huge). As the number of duplicates increases, the splitting improves towards the ideal case where each partition step splits the data into two equal halves. I ran a benchmark sorting 16 million 64 bit integers: with random data, it took about 1.37 seconds, improving with duplicates and with all values the same, it took about about 0.288 seconds.
Another alternative is a 3 way partition, which splits a partition into elements < pivot, elements == pivot, elements > pivot. If all the elements are the same, it's done in O(n) time. For n elements with only k possible values, then time complexity is O(n ⌈log3(k)⌉), and since k is constant, the time complexity is still O(n).
Wiki links:
https://en.wikipedia.org/wiki/Quicksort#Repeated_elements
https://en.wikipedia.org/wiki/Dutch_national_flag_problem
Related
I've been given two homework problems on the complexity of hash tables, but I'm struggling to understand the difference between them.
They are as follows:
Consider a hash function which is to take n inputs and map them to a table of size m.
Write the complexity of insert, search, and deletion for the hash function which distributes all n inputs evenly over the buckets of the hash table.
Write the complexity of insert, search, and deletion for the (supposedly perfect but unrealistic) hash function which will never has two items to the same bucket, i.e. this hash function will never result in a collision.
These two questions seem quite similar to me and I'm not really sure of their differences.
For question one, since the n inputs are distributed evenly we can assume there will be zero or one items in each bucket, so all of insert, search and delete will be O(1). Is this correct?
How then does question two differ in any way? If the function never results in a collision then all the items will be spread evenly so wouldn't this result in O(1) for each operation?
Is my thinking correct for these problems or am I missing something?
EDIT:
I believe I've identified where I've gone wrong. O(1) is correct for every operation in question 3 because the hash function is ideal and never results in collision.
However for question 2, the items are spread evenly BUT DOES NOT MEAN there is only 1 item in each bucket, every bucket could have 20 items in a linked list, for example. So insertion would be O(1).
But what about search? It would be O(1) + cost of searching the linked list. But we don't know the size, only know it's spread evenly. Can we get an expression for the length in terms of n (number of inputs) and m (size of table)?
Your edit is on the right track.
Can we get an expression for the length in terms of n (number of inputs) and m (size of table)?
For 1, if the hash table sizing is inhibited in some way that means the load factor (i.e. number of items per bucket) n/m is greater than 1 and not constant nor within constant bounds, then you can postulate a relationship m = f(n), then the load factor will be n / f(n), so the complexity will be O(n/f(n)) too.
In the second case, the complexity is always O(1).
Is it O(n) or O(n logn)? I have n elements that I need to setup in a hash table, what is the worst-case and average runtime?
Worst case is unlimited. You need to calculate hash codes and may have to compare elements, and the time for that is not limited.
Assuming that calculating hashes and comparing elements is constant time, for insertion the worst case is O (n^2). What saves you is the fact that the worst case would be exceedingly rare, assuming a halfway decent has function. Average time for a decent implementation is O (n).
I have a file with millions of lines (actually it's an online stream of data, which means we are receiving it line by line) , each line consists of an array of integers which is not sorted (positive and negative) there's no limit for the each number and the lengths are different and we might have duplicate values in one line,
I want to remove the duplicate lines (if 2 lines have same values regardless of how they are ordered we consider them duplicate), is there any good hashing function ?
We want to do this in O(n) while n is number of lines (we can assume that the maximum possibele number of elements in each line is constant, e.g. we have maximum of 100 elements in each line)
I've read some of the questions posted here in stackoverflow and I also googled it, most of them were for the cases where the arrays are of the same length or the integers are positive or even or they are sorted, is there any way to solve this in general case ?
My solution:
First we sort each line with the use of O(n) sorting algorithm e.g. Counting sort , then we put them into a string and then we use md5 hashing to put them into a hashset. If it's not in the set we put it into that set, if it's already in the list we check the arrays with the same hash value.
Problem with the solution : sorting using the Counting Sort takes a lot of space as there's no limit for the numbers and the collisions are possible .
The problem with using a hashing algorithm on a set of data this large is that you have a high probability of two different lines hashing to the same value. You want to stay in O(n) but I am not sure that is possible, with the size of the data and accuracy needed. If you use heapsort, which is space efficient and then traverse down the new sorted data removing consecutive lines that are the same you could accomplish this in O(nlogn)
I read that it's possible to make quicksort run at O(nlogn)
the algorithm says on each step choose the median as a pivot
but, suppose we have this array:
10 8 39 2 9 20
which value will be the median?
In math if I remember correct the median is (39+2)/2 = 41/2 = 20.5
I don't have a 20.5 in my array though
thanks in advance
You can choose either of them; if you consider the input as a limit, it does not matter as it scales up.
We're talking about the exact wording of the description of an algorithm here, and I don't have the text you're referring to. But I think in context by "median" they probably meant, not the mathematical median of the values in the list, but rather the middle point in the list, i.e. the median INDEX, which in this cade would be 3 or 4. As coffNjava says, you can take either one.
The median is actually found by sorting the array first, so in your example, the median is found by arranging the numbers as 2 8 9 10 20 39 and the median would be the mean of the two middle elements, (9+10)/2 = 9.5, which doesn't help you at all. Using the median is sort of an ideal situation, but would work if the array were at least already partially sorted, I think.
With an even numbered array, you can't find an exact pivot point, so I believe you can use either of the middle numbers. It'll throw off the efficiency a bit, but not substantially unless you always ended up sorting even arrays.
Finding the median of an unsorted set of numbers can be done in O(N) time, but it's not really necessary to find the true median for the purposes of quicksort's pivot. You just need to find a pivot that's reasonable.
As the Wikipedia entry for quicksort says:
In very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by R. Sedgewick).
Finding the median of three values is much easier than finding it for the whole collection of values, and for collections that have an even number of elements, it doesn't really matter which of the two 'middle' elements you choose as the potential pivot.
I am implementing an algorithm that perform Quick sort with Leftmost pivot selection up to a certain limit and when the list of arrays becomes almost sorted, I will use Insertion sort to sort those elements.
For left most pivot selection,I know the Average case complexity of Quick sort is O(nlogn) and worst case complexity ,i.e. when the list is almost sorted, is O(n^2). On the other hand, Insertion sort is very efficient on almost sorted list of elements with a complexity is O(n).
SO I think the complexity of this hybrid algorithm should be O(n). Am I correct?
The most important thing for the performance of qsort is picking a good pivot above all. This means choosing an element that's as close to the average of the elements you're sorting as possible.
The worse case of O(n2) in qsort comes about from consistently choosing 'bad' pivots every time for each partition pass. This causes the partitions to be extremely lopsided rather than balanced eg. 1 : n-1 element partition ratio.
I don't see how adding insertion sort into the mix as you've describe would help or mitigate this problem.