Handling duplicate keys in quicksort - quicksort

A naïve quicksort will take O(n^2) time to sort an array containing no unique keys, because all keys will be partitioned either before or after the pivot value. There are ways to handle duplicate keys (like one described in Quicksort is Optimal). The proposed solution only works for the Hoare partition, but I've implemented the Lomuto partition. To deal with duplicate keys, I alternated between moving duplicates to the left of the pivot and moving duplicates to the right of the pivot. The algorithm works something like this:
//partition array from index start to end
select pivot element and move it to array[start]
boolean dupHandler=true;
int index=start;
for(i from start+1 to end)
int val=array[start].compareTo(array[i]);
if(val==0)
if(dupHandler)
swap array[++index] and array[i]
dupHandler=!dupHandler;
else if(val>0)
swap array[++index] and array[i]
swap array[start] and array[index]
Is there a better (more efficient) way to handle duplicate keys?
EDIT: My code (as shown) requires compareTo to be consistent with equals (even thought that's not a requirement)

Related

How does scala's VectorMap work and how is it different than ListMap?

How does scala's VectorMap work? It says that it is constant time for look up.
I think ListMap has to iterate through everything to find an entry. Why would vector map be different?
Is it a hash table combined with a vector, where the hash table will map a key to an index in the vector, which has the entries?
Essentially, yes. It has a regular Map inside that maps keys to tuples (index, value), where index is pointing into a Vector of (keys), which is only used for in-order access (.head, .tail, .last, .iterator etc).

Does Erlang Mnesia select on an ordered_set give a list in Erlang Term order?

In the documentation it isn't clear to me whether I need to iterate through in order with either next or perhaps foldl (it is mentioned that foldr goes in the opposite order to ordered_set so presuambly foldl goes in the same order) or if I can use select and rely upon it being ordered (assuming ordered_set table)
can I use select and rely upon it being ordered (assuming ordered_set table)
ets:select/2:
For tables of type ordered_set, objects are visited in the same order as in a first/next traversal. This means that the match
specification is executed against objects with keys in the first/next
order and the corresponding result list is in the order of that
execution.
ets:first/1:
Returns the first key Key in table Tab. For an ordered_set table, the
first key in Erlang term order is returned.
Table Traversal:
Traversals using match and select functions may not need to scan
the entire table depending on how the key is specified. A match
pattern with a fully bound key (without any match variables) will
optimize the operation to a single key lookup without any table
traversal at all. For ordered_set a partially bound key will limit the
traversal to only scan a subset of the table based on term order.
It would make no sense to me for a table of type ordered_set to return search results in a random order.

Complexity of insert in Hash Table

Consider an initially empty hash table of size M and hash function h(x) = x mod M. In the worst case, what is the time complexity (in Big-Oh notation) to insert n keys into the table if separate chaining is used to resolve collisions (without rehashing)? Suppose that each entry (bucket) of the table stores an unordered linked list. When adding a new element to an unordered linked list, such an element is inserted at the beginning of the list.
In the absence of collisions, inserting a key into a hash table/map is O(1), since looking up the bucket is a constant time operation. I would not expect this to vary in the case of collisions, assuming that collisions are resolved using a linked list and that the new element is inserted to the head of the list. The reason for this is that adding an new element to the head of a linked list it also basically O(1). So, inserting under these assumptions should also be O(1), and therefore inserting n keys should be O(n).

Efficiently take one value for each key out of a RDD[(key,value)]

My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)

Overriding Ordering[Int] in Scala

I'm trying to sort an array of integers with a custom ordering.
E.g.
quickSort[Int](indices)(Ordering.by[Int, Double](value(_)))
Basically, I'm trying to sort indices of rows by the values of a particular column. I end up with a stackoverflow error when I run this on a fairly large data. If I use a more direct approach (e.g. sorting Tuple), this is not a problem.
Is there a problem if you try to extend the default Ordering[Int]?
You can reproduce this like this:
val indices = (0 to 99999).toArray
val values = Array.fill[Double](100000)(math.random)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values(_))) // Works
val values2 = Array.fill[Double](100000)(0.0)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values2(_))) // Fails
Update:
I think that I found out what the problem is (am answering my own question). It seems that I've created a paradoxical situation by changing the ordering definition of integers.
Within the quickSort algorithm itself, array positions are also integers, and there are certain statements comparing positions of arrays. This position comparison should be following the standard integer ordering.
But because of the new definition, now these position comparators are also following the indexed value comparator and things are getting really messed up.
I suppose that at least for the time being, I shouldn't be changing these default value type ordering as library might depend on default value type ordering.
Update2
It turns out that the above is in fact not the problem and there's a bug in quickSort when used together with Ordering. When a new Ordering is defined, the equality operator among Ordering is 'equiv', however the quickSort uses '=='. This results in the indices being compared, rather than indexed values being compared.