Sort a list by an ordered index - scala

Let us assume that I have the following two sequences:
val index = Seq(2,5,1,4,7,6,3)
val unsorted = Seq(7,6,5,4,3,2,1)
The first is the index by which the second should be sorted. My current solution is to traverse over the index and construct a new sequence with the found elements from the unsorted sequence.
val sorted = index.foldLeft(Seq[Int]()) { (s, num) =>
s ++ Seq(unsorted.find(_ == num).get)
}
But this solution seems very inefficient and error-prone to me. On every iteration it searches the complete unsorted sequence. And if the index and the unsorted list aren't in sync, then either an error will be thrown or an element will be omitted. In both cases, the not in sync elements should be appended to the ordered sequence.
Is there a more efficient and solid solution for this problem? Or is there a sort algorithm which fits into this paradigm?
Note: This is a constructed example. In reality I would like to sort a list of mongodb documents by an ordered list of document Id's.
Update 1
I've selected the answer from Marius Danila because it seems the more fastest and scala-ish solution for my problem. It doesn't come with a not in sync item solution, but this could be easily implemented.
So here is the updated solution:
def sort[T: ClassTag, Key](index: Seq[Key], unsorted: Seq[T], key: T => Key): Seq[T] = {
val positionMapping = HashMap(index.zipWithIndex: _*)
val inSync = new Array[T](unsorted.size)
val notInSync = new ArrayBuffer[T]()
for (item <- unsorted) {
if (positionMapping.contains(key(item))) {
inSync(positionMapping(key(item))) = item
} else {
notInSync.append(item)
}
}
inSync.filterNot(_ == null) ++ notInSync
}
Update 2
The approach suggested by Bask.cc seems the correct answer. It also doesn't consider the not in sync issue, but this can also be easily implemented.
val index: Seq[String]
val entities: Seq[Foo]
val idToEntityMap = entities.map(e => e.id -> e).toMap
val sorted = index.map(idToEntityMap)
val result = sorted ++ entities.filterNot(sorted.toSet)

Why do you want to sort collection, when you already have sorted index collection? You can just use map
Concerning> In reality I would like to sort a list of mongodb documents by an ordered list of document Id's.
val ids: Seq[String]
val entities: Seq[Foo]
val idToEntityMap = entities.map(e => e.id -> e).toMap
ids.map(idToEntityMap _)

This may not exactly map to your use case, but Googlers may find this useful:
scala> val ids = List(3, 1, 0, 2)
ids: List[Int] = List(3, 1, 0, 2)
scala> val unsorted = List("third", "second", "fourth", "first")
unsorted: List[String] = List(third, second, fourth, first)
scala> val sorted = ids map unsorted
sorted: List[String] = List(first, second, third, fourth)

I do not know the language that you are using. But irrespective of the language this is how i would have solved the problem.
From the first list (here 'index') create a hash table taking key as the document id and the value as the position of the document in the sorted order.
Now when traversing through the list of document i would lookup the hash table using the document id and then get the position it should be in the sorted order. Then i would use this obtained order to sort in a pre allocated memory.
Note: if the number of documents is small then instead of using hashtable u could use a pre allocated table and index it directly using the document id.

Flat Mapping the index over the unsorted list seems to be a safer version (if the index isn't found it's just dropped since find returns a None):
index.flatMap(i => unsorted.find(_ == i))
It still has to traverse the unsorted list every time (worst case this is O(n^2)). With you're example I'm not sure that there's a more efficient solution.

In this case you can use zip-sort-unzip:
(unsorted zip index).sortWith(_._2 < _._2).unzip._1
Btw, if you can, better solution would be to sort list on db side using $orderBy.

Ok.
Let's start from the beginning.
Besides the fact you're rescanning the unsorted list each time, the Seq object will create, by default a List collection. So in the foldLeft you're appending an element at the end of the list each time and this is a O(N^2) operation.
An improvement would be
val sorted_rev = index.foldLeft(Seq[Int]()) { (s, num) =>
unsorted.find(_ == num).get +: s
}
val sorted = sorted_rev.reverse
But that is still an O(N^2) algorithm. We can do better.
The following sort function should work:
def sort[T: ClassTag, Key](index: Seq[Key], unsorted: Seq[T], key: T => Key): Seq[T] = {
val positionMapping = HashMap(index.zipWithIndex: _*) //1
val arr = new Array[T](unsorted.size) //2
for (item <- unsorted) { //3
val position = positionMapping(key(item))
arr(position) = item
}
arr //6
}
The function sorts a list of items unsorted by a sequence of indexes index where the key function will be used to extract the id from the objects you're trying to sort.
Line 1 creates a reverse index - mapping each object id to its final position.
Line 2 allocates the array which will hold the sorted sequence. We're using an array since we need constant-time random-position set performance.
The loop that starts at line 3 will traverse the sequence of unsorted items and place each item in it's meant position using the positionMapping reverse index
Line 6 will return the array converted implicitly to a Seq using the WrappedArray wrapper.
Since our reverse-index is an immutable HashMap, lookup should take constant-time for regular cases. Building the actual reverse-index takes O(N_Index) time where N_Index is the size of the index sequence. Traversing the unsorted sequence takes O(N_Unsorted) time where N_Unsorted is the size of the unsorted sequence.
So the complexity is O(max(N_Index, N_Unsorted)), which I guess is the best you can do in the circumstances.
For your particular example, you would call the function like so:
val sorted = sort(index, unsorted, identity[Int])
For the real case, it would probably be like this:
val sorted = sort(idList, unsorted, obj => obj.id)

The best I can do is to create a Map from the unsorted data, and use map lookups (basically the hashtable suggested by a previous poster). The code looks like:
val unsortedAsMap = unsorted.map(x => x -> x).toMap
index.map(unsortedAsMap)
Or, if there's a possibility of hash misses:
val unsortedAsMap = unsorted.map(x => x -> x).toMap
index.flatMap(unsortedAsMap.get)
It's O(n) in time*, but you're swapping time for space, as it uses O(n) space.
For a slightly more sophisticated version, that handles missing values, try:
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val unsortedAsMap = new java.util.LinkedHashMap[Int, Int]
for (i <- unsorted) unsortedAsMap.add(i, i)
val newBuffer = ListBuffer.empty[Int]
for (i <- index) {
val r = unsortedAsMap.remove(i)
if (r != null) newBuffer += i
// Not sure what to do for "else"
}
for ((k, v) <- unsortedAsMap) newBuffer += v
newBuffer.result()
If it's a MongoDB database in the first place, you might be better retrieving documents directly from the database by index, so something like:
index.map(lookupInDB)
*technically it's O(n log n), as Scala's standard immutable map is O(log n), but you could always use a mutable map, which is O(1)

Related

Filter a Scala Seq[(String, String)] using a Seq[String]

I have this Seq[(String, String)] :
val tupleSeq: Seq[(String, String)] = Seq(
("aaa", "A_A_A"),
("bbb", "B_B_B"),
("ccc", "C_C_C")
)
I want to use the given seqA on tupleSeq:
val seqA: Seq[String] = Seq("aaa", "bbb")
In order to obtain :
val seqB: Seq[String] = Seq("A_A_A", "B_B_B")
Any ideas ?
One approach is to use the data unaltered.
// The size of `data` is M
// The size of `query` is N
val data: Seq[(String, String)] = Seq(
("aaa", "A_A_A"),
("bbb", "B_B_B"),
("ccc", "C_C_C")
)
val query: Seq[String] = Seq("aaa", "bbb")
// Use the data as is
// O(M * N)
for {
(key, value) <- data
lookup <- query
if key == lookup
} yield value
The problem with this approach is that the overall complexity is O(M * N), where M and N are the sizes of the data and query collections. This might be completely acceptable if either M or N are known to be very small and can be further improved in practical terms by making use of functions that can terminate early (like find, exemplified in another answer).
If M and N are reasonably large, you might want to spend the time necessary to convert them into an appropriate data structure (which consumes time and space in a way which is linear to the size of the collection).
Depending on which size you expect to be larger you might want to either turn the data into a map and look up the relevant keys or turn the query into a set and iterate each key in the map to find which is relevant.
I would expect the data to be queried in most cases to be larger than the query, so probably you may want to turn the data into a map. Keeping the map around would also allow you to query it multiple times without suffering from the time to turn it into a more appropriate structure for querying.
// Turn the query into a set and iterate the data
// O(M)
val lookups = query.toSet
data.collect {
case (key, value) if lookups.contains(key) => value
}
// Turn the data into a map and iterate the query
// O(N)
val map = data.toMap
query.collect(map)
You can play around with this code here on Scastie.
Your tupleSeq naturally looks like a Map of key-to-value pairs, so you should treat it like one. The code becomes very simple with this observation:
val myMap = tupleSeq.toMap
val seqB = seqA.collect(myMap) // List(A_A_A, B_B_B)
For additional space complexity, you get O(1) amortized time complexity for your query, which is an acceptable trade-off and arguably a better solution than linear searches through the sequence.
Note the use of collect instead of map because it discards keys that do not have a mapping value in your Map.
val tupleSeq: Seq[(String, String)] = Seq(
("aaa", "A_A_A"),
("bbb", "B_B_B"),
("ccc", "C_C_C")
)
val seqA: Seq[String] = Seq("aaa", "bbb")
// List(A_A_A, B_B_B)
val seqB = for {
key <- seqA
value <- tupleSeq.find(_._1 == key).map(_._2)
} yield value
You can try something like this:
val seqB = tupleSeq.filter{x => seqA.contains(x._1)}.map(x => x._2)
It filters the sequence and keeps the tuples where the first value is part of your second sequence, and then maps the tuples to the second value.
seqB.foreach(println) then outputs this:
A_A_A
B_B_B

Is there a way to filter out the elements of a List by checking them against elements of an Array in Scala?

I have a List in Scala:
val hdtList = hdt.split(",").toList
hdtList.foreach(println)
Output:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string
There is an array which is obtained from a dataframe and converting its column to array as below:
val partition_columns = spColsDF.select("partition_columns").collect.flatMap(x => x.getAs[String](0).split(","))
partition_columns.foreach(println)
Output:
source_system_name
period_year
Is there a way to filter out the elements: source_system_name string, period_year bigint from hdtList by checking them against the elements in the Array: partition_columns and put them into new List.
I am confused on applying filter/map on the right collections appropriately and compare them.
Could anyone let me know how can I achieve that ?
Unless I'm misunderstanding the question, I think this is what you need:
val filtered = hdtList.filter { x =>
!partition_columns.exists { col => x.startsWith(col) }
}
In your case you need to use filter, because you need to remove elements from hdtList.
Map is a function that transform elements, there is no way to remove elements from a collection using map. If you have a List of X elements, after map execution, you have X elements, not less, not more.
val newList = hdtList.filter( x => partition_columns.exists(x.startsWith) )
Be aware that the combination filter+exists between two List is an algorithm NxM. If your Lists are big, you will have a performance problem.
One way to solve that problem is using Sets.
It might be useful to have both lists: the hdt elements referenced in partition_columns, and the hdt elements that aren't.
val (pc
,notPc) = hdtList.partition( w =>
partition_columns.contains(w.takeWhile(_!=' ')))
//pc: List[String] = List(period_year bigint, source_system_name string)
//notPc: List[String] = List(forecast_id bigint, period_num bigint, ... etc.

Optimize Sorting Iterable Values after grouping in Spark

I have RDD[(String,(Int, Int)], I need to get top 10 values(tuples) for each key after sorting. I tried:
val sortedRDD = rdd.groupByKey.mapValues( x => x.toList.sortWith((x,y) => <<sorting logic>>).take(10))
This throws OutOfMemoryException as Iterable[(Int, Int)] is large for few keys for some keys. How should i handle this?, Is there a way to do this without using .groupByKey().
You should use aggregateByKey instead of groupByKey to perform the sorting and "trimming" (that keeps only top 10) while grouping instead of grouping into potentially-huge groups and only then mapping the result.
Here's how this could look:
// your sorting logic:
val sortingFunction: ((Int, Int), (Int, Int)) => Boolean = ???
val N = 10
val sortedRDD = rdd.aggregateByKey(List[(Int, Int)]())(
// first function: seqOp, how to add another item of the group to the result
{
case (topSoFar, candidate) if topSoFar.size < N => candidate :: topSoFar
case (topTen, candidate) => (candidate :: topTen).sortWith(sortingFunction).take(N)
},
// second function: combOp, how to add combine two partial results created by seqOp
{ case (list1, list2) => (list1 ++ list2).sortWith(sortingFunction).take(N) }
)
Notice that per group, we always create values that are 10 items or less.
NOTE: performance can possibly be improved by performing less "sort" operations (we sort the same list again and again whenever we add another item / list). To solve that, you can consider using a "sorted set" with a limited capacity (see Limited SortedSet) as the value, so that each addition efficiently adds or discards the new value without sorting.

How to sort a list in scala

I am a newbie in scala and I need to sort a very large list with 40000 integers.
The operation is performed many times. So performance is very important.
What is the best method for sorting?
You can sort the list with List.sortWith() by providing a relevant function literal. For example, the following code prints all elements of sorted list which contains all elements of the initial list in alphabetical order of the first character lowercased:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s: String, t: String)
=> s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
Much shorter version will be the following with Scala's type inference:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s, t) => s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
For integers there is List.sorted, just use this:
val list = List(4, 3, 2, 1)
val sortedList = list.sorted
println(sortedList)
just check the docs
List has several methods for sorting. myList.sorted works for types with already defined order (like Int or String and others). myList.sortWith and myList.sortBy receive a function that helps defining the order
Also, first link on google for scala List sort: http://alvinalexander.com/scala/how-sort-scala-sequences-seq-list-array-buffer-vector-ordering-ordered
you can use List(1 to 400000).sorted

Scala: Getting the index of an element in a Map

I have a Map[String, Object], where the key is an ID. That ID is now referenced within another object and I have to know what index (what position in the Map) that ID has. I can't believe there isn't an .indexOf
How can I accomplish that?
Do I really have to build myself a list with all the keys or another Map with ID1 -> 1, ID2 -> 2,... ?
I have to get the ID's indexes multiple times. Would a List or that Map be more efficient?
#Dora, as everyone mentioned, maps are unordered so there is no way to index them in place and store id with them.
It's hard to guess use case of storing (K,V) pairs in map and then getting unique id for every K. So, these are few suggestions based on my understanding -
1. You could use LinkedHashMap instead of Map which will maintain the insertion order so you will get stable iteration. Get KeysIterator on this map and convert it into a list which give you an unique index for every key in you map. Something like this-
import scala.collection.mutable.LinkedHashMap
val myMap = LinkedHashMap("a"->11, "b"->22, "c"->33)
val l = myMap.keysIterator.toList
l.indexOf("a") //get index of key a
myMap.+=("d"->44) //insert new element
val l = myMap.keysIterator.toList
l.indexOf("a") //index of a still remains 0 due to linkedHashMap
l.indexOf("d") //get index of newly inserted element.
Obviously, it is expensive to insert elements in linkedHashMap compared to HashMaps.
Deleting element from Map would automatically shift indexes to left.
myMap.-=("b")
val l = myMap.keysIterator.toList
l.indexOf("c") // Index of "c" moves from 2 to 1.
Change you map (K->V) to (K->(index, v)) and generate index manually while inserting new elements.
class ValueObject(val index: Int, val value: Int)
val myMap = scala.collection.mutable.Map[String, ValueObject]()
myMap.+=("a"-> new ValueObject(myMap.size+1, 11))
myMap("a").index<br/> // get index of key a
myMap.+=("b"-> new ValueObject(myMap.size+1, 22))
myMap.+=("c"-> new ValueObject(myMap.size+1, 33))
myMap("c").index<br/> // get index of key c
myMap("b").index<br/> // get index of key b
deletion would be expensive if we need indexes with no gaps, as we need to update all keys accordingly. However, keys insertion and search will be faster.
This problem can be solved efficiently if we know exactly what you need so please explain if above solutions doesn't work for you !!! (May be you really don't need map for solving your problem)
Maps do not have indexes, by definition.
What you can do is enumerate the keys in a map, but this is not guaranteed to be stable/repeatable. Add a new element, and it may change the numbers of every other element randomly due to rehashing. If you want a stable mapping of key to index (or the opposite), you will have to store this mapping somewhere, e.g. by serializing the map into a list.
Convert the Map[A,B] to an ordered collection namely for instance Seq[(A,B)] equipped with indexOf method. Note for
val m = Map("a" -> "x", "b" -> "y")
that
m.toSeq
Seq[(String, String)] = ArrayBuffer((a,x), (b,y))
From the extensive discussions above note this conversion does not guarantee any specific ordering in the resulting ordered collection. This collection can be sorted as necessary.
Alternatively you can index the set of keys in the map, namely for instance,
val idxm = for ( ((k,v),i) <- m.zipWithIndex ) yield ((k,i),v)
Map[(String, Int),String] = Map((a,0) -> x, (b,1) -> y)
where an equivalent to indexOf would be for example
idxm.find { case ((_,i),v) => i == 1 }
Option[((String, Int), String)] = Some(((b,1),y))