Is there a way to filter out the elements of a List by checking them against elements of an Array in Scala? - scala

I have a List in Scala:
val hdtList = hdt.split(",").toList
hdtList.foreach(println)
Output:
forecast_id bigint,period_year bigint,period_num bigint,period_name string,drm_org string,ledger_id bigint,currency_code string,source_system_name string,source_record_type string,gl_source_name string,gl_source_system_name string,year string,period string
There is an array which is obtained from a dataframe and converting its column to array as below:
val partition_columns = spColsDF.select("partition_columns").collect.flatMap(x => x.getAs[String](0).split(","))
partition_columns.foreach(println)
Output:
source_system_name
period_year
Is there a way to filter out the elements: source_system_name string, period_year bigint from hdtList by checking them against the elements in the Array: partition_columns and put them into new List.
I am confused on applying filter/map on the right collections appropriately and compare them.
Could anyone let me know how can I achieve that ?

Unless I'm misunderstanding the question, I think this is what you need:
val filtered = hdtList.filter { x =>
!partition_columns.exists { col => x.startsWith(col) }
}

In your case you need to use filter, because you need to remove elements from hdtList.
Map is a function that transform elements, there is no way to remove elements from a collection using map. If you have a List of X elements, after map execution, you have X elements, not less, not more.
val newList = hdtList.filter( x => partition_columns.exists(x.startsWith) )
Be aware that the combination filter+exists between two List is an algorithm NxM. If your Lists are big, you will have a performance problem.
One way to solve that problem is using Sets.

It might be useful to have both lists: the hdt elements referenced in partition_columns, and the hdt elements that aren't.
val (pc
,notPc) = hdtList.partition( w =>
partition_columns.contains(w.takeWhile(_!=' ')))
//pc: List[String] = List(period_year bigint, source_system_name string)
//notPc: List[String] = List(forecast_id bigint, period_num bigint, ... etc.

Related

How to sort a list in scala

I am a newbie in scala and I need to sort a very large list with 40000 integers.
The operation is performed many times. So performance is very important.
What is the best method for sorting?
You can sort the list with List.sortWith() by providing a relevant function literal. For example, the following code prints all elements of sorted list which contains all elements of the initial list in alphabetical order of the first character lowercased:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s: String, t: String)
=> s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
Much shorter version will be the following with Scala's type inference:
val initial = List("doodle", "Cons", "bible", "Army")
val sorted = initial.sortWith((s, t) => s.charAt(0).toLower < t.charAt(0).toLower)
println(sorted)
For integers there is List.sorted, just use this:
val list = List(4, 3, 2, 1)
val sortedList = list.sorted
println(sortedList)
just check the docs
List has several methods for sorting. myList.sorted works for types with already defined order (like Int or String and others). myList.sortWith and myList.sortBy receive a function that helps defining the order
Also, first link on google for scala List sort: http://alvinalexander.com/scala/how-sort-scala-sequences-seq-list-array-buffer-vector-ordering-ordered
you can use List(1 to 400000).sorted

Scala: Getting the index of an element in a Map

I have a Map[String, Object], where the key is an ID. That ID is now referenced within another object and I have to know what index (what position in the Map) that ID has. I can't believe there isn't an .indexOf
How can I accomplish that?
Do I really have to build myself a list with all the keys or another Map with ID1 -> 1, ID2 -> 2,... ?
I have to get the ID's indexes multiple times. Would a List or that Map be more efficient?
#Dora, as everyone mentioned, maps are unordered so there is no way to index them in place and store id with them.
It's hard to guess use case of storing (K,V) pairs in map and then getting unique id for every K. So, these are few suggestions based on my understanding -
1. You could use LinkedHashMap instead of Map which will maintain the insertion order so you will get stable iteration. Get KeysIterator on this map and convert it into a list which give you an unique index for every key in you map. Something like this-
import scala.collection.mutable.LinkedHashMap
val myMap = LinkedHashMap("a"->11, "b"->22, "c"->33)
val l = myMap.keysIterator.toList
l.indexOf("a") //get index of key a
myMap.+=("d"->44) //insert new element
val l = myMap.keysIterator.toList
l.indexOf("a") //index of a still remains 0 due to linkedHashMap
l.indexOf("d") //get index of newly inserted element.
Obviously, it is expensive to insert elements in linkedHashMap compared to HashMaps.
Deleting element from Map would automatically shift indexes to left.
myMap.-=("b")
val l = myMap.keysIterator.toList
l.indexOf("c") // Index of "c" moves from 2 to 1.
Change you map (K->V) to (K->(index, v)) and generate index manually while inserting new elements.
class ValueObject(val index: Int, val value: Int)
val myMap = scala.collection.mutable.Map[String, ValueObject]()
myMap.+=("a"-> new ValueObject(myMap.size+1, 11))
myMap("a").index<br/> // get index of key a
myMap.+=("b"-> new ValueObject(myMap.size+1, 22))
myMap.+=("c"-> new ValueObject(myMap.size+1, 33))
myMap("c").index<br/> // get index of key c
myMap("b").index<br/> // get index of key b
deletion would be expensive if we need indexes with no gaps, as we need to update all keys accordingly. However, keys insertion and search will be faster.
This problem can be solved efficiently if we know exactly what you need so please explain if above solutions doesn't work for you !!! (May be you really don't need map for solving your problem)
Maps do not have indexes, by definition.
What you can do is enumerate the keys in a map, but this is not guaranteed to be stable/repeatable. Add a new element, and it may change the numbers of every other element randomly due to rehashing. If you want a stable mapping of key to index (or the opposite), you will have to store this mapping somewhere, e.g. by serializing the map into a list.
Convert the Map[A,B] to an ordered collection namely for instance Seq[(A,B)] equipped with indexOf method. Note for
val m = Map("a" -> "x", "b" -> "y")
that
m.toSeq
Seq[(String, String)] = ArrayBuffer((a,x), (b,y))
From the extensive discussions above note this conversion does not guarantee any specific ordering in the resulting ordered collection. This collection can be sorted as necessary.
Alternatively you can index the set of keys in the map, namely for instance,
val idxm = for ( ((k,v),i) <- m.zipWithIndex ) yield ((k,i),v)
Map[(String, Int),String] = Map((a,0) -> x, (b,1) -> y)
where an equivalent to indexOf would be for example
idxm.find { case ((_,i),v) => i == 1 }
Option[((String, Int), String)] = Some(((b,1),y))

aggregate data for uniquely tagged values in a list in scala

I was wondering if somebody could help.
I'm trying to aggregate some data in a list based on id values, I have a listBuffer which is updated from a foreach function. My output means I have an id number and a value, because the foreach applies a function to each id often more than once, the list I end up with looks something like the following:
ListBuffer(3106;0, 3106;3, 3108;2, 3108;0, 3110;1, 3110;2, 3113;0, 3113;2, 3113;0)
What I want to do is apply a simple function to aggregate this data, so I am left with
List(3106;3 ,3108;2, 3110;3, 3113;2)
I thought this could be done with foldLeft or groupBy, however I'm not sure how to get it to recognise id values and normal values.
Any help or pointers would be much appreciated
First of all, you can't group key-value pairs this way. In scala you have tuples which are written as
val pair: (Int, Int) = (3106,3), where
pair._1 == 3106
pair._2 == 3
are true statements.
So you have:
val l = ListBuffer((3106,0), (3106,3), (3108,2), (3108,0), (3110,1), (3110,2), (3113,0), (3113,2), (3113,0))
val result = l.groupBy(x => x._1).map(x => (x._1, x._2.map(_._2))).map(x => (x._1, x._2.sum)).toList
println(result)
will give you
List((3106,3), (3108,2), (3110,3), (3113,2))

How to sort an ArrayBuffer[Double] and save indices

Given an ArrayBuffer[Double], how can sort its elements with maintaining also their indices, e.g.
val arr ArrayBuffer[Double] = ArrayBuffer(4,5.3,5,3,8,9)
the result has to be:
arrSorted = ArrayBuffer(3,4,5,5.3,8,9)
indices = Arraybuffer(3,0,2,1,4,5) //here the data structure doesn't matter, it can be Array, List, Vector, etc.
Thanks
This is a one-liner:
val (addSorted, indices) = arr.zipWithIndex.sorted.unzip
Going step by step, zipWithIndex creates a collection of tuples with the index as the second value in each tuple:
scala> println(arr.zipWithIndex)
ArrayBuffer((4.0,0), (5.3,1), (5.0,2), (3.0,3), (8.0,4), (9.0,5))
sorted sorts these tuples lexicographically (which is almost certainly what you want, but you could also use sortBy(_._1) to be explicit about the fact that you want to sort by the values):
scala> println(arr.zipWithIndex.sorted)
ArrayBuffer((3.0,3), (4.0,0), (5.0,2), (5.3,1), (8.0,4), (9.0,5))
unzip then turns this collection of tuples into a tuple of collections, which you can deconstruct with val (addSorted, indices) = ....

Sort a list by an ordered index

Let us assume that I have the following two sequences:
val index = Seq(2,5,1,4,7,6,3)
val unsorted = Seq(7,6,5,4,3,2,1)
The first is the index by which the second should be sorted. My current solution is to traverse over the index and construct a new sequence with the found elements from the unsorted sequence.
val sorted = index.foldLeft(Seq[Int]()) { (s, num) =>
s ++ Seq(unsorted.find(_ == num).get)
}
But this solution seems very inefficient and error-prone to me. On every iteration it searches the complete unsorted sequence. And if the index and the unsorted list aren't in sync, then either an error will be thrown or an element will be omitted. In both cases, the not in sync elements should be appended to the ordered sequence.
Is there a more efficient and solid solution for this problem? Or is there a sort algorithm which fits into this paradigm?
Note: This is a constructed example. In reality I would like to sort a list of mongodb documents by an ordered list of document Id's.
Update 1
I've selected the answer from Marius Danila because it seems the more fastest and scala-ish solution for my problem. It doesn't come with a not in sync item solution, but this could be easily implemented.
So here is the updated solution:
def sort[T: ClassTag, Key](index: Seq[Key], unsorted: Seq[T], key: T => Key): Seq[T] = {
val positionMapping = HashMap(index.zipWithIndex: _*)
val inSync = new Array[T](unsorted.size)
val notInSync = new ArrayBuffer[T]()
for (item <- unsorted) {
if (positionMapping.contains(key(item))) {
inSync(positionMapping(key(item))) = item
} else {
notInSync.append(item)
}
}
inSync.filterNot(_ == null) ++ notInSync
}
Update 2
The approach suggested by Bask.cc seems the correct answer. It also doesn't consider the not in sync issue, but this can also be easily implemented.
val index: Seq[String]
val entities: Seq[Foo]
val idToEntityMap = entities.map(e => e.id -> e).toMap
val sorted = index.map(idToEntityMap)
val result = sorted ++ entities.filterNot(sorted.toSet)
Why do you want to sort collection, when you already have sorted index collection? You can just use map
Concerning> In reality I would like to sort a list of mongodb documents by an ordered list of document Id's.
val ids: Seq[String]
val entities: Seq[Foo]
val idToEntityMap = entities.map(e => e.id -> e).toMap
ids.map(idToEntityMap _)
This may not exactly map to your use case, but Googlers may find this useful:
scala> val ids = List(3, 1, 0, 2)
ids: List[Int] = List(3, 1, 0, 2)
scala> val unsorted = List("third", "second", "fourth", "first")
unsorted: List[String] = List(third, second, fourth, first)
scala> val sorted = ids map unsorted
sorted: List[String] = List(first, second, third, fourth)
I do not know the language that you are using. But irrespective of the language this is how i would have solved the problem.
From the first list (here 'index') create a hash table taking key as the document id and the value as the position of the document in the sorted order.
Now when traversing through the list of document i would lookup the hash table using the document id and then get the position it should be in the sorted order. Then i would use this obtained order to sort in a pre allocated memory.
Note: if the number of documents is small then instead of using hashtable u could use a pre allocated table and index it directly using the document id.
Flat Mapping the index over the unsorted list seems to be a safer version (if the index isn't found it's just dropped since find returns a None):
index.flatMap(i => unsorted.find(_ == i))
It still has to traverse the unsorted list every time (worst case this is O(n^2)). With you're example I'm not sure that there's a more efficient solution.
In this case you can use zip-sort-unzip:
(unsorted zip index).sortWith(_._2 < _._2).unzip._1
Btw, if you can, better solution would be to sort list on db side using $orderBy.
Ok.
Let's start from the beginning.
Besides the fact you're rescanning the unsorted list each time, the Seq object will create, by default a List collection. So in the foldLeft you're appending an element at the end of the list each time and this is a O(N^2) operation.
An improvement would be
val sorted_rev = index.foldLeft(Seq[Int]()) { (s, num) =>
unsorted.find(_ == num).get +: s
}
val sorted = sorted_rev.reverse
But that is still an O(N^2) algorithm. We can do better.
The following sort function should work:
def sort[T: ClassTag, Key](index: Seq[Key], unsorted: Seq[T], key: T => Key): Seq[T] = {
val positionMapping = HashMap(index.zipWithIndex: _*) //1
val arr = new Array[T](unsorted.size) //2
for (item <- unsorted) { //3
val position = positionMapping(key(item))
arr(position) = item
}
arr //6
}
The function sorts a list of items unsorted by a sequence of indexes index where the key function will be used to extract the id from the objects you're trying to sort.
Line 1 creates a reverse index - mapping each object id to its final position.
Line 2 allocates the array which will hold the sorted sequence. We're using an array since we need constant-time random-position set performance.
The loop that starts at line 3 will traverse the sequence of unsorted items and place each item in it's meant position using the positionMapping reverse index
Line 6 will return the array converted implicitly to a Seq using the WrappedArray wrapper.
Since our reverse-index is an immutable HashMap, lookup should take constant-time for regular cases. Building the actual reverse-index takes O(N_Index) time where N_Index is the size of the index sequence. Traversing the unsorted sequence takes O(N_Unsorted) time where N_Unsorted is the size of the unsorted sequence.
So the complexity is O(max(N_Index, N_Unsorted)), which I guess is the best you can do in the circumstances.
For your particular example, you would call the function like so:
val sorted = sort(index, unsorted, identity[Int])
For the real case, it would probably be like this:
val sorted = sort(idList, unsorted, obj => obj.id)
The best I can do is to create a Map from the unsorted data, and use map lookups (basically the hashtable suggested by a previous poster). The code looks like:
val unsortedAsMap = unsorted.map(x => x -> x).toMap
index.map(unsortedAsMap)
Or, if there's a possibility of hash misses:
val unsortedAsMap = unsorted.map(x => x -> x).toMap
index.flatMap(unsortedAsMap.get)
It's O(n) in time*, but you're swapping time for space, as it uses O(n) space.
For a slightly more sophisticated version, that handles missing values, try:
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val unsortedAsMap = new java.util.LinkedHashMap[Int, Int]
for (i <- unsorted) unsortedAsMap.add(i, i)
val newBuffer = ListBuffer.empty[Int]
for (i <- index) {
val r = unsortedAsMap.remove(i)
if (r != null) newBuffer += i
// Not sure what to do for "else"
}
for ((k, v) <- unsortedAsMap) newBuffer += v
newBuffer.result()
If it's a MongoDB database in the first place, you might be better retrieving documents directly from the database by index, so something like:
index.map(lookupInDB)
*technically it's O(n log n), as Scala's standard immutable map is O(log n), but you could always use a mutable map, which is O(1)