efficient way of transforming a dataframe to map in scala - scala

I have a use case where I want to convert the dataframe to a map. For that i am using groupbykey and mapGroups operations. But they are running into memory issues.
Eg:
EmployeeDataFrame
EmployeeId, JobLevel, JobCode
1, 4, 50
1, 5, 60
2, 4, 70
2, 5, 80
3, 7, 90
case class EmployeeModel(EmployeeId: String ,JobLevel: String, JobCode: String)
val ds : Dataset[EmployeeModel] = EmployeeDataFrame.as[EmployeeModel]
val groupedData = ds
.groupByKey(_.EmployeeId)
.mapGroups((key, rows) => (key, rows.toList))
.collect()
.toMap
Expected Map
1, [(),()]
2, [(), ()]
3, [()]
Is there a better way of doing this?

Related

Scala - Calculate maximum average between two lists

I am at the beginning of my Scala journey. I am trying to find and compare the average value of a given dataset - type Map(String, List[Int]), for two random rows selected by the user, in order to return the greater average value between the two. I can calculate the average for each row but I can't find a way to compare the average between the two rows.
I have tried in different ways, but I only get error messages. However the program calculates the average of each row
DATASET
SK1, 9, 7, 2, 0, 7, 3, 7, 9, 1, 2, 8, 1, 9, 6, 5, 3, 2, 2, 7, 2, 8, 5, 4, 5, 1, 6, 5, 2, 4, 1
SK2, 0, 7, 6, 3, 3, 3, 1, 6, 9, 2, 9, 7, 8, 7, 3, 6, 3, 5, 5, 2, 9, 7, 3, 4, 6, 3, 4, 3, 4, 1
SK3, 8, 7, 1, 8, 0, 5, 8, 3, 5, 9, 7, 5, 4, 7, 9, 8, 1, 4, 6, 5, 6, 6, 3, 6, 8, 8, 7, 4, 0, 6
This is how I the program calculates the average of a row
//Function to find the average
def average(list: List[Int]): Double = list.sum.toDouble / list.size
def averageStockLevel1(stock1: String, stock2: String): (String, Int) = {
val ave1 = mapdata.get(stock1).map(average(_).toInt).getOrElse(0)
val ave2 = mapdata.get(stock2).map(average(_).toInt).getOrElse(0)
if (ave1>ave2){
(stock1,ave1)
}else{
(stock2,ave2)
}
}
This is how I have called the function in the menu
def handleFour(): Boolean = {
menuDoubleDataStock(averageStockLevel1)
true
}
//Pull two rows from the dataset
def menuShowDoubleDataStock(f: (String) => (String, Int), g:(String) => (String, Int)) = {
print("Please insert the Stock > ")
val data = f(readLine)
println(s"${data._1}: ${data._2}")
print("Please insert the Stock > ")
val data1 = g(readLine)
println(s"${data1._1}: ${data1._2}")
}
error message
Unspecified value parameters: g: String => (String, Int)
The error message "Unspecified value parameters: g: String => (String, Int)" tells you the following:
Your menuShowDoubleDataStock expects two parameters (f and g), but where you call it (from handleFour()), you only pass one value (averageStockLevel1) - that value is accepted as f, so the compiler complains that no value was passed for g.
Besides that specific error that the compiler currently complains about, there is also a second problem (which currently seems to be overshadowed by the one above): the type of f is defined as String => (String, Int) (a function that takes one String parameter), but the value that you are passing (averageStockLevel1) has the type (String, String) => (String, Int) (a function that takes two String parameters).
I'm not 100% sure if I understood what you are aiming to do, but I think the solution could be to change the signature of menuShowDoubleDataStock so that it only takes one parameter of type (String, String) => (String, Int):
// make the user enter two stock-names and pass them into resultCalculator to
// get the result (and then print it)
def menuShowDoubleDataStock(resultCalculator: (String, String) => (String, Int)) = {
print("Please insert the Stock > ")
val stockName1 = readLine
print("Please insert the Stock > ")
val stockName2 = readLine
val result = resultCalculator(stockName1, stockName2)
println(s"${result._1}: ${result._2}")
}
Then calling menuDoubleDataStock(averageStockLevel1) should work.

Efficiently find common values in a map of lists - scala

I asked a similar question already here. However, I misjudged the scale of my specific case. In my example I gave, there were only 4 keys in the map. I am actually dealing with over 10,000 keys and they are mapped to lists of different sizes. So the solution given was correct, but I am now looking for a way that will do this in a more efficient manner.
Say I have:
val myMap: Map[Int, List[Int]] = Map(
1 -> List(1, 10, 12, 76, 105), 2 -> List(2, 5, 10), 3 -> List(10, 12, 76, 5), 4 -> List(2, 4, 5, 10),
... -> List(...)
)
Imagine the (...) go on for over 10,000 keys. I want to return a List of Lists containing a pair of keys and their shared values if the size of the intersection of their respective lists is >= 3.
For example:
res0: List[(Int, Int, List[Int])] = List(
(1, 3, List(10, 12, 76)),
(2, 4, List(2, 5, 10)),
(...),
(...),
)
I've been pretty stuck on this for a couple of days, so any help is genuinely appreciated. Thank you in advance!
If space is not the concern then the problem can be solved in the O(N) where N is the number of elements in the list.
Algorithm:
Create a reverse lookup map out from the input map. Here reverse lookup maps the list element to the key (Id).
For each input map key
Create a temp map
Iterate over the list and look for value (Id) in the reverse lookup. Count the number of occurred for the fetched id.
All key which occurred equal or more than 3 times is the desired pair.
Code
import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer
object Application extends App {
val inputMap = Map(
1 -> List(1, 2, 3, 4),
2 -> List(2, 3, 4, 5),
3 -> List(3, 5, 6, 7),
4 -> List(1, 2, 3, 6, 7))
/*
Expected pairs
| pair | common elements |
---------------------------
(1, 2) -> 2, 3, 4
(1, 4) -> 2, 3, 4
(2, 1) -> 2, 3, 4
(3, 4) -> 3, 5, 6
(4, 1) -> 1, 2, 3
(4, 3) -> 3, 5, 6
*/
val reverseMap = mutable.Map[Int, ArrayBuffer[Int]]()
inputMap.foreach {
case (id, list) => list.foreach(
o => if (reverseMap.contains(o)) reverseMap(o).append(id) else reverseMap.put(o, ArrayBuffer(id)))
}
val result = inputMap.map {
case (id, list) =>
val m = mutable.Map[Int, Int]()
list.foreach(o =>
reverseMap(o).foreach(k => if (m.contains(k)) m.update(k, m(k)+1) else m.put(k, 1)))
val res = m.toList.filter(o => o._2 >= 3 && o._1 != id).map(o => (id, o._1))
res
}.flatten
println(result)
}

How to find duplicates in a list in Scala?

I have a list of unsorted integers and I want to find the elements which are duplicated.
val dup = List(1|1|1|2|3|4|5|5|6|100|101|101|102)
I have to find the list of unique elements and also how many times each element is repeated.
I know I can find it with below code :
val ans2 = dup.groupBy(identity).map(t => (t._1, t._2.size))
But I am not able to split the above list on "|" . I tried converting to a String then using split but I got the result below:
L
i
s
t
(
1
0
3
)
I am not sure why I am getting this result.
Reference: How to find duplicates in a list?
The symbol | is a function in scala. You can check the API here
|(x: Int): Int
Returns the bitwise OR of this value and x.
So you don't have a List, you have a single Integer (103) which is the result of operating | with all the integers in your pretended List.
Your code is fine, if you want to make a proper List you should separate its elements by commas
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
If you want to convert your given String before having it on a List you can do:
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\\|").toList
Even easier, convert the list of duplicates into a set - a set is a data structure that by default does not have any duplicates.
scala> val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup: List[Int] = List(1, 1, 1, 2, 3, 4, 5, 5, 6, 100, 101, 101, 102)
scala> val noDup = dup.toSet
res0: scala.collection.immutable.Set[Int] = Set(101, 5, 1, 6, 102, 2, 3, 4, 100)
To count the elements, just call the method sizeon the resulting set:
scala> noDup.size
res3: Int = 9
Another way to solve the problem
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\|").groupBy(x => x).mapValues(_.size)
res0: scala.collection.immutable.Map[String,Int] = Map(100 -> 1, 4 -> 1, 5 -> 2, 6 -> 1, 1 -> 3, 102 -> 1, 2 -> 1, 101 -> 2, 3 -> 1)

"unlist" in scala (e.g flattening a sequence of sequences of sequences...)

Is there a simple way in Scala to flatten or "unlist" a nested sequence of sequences of sequences (etc.) of something into just a simple sequence of those things, without any nesting structure?
I don't think there is a flatten` method which converts a deeply nested into a sequence.
Its easy to write a simple recursive function to do this
def flatten(ls: List[Any]): List[Any] = ls flatMap {
case ms: List[_] => flatten(ms)
case e => List(e)
}
val a = List(List(List(1, 2, 3, 4, 5)),List(List(1, 2, 3, 4, 5)))
flatten(a)
//> res0: List[Any] = List(1, 2, 3, 4, 5, 1, 2, 3, 4, 5)

Scala: How to sort an array within a specified range of indices?

And I have a comparison function "compr" already in the code to compare two values.
I want something like this:
Sorting.stableSort(arr[i,j] , compr)
where arr[i,j] is a range of element in array.
Take the slice as a view, sort and copy it back (or take a slice as a working buffer).
scala> val vs = Array(3,2,8,5,4,9,1,10,6,7)
vs: Array[Int] = Array(3, 2, 8, 5, 4, 9, 1, 10, 6, 7)
scala> vs.view(2,5).toSeq.sorted.copyToArray(vs,2)
scala> vs
res31: Array[Int] = Array(3, 2, 4, 5, 8, 9, 1, 10, 6, 7)
Outside the REPL, the extra .toSeq isn't needed:
vs.view(2,5).sorted.copyToArray(vs,2)
Updated:
scala 2.13.8> val vs = Array(3, 2, 8, 5, 4, 9, 1, 10, 6, 7)
val vs: Array[Int] = Array(3, 2, 8, 5, 4, 9, 1, 10, 6, 7)
scala 2.13.8> vs.view.slice(2,5).sorted.copyToArray(vs,2)
val res0: Int = 3
scala 2.13.8> vs
val res1: Array[Int] = Array(3, 2, 4, 5, 8, 9, 1, 10, 6, 7)
Split array into three parts, sort middle part and then concat them, not the most efficient way, but this is FP who cares about performance =)
val sorted =
for {
first <- l.take(FROM)
sortingPart <- l.slice(FROM, UNTIL)
lastPart <- l.takeRight(UNTIL)
} yield (first ++ Sorter.sort(sortingPart) ++ lastPart)
Something like that:
def stableSort[T](x: Seq[T], i: Int, j: Int, comp: (T,T) => Boolean ):Seq[T] = {
x.take(i) ++ x.slice(i,j).sortWith(comp) ++ x.drop(i+j-1)
}
def comp: (Int,Int) => Boolean = { case (x1,x2) => x1 < x2 }
val x = Array(1,9,5,6,3)
stableSort(x,1,4, comp)
// > res0: Seq[Int] = ArrayBuffer(1, 5, 6, 9, 3)
If your class implements Ordering it would be less cumbersome.
This should be as good as you can get without reimplementing the sort. Creates just one extra array with the size of the slice to be sorted.
def stableSort[K:reflect.ClassTag](xs:Array[K], from:Int, to:Int, comp:(K,K) => Boolean) : Unit = {
val tmp = xs.slice(from,to)
scala.util.Sorting.stableSort(tmp, comp)
tmp.copyToArray(xs, from)
}