Scala getting average of the result - scala

Hi I am trying to calculate the average of the movie result from this tsv set
running time Genre
1 Documentary,Short
5 Animation,Short
4 Animation,Comedy,Romance
Animation is one type of Genre
and same goes for Short, Comedy, Romance
I'm new to Scala and I'm confused about how to get an Average as per each genre using Scala without any immutable functions
I tried using this below snippet to just try some sort of iterations and get the runTimes as per each genre
val a = list.foldLeft(Map[String,(Int)]()){
case (map,arr) =>{
map + (arr.genres.toString ->(arr.runtimeMinutes))
}}
Is there any way to calculate the average

Assuming the data was already parsed into something like:
final case class Row(runningTime: Int, genres: List[String])
Then you can follow a declarative approach to compute your desired result.
Flatten a List[Row] into a list of pairs, where the first element is a genre and the second element is a running time.
Collect all running times for the same genre.
Reduce each group to compute its average.
def computeAverageRunningTimePerGenre(data: List[Row]): Map[String, Double] =
data.flatMap {
case Row(runningTime, genres) =>
genres.map(genre => genre -> runningTime)
}.groupMap(_._1)(_._2).view.mapValues { runningTimes =>
runningTimes.sum.toDouble / runningTimes.size.toDouble
}.toMap
Note: There are ways to make this faster but IMHO is better to start with the most readable alternative first and then refactor to something more performant if needed.
You can see the code running here.

I tried to break it down as follows:
I modeled your data as a List[(Int, String)]:
val data: List[(Int, List[String])] = List(
(1, List("Documentary","Short")),
(5, List("Animation","Short")),
(4, List("Animation","Comedy","Romance"))
)
I wrote a function to spread the runtime value across each genre so that I have a value for each one:
val spread: ((Int, List[String]))=>List[(Int, String)] = t => t._2.map((t._1, _))
// now, if I pass it a tuple, I get:
// spread((23, List("one","two","three")))) == List((23,one), (23,two), (23,three))
So far, so good. Now I can use spread with flatMap to get a 2-dimensional list:
val flatData = data.flatMap(spread)
flatData: List[(Int, String)] = List(
(1, "Documentary"),
(1, "Short"),
(5, "Animation"),
(5, "Short"),
(4, "Animation"),
(4, "Comedy"),
(4, "Romance")
)
Now we can use groupBy to summarize by genre:
flatData.groupBy(_._2)
res26: Map[String, List[(Int, String)]] = HashMap(
"Animation" -> List((5, "Animation"), (4, "Animation")),
"Documentary" -> List((1, "Documentary")),
"Comedy" -> List((4, "Comedy")),
"Romance" -> List((4, "Romance")),
"Short" -> List((1, "Short"), (5, "Short"))
)
Finally, I can get the results (it took me about 10 tries):
flatData.groupBy(_._2).map(t => (t._1, t._2.map(_._1).foldLeft(0)(_+_)/t._2.size.toDouble))
res43: Map[String, Double] = HashMap(
"Animation" -> 4.5,
"Documentary" -> 1.0,
"Comedy" -> 4.0,
"Romance" -> 4.0,
"Short" -> 3.0
)
The map() after the groupBy() is chunky, but now that I got it, it's easy(er) to explain. Each tuple in the groupBy is (rating, list(genre)). So we just map each tuple and use foldLeft to compute the average of the values in each list. You should coerce the calc to a double, or you'll get integer division.
I think it would have been good to define a cleaner model for the data like Luis did. That would've made all the tuple notation less obscure. Hey, I am learning, too.

Related

Scala MaxBy's Tuple

I have a Seq of Tuples, which represents a word count: (count, word)
For Example:
(5, "Hello")
(3, "World")
My Goal is to find the word with the highest count. In a case of a tie between 2 words, I'll pick the word, which appears first in the Dictionary (aka Alphabetical order).
val wordCounts = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello")
)
val commonWord = wordCounts.maxBy(_._1)
print(commonWord)
Now, this code segment will return (10, "World"), because this is the first tuple that have the maximum count.
I could use .sortBy and then .head, but I want to be more efficient.
My question is there any way to change the Ordering of the maxBy, in order to achieve the desired outcome.
Note: I prefer not to use .sortBy, because it's O(n*log(n)) and not O(n). I know that I can use .reduce, but I want to check if I can adjust .maxBy?
Scala Version 2.13
Functions like max, min, maxBy and minBy use implicit Ordering that defines the comparison between two items. There's a default implementation of Ordering for Tuple2, however the problem is it will apply the same comparison to both elements – while in your case you need to use greater than for _._1 and less than for _._2. However you can easily solve this by inverting the first element, so this does the trick:
wordCounts.minBy(x => (-x._1, x._2))
You can create your own Ordering by using orElse() to combine two Orderings together:
// can't use .orElseBy() because of the .reverse, so this is a bit verbose
val countThenAlphaOrdering =
Ordering.by[(Int, String), Int](_._1)
.orElse(Ordering.by[(Int, String), String](_._2).reverse)
Or you can use Ordering.Tuple2 in this case:
val countThenAlphaOrdering = Ordering.Tuple2(Ordering[Int], Ordering[String].reverse)
Then
val wordCounts = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello"),
)
wordCounts.max(countThenAlphaOrdering) // (10,Hello): (Int, String)
implicit val WordSorter: Ordering[(Int, String)] = new Ordering[(Int, String)]{
override def compare(
x: (Int, String),
y: (Int, String)
) = {
val iComp = implicitly[Ordering[Int]].compare(x._1, y._1)
if iComp == 0
-implicitly[Ordering[String]].compare(x._2, y._2)
else
iComp
}
}
val seq = Seq(
(10, "World"),
(5, "Something"),
(10, "Hello")
)
def main(args: Array[String]): Unit = println(seq.max)
You can create your own Ordering[(Int, String)] implicit whose compare method returns the comparison of the numbers in the tuples if its not zero and the comparison of the strings negatively if the int comparison is zero. Using implicitly defined Ordering[Int] and Ordering[String] for modularity if you want to change the behaviour later on. If you don't want to use those, you can just replace
implicitly[Orderint[Int]].compare(x._1, y._1) with x._1.compareTo(y._1) and so on.

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

Calculation on consecutive array elements

I have this:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
(a,timestampAStr),
(b,timestampBStr),
...
)
I would like to calculate the duration between each two consecutive timestamps from myInput and retrieve those like the following:
val myOutput = ArrayBuffer(
(a,durationFromTimestampAToTimestampB),
(b,durationFromTimestampBToTimestampC),
...
)
This is a paired evaluation, which led me to think something with foldLeft() might do the trick, but after giving this a little more thought, I could not come up with a solution.
I have already put something together with some for loops and .indices, but this does not seem as clean and concise as it could be. I would appreciate if somebody had a better option.
You can use zip and sliding to achieve what you want. For example, if you have a collection
scala> List(2,3,5,7,11)
res8: List[Int] = List(2, 3, 5, 7, 11)
The list of differences is res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList, which you can zip with the original list.
scala> res8.zip(res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList)
res13: List[(Int, Int)] = List((2,1), (3,2), (5,2), (7,4))
You can zip your array with itself, after dropping the first item - to match each item with the consecutive one - and then map to the calculated result:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
("a","1000"),
("b","1500"),
("c","2500")
)
val result: ArrayBuffer[(String, Int)] = myInput.zip(myInput.drop(1)).map {
case ((k1, v1), (k2, v2)) => (k1, v2.toInt - v1.toInt)
}
result.foreach(println)
// prints:
// (a,500)
// (b,1000)

scala groupby and sum on Observable

I am a scala beginner and I want to perform a simple groupby and sum over an Observable. For example:
val test = Observable.just(("a", 1), ("a", 2), ("b", 5), ("b",3))
I would like to group by key and sum over the values so to have something like:
(a,3)
(b,8)
I am able to sum over all with test.map(_._2).sum,but not when performing a groupby
not the classes you are looking for clearly. i was rummaging around scala.react and reactivex.io, but how about this:
scala> val test999=Seq(("a",1),("a",16),("b",5),("a",9),("b",9),("c",90))
test999: Seq[(String, Int)] = List((a,1), (a,16), (b,5), (a,9), (b,9), (c,90))
scala> test999
res12: Seq[(String, Int)] = List((a,1), (a,16), (b,5), (a,9), (b,9), (c,90))
scala> test999.groupBy(_._1).mapValues(_.map(_._2).sum)
res13: scala.collection.immutable.Map[String,Int] = Map(b -> 14, a -> 26, c -> 90)
Given that
I am able to sum over all with test.map(_._2).sum,...
namely that it is possible to map onto test, consider applying groupBy to test.toSeq.