I am a scala beginner and I want to perform a simple groupby and sum over an Observable. For example:
val test = Observable.just(("a", 1), ("a", 2), ("b", 5), ("b",3))
I would like to group by key and sum over the values so to have something like:
(a,3)
(b,8)
I am able to sum over all with test.map(_._2).sum,but not when performing a groupby
not the classes you are looking for clearly. i was rummaging around scala.react and reactivex.io, but how about this:
scala> val test999=Seq(("a",1),("a",16),("b",5),("a",9),("b",9),("c",90))
test999: Seq[(String, Int)] = List((a,1), (a,16), (b,5), (a,9), (b,9), (c,90))
scala> test999
res12: Seq[(String, Int)] = List((a,1), (a,16), (b,5), (a,9), (b,9), (c,90))
scala> test999.groupBy(_._1).mapValues(_.map(_._2).sum)
res13: scala.collection.immutable.Map[String,Int] = Map(b -> 14, a -> 26, c -> 90)
Given that
I am able to sum over all with test.map(_._2).sum,...
namely that it is possible to map onto test, consider applying groupBy to test.toSeq.
Related
Hi I am trying to calculate the average of the movie result from this tsv set
running time Genre
1 Documentary,Short
5 Animation,Short
4 Animation,Comedy,Romance
Animation is one type of Genre
and same goes for Short, Comedy, Romance
I'm new to Scala and I'm confused about how to get an Average as per each genre using Scala without any immutable functions
I tried using this below snippet to just try some sort of iterations and get the runTimes as per each genre
val a = list.foldLeft(Map[String,(Int)]()){
case (map,arr) =>{
map + (arr.genres.toString ->(arr.runtimeMinutes))
}}
Is there any way to calculate the average
Assuming the data was already parsed into something like:
final case class Row(runningTime: Int, genres: List[String])
Then you can follow a declarative approach to compute your desired result.
Flatten a List[Row] into a list of pairs, where the first element is a genre and the second element is a running time.
Collect all running times for the same genre.
Reduce each group to compute its average.
def computeAverageRunningTimePerGenre(data: List[Row]): Map[String, Double] =
data.flatMap {
case Row(runningTime, genres) =>
genres.map(genre => genre -> runningTime)
}.groupMap(_._1)(_._2).view.mapValues { runningTimes =>
runningTimes.sum.toDouble / runningTimes.size.toDouble
}.toMap
Note: There are ways to make this faster but IMHO is better to start with the most readable alternative first and then refactor to something more performant if needed.
You can see the code running here.
I tried to break it down as follows:
I modeled your data as a List[(Int, String)]:
val data: List[(Int, List[String])] = List(
(1, List("Documentary","Short")),
(5, List("Animation","Short")),
(4, List("Animation","Comedy","Romance"))
)
I wrote a function to spread the runtime value across each genre so that I have a value for each one:
val spread: ((Int, List[String]))=>List[(Int, String)] = t => t._2.map((t._1, _))
// now, if I pass it a tuple, I get:
// spread((23, List("one","two","three")))) == List((23,one), (23,two), (23,three))
So far, so good. Now I can use spread with flatMap to get a 2-dimensional list:
val flatData = data.flatMap(spread)
flatData: List[(Int, String)] = List(
(1, "Documentary"),
(1, "Short"),
(5, "Animation"),
(5, "Short"),
(4, "Animation"),
(4, "Comedy"),
(4, "Romance")
)
Now we can use groupBy to summarize by genre:
flatData.groupBy(_._2)
res26: Map[String, List[(Int, String)]] = HashMap(
"Animation" -> List((5, "Animation"), (4, "Animation")),
"Documentary" -> List((1, "Documentary")),
"Comedy" -> List((4, "Comedy")),
"Romance" -> List((4, "Romance")),
"Short" -> List((1, "Short"), (5, "Short"))
)
Finally, I can get the results (it took me about 10 tries):
flatData.groupBy(_._2).map(t => (t._1, t._2.map(_._1).foldLeft(0)(_+_)/t._2.size.toDouble))
res43: Map[String, Double] = HashMap(
"Animation" -> 4.5,
"Documentary" -> 1.0,
"Comedy" -> 4.0,
"Romance" -> 4.0,
"Short" -> 3.0
)
The map() after the groupBy() is chunky, but now that I got it, it's easy(er) to explain. Each tuple in the groupBy is (rating, list(genre)). So we just map each tuple and use foldLeft to compute the average of the values in each list. You should coerce the calc to a double, or you'll get integer division.
I think it would have been good to define a cleaner model for the data like Luis did. That would've made all the tuple notation less obscure. Hey, I am learning, too.
I have list of tuples which contains userId and point. I want to combine or reduce this list by the key.
val points: List[(Int, Double)] = List(
(1, 1.0),
(2, 3.2),
(4, 2.0),
(1, 4.0),
(2, 6.8)
)
The expected result should look like:
List((1, 5.0), (2, 10.0), (4, 2.0))
I tried with groupBy and mapValue, but got an error:
val aggrPoint: Map[Int, Double] = incomes.groupBy(_._1).mapValues(seq => seq.reduce(_._2 + _._2))
Error:(16, 180) type mismatch;
found : Double
required: (Int, Double)
What am I doing wrong, and is there a idiomatic way to achieve this?
P.S) I found that in Spark aggregateByKey does this job. But, is there a built-in method in Scala?
What am I doing wrong, and is there a idiomatic way to achieve this?
let's go step by step to see what are you doing wrong. (I am going to use REPL)
first of all lets define the points
scala> val points: List[(Int, Double)] = List(
| (1, 1.0),
| (2, 3.2),
| (4, 2.0),
| (1, 4.0),
| (2, 6.8)
| )
points: List[(Int, Double)] = List((1,1.0), (2,3.2), (4,2.0), (1,4.0), (2,6.8))
As you can see that you have List[Tuple2[Int, Double]] so when you do groupBy and mapValues as
scala> points.groupBy(_._1).mapValues(seq => println(seq))
List((2,3.2), (2,6.8))
List((4,2.0))
List((1,1.0), (1,4.0))
res1: scala.collection.immutable.Map[Int,Unit] = Map(2 -> (), 4 -> (), 1 -> ())
You can see that seq object is of List[Tuple2[Int, Double]] again but only contains the grouped tuples as list.
So when you apply seq.reduce(_._2 + _._2), the reduce function takes two inputs of Tuple2[Int, Double] but the output is Double only which doesn't match for the next iteration on seq as the expected input is Tuple2[Int, Double]. Thats the main issue. All you have to do is match the input and output types for reduce function
One way would be to match Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)})
res6: scala.collection.immutable.Map[Int,(Int, Double)] = Map(2 -> (2,10.0), 4 -> (4,2.0), 1 -> (1,5.0))
But this isn't your desired output, so you can extract the double value from the reduced Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)}._2)
res8: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
or you can just use map before you apply reduce function as
scala> points.groupBy(_._1).mapValues(seq => seq.map(_._2).reduce(_ + _))
res3: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
I hope the explanation is clear enough to understand your mistake and you must have understood how a reduce function works
You can map the tuples in the mapValues to their 2nd elements then sum them as follows:
points.groupBy(_._1).mapValues( _.map(_._2).sum ).toList
// res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))
Using collect
points.groupBy(_._1).collect{
case e => e._1 -> e._2.map(_._2).sum
}.toList
//res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))
I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size
This question already has answers here:
Why does groupBy in Scala change the ordering of a list's items?
(2 answers)
Closed 6 years ago.
This is a very simple code that can be executed inside a Scala Worksheet also. It is a map reduce kind of approach to calculate frequency of the numbers in the list.
I am sorting the list before I am starting the groupBy and map operation. Even then the list.groupBy.map operation generates a map, which is not sorted. Neither number wise nor frequency wise
//put this code in Scala worksheet
//this list is sorted and sorted in list variable
val list = List(1,2,4,2,4,7,3,2,4).sorted
//now you can see list is sorted
list
//now applying groupBy and map operation to create frequency map
val freqMap = list.groupBy(x => x) map{ case(k,v) => k-> v.length }
freqMap
groupBy doesn't guarantee any order
val list = List(1,2,4,2,4,7,3,2,4).sorted
val freqMap = list.groupBy(x => x)
Output:
freqMap: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1), 2 -> List(2, 2, 2), 7 -> List(7), 3 -> List(3), 4 -> List(4, 4, 4))
groupBy takes the list and groups the elements. It builds a Map in which the
key is a unique value of the list
value is a List of all occurrence in the list.
Here is the official method definition from the Scala docs:
def groupBy [K] (f: (A) ⇒ K): Map[K, Traversable[A]]
If you want to order the grouped result you can do it with ListMap:
scala> val freqMap = list.groupBy(x => x)
freqMap: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1), 2 -> List(2, 2, 2), 7 -> List(7), 3 -> List(3), 4 -> List(4, 4, 4))
scala> import scala.collection.immutable.ListMap
import scala.collection.immutable.ListMap
scala> ListMap(freqMap.toSeq.sortBy(_._1):_*)
res0: scala.collection.immutable.ListMap[Int,List[Int]] = Map(1 -> List(1), 2 -> List(2, 2, 2), 3 -> List(3), 4 -> List(4, 4, 4), 7 -> List(7))
Here's a quite simple request to combine two lists as following:
scala> list1
res17: List[(Int, Double)] = List((1,0.1), (2,0.2), (3,0.3), (4,0.4))
scala> list2
res18: List[(Int, String)] = List((1,aaa), (2,bbb), (3,ccc), (4,ddd))
The desired output is as:
((aaa,0.1),(bbb,0.2),(ccc,0.3),(ddd,0.4))
I tried:
scala> (list1 ++ list2)
res23: List[(Int, Any)] = List((1,0.1), (2,0.2), (3,0.3), (4,0.4),
(1,aaa), (2,bbb), (3,ccc), (4,ddd))
But:
scala> (list1 ++ list2).groupByKey
<console>:10: error: value groupByKey is not a member of List[(Int,
Any)](list1 ++ list2).groupByKey
Any hints? Thanks!
The method you're looking for is groupBy:
(list1 ++ list2).groupBy(_._1)
If you know that for each key you have exactly two values, you can join them:
scala> val pairs = List((1, "a1"), (2, "b1"), (1, "a2"), (2, "b2"))
pairs: List[(Int, String)] = List((1,a1), (2,b1), (1,a2), (2,b2))
scala> pairs.groupBy(_._1).values.map {
| case List((_, v1), (_, v2)) => (v1, v2)
| }
res0: Iterable[(String, String)] = List((b1,b2), (a1,a2))
Another approach using zip is possible if the two lists contain the same keys in the same order:
scala> val l1 = List((1, "a1"), (2, "b1"))
l1: List[(Int, String)] = List((1,a1), (2,b1))
scala> val l2 = List((1, "a2"), (2, "b2"))
l2: List[(Int, String)] = List((1,a2), (2,b2))
scala> l1.zip(l2).map { case ((_, v1), (_, v2)) => (v1, v2) }
res1: List[(String, String)] = List((a1,a2), (b1,b2))
Here's a quick one-liner:
scala> list2.map(_._2) zip list1.map(_._2)
res0: List[(String, Double)] = List((aaa,0.1), (bbb,0.2), (ccc,0.3), (ddd,0.4))
If you are unsure why this works then read on! I'll expand it step by step:
list2.map(<function>)
The map method iterates over each value in list2 and applies your function to it. In this case each of the values in list2 is a Tuple2 (a tuple with two values). What you are wanting to do is access the second tuple value. To access the first tuple value use the ._1 method and to access the second tuple value use the ._2 method. Here is an example:
val myTuple = (1.0, "hello") // A Tuple2
println(myTuple._1) // prints "1.0"
println(myTuple._2) // prints "hello"
So what we want is a function literal that takes one parameter (the current value in the list) and returns the second tuple value (._2). We could have written the function literal like this:
list2.map(item => item._2)
We don't need to specify a type for item because the compiler is smart enough to infer it thanks to target typing. A really helpful shortcut is that we can just leave out item altogether and replace it with a single underscore _. So it gets simplified (or cryptified, depending on how you view it) to:
list2.map(_._2)
The other interesting part about this one liner is the zip method. All zip does is take two lists and combines them into one list just like the zipper on your favorite hoodie does!
val a = List("a", "b", "c")
val b = List(1, 2, 3)
a zip b // returns ((a,1), (b,2), (c,3))
Cheers!