I have list of tuples which contains userId and point. I want to combine or reduce this list by the key.
val points: List[(Int, Double)] = List(
(1, 1.0),
(2, 3.2),
(4, 2.0),
(1, 4.0),
(2, 6.8)
)
The expected result should look like:
List((1, 5.0), (2, 10.0), (4, 2.0))
I tried with groupBy and mapValue, but got an error:
val aggrPoint: Map[Int, Double] = incomes.groupBy(_._1).mapValues(seq => seq.reduce(_._2 + _._2))
Error:(16, 180) type mismatch;
found : Double
required: (Int, Double)
What am I doing wrong, and is there a idiomatic way to achieve this?
P.S) I found that in Spark aggregateByKey does this job. But, is there a built-in method in Scala?
What am I doing wrong, and is there a idiomatic way to achieve this?
let's go step by step to see what are you doing wrong. (I am going to use REPL)
first of all lets define the points
scala> val points: List[(Int, Double)] = List(
| (1, 1.0),
| (2, 3.2),
| (4, 2.0),
| (1, 4.0),
| (2, 6.8)
| )
points: List[(Int, Double)] = List((1,1.0), (2,3.2), (4,2.0), (1,4.0), (2,6.8))
As you can see that you have List[Tuple2[Int, Double]] so when you do groupBy and mapValues as
scala> points.groupBy(_._1).mapValues(seq => println(seq))
List((2,3.2), (2,6.8))
List((4,2.0))
List((1,1.0), (1,4.0))
res1: scala.collection.immutable.Map[Int,Unit] = Map(2 -> (), 4 -> (), 1 -> ())
You can see that seq object is of List[Tuple2[Int, Double]] again but only contains the grouped tuples as list.
So when you apply seq.reduce(_._2 + _._2), the reduce function takes two inputs of Tuple2[Int, Double] but the output is Double only which doesn't match for the next iteration on seq as the expected input is Tuple2[Int, Double]. Thats the main issue. All you have to do is match the input and output types for reduce function
One way would be to match Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)})
res6: scala.collection.immutable.Map[Int,(Int, Double)] = Map(2 -> (2,10.0), 4 -> (4,2.0), 1 -> (1,5.0))
But this isn't your desired output, so you can extract the double value from the reduced Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)}._2)
res8: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
or you can just use map before you apply reduce function as
scala> points.groupBy(_._1).mapValues(seq => seq.map(_._2).reduce(_ + _))
res3: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
I hope the explanation is clear enough to understand your mistake and you must have understood how a reduce function works
You can map the tuples in the mapValues to their 2nd elements then sum them as follows:
points.groupBy(_._1).mapValues( _.map(_._2).sum ).toList
// res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))
Using collect
points.groupBy(_._1).collect{
case e => e._1 -> e._2.map(_._2).sum
}.toList
//res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))
Related
This question already has answers here:
take top N after groupBy and treat them as RDD
(4 answers)
Closed 4 years ago.
I have a rdd with format of each row (key, (int, double))
I would like to transform the rdd into (key, ((int, double), (int, double) ...) )
Where the the values in the new rdd is the top N values pairs sorted by the double
So far I came up with the solution below but it's really slow and runs forever, it works fine with smaller rdd but now the rdd is too big
val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
.mapValues(x => x.takeRight(n))
I wonder if there are better and faster ways to do this?
The most efficient way is probably aggregateByKey
type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???
//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ???
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO
val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)
Since you're interested only in the top-N values in your RDD, I would suggest that you avoid sorting across the entire RDD. In addition, use the more performing reduceByKey rather than groupByKey if at all possible. Below is an example using a topN method, borrowed from this blog:
def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
case Nil => list
case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
if (x._2 <= acc.head._2) x :: acc else acc :+ x
)
}
def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
}
list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}
val rdd = sc.parallelize(Seq(
("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))
val n = 2
rdd.
map{ case (k, v) => (k, List(v)) }.
reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
collect
// res1: Array[(String, List[(Int, Double)])] =
// Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))
In the following snippet of code, I am aware that x._1 denotes the first element of the tuple, but I couldn't understand what x._1._1 represents.I am not so familiar with Scala, sorry if it is a relatively naive question, thank you!!
val a = b.groupBy(x=> x._1._1)
Here is a quick example in the REPL of a nested tuple
scala> val t = ((1, 2), 3)
t: ((Int, Int), Int) = ((1,2),3)
scala> t._1 // Get the first part of the tuple
res0: (Int, Int) = (1,2)
scala> t._2 // Get the second part of the tuple
res1: Int = 3
scala> t._1._1 // Get the first part of the first part
res2: Int = 1
And here is an example with a sequence to demonstrate the groupBy:
scala> val s = Seq(((1, 2), 3), ((1, 5), 6), ((2, 4), 32))
s: Seq[((Int, Int), Int)] = List(((1,2),3), ((1,5),6), ((2,4),32))
scala> s.groupBy
def groupBy[K](f: (((Int, Int), Int)) => K): scala.collection.immutable.Map[K,Seq[((Int, Int), Int)]]
scala> s.groupBy(x => x._1._1)
res3: scala.collection.immutable.Map[Int,Seq[((Int, Int), Int)]] = Map(2 -> List(((2,4),32)), 1 -> List(((1,2),3), ((1,5),6)))
In this case the first element of the first element are the target for the grouping. Here's the result in an easier to look at format:
Map(
2 -> List(
((2,4),32)),
1 -> List(
((1,2),3),
((1,5),6))
)
It means x._1 itself is a tuple.
Example:
val b = Seq((("subTuple_1", "subTuple_2"), "tuple_2"))
val a = b.groupBy(x=> x._1._1)
As you mentioned ._1 gives you the first column of your tuple, and if the result of first column is Tuple, you can do ._1.
eg.
scala> Map(("a" -> "b") -> 100, ("c" -> "d") -> 200).map(_._1)
res31: scala.collection.immutable.Map[String,String] = Map(a -> b, c -> d)
scala> Map(("a" -> "b") -> 100, ("c" -> "d") -> 200).map(_._1._1)
res32: scala.collection.immutable.Iterable[String] = List(a, c)
groupBy,
scala> Map(("a" -> "b") -> 100, ("a" -> "c") -> 200).groupBy(_._1._1)
res19: scala.collection.immutable.Map[String,scala.collection.immutable.Map[(String, String),Int]] = Map(a -> Map((a,b) -> 100, (a,c) -> 200))
I am a scala beginner and I want to perform a simple groupby and sum over an Observable. For example:
val test = Observable.just(("a", 1), ("a", 2), ("b", 5), ("b",3))
I would like to group by key and sum over the values so to have something like:
(a,3)
(b,8)
I am able to sum over all with test.map(_._2).sum,but not when performing a groupby
not the classes you are looking for clearly. i was rummaging around scala.react and reactivex.io, but how about this:
scala> val test999=Seq(("a",1),("a",16),("b",5),("a",9),("b",9),("c",90))
test999: Seq[(String, Int)] = List((a,1), (a,16), (b,5), (a,9), (b,9), (c,90))
scala> test999
res12: Seq[(String, Int)] = List((a,1), (a,16), (b,5), (a,9), (b,9), (c,90))
scala> test999.groupBy(_._1).mapValues(_.map(_._2).sum)
res13: scala.collection.immutable.Map[String,Int] = Map(b -> 14, a -> 26, c -> 90)
Given that
I am able to sum over all with test.map(_._2).sum,...
namely that it is possible to map onto test, consider applying groupBy to test.toSeq.
I apply groupBy function to my List collection, however I want to remove the repetitive values in the value part of the Map. Here is the initial List collection:
PO_ID PRODUCT_ID RETURN_QTY
1 1 10
1 1 20
1 2 30
1 2 10
When I apply groupBy to that List, it will produce something like this:
(1, 1) -> (1, 1, 10),(1, 1, 20)
(1, 2) -> (1, 2, 30),(1, 2, 10)
What I really want is something like this:
(1, 1) -> (10),(20)
(1, 2) -> (30),(10)
So, is there anyway to remove the repetitive part in the Map's values [(1,1),(1,2)] ?
Thanks..
For
val a = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
consider
a.groupBy( v => (v._1,v._2) ).mapValues( _.map (_._3) )
which delivers
Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Note that mapValues operates over a List[List] of triplets obtained from groupBy, whereas in map we extract the third element of each triplet.
Is it easier to pull the tuple apart first?
scala> val ts = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
ts: Seq[(Int, Int, Int)] = List((1,1,10), (1,1,20), (1,2,30), (1,2,10))
scala> ts map { case (a,b,c) => (a,b) -> c }
res0: Seq[((Int, Int), Int)] = List(((1,1),10), ((1,1),20), ((1,2),30), ((1,2),10))
scala> ((Map.empty[(Int, Int), List[Int]] withDefaultValue List.empty[Int]) /: res0) { case (m, (k,v)) => m + ((k, m(k) :+ v)) }
res1: scala.collection.immutable.Map[(Int, Int),List[Int]] = Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Guess not.
I am trying to add pairs from list to a map using foldl. I get the following error:
"missing arguments for method /: in trait TraversableOnce; follow this method with `_' if you want to treat it as a partially applied function"
code:
val pairs = List(("a", 1), ("a", 2), ("c", 3), ("d", 4))
def lstToMap(lst:List[(String,Int)], map: Map[String, Int] ) = {
(map /: lst) addToMap ( _, _)
}
def addToMap(pair: (String, Int), map: Map[String, Int]): Map[String, Int] = {
map + (pair._1 -> pair._2)
}
What is wrong?
scala> val pairs = List(("a", 1), ("a", 2), ("c", 3), ("d", 4))
pairs: List[(String, Int)] = List((a,1), (a,2), (c,3), (d,4))
scala> (Map.empty[String, Int] /: pairs)(_ + _)
res9: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)
But you know, you could just do:
scala> pairs.toMap
res10: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)
You need to swap the input values of addToMap and put it in parenthesis for this to work:
def addToMap( map: Map[String, Int], pair: (String, Int)): Map[String, Int] = {
map + (pair._1 -> pair._2)
}
def lstToMap(lst:List[(String,Int)], map: Map[String, Int] ) = {
(map /: lst)(addToMap)
}
missingfaktor's answer is much more concise, reusable, and scala-like.
If you already have a collection of Tuple2s, you don't need to implement this yourself, there is already a toMap method that only works if the elements are tuples!
The full signature is:
def toMap[T, U](implicit ev: <:<[A, (T, U)]): Map[T, U]
It works by requiring an implicit A <:< (T, U) which is essentially a function that can take the element type A and cast/convert it to tuples of type (T, U). Another way of saying this is that it requires an implicit witness that A is-a (T, U). Therefore, this is completely type-safe.
Update: which is what #missingfaktor said
This is not a direct answer to the question, which is about folding correctly on the map, but I deem it important to emphasize that
a Map can be treated as a generic Traversable of pairs
and you can easily combine the two!
scala> val pairs = List(("a", 1), ("a", 2), ("c", 3), ("d", 4))
pairs: List[(String, Int)] = List((a,1), (a,2), (c,3), (d,4))
scala> Map.empty[String, Int] ++ pairs
res1: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)
scala> pairs.toMap
res2: scala.collection.immutable.Map[String,Int] = Map(a -> 2, c -> 3, d -> 4)