Aggregate RDD values per key

Aggregate RDD values per key - scala

I have RDD in key,value structure (someKey,(measure1,measure2)). I grouped by the key and now I want to aggregate the values for each key.
val RDD1 : RDD[(String,(Int,Int))]
RDD1.groupByKey()
the result I need is:
key: avg(measure1), avg(measure2), max(measure1), max(measure2), min(measure1), min(measure2), count(*)

First of all, avoid groupByKey! You should use aggregateByKey or combineByKey. We will use aggregateByKey. This function will transform values for each key: RDD[(K, V)] => RDD[(K, U)]. It needs zero value of type U and knowledge how to merge (V, U) => U and (U, U) => U. I simplified your example a little bit and want to get: key: avg(measure1), avg(measure2), min(measure1), min(measure2), count(*)
val rdd1 = sc.parallelize(List(("a", (11, 1)), ("a",(12, 3)), ("b",(10, 1))))
rdd1
.aggregateByKey((0.0, 0.0, Int.MaxValue, Int.MaxValue, 0))(
{
case ((sum1, sum2, min1, min2, count1), (v1, v2)) =>
(sum1 + v1, sum2 + v2, v1 min min1, v2 min min2, count1+1)
},
{
case ((sum1, sum2, min1, min2, count),
(otherSum1, otherSum2, otherMin1, otherMin2, otherCount)) =>
(sum1 + otherSum1, sum2 + otherSum2,
min1 min otherMin1, min2 min otherMin2, count + otherCount)
}
)
.map {
case (k, (sum1, sum2, min1, min2, count1)) => (k, (sum1/count1, sum2/count1, min1, min2, count1))
}
.collect()
giving
(a,(11.5,2.0,11,1,2)), (b,(10.0,1.0,10,1,1))

Related

How to convert RDD[Array[String]] to RDD[(Int, HashMap[String, List])]?

I have input data:
time, id, counter, value
00.2, 1 , c1 , 0.2
00.2, 1 , c2 , 0.3
00.2, 1 , c1 , 0.1
and I want for every id to create a structure to store counters and values. After thinking about vectors and rejecting them, I came to this:
(id, Hashmap( (counter1, List(Values)), (Counter2, List(Values)) ))
(1, HashMap( (c1,List(0.2, 0.1)), (c2,List(0.3)))
The problem is that I can't convert to Hashmap inside the map transformation and additionaly I dont't know if I will be able to reduce by counter the list inside map.
Does anyone have any idea?
My code is :
val data = inputRdd
.map(y => (y(1).toInt, mutable.HashMap(y(2), List(y(3).toDouble)))).reduceByKey(_++_)
}

Off the top of my head, untested:
import collection.mutable.HashMap
inputRdd
.map{ case Array(t, id, c, v) => (id.toInt, (c, v)) }
.aggregateByKey(HashMap.empty[String, List[String]])(
{ case (m, (c, v)) => { m(c) ::= v; m } },
{ case (m1, m2) => { for ((k, v) <- m2) m1(k) ::= v ; m1 } }
)

Here's one approach:
val rdd = sc.parallelize(Seq(
("00.2", 1, "c1", 0.2),
("00.2", 1, "c2", 0.3),
("00.2", 1, "c1", 0.1)
))
rdd.
map{ case (t, i, c, v) => (i, (c, v)) }.
groupByKey.mapValues(
_.groupBy(_._1).mapValues(_.map(_._2)).map(identity)
).
collect
// res1: Array[(Int, scala.collection.immutable.Map[String,Iterable[Double]])] = Array(
// (1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
// )
Note that the final map(identity) is a remedy for the Map#mapValues not serializable problem suggested in this SO answer.

If, as you have mentioned, have inputRdd as
//inputRdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at ....
Then a simple groupBy and foldLeft on the grouped values should do the trick for you to have the final desired result
val resultRdd = inputRdd.groupBy(_(1))
.mapValues(x => x
.foldLeft(Map.empty[String, List[String]]){(a, b) => {
if(a.keySet.contains(b(2))){
val c = a ++ Map(b(2) -> (a(b(2)) ++ List(b(3))))
c
}
else{
val c = a ++ Map(b(2) -> List(b(3)))
c
}
}}
)
//resultRdd: org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Map[String,List[String]])] = MapPartitionsRDD[3] at mapValues at ...
//(1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
changing RDD[(String, scala.collection.immutable.Map[String,List[String]])] to RDD[(Int, HashMap[String,List[String]])] would just be casting and I hope it would be easier for you to do that
I hope the answer is helpful

scala error: constructor cannot be instantiated to expected type;

I am trying to do this for matrix multiplication of two large matrices in scala. Below is the logic for the multiplication:
val res = M_.map( M_ => (M_.j,M_) )
.join(N_.map( N_ => (N_.j, N_)))
.map({ case (_, ((i, v), (k, w))) => ((i, k), (v * w)) })
.reduceByKey(_ + _)
.map({ case ((i, k), sum) => (i, k, sum) })
M_ and N_ are two RDDs of these two classes:
case class M_Matrix ( i: Long, j: Long, v: Double )
case class N_Matrix ( j: Long, k: Long, w: Double )
But I am getting the following error:
Error image-Please open
What am I doing wrong here?

Since your rdd/dataframe contains M_Matrix and N_Matrix objects you can not match with a tuple. Something like this should work:
val res = M_.map( M_ => (M_.j,M_) )
.join(N_.map( N_ => (N_.j, N_)))
.map{ case (_, (m_matrix, n_matrix)) => ((m_matrix.i, n_matrix.k), m_matrix.v * n_matrix.w)}
.reduceByKey(_ + _)
.map{ case ((i, k), sum) => (i, k, sum)}
A better solution than using the case classes would e to use MatrixEntry:
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
Use it instead of M_Matrix and N_Matrix when building the RDDs, then the join can look like this:
val res = M_.map( M_ => (M_.j,M_) )
.join(N_.map( N_ => (N_.i, N_)))
.map{ case (_, (m_matrix, n_matrix)) => ((m_matrix.i, n_matrix.j), m_matrix.value * n_matrix.value)}
.reduceByKey(_ + _)
.map{ case ((i, k), sum) => MatrixEntry(i, k, sum)}
This will result in a RDD[MatrixEntry], same as the two that were joined.

Reduce/fold over scala sequence with grouping

In scala, given an Iterable of pairs, say Iterable[(String, Int]),
is there a way to accumulate or fold over the ._2s based on the ._1s? Like in the following, add up all the #s that come after A and separately the # after B
List(("A", 2), ("B", 1), ("A", 3))
I could do this in 2 steps with groupBy
val mapBy1 = list.groupBy( _._1 )
for ((key,sublist) <- mapBy1) yield (key, sublist.foldLeft(0) (_+_._2))
but then I would be allocating the sublists, which I would rather avoid.

You could build the Map as you go and convert it back to a List after the fact.
listOfPairs.foldLeft(Map[String,Int]().withDefaultValue(0)){
case (m,(k,v)) => m + (k -> (v + m(k)))
}.toList

You could do something like:
list.foldLeft(Map[String, Int]()) {
case (map, (k,v)) => map + (k -> (map.getOrElse(k, 0) + v))
}

You could also use groupBy with mapValues:
list.groupBy(_._1).mapValues(_.map(_._2).sum).toList
res1: List[(String, Int)] = List((A,5), (B,1))

How can I count the average from Spark RDD?

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to count them like this,
[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]
then,get the result like this,
[(2,120),(3,204),(4,160)]
How can I do this with scala from RDD?
I use spark version 1.6

you can use aggregateByKey.
val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect

You can use the groupByKey in this case.like this
val rdd = spark.sparkContext.parallelize(List((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val processedRDD = rdd.groupByKey.mapValues{iterator => iterator.sum / iterator.size}
processedRDD.collect.toList
Here, groupByKey will return the RDD[(Int, Iterator[Int])] then you can simply apply average operation on Iterator
Hope this works for you
Thanks

You can use .combineByKey() to compute average:
val data = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val sumCountPair = data.combineByKey((x: Int) => (x.toDouble,1),
(pair1: (Double, Int), x: Int) => (pair1._1 + x, pair1._2 + 1),
(pair1: (Double, Int), pair2: (Double, Int)) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
val average = sumCountPair.map(x => (x._1, (x._2._1/x._2._2)))
average.collect()
here sumCountPair returns type RDD[(Int, (Double, Int))], denoting: (Key, (SumValue, CountValue)). The next step just divides sum by the count and returns (Key, AverageValue)

value _2 is not a member of Double spark-shell

I am getting an error while implementing aggregateByKey in spark-scala-shell.
The piece of code that I am trying to execute on Scala-shell is this,
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey(0.0,0)(
(a,b) => (a._1 + b , a._2 + 1),
(a,b) => (a._1 + b._1 , a._2 + b._2 )
)
But I am getting the following error,
<console>:39: error: value _1 is not a member of Double
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap.aggregateByKey( 0.0,0)( (a,b) => (a._1 + b , a._2 +1), (a,b) => (a._1 + b._1 , a._2 + b._2 ))
scala> orderItemsMapJoinOrdersMapMap
res8: org.apache.spark.rdd.RDD[(String, Float)] = MapPartitionsRDD[16] at map at <console>:37
Can someone help me in understanding double and Float value logic and how to fix it

The problem is that you are providing the first curried argument the wrong way. It should be something like this,
val orderItemsMapJoinOrdersMapMap: RDD[(String, Float)] = ...
// so elems of your orderItemsMapJoinOrdersMapMap are (String, Float)
// And your accumulator looks like (Double, Int)
// thus I believe that you just want to accumulate total number of elements and sum of the floats in them
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey((0.0,0))(
(acc, elem) => (acc._1 + elem._2 , acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1 , acc1._2 + acc._2)
)