Combine map, reduceByKey and another map - scala

Data is a collection of tuples in the format: (group, number)
data.map(a => (a._1, (a._2, 1)))
.reduceByKey((a,b) => (a._1 * b._1, a._2 + b._2))
.map(a => (a._1, pow(a._2._1, 1/a. 2._2))
As a total newbie in Spark - what provided code is doing? Can you explain to me this code?

Related

scala error: constructor cannot be instantiated to expected type;

I am trying to do this for matrix multiplication of two large matrices in scala. Below is the logic for the multiplication:
val res = M_.map( M_ => (M_.j,M_) )
.join(N_.map( N_ => (N_.j, N_)))
.map({ case (_, ((i, v), (k, w))) => ((i, k), (v * w)) })
.reduceByKey(_ + _)
.map({ case ((i, k), sum) => (i, k, sum) })
M_ and N_ are two RDDs of these two classes:
case class M_Matrix ( i: Long, j: Long, v: Double )
case class N_Matrix ( j: Long, k: Long, w: Double )
But I am getting the following error:
Error image-Please open
What am I doing wrong here?
Since your rdd/dataframe contains M_Matrix and N_Matrix objects you can not match with a tuple. Something like this should work:
val res = M_.map( M_ => (M_.j,M_) )
.join(N_.map( N_ => (N_.j, N_)))
.map{ case (_, (m_matrix, n_matrix)) => ((m_matrix.i, n_matrix.k), m_matrix.v * n_matrix.w)}
.reduceByKey(_ + _)
.map{ case ((i, k), sum) => (i, k, sum)}
A better solution than using the case classes would e to use MatrixEntry:
import org.apache.spark.mllib.linalg.distributed.MatrixEntry
Use it instead of M_Matrix and N_Matrix when building the RDDs, then the join can look like this:
val res = M_.map( M_ => (M_.j,M_) )
.join(N_.map( N_ => (N_.i, N_)))
.map{ case (_, (m_matrix, n_matrix)) => ((m_matrix.i, n_matrix.j), m_matrix.value * n_matrix.value)}
.reduceByKey(_ + _)
.map{ case ((i, k), sum) => MatrixEntry(i, k, sum)}
This will result in a RDD[MatrixEntry], same as the two that were joined.

How can I count the average from Spark RDD?

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to count them like this,
[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]
then,get the result like this,
[(2,120),(3,204),(4,160)]
How can I do this with scala from RDD?
I use spark version 1.6
you can use aggregateByKey.
val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect
You can use the groupByKey in this case.like this
val rdd = spark.sparkContext.parallelize(List((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val processedRDD = rdd.groupByKey.mapValues{iterator => iterator.sum / iterator.size}
processedRDD.collect.toList
Here, groupByKey will return the RDD[(Int, Iterator[Int])] then you can simply apply average operation on Iterator
Hope this works for you
Thanks
You can use .combineByKey() to compute average:
val data = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val sumCountPair = data.combineByKey((x: Int) => (x.toDouble,1),
(pair1: (Double, Int), x: Int) => (pair1._1 + x, pair1._2 + 1),
(pair1: (Double, Int), pair2: (Double, Int)) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
val average = sumCountPair.map(x => (x._1, (x._2._1/x._2._2)))
average.collect()
here sumCountPair returns type RDD[(Int, (Double, Int))], denoting: (Key, (SumValue, CountValue)). The next step just divides sum by the count and returns (Key, AverageValue)

value _2 is not a member of Double spark-shell

I am getting an error while implementing aggregateByKey in spark-scala-shell.
The piece of code that I am trying to execute on Scala-shell is this,
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey(0.0,0)(
(a,b) => (a._1 + b , a._2 + 1),
(a,b) => (a._1 + b._1 , a._2 + b._2 )
)
But I am getting the following error,
<console>:39: error: value _1 is not a member of Double
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap.aggregateByKey( 0.0,0)( (a,b) => (a._1 + b , a._2 +1), (a,b) => (a._1 + b._1 , a._2 + b._2 ))
scala> orderItemsMapJoinOrdersMapMap
res8: org.apache.spark.rdd.RDD[(String, Float)] = MapPartitionsRDD[16] at map at <console>:37
Can someone help me in understanding double and Float value logic and how to fix it
The problem is that you are providing the first curried argument the wrong way. It should be something like this,
val orderItemsMapJoinOrdersMapMap: RDD[(String, Float)] = ...
// so elems of your orderItemsMapJoinOrdersMapMap are (String, Float)
// And your accumulator looks like (Double, Int)
// thus I believe that you just want to accumulate total number of elements and sum of the floats in them
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey((0.0,0))(
(acc, elem) => (acc._1 + elem._2 , acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1 , acc1._2 + acc._2)
)

Spark Mapvalues vs. Map

So I saw this question on stackoverflow asked by another user and I tried to write the code myself as I am trying to practice scala and spark:
The question was to find the per-key average from a list:
Assuming the list is: ( (1,1), (1,3), (2,4), (2,3), (3,1) )
The code was:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
map{ case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
So basically the above code will create an RDD of type [Int, (Int, Int)] where the first Int is the key and the value is (Int, Int) where the first Int here is the addition of all the values with the same key and the second Int is the amount of times the key appeared.
I understand what is going on but for some reason when I rewrite the code like this:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value: (Int, Int) => (value._1 / value._2))
result.collectAsMap().map(println(_))
When I use mapValues instead of map with the case keyword the code doesn't work.It gives an error saying error: not found: type / What is the difference when using map with case and mapValues. Since I thought map values will just take the value (which in this case is a (Int,Int)) and return to you a new value and the key remains the same for the key value pair.
try
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value => (value._1 / value._2))
result.collectAsMap().map(println(_))
Never mind I found a good article to my problem: http://danielwestheide.com/blog/2012/12/12/the-neophytes-guide-to-scala-part-4-pattern-matching-anonymous-functions.html
If anyone else has the same problem that explains it well!

Aggregate RDD values per key

I have RDD in key,value structure (someKey,(measure1,measure2)). I grouped by the key and now I want to aggregate the values for each key.
val RDD1 : RDD[(String,(Int,Int))]
RDD1.groupByKey()
the result I need is:
key: avg(measure1), avg(measure2), max(measure1), max(measure2), min(measure1), min(measure2), count(*)
First of all, avoid groupByKey! You should use aggregateByKey or combineByKey. We will use aggregateByKey. This function will transform values for each key: RDD[(K, V)] => RDD[(K, U)]. It needs zero value of type U and knowledge how to merge (V, U) => U and (U, U) => U. I simplified your example a little bit and want to get: key: avg(measure1), avg(measure2), min(measure1), min(measure2), count(*)
val rdd1 = sc.parallelize(List(("a", (11, 1)), ("a",(12, 3)), ("b",(10, 1))))
rdd1
.aggregateByKey((0.0, 0.0, Int.MaxValue, Int.MaxValue, 0))(
{
case ((sum1, sum2, min1, min2, count1), (v1, v2)) =>
(sum1 + v1, sum2 + v2, v1 min min1, v2 min min2, count1+1)
},
{
case ((sum1, sum2, min1, min2, count),
(otherSum1, otherSum2, otherMin1, otherMin2, otherCount)) =>
(sum1 + otherSum1, sum2 + otherSum2,
min1 min otherMin1, min2 min otherMin2, count + otherCount)
}
)
.map {
case (k, (sum1, sum2, min1, min2, count1)) => (k, (sum1/count1, sum2/count1, min1, min2, count1))
}
.collect()
giving
(a,(11.5,2.0,11,1,2)), (b,(10.0,1.0,10,1,1))