How can I count the average from Spark RDD? - scala

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to count them like this,
[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]
then,get the result like this,
[(2,120),(3,204),(4,160)]
How can I do this with scala from RDD?
I use spark version 1.6

you can use aggregateByKey.
val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect

You can use the groupByKey in this case.like this
val rdd = spark.sparkContext.parallelize(List((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val processedRDD = rdd.groupByKey.mapValues{iterator => iterator.sum / iterator.size}
processedRDD.collect.toList
Here, groupByKey will return the RDD[(Int, Iterator[Int])] then you can simply apply average operation on Iterator
Hope this works for you
Thanks

You can use .combineByKey() to compute average:
val data = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val sumCountPair = data.combineByKey((x: Int) => (x.toDouble,1),
(pair1: (Double, Int), x: Int) => (pair1._1 + x, pair1._2 + 1),
(pair1: (Double, Int), pair2: (Double, Int)) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
val average = sumCountPair.map(x => (x._1, (x._2._1/x._2._2)))
average.collect()
here sumCountPair returns type RDD[(Int, (Double, Int))], denoting: (Key, (SumValue, CountValue)). The next step just divides sum by the count and returns (Key, AverageValue)

Related

loop with accumulator on an rdd

I want to loop n times where n is an accumulator over the same rdd
lets say n = 10 so I want the code bellow to loop 5 times (since the accumulator is increased by two)
val key = keyAcm.value.toInt
val rest = rdd.filter(_._1 > (key + 1))
val combined = rdd.filter(k => (k._1 == key) || (k._1 == key + 1))
.map(x => (key, x._2))
.reduceByKey { case (x, y) => (x ++ y) }
keyAcm.add(2)
combined.union(rest)
with this code I filter the rdd and keep keys 0 (init value of accumulator) and 1. Then, I am trying to merge its second parameter and change the key to create a new rdd with key 0 and a merged array. After that, I union this rdd with the original one leaving behind the filtered values (0 and 1). Lastly, I increase the accumulator by two.How can I repeat theses steps until the accumulator is 10?
any ideas?
val rdd: RDD[(Int, String)] = ???
val res: RDD[(Int, Iterable[String])] = rdd.map(x => (x._1 / 2, x._2)).groupByKey()

Reduce Hashmaps per partition in Spark

I have a RDD of some mutable.Map[(Int, Array[Double])] and I would like to reduce the maps by Int and find the means of the elements of the arrays.
For example I have:
Map[(1, Array[0.1, 0.1]), (2, Array[0.3, 0.2])]
Map[(1, Array[0.1, 0.4])]
What I want:
Map[(1, Array[0.1, 0.25]), (2, Array[0.3, 0.2])]
The problem is that I don't know how reduce works between maps and additionally I have to do it per partition, collect the results to the driver and reduce them there too. I found the foreachPartition method but I don't know if it is meant to be used in such cases.
Any ideas?
You can do it using combineByKey :
val rdd = ss.sparkContext.parallelize(Seq(
Map((1, Array(0.1, 0.1)), (2, Array(0.3, 0.2))),
Map((1, Array(0.1, 0.4)))
))
// functions for combineByKey
val create = (arr: Array[Double]) => arr.map( x => (x,1))
val update = (acc : Array[(Double,Int)], current: Array[Double]) => acc.zip(current).map{case ((s,c),x) => (s+x,c+1)}
val merge = (acc1 : Array[(Double,Int)],acc2:Array[(Double,Int)]) => acc1.zip(acc2).map{case ((s1,c1),(s2,c2)) => (s1+s2,c1+c2)}
val finalMap = rdd.flatMap(_.toList)
// aggreate elementwise sum & count
.combineByKey(create,update,merge)
// calculate elementwise average per key
.map{case (id,arr) => (id,arr.map{case (s,c) => s/c})}
.collectAsMap()
// finalMap = Map(2 -> Array(0.3, 0.2), 1 -> Array(0.1, 0.25))

value _2 is not a member of Double spark-shell

I am getting an error while implementing aggregateByKey in spark-scala-shell.
The piece of code that I am trying to execute on Scala-shell is this,
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey(0.0,0)(
(a,b) => (a._1 + b , a._2 + 1),
(a,b) => (a._1 + b._1 , a._2 + b._2 )
)
But I am getting the following error,
<console>:39: error: value _1 is not a member of Double
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap.aggregateByKey( 0.0,0)( (a,b) => (a._1 + b , a._2 +1), (a,b) => (a._1 + b._1 , a._2 + b._2 ))
scala> orderItemsMapJoinOrdersMapMap
res8: org.apache.spark.rdd.RDD[(String, Float)] = MapPartitionsRDD[16] at map at <console>:37
Can someone help me in understanding double and Float value logic and how to fix it
The problem is that you are providing the first curried argument the wrong way. It should be something like this,
val orderItemsMapJoinOrdersMapMap: RDD[(String, Float)] = ...
// so elems of your orderItemsMapJoinOrdersMapMap are (String, Float)
// And your accumulator looks like (Double, Int)
// thus I believe that you just want to accumulate total number of elements and sum of the floats in them
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey((0.0,0))(
(acc, elem) => (acc._1 + elem._2 , acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1 , acc1._2 + acc._2)
)

Spark Mapvalues vs. Map

So I saw this question on stackoverflow asked by another user and I tried to write the code myself as I am trying to practice scala and spark:
The question was to find the per-key average from a list:
Assuming the list is: ( (1,1), (1,3), (2,4), (2,3), (3,1) )
The code was:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
map{ case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
So basically the above code will create an RDD of type [Int, (Int, Int)] where the first Int is the key and the value is (Int, Int) where the first Int here is the addition of all the values with the same key and the second Int is the amount of times the key appeared.
I understand what is going on but for some reason when I rewrite the code like this:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value: (Int, Int) => (value._1 / value._2))
result.collectAsMap().map(println(_))
When I use mapValues instead of map with the case keyword the code doesn't work.It gives an error saying error: not found: type / What is the difference when using map with case and mapValues. Since I thought map values will just take the value (which in this case is a (Int,Int)) and return to you a new value and the key remains the same for the key value pair.
try
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value => (value._1 / value._2))
result.collectAsMap().map(println(_))
Never mind I found a good article to my problem: http://danielwestheide.com/blog/2012/12/12/the-neophytes-guide-to-scala-part-4-pattern-matching-anonymous-functions.html
If anyone else has the same problem that explains it well!

Spark Scala Understanding reduceByKey(_ + _)

I can't understand reduceByKey(_ + _) in the first example of spark with scala
object WordCount {
def main(args: Array[String]): Unit = {
val inputPath = args(0)
val outputPath = args(1)
val sc = new SparkContext()
val lines = sc.textFile(inputPath)
val wordCounts = lines.flatMap {line => line.split(" ")}
.map(word => (word, 1))
.reduceByKey(_ + _) **I cant't understand this line**
wordCounts.saveAsTextFile(outputPath)
}
}
Reduce takes two elements and produce a third after applying a function to the two parameters.
The code you shown is equivalent to the the following
reduceByKey((x,y)=> x + y)
Instead of defining dummy variables and write a lambda, Scala is smart enough to figure out that what you trying achieve is applying a func (sum in this case) on any two parameters it receives and hence the syntax
reduceByKey(_ + _)
reduceByKey takes two parameters, apply a function and returns
reduceByKey(_ + _) is equivalent to reduceByKey((x,y)=> x + y)
Example :
val numbers = Array(1, 2, 3, 4, 5)
val sum = numbers.reduceLeft[Int](_+_)
println("The sum of the numbers one through five is " + sum)
Results :
The sum of the numbers one through five is 15
numbers: Array[Int] = Array(1, 2, 3, 4, 5)
sum: Int = 15
Same reduceByKey(_ ++ _) is equivalent to reduceByKey((x,y)=> x ++ y)