Spark Mapvalues vs. Map - scala

So I saw this question on stackoverflow asked by another user and I tried to write the code myself as I am trying to practice scala and spark:
The question was to find the per-key average from a list:
Assuming the list is: ( (1,1), (1,3), (2,4), (2,3), (3,1) )
The code was:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
map{ case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
So basically the above code will create an RDD of type [Int, (Int, Int)] where the first Int is the key and the value is (Int, Int) where the first Int here is the addition of all the values with the same key and the second Int is the amount of times the key appeared.
I understand what is going on but for some reason when I rewrite the code like this:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value: (Int, Int) => (value._1 / value._2))
result.collectAsMap().map(println(_))
When I use mapValues instead of map with the case keyword the code doesn't work.It gives an error saying error: not found: type / What is the difference when using map with case and mapValues. Since I thought map values will just take the value (which in this case is a (Int,Int)) and return to you a new value and the key remains the same for the key value pair.

try
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value => (value._1 / value._2))
result.collectAsMap().map(println(_))

Never mind I found a good article to my problem: http://danielwestheide.com/blog/2012/12/12/the-neophytes-guide-to-scala-part-4-pattern-matching-anonymous-functions.html
If anyone else has the same problem that explains it well!

Related

aggregateByKey method not working in spark rdd

Below is my sample data:
1,Siddhesh,43,32000
1,Siddhesh,12,4300
2,Devil,10,1000
2,Devil,10,3000
2,Devil,11,2000
I created pair RDD to perform combineByKey and aggregateByKey operations. Below is my code:
val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toLong,p(2).toString.toInt)))
Above I paired data of first two columns as key and the remaining columns as value. Now I want only distinct values from the right tuple for 3rd column in dataset which I was able to do with the combineByKey. Below is my code:
val reduced = rd.combineByKey(
(x:(Long,Int))=>{(x._1,Set(x._2))},
(x:(Long,Set[Int]),y:(Long,Int))=>(x._1+y._1,x._2+y._2),
(x:(Long,Set[Int]),y:(Long,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
)
scala> reduced.foreach(println)
((1,Siddhesh),(36300,Set(43, 12)))
((2,Devil),(6000,Set(10, 11)))
Now I map it so that I can get the sum of values of unique distinct keys.
scala> val newRdd=reduced.map(p=>(p._1._1,p._1._2,p._2._1,p._2._2.size))
scala> newRdd.foreach(println)
(1,Siddhesh,36300,2)
(2,Devil,6000,2)
Here for devil the last value is 2 since I have 10 as 2 values for 'Devil' record in the dataset and since I have had used Set it eliminates the duplicates. So now I tried it with aggregateByKey. Below is my code with error:
val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toString.toInt,p(2).toString.toInt)))
I converted the value column from long to int because while initializing it was throwing error on '0'
scala> val reducedByAggKey = rd.aggregateByKey((0,0))(
| (x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
| (x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
| )
<console>:36: error: type mismatch;
found : scala.collection.immutable.Set[Int]
required: Int
(x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
^
<console>:37: error: type mismatch;
found : scala.collection.immutable.Set[Int]
required: Int
(x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
^
And as suggested by Leo, below is my code with error:
scala> val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
| (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
| (x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, y._2++ x._2)
| )
<console>:36: error: overloaded method value + with alternatives:
(x: Double)Double <and>
(x: Float)Float <and>
(x: Long)Long <and>
(x: Int)Int <and>
(x: Char)Int <and>
(x: Short)Int <and>
(x: Byte)Int <and>
(x: String)String
cannot be applied to (Set[Int])
(x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
^
So where am I making mess over here ?? Please correct me
If I understand your requirement correctly, to get the full count rather than distinct count, use List instead of Set for the aggregations. As to the problem with your aggregateByKey, it's due to the incorrect type of the zeroValue which should be (0, List.empty[Int]) (would've been (0, Set.empty[Int]) if you were to stick to using Set):
val reduced = rdd.aggregateByKey((0, List.empty[Int]))(
(x: (Int, List[Int]), y: (Int, Int)) => (x._1 + y._1, y._2 :: x._2),
(x: (Int, List[Int]), y: (Int, List[Int])) => (x._1 + y._1, y._2 ::: x._2)
)
reduced.collect
// res1: Array[((String, String), (Int, List[Int]))] =
// Array(((2,Devil),(6000,List(11, 10, 10))), ((1,Siddhesh),(36300,List(12, 43))))
val newRdd = reduced.map(p => (p._1._1, p._1._2, p._2._1, p._2._2.size))
newRdd.collect
// res2: Array[(String, String, Int, Int)] =
// Array((2,Devil,6000,3), (1,Siddhesh,36300,2))
Note that the Set to List change would apply to your combineByKey code as well if you want the full count instead of distinct count.
[UPDATE]
For distinct count per your comment, simply stay with Set with zeroValue set to (0, Set.empty[Int]):
val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
(x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2),
(x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, x._2 ++ y._2)
)
reduced.collect
// res3: Array[((String, String), (Int, scala.collection.immutable.Set[Int]))] =
// Array(((2,Devil),(6000,Set(10, 11))), ((1,Siddhesh),(36300,Set(43, 12))))

Flink WindowFunction Fold

I create a sliding window and hope to recursively pack all the elements enter that window period,
This is chunk of the code
.map(x => ((x.pickup.get.latitude, x.pickup.get.longitude), (x.dropoff.get.latitude, x.dropoff.get.longitude)))
.windowAll(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
.fold(List[((Double, Double), (Double, Double))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
I hope to create a List in which the elements are tuple, but this does not work.
I tried this and it works:
val l2 : List[((Int, Int), (Int, Int))] = List(((1, 1), (2, 2)))
val newl2 = l2 :+ ((3, 3), (4, 4))
How can I do this?
Thanks so much
The first argument of the fold function needs to be the initial value and not the type. Changing the last line into:
.fold(List.empty[((Long, Long), (Long, Long))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
should do the trick.

How can I count the average from Spark RDD?

I have a problem with Spark Scala which I want count the average from the Rdd data,I create a new RDD like this,
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to count them like this,
[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]
then,get the result like this,
[(2,120),(3,204),(4,160)]
How can I do this with scala from RDD?
I use spark version 1.6
you can use aggregateByKey.
val rdd = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val agg_rdd = rdd.aggregateByKey((0,0))((acc, value) => (acc._1 + value, acc._2 + 1),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val sum = agg_rdd.mapValues(x => (x._1/x._2))
sum.collect
You can use the groupByKey in this case.like this
val rdd = spark.sparkContext.parallelize(List((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val processedRDD = rdd.groupByKey.mapValues{iterator => iterator.sum / iterator.size}
processedRDD.collect.toList
Here, groupByKey will return the RDD[(Int, Iterator[Int])] then you can simply apply average operation on Iterator
Hope this works for you
Thanks
You can use .combineByKey() to compute average:
val data = sc.parallelize(Seq((2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)))
val sumCountPair = data.combineByKey((x: Int) => (x.toDouble,1),
(pair1: (Double, Int), x: Int) => (pair1._1 + x, pair1._2 + 1),
(pair1: (Double, Int), pair2: (Double, Int)) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
val average = sumCountPair.map(x => (x._1, (x._2._1/x._2._2)))
average.collect()
here sumCountPair returns type RDD[(Int, (Double, Int))], denoting: (Key, (SumValue, CountValue)). The next step just divides sum by the count and returns (Key, AverageValue)

value _2 is not a member of Double spark-shell

I am getting an error while implementing aggregateByKey in spark-scala-shell.
The piece of code that I am trying to execute on Scala-shell is this,
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey(0.0,0)(
(a,b) => (a._1 + b , a._2 + 1),
(a,b) => (a._1 + b._1 , a._2 + b._2 )
)
But I am getting the following error,
<console>:39: error: value _1 is not a member of Double
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap.aggregateByKey( 0.0,0)( (a,b) => (a._1 + b , a._2 +1), (a,b) => (a._1 + b._1 , a._2 + b._2 ))
scala> orderItemsMapJoinOrdersMapMap
res8: org.apache.spark.rdd.RDD[(String, Float)] = MapPartitionsRDD[16] at map at <console>:37
Can someone help me in understanding double and Float value logic and how to fix it
The problem is that you are providing the first curried argument the wrong way. It should be something like this,
val orderItemsMapJoinOrdersMapMap: RDD[(String, Float)] = ...
// so elems of your orderItemsMapJoinOrdersMapMap are (String, Float)
// And your accumulator looks like (Double, Int)
// thus I believe that you just want to accumulate total number of elements and sum of the floats in them
val orderItemsMapJoinOrdersMapMapAgg = orderItemsMapJoinOrdersMapMap
.aggregateByKey((0.0,0))(
(acc, elem) => (acc._1 + elem._2 , acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1 , acc1._2 + acc._2)
)

Aggregate RDD values per key

I have RDD in key,value structure (someKey,(measure1,measure2)). I grouped by the key and now I want to aggregate the values for each key.
val RDD1 : RDD[(String,(Int,Int))]
RDD1.groupByKey()
the result I need is:
key: avg(measure1), avg(measure2), max(measure1), max(measure2), min(measure1), min(measure2), count(*)
First of all, avoid groupByKey! You should use aggregateByKey or combineByKey. We will use aggregateByKey. This function will transform values for each key: RDD[(K, V)] => RDD[(K, U)]. It needs zero value of type U and knowledge how to merge (V, U) => U and (U, U) => U. I simplified your example a little bit and want to get: key: avg(measure1), avg(measure2), min(measure1), min(measure2), count(*)
val rdd1 = sc.parallelize(List(("a", (11, 1)), ("a",(12, 3)), ("b",(10, 1))))
rdd1
.aggregateByKey((0.0, 0.0, Int.MaxValue, Int.MaxValue, 0))(
{
case ((sum1, sum2, min1, min2, count1), (v1, v2)) =>
(sum1 + v1, sum2 + v2, v1 min min1, v2 min min2, count1+1)
},
{
case ((sum1, sum2, min1, min2, count),
(otherSum1, otherSum2, otherMin1, otherMin2, otherCount)) =>
(sum1 + otherSum1, sum2 + otherSum2,
min1 min otherMin1, min2 min otherMin2, count + otherCount)
}
)
.map {
case (k, (sum1, sum2, min1, min2, count1)) => (k, (sum1/count1, sum2/count1, min1, min2, count1))
}
.collect()
giving
(a,(11.5,2.0,11,1,2)), (b,(10.0,1.0,10,1,1))