Sum values of each unique key in Apache Spark RDD - scala

I have an RDD[(String, (Long, Long))] where each element is not unique:
(com.instagram.android,(2,0))
(com.android.contacts,(6,1))
(com.android.contacts,(3,4))
(com.instagram.android,(8,3))
...
So I want to obtain an RDD where each element is the sum of the two values for every unique key:
(com.instagram.android,(10,3))
(com.android.contacts,(9,5))
...
Here is my code:
val appNamesAndPropertiesRdd = appNodesRdd.map({
case Row(_, appName, totalUsageTime, usageFrequency, _, _, _, _) =>
(appName, (totalUsageTime, usageFrequency))
})

Use reduceByKey:
val rdd = appNamesAndPropertiesRdd.reduceByKey(
(acc, elem) => (acc._1 + elem._1, acc._2 + elem._2)
)
reduceByKey uses aggregateByKey described by SCouto, but has more readable usage. For your case, more advanced features of aggregateByKey - hidden by simpler API of reduceBykey - are not necessary

First of all, I don't think that usageFrequency should be simply added up.
Now, lets come to what you want to do, You want to add things by key, you can do it
1.Using groupByKey and then reducing the groups to sum things up,
val requiredRdd = appNamesAndPropertiesRdd
.groupBy({ case (an, (tut, uf)) => an })
.map({
case (an, iter) => (
an,
iter
.map({ case (an, tut, uf) => (tut, tf) })
.reduce({ case ((tut1, tf1), (tut2, tf2)) => (tut1 + tut2, tf1 + tf2) })
)
})
Or by using reduceByKey
val requiredRdd = appNamesAndPropertiesRdd
.reduceByKey({
case ((tut1, uf1), (tut2, uf2)) => (tut1 + tut2, tf1 + tf2)
})
And reduceByKey is a better choice for two reasons,
It saves a not so required group operation.
The groupBy approach can lead to a reshuffle which will be expensive.

The function aggregateByKey is the best one for this purpose
appNamesAndPropertiesRdd.aggregateByKey((0, 0))((acc, elem) => (acc._1 + elem._1, acc._2 +elem._2 ),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
Explained here:
aggregateByKey((0, 0)) => This is the zerovalue. The value that will be the initial one. In your case, since you want the addition, 0,0 will be the initial value (0.0, 0.0) if you want double instead of int
((acc, elem) => (acc._1 + elem._1, acc._2 +elem._2 ) => The first function. To accumulate the elements in the same partition. The accumulator will hold the partial value. Since elem is a tuple, you need to add each part of it to the correpondent part of the accumulator
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)) => The second function. To accumulate the accumulator from each partition.

Try this logic,
rdd.groupBy(_._1).map(x=> (x._1, (x._2.map(_._2).foldLeft((0,0)) {case ((acc1, acc2),(a, b))=> (acc1+a, acc2+b)} )))

Related

How to get minimum value for each distinct key using ReduceByKey() in Scala

I have a flat map that returns the Sequence Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)) now I need to use the reduceByKey() on the sequence that I got from the flat map to find the minimum value for each key.
I tried using .reduceByKey(a,min(b)) and .reduceByKey((a, b) => if (a._1 < b._1) a else b) but neither of them are working.
This is my code
for(i<- 1 to 5){
var graph=graph.flatMap{ in => in match{ case (x, y, zs) => (x, y) :: zs.map(z => (z, y))}
.reduceByKey((a, b) => if (a._1 < b._1) a else b)
}
For each distinct key the flatmap generates I need to get the minimum value for that key. Eg: the flatmap generates Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)) the resultByKey() should generate (20,1),(22,1),(23,6),(24,6)
Here is the signature of reduceByKey:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Basically, given a RDD of key/value pairs, you need to provide a function that reduces two values (and not the entire pair) into one. Therefore, you can use it as follows:
val rdd = sc.parallelize(Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)))
val result = rdd.reduceByKey((a, b) => if (a < b) a else b)
result.collect
// Array[(Int, Int)] = Array((24,6), (20,1), (22,1), (23,6))

How to find the common values in key value pairs and put it as value in all pairs?

How can I get the intersection of values in key value pairs?
I have pairs:
(p, Set(n))
in which I used reduceByKey and finally got:
(p1, Set(n1, n2)) (p2, Set(n1, n2, n3)) (p3, Set(n2, n3))
What I want is to find n that exist in all of the pairs and put them as value. For the above data, the result would by
(p1, Set(n2)) (p2, Set(n2)), (p3, Set(n2))
As long as I searched, there is no reduceByValue in spark. The only function that seemed closer to what i want was reduce() but it didn't work as the result was only one key value pair ((p3, Set(n2))).
Is there any way to solve it? Or should i think something else from the start?
Code:
val rRdd = inputFile.map(x => (x._1, Set(x._2)).reduceByKey(_++_)
val wrongRdd = rRdd.reduce{(x, y) => (x._1, x._2.intersect(y._2))}
I can see why wrongRdd is not correct, I just put it to show how (p3, Set(n2)) resulted from.
You can first reduce the sets to their intersection (say, s), then replace (k, v) with (k, s):
val rdd = sc.parallelize(Seq(
("p1", Set("n1", "n2")),
("p2", Set("n1", "n2", "n3")),
("p3", Set("n2", "n3"))
))
val s = rdd.map(_._2).reduce(_ intersect _)
// s: scala.collection.immutable.Set[String] = Set(n2)
rdd.map{ case (k, v) => (k, s) }.collect
// res1: Array[(String, scala.collection.immutable.Set[String])] = Array(
// (p1,Set(n2)), (p2,Set(n2)), (p3,Set(n2))
// )

How does this use of aggregate work in Scala?

I have been reading a spark book and this example is from the book
input = List(1,2,3,4,5,6)
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I am trying to understand how this works and what is the _1 and _2 at each step
(0,0) is the seed value or initial value
This list gets split into sep rdd's
lets say rdd1 contains List(1,2)
loop through this list
(acc, value)
acc = ??? during each iteration of the loop
value = ??? during each iteration of the loop
(acc, value) => (acc._1 + value, acc._2 + 1)
during the first iteration of List(1,2) what is the value of acc._1 and _2 and value
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
acc1 (for 1,2) is something like 3,2 and acc2 (for 3,4) is 7,2
and this function adds 3+7 and 2+2 = 10,4 and add this value to the next group
Dear kind hearted Helpers,
please do not use jargons used in scala, I already read it and did not understand it hence came for help.
For a List(1,2) what will be the value of acc._1 and acc._2 during the first iteration of the list and during that iteration what is the value of 'value' and during the second iteration what are their values?
The first parameter of the aggregation function takes an initial value which in this example is a Tuple (0,0), then the next parameter is seqop which is a function (B, A) => A, in your example it would (Tuple, Int) => Tuple
What is happening here is this function is applied on every parameter of the list one by one. The tuple actually holds on the left side the sum of the list and on the right side the amount of the list passed so far. The result of the aggregation function is (21, 6).
A side note: The implementation of TraversableOnce in Scala doesn't really use the combop parameter which in this example is (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)) So you can just ignore it in this case. If you are familiar with Scala the code that gets executed is:
input.foldLeft((0, 0))((acc, value) => (acc._1 + value, acc._2 + 1))
aggregate works by taking in two functions, one which combines values within a partition and one which combines partitions.
The first function (the one for a single partition) could be more clearly written as
((sum, count), value) => (sum + value, count + 1)
The second function (to combine partitions) could be written as
((partition1Sum, partition1Count), (partition2Sum, partition2Count)) =>
(partition1Sum + partition2Sum, partition1Count + partition2Count)
Note on tuple notation:
In Scala (a, b, c)._1 == a, (a, b, c)._2 == b and so on. _n gives you the nth element of the tuple.

Aggregation of multiple values using scala/spark

I am new with spark and scala. I want to sum up all the values present in the RDD. below is the example.
RDD is key value pair and suppose after doing some join and transformation the output of RDD have 3 record as below, where A is key:
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
Now i want to sum up all values of each record with corresponding value in other records, so the output should come like
(A, List(3,3,3,3,3,3,3))
Can anyone please help me out on this. Is there any possible way to achieve this using scala?
Big Thanks in Advance
A naive approach is to reduceByKey:
rdd.reduceByKey(
(xs, ys) => xs.zip(ys).map { case (x, y) => x + y }
)
but it is rather inefficient because it creates a new List on each merge.
You can improve on that by using for example aggregateByKey with mutable buffer:
rdd.aggregateByKey(Array.fill(7)(0)) // Mutable buffer
// For seqOp we'll mutate accumulator
(acc, xs) => {
for {
(x, i) <- xs.zipWithIndex
} acc(i) += x
acc
},
// For performance you could modify acc1 as above
(acc1, acc2) => acc1.zip(acc2).map { case(x, y) => x + y }
).mapValues(_.toList)
It should be also possible to use DataFrames but by default recent versions schedule aggregations separately so without adjusting configuration it is probably not worth the effort.

SortByValue for a RDD of tuples

Recently I was asked (in a class assignment) to find the top 10 occurring words inside RDD. I submitted my assignment with a working solution which looks like
wordsRdd
.map(x => (x, 1))
.reduceByKey(_ + _)
.map(case (x, y) => (y, x))
.sortByKey(false)
.map(case (x, y) => (y, x))
.take(10)
So basically, I swap the tuple, sort by key, and then swap again. Then finally take 10. I don't find the repeated swapping very elegant.
So I wonder if there is a more elegant way of doing this.
I searched and found some people using Scala implicits to convert the RDD into a Scala Sequence and then doing the sortByValue, but I don't want to convert RDD to a Scala Seq, because that will kill the distributed nature of the RDD.
So is there a better way?
How about this:
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(Ordering.by(-1 * _._2))
or a little bit more verbose:
object WordCountPairsOrdering extends Ordering[(String, Int)] {
def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(WordCountPairsOrdering)