Spark - Best way agg two values using ReduceByKey

Spark - Best way agg two values using ReduceByKey - scala

Using Spark, I have a pair RDD[(String, (Int, Int)]. I am trying to find the best way to show multiple sums per key (in this case the sum of each Int shown seperately). I would like to do this with reduceByKey.
Is this possible?

Sure.
val rdd = sc.parallelize(Array(("foo", (1, 10)), ("foo", (2, 2)), ("bar", (5, 5))))
val res = rdd.reduceByKey((p1, p2) => (p1._1 + p2._1, p1._2 + p2._2))
res.collect()

Related

Column bind two RDD in scala spark without KEYs

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.

Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

Merging RDD records to obtain a single Row with multiple conditional counters

As a little bit of context, what I'm trying to achieve here is given multiple rows grouped by a certain set of keys, after that first reduce I would like to group them in a general row by, for example, date, with each of the grouped counters previously calculated. This may not seem clear by just reading it so here is an example output (quite simple, nothing complex) of what should happen.
(("Volvo", "T4", "2019-05-01"), 5)
(("Volvo", "T5", "2019-05-01"), 7)
(("Audi", "RS6", "2019-05-01"), 4)
And once merged those Row objects...
date , volvo_counter , audi_counter
"2019-05-01" , 12 , 4
I reckon this is quite a corner case and that there may be different approaches but I was wondering if there was any solution within the same RDD so there's no need for multiple RDDs divided by counter.

What you want to do is a pivot. You talk about RDDs so I assume that your question is: "how to do a pivot with the RDD API?". As far as I know there is no built-in function in the RDD API that does it. You could do it yourself like this:
// let's create sample data
val rdd = sc.parallelize(Seq(
(("Volvo", "T4", "2019-05-01"), 5),
(("Volvo", "T5", "2019-05-01"), 7),
(("Audi", "RS6", "2019-05-01"), 4)
))
// If the keys are not known in advance, we compute their distinct values
val values = rdd.map(_._1._1).distinct.collect.toSeq
// values: Seq[String] = WrappedArray(Volvo, Audi)
// Finally we make the pivot and use reduceByKey on the sequence
val res = rdd
.map{ case ((make, model, date), counter) =>
date -> values.map(v => if(make == v) counter else 0)
}
.reduceByKey((a, b) => a.indices.map(i => a(i) + b(i)))
// which gives you this
res.collect.head
// (String, Seq[Int]) = (2019-05-01,Vector(12, 4))
Note that you can write much simpler code with the SparkSQL API:
// let's first transform the previously created RDD to a dataframe:
val df = rdd.map{ case ((a, b, c), d) => (a, b, c, d) }
.toDF("make", "model", "date", "counter")
// And then it's as simple as that:
df.groupBy("date")
.pivot("make")
.agg(sum("counter"))
.show
+----------+----+-----+
| date|Audi|Volvo|
+----------+----+-----+
|2019-05-01| 4| 12|
+----------+----+-----+

I think it's easier to do with DataFrame:
val data = Seq(
Record(Key("Volvo", "2019-05-01"), 5),
Record(Key("Volvo", "2019-05-01"), 7),
Record(Key("Audi", "2019-05-01"), 4)
)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF()
val modelsExpr = df
.select("key.model").as("model")
.distinct()
.collect()
.map(r => r.getAs[String]("model"))
.map(m => sum(when($"key.model" === m, $"value").otherwise(0)).as(s"${m}_counter"))
df
.groupBy("key.date")
.agg(modelsExpr.head, modelsExpr.tail: _*)
.show(false)

ReduceByKey for a HashMap based RDD

I have a RDD A of the form tuple (key,HashMap[Int, Set(String)]) which I want to convert to a new RDD B (key, HashMap[Int, Set(String)) where the latter RDD has unique keys and the value for each key k is union of all sets for key k in RDD A.
For example,
RDD A
(1,{1->Set(3,5)}), (2,{3->Set(5,6)}), (1,{1->Set(3,4), 7->Set(10, 11)})
will convert to
RDD B
(1, {1->Set(3,4,5), 7->Set(10,11)}), (2, {3->Set(5,6)})
I am not able to formulate a function for this in Scala as I am new to the language. Any help would be appreciated.
Thanks in advance.

cats Semigroup would be a great fit here. Add
spark.jars.packages org.typelevel:cats_2.11:0.9.0
to the configuration and use combine method:
import cats.implicits._
val rdd = sc.parallelize(Seq(
(1, Map(1 -> Set(3,5))),
(2, Map(3 -> Set(5,6))),
(1, Map(1 -> Set(3,4), 7 -> Set(10, 11)))
rdd.reduceByKey(_ combine _)

Joining 2 RDDs when one having a Option type as key

I have 2 RDDs I would like to join which looks like this
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
Is there any way I can do a left outer join on them?
I have tried this but it does not work because the type of the key is different i.e Int, Option[Int]
q.leftOuterJoin(a)

The natural solution is to convert the Int to Option[Int] so they have the same type.
Following you example:
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a)
If you want to recover the Int type at the output, you can do this:
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a).map{ case (k,v) => (k.get, v) }
Note that you can do ".get" without any problem since it is not possible to get None's there.

One way to do is to convert it into dataframe and join
Here is a simple example
import spark.implicits._
val a = spark.sparkContext.parallelize(Seq(
(Some(3), 33),
(Some(1), 11),
(Some(2), 22)
)).toDF("id", "value1")
val q = spark.sparkContext.parallelize(Seq(
(Some(3), 33)
)).toDF("id", "value2")
q.join(a, a("id") === q("id") , "leftouter").show

Compare two rdd and the values which match from the right rdd put it in the rdd

I have 2 rdd
rdd1 rdd2
1,abc 3,asd
2,edc 4,qwe
3,wer 5,axc
4,ert
5,tyu
6,sdf
7,ghj
Compare the two rdd and once which match the with the id will be updated with the value from the rdd2 to the rdd1.
I understand that rdd are immutable so I consider that the new rdd will be made.
The output rdd will look something like this
output rdd
1,abc
2,edc
3,asd
4,qwe
5,axc
6,sdf
7,ghj
Its a basic thing but, I am new to spark and scala and trying things.

Use leftOuterJoin to match two RDDs by key, then use map to choose the "new value" (from rdd2) if it exists, or keep the "old" one otherwise:
// sample data:
val rdd1 = sc.parallelize(Seq((1, "aaa"), (2, "bbb"), (3, "ccc")))
val rdd2 = sc.parallelize(Seq((3, "333"), (4, "444"), (5, "555")))
val result = rdd1.leftOuterJoin(rdd2).map {
case (key, (oldV, maybeNewV)) => (key, maybeNewV.getOrElse(oldV))
}