ReduceByKey for a HashMap based RDD - scala

I have a RDD A of the form tuple (key,HashMap[Int, Set(String)]) which I want to convert to a new RDD B (key, HashMap[Int, Set(String)) where the latter RDD has unique keys and the value for each key k is union of all sets for key k in RDD A.
For example,
RDD A
(1,{1->Set(3,5)}), (2,{3->Set(5,6)}), (1,{1->Set(3,4), 7->Set(10, 11)})
will convert to
RDD B
(1, {1->Set(3,4,5), 7->Set(10,11)}), (2, {3->Set(5,6)})
I am not able to formulate a function for this in Scala as I am new to the language. Any help would be appreciated.
Thanks in advance.

cats Semigroup would be a great fit here. Add
spark.jars.packages org.typelevel:cats_2.11:0.9.0
to the configuration and use combine method:
import cats.implicits._
val rdd = sc.parallelize(Seq(
(1, Map(1 -> Set(3,5))),
(2, Map(3 -> Set(5,6))),
(1, Map(1 -> Set(3,4), 7 -> Set(10, 11)))
rdd.reduceByKey(_ combine _)

Related

Column bind two RDD in scala spark without KEYs

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.
Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

How can i sort Rdd[(Int, (val1, val2))] by val2, when only SortByKey is available as option?

I have a Rdd[(Int, (val1, val2))] which i want to sort by val2, but the only available option to use is SortByKey.
Is SortBy available only in older scala versions?
Is there another option than collecting it to driver?
In code i do only:
val nonslack = slacks.filter(x=> Vlts.contains(x._1))
where Vlts is Array[Int] and slacks is rdd read from file.
There is a sortBy in RDD:
val rdd = spark.sparkContext.parallelize(Seq(("one", ("one" -> 1)), ("two", ("two" -> 2)), ("three", ("three" -> 3))))
rdd.sortBy(_._2._2).collect().foreach(println(_))

Joining 2 RDDs when one having a Option type as key

I have 2 RDDs I would like to join which looks like this
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
Is there any way I can do a left outer join on them?
I have tried this but it does not work because the type of the key is different i.e Int, Option[Int]
q.leftOuterJoin(a)
The natural solution is to convert the Int to Option[Int] so they have the same type.
Following you example:
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a)
If you want to recover the Int type at the output, you can do this:
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a).map{ case (k,v) => (k.get, v) }
Note that you can do ".get" without any problem since it is not possible to get None's there.
One way to do is to convert it into dataframe and join
Here is a simple example
import spark.implicits._
val a = spark.sparkContext.parallelize(Seq(
(Some(3), 33),
(Some(1), 11),
(Some(2), 22)
)).toDF("id", "value1")
val q = spark.sparkContext.parallelize(Seq(
(Some(3), 33)
)).toDF("id", "value2")
q.join(a, a("id") === q("id") , "leftouter").show

Compare two rdd and the values which match from the right rdd put it in the rdd

I have 2 rdd
rdd1 rdd2
1,abc 3,asd
2,edc 4,qwe
3,wer 5,axc
4,ert
5,tyu
6,sdf
7,ghj
Compare the two rdd and once which match the with the id will be updated with the value from the rdd2 to the rdd1.
I understand that rdd are immutable so I consider that the new rdd will be made.
The output rdd will look something like this
output rdd
1,abc
2,edc
3,asd
4,qwe
5,axc
6,sdf
7,ghj
Its a basic thing but, I am new to spark and scala and trying things.
Use leftOuterJoin to match two RDDs by key, then use map to choose the "new value" (from rdd2) if it exists, or keep the "old" one otherwise:
// sample data:
val rdd1 = sc.parallelize(Seq((1, "aaa"), (2, "bbb"), (3, "ccc")))
val rdd2 = sc.parallelize(Seq((3, "333"), (4, "444"), (5, "555")))
val result = rdd1.leftOuterJoin(rdd2).map {
case (key, (oldV, maybeNewV)) => (key, maybeNewV.getOrElse(oldV))
}

Spark - Best way agg two values using ReduceByKey

Using Spark, I have a pair RDD[(String, (Int, Int)]. I am trying to find the best way to show multiple sums per key (in this case the sum of each Int shown seperately). I would like to do this with reduceByKey.
Is this possible?
Sure.
val rdd = sc.parallelize(Array(("foo", (1, 10)), ("foo", (2, 2)), ("bar", (5, 5))))
val res = rdd.reduceByKey((p1, p2) => (p1._1 + p2._1, p1._2 + p2._2))
res.collect()