Joining multiple pairedrdds - scala

I have a question regarding joining multiple rdds simultaneously. I have about 8 paired rdds of datatype: RDD [(String, mutable.HashSet[String])]. I would like to join them by key. I can join 2 using spark's join or cogroup?
However, is there a build-in way to do this? I can join two-at a time and then join the result rdd with the next one, however if there is any better way, would like to use that.

There is no built-in method to join multiple RDDs. Assuming this question is related to the previous one and you want to combine sets for each key you can simply use union followed by reduceByKey:
val rdds = Seq(rdd1, rdd2, ..., rdd8)
val combined: RDD[(String, mutable.HashSet[String])] = sc
.union(rdds)
.reduceByKey(_ ++ _)
If not you can try to reduce a collection of RDDs:
val combined: RDD[(String, Seq[mutable.HashSet[String]])] = rdds
.map(_.mapValues(s => Seq(s)))
.reduce((a, b) => a.join(b).mapValues{case (s1, s2) => s1 ++ s2})

Related

Join Dataframes dynamically using Spark Scala when JOIN columns differ

Dynamically select multiple columns while joining different Dataframe in scala spark
From the above link , I was able to have the join expression working , but what if the column names are different, we cannot use Seq(columns) and need to join it dynamically. Here left_ds and right_ds are the dataframes which I wanted to join.
Below I want to join columns id=acc_id and "acc_no=number"
left_da => id,acc_no,name,ph
right_ds => acc_id,number,location
val joinKeys="id,acc_id|acc_no,number"
val joinKeyPair: Array[(String, String)] = joinKeys.split("\\|").map(_.split(",")).map(x => x(0).toUpperCase -> x(1).toUpperCase)
val joinExpr: Column = joinKeyPair.map { case (ltable_col, rtable_col) =>left_ds.col(ltable_col) === right_ds.col(rtable_col)}.reduce(_ and _)
left_ds.join(right_ds, joinExpr, "left_outer")
Above is the join expression I was trying but it not working. Is there a way to achieve this if the join column names are different with out using Seq. So if the number of join keys increase ,I should still be able to make the code work dynamically.
With aliases have to work fine:
val conditionArrays = joinKeys.split("\\|").map(c => c.split(","))
val joinExpr = conditionArrays.map { case Array(a, b) => col("a." + a) === col("b." + b) }.reduce(_ and _)
left_ds.alias("a").join(right_ds.alias("b"), joinExpr, "left_outer")

Multiply elements in the Spark RDD with each other

In one of the problems I've faced when running an Apache Spark job is to multiply each element in the RDD with each other.
Simply put, I want to do something similar to this,
Currently, I'm doing this using 2 iterators for each 'foreach'. My gut feeling is that this can be done in a much efficient manner.
for (elementOutSide <- iteratorA) {
for (elementInside <- iteratorB) {
if (!elementOutSide.get(3).equals(elementInside.get(3))) {
val multemp = elementInside.getLong(3) * elementOutSide.getLong(3)
....
...
}}}
Can anyone help me in correcting and improving the situation?? Thanks in advance .. !!
As pointed out by comments, this is a cartesian join. Here's how it can be done on an RDD[(Int, String)], where we're interested in the multiplication of every two non-identical Ints:
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(1, "aa"),
(2, "ab"),
(3, "ac")
))
// use "cartesian", then "collect" to map only relevant results
val result: RDD[Int] = rdd.cartesian(rdd).collect {
case ((t1: Int, _), (t2: Int, _)) if t1 != t2 => t1 * t2
}
Note: this implementation assumes input records are unique, as instructed. If they aren't, you can perform the cartesian join and the mapping on the result of rdd.zipWithIndex while comparing the indices instead of the values.

Join two Dataframe without a common field in Spark-scala

I have two dataframes in Spark Scala, but one of these is composed by a unique column. I have to join them but they have no column in common. The number of row is the same.
val userFriends=userJson.select($"friends",$"user_id")
val x = userFriends("friends")
.rdd
.map(x => x.getList(0).toArray.map(_.toString))
val y = x.map(z=>z.count(z=>true)).toDF("friendCount")
I have to join userFriends with y
It's not possible to join them without common fields, except if you can rely on a ordering, in this case you can use row-number (with window-function) on both dataframes and join on the row-number.
But in your case this does not seem necessary, just keep the user_id column in your dataframe, something like this should work:
val userFriends=userJson.select($"friends",$"user_id")
val result_df =
userFriends.select($"friends",$"user_id")
.rdd
.map(x => (x.getList(0).toArray.map(_.toString).count(z=>true)),x.getInt(1)))
.toDF("friendsCount","user_id")

What's the difference between join and cogroup in Apache Spark

What's the difference between join and cogroup in Apache Spark? What's the use case for each method?
Let me help you to clarify them, both are common to use and important!
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
This is prototype of join, please carefully look at it. For example,
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to relation database operation INNER JOIN.
But cogroup is different,
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(C,(CompactBuffer(3),CompactBuffer(c)))
)
This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!
Good Luck!
Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.