What's the difference between join and cogroup in Apache Spark - scala

What's the difference between join and cogroup in Apache Spark? What's the use case for each method?

Let me help you to clarify them, both are common to use and important!
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
This is prototype of join, please carefully look at it. For example,
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to relation database operation INNER JOIN.
But cogroup is different,
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(C,(CompactBuffer(3),CompactBuffer(c)))
)
This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!
Good Luck!
Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Related

Multiply elements in the Spark RDD with each other

In one of the problems I've faced when running an Apache Spark job is to multiply each element in the RDD with each other.
Simply put, I want to do something similar to this,
Currently, I'm doing this using 2 iterators for each 'foreach'. My gut feeling is that this can be done in a much efficient manner.
for (elementOutSide <- iteratorA) {
for (elementInside <- iteratorB) {
if (!elementOutSide.get(3).equals(elementInside.get(3))) {
val multemp = elementInside.getLong(3) * elementOutSide.getLong(3)
....
...
}}}
Can anyone help me in correcting and improving the situation?? Thanks in advance .. !!
As pointed out by comments, this is a cartesian join. Here's how it can be done on an RDD[(Int, String)], where we're interested in the multiplication of every two non-identical Ints:
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(1, "aa"),
(2, "ab"),
(3, "ac")
))
// use "cartesian", then "collect" to map only relevant results
val result: RDD[Int] = rdd.cartesian(rdd).collect {
case ((t1: Int, _), (t2: Int, _)) if t1 != t2 => t1 * t2
}
Note: this implementation assumes input records are unique, as instructed. If they aren't, you can perform the cartesian join and the mapping on the result of rdd.zipWithIndex while comparing the indices instead of the values.

Spark Joins with None Values

I am trying to perform a join in Spark knowing that one of my keys on the left does not have a corresponding value in the other RDD.
The documentation says it should perform the join with None as an option if no key is found, but I keep getting a type mismatch error.
Any insight here?
Take these two RDDs:
val rdd1 = sc.parallelize(Array(("test","foo"),("test2", "foo2")))
val rdd2 = sc.parallelize(Array(("test","foo3"),("test3", "foo4")))
When you join them, you have a couple of options. What you do depends on what you want. Do you want an RDD only with the common keys?
val leftJoined = rdd1.join(rdd2)
leftJoined.collect
res1: Array[(String, (String, String))] = Array((test,(foo,foo3)))
If you want keys missing from rdd2 to be filled in with None, use leftOuterJoin:
val leftOuter = rdd.leftOuterJoin(rdd2)
leftOuter.collect
res2: Array[(String, (String, Option[String]))] = Array((test2,(foo2,None)), (test,(foo,Some(foo3))))
If you want keys missing from either side to be filled in with None, use fullOuterJoin:
val fullOuter = rdd1.fullOuterJoin(rdd2)
fullOuter.collect
res3: Array[(String, (Option[String], Option[String]))] = Array((test2,(Some(foo2),None)), (test3,(None,Some(foo4))), (test,(Some(foo),Some(fo3))))

Joining multiple pairedrdds

I have a question regarding joining multiple rdds simultaneously. I have about 8 paired rdds of datatype: RDD [(String, mutable.HashSet[String])]. I would like to join them by key. I can join 2 using spark's join or cogroup?
However, is there a build-in way to do this? I can join two-at a time and then join the result rdd with the next one, however if there is any better way, would like to use that.
There is no built-in method to join multiple RDDs. Assuming this question is related to the previous one and you want to combine sets for each key you can simply use union followed by reduceByKey:
val rdds = Seq(rdd1, rdd2, ..., rdd8)
val combined: RDD[(String, mutable.HashSet[String])] = sc
.union(rdds)
.reduceByKey(_ ++ _)
If not you can try to reduce a collection of RDDs:
val combined: RDD[(String, Seq[mutable.HashSet[String]])] = rdds
.map(_.mapValues(s => Seq(s)))
.reduce((a, b) => a.join(b).mapValues{case (s1, s2) => s1 ++ s2})

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.

Spark sort RDD and join their rank

I have an RDD[(VertexId, Double)], and I want to sort it by _._2 and join the index(rank) with this RDD. Therefore I can get an element and its rank by filter.
Currently I sort the RDD by sortBy, but I do not know how to join a RDD with its rank. So I collect it as a sequence and zip it with its index. But this is not efficient. I am wondering if there is a more elegant way to do that.
The code I'm using right now are:
val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
.collect() // collect to master, this may be very expensive
tmpRes.zip(tmpRes.indices) // zip with index
if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use.
Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort).
For example :
scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))
Regards,