Column bind two RDD in scala spark without KEYs - scala

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.

Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

Related

Compare two rdd and the values which match from the right rdd put it in the rdd

I have 2 rdd
rdd1 rdd2
1,abc 3,asd
2,edc 4,qwe
3,wer 5,axc
4,ert
5,tyu
6,sdf
7,ghj
Compare the two rdd and once which match the with the id will be updated with the value from the rdd2 to the rdd1.
I understand that rdd are immutable so I consider that the new rdd will be made.
The output rdd will look something like this
output rdd
1,abc
2,edc
3,asd
4,qwe
5,axc
6,sdf
7,ghj
Its a basic thing but, I am new to spark and scala and trying things.
Use leftOuterJoin to match two RDDs by key, then use map to choose the "new value" (from rdd2) if it exists, or keep the "old" one otherwise:
// sample data:
val rdd1 = sc.parallelize(Seq((1, "aaa"), (2, "bbb"), (3, "ccc")))
val rdd2 = sc.parallelize(Seq((3, "333"), (4, "444"), (5, "555")))
val result = rdd1.leftOuterJoin(rdd2).map {
case (key, (oldV, maybeNewV)) => (key, maybeNewV.getOrElse(oldV))
}

Spark Joins with None Values

I am trying to perform a join in Spark knowing that one of my keys on the left does not have a corresponding value in the other RDD.
The documentation says it should perform the join with None as an option if no key is found, but I keep getting a type mismatch error.
Any insight here?
Take these two RDDs:
val rdd1 = sc.parallelize(Array(("test","foo"),("test2", "foo2")))
val rdd2 = sc.parallelize(Array(("test","foo3"),("test3", "foo4")))
When you join them, you have a couple of options. What you do depends on what you want. Do you want an RDD only with the common keys?
val leftJoined = rdd1.join(rdd2)
leftJoined.collect
res1: Array[(String, (String, String))] = Array((test,(foo,foo3)))
If you want keys missing from rdd2 to be filled in with None, use leftOuterJoin:
val leftOuter = rdd.leftOuterJoin(rdd2)
leftOuter.collect
res2: Array[(String, (String, Option[String]))] = Array((test2,(foo2,None)), (test,(foo,Some(foo3))))
If you want keys missing from either side to be filled in with None, use fullOuterJoin:
val fullOuter = rdd1.fullOuterJoin(rdd2)
fullOuter.collect
res3: Array[(String, (Option[String], Option[String]))] = Array((test2,(Some(foo2),None)), (test3,(None,Some(foo4))), (test,(Some(foo),Some(fo3))))

Spark - Best way agg two values using ReduceByKey

Using Spark, I have a pair RDD[(String, (Int, Int)]. I am trying to find the best way to show multiple sums per key (in this case the sum of each Int shown seperately). I would like to do this with reduceByKey.
Is this possible?
Sure.
val rdd = sc.parallelize(Array(("foo", (1, 10)), ("foo", (2, 2)), ("bar", (5, 5))))
val res = rdd.reduceByKey((p1, p2) => (p1._1 + p2._1, p1._2 + p2._2))
res.collect()

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.

Spark sort RDD and join their rank

I have an RDD[(VertexId, Double)], and I want to sort it by _._2 and join the index(rank) with this RDD. Therefore I can get an element and its rank by filter.
Currently I sort the RDD by sortBy, but I do not know how to join a RDD with its rank. So I collect it as a sequence and zip it with its index. But this is not efficient. I am wondering if there is a more elegant way to do that.
The code I'm using right now are:
val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
.collect() // collect to master, this may be very expensive
tmpRes.zip(tmpRes.indices) // zip with index
if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use.
Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort).
For example :
scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))
Regards,