Multiply elements in the Spark RDD with each other - scala

In one of the problems I've faced when running an Apache Spark job is to multiply each element in the RDD with each other.
Simply put, I want to do something similar to this,
Currently, I'm doing this using 2 iterators for each 'foreach'. My gut feeling is that this can be done in a much efficient manner.
for (elementOutSide <- iteratorA) {
for (elementInside <- iteratorB) {
if (!elementOutSide.get(3).equals(elementInside.get(3))) {
val multemp = elementInside.getLong(3) * elementOutSide.getLong(3)
....
...
}}}
Can anyone help me in correcting and improving the situation?? Thanks in advance .. !!

As pointed out by comments, this is a cartesian join. Here's how it can be done on an RDD[(Int, String)], where we're interested in the multiplication of every two non-identical Ints:
val rdd: RDD[(Int, String)] = sc.parallelize(Seq(
(1, "aa"),
(2, "ab"),
(3, "ac")
))
// use "cartesian", then "collect" to map only relevant results
val result: RDD[Int] = rdd.cartesian(rdd).collect {
case ((t1: Int, _), (t2: Int, _)) if t1 != t2 => t1 * t2
}
Note: this implementation assumes input records are unique, as instructed. If they aren't, you can perform the cartesian join and the mapping on the result of rdd.zipWithIndex while comparing the indices instead of the values.

Related

Column bind two RDD in scala spark without KEYs

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.
Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

Non Deterministic Behaviour of UNION of RDD in Spark

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??
I've a (myDF)dataframe of rows and converted to RDD :-
myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))
myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/
val rowCount = myRdd.count() // Count of Records in myRdd
val header = "name:country:date:nextdate:1" // random header
// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))
//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))
//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")
As Union doesn't preserve ordering it should not give below result
Output
name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
Without using any sorting, How's that possible to get above output??
sortByKey("true", 1)
But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
What is the possible reason for above behaviour??
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

What's the difference between join and cogroup in Apache Spark

What's the difference between join and cogroup in Apache Spark? What's the use case for each method?
Let me help you to clarify them, both are common to use and important!
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
This is prototype of join, please carefully look at it. For example,
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))
All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to relation database operation INNER JOIN.
But cogroup is different,
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:
val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())),
(D,(CompactBuffer(),CompactBuffer(d))),
(A,(CompactBuffer(1),CompactBuffer(a))),
(C,(CompactBuffer(3),CompactBuffer(c)))
)
This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!
Good Luck!
Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.

Spark sort RDD and join their rank

I have an RDD[(VertexId, Double)], and I want to sort it by _._2 and join the index(rank) with this RDD. Therefore I can get an element and its rank by filter.
Currently I sort the RDD by sortBy, but I do not know how to join a RDD with its rank. So I collect it as a sequence and zip it with its index. But this is not efficient. I am wondering if there is a more elegant way to do that.
The code I'm using right now are:
val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
.collect() // collect to master, this may be very expensive
tmpRes.zip(tmpRes.indices) // zip with index
if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use.
Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort).
For example :
scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))
Regards,