Access joined RDD fields in a readable way - scala

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks

Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.

Related

Column bind two RDD in scala spark without KEYs

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.
Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

Non Deterministic Behaviour of UNION of RDD in Spark

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??
I've a (myDF)dataframe of rows and converted to RDD :-
myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))
myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/
val rowCount = myRdd.count() // Count of Records in myRdd
val header = "name:country:date:nextdate:1" // random header
// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))
//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))
//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")
As Union doesn't preserve ordering it should not give below result
Output
name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
Without using any sorting, How's that possible to get above output??
sortByKey("true", 1)
But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
What is the possible reason for above behaviour??
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

Compare two rdd and the values which match from the right rdd put it in the rdd

I have 2 rdd
rdd1 rdd2
1,abc 3,asd
2,edc 4,qwe
3,wer 5,axc
4,ert
5,tyu
6,sdf
7,ghj
Compare the two rdd and once which match the with the id will be updated with the value from the rdd2 to the rdd1.
I understand that rdd are immutable so I consider that the new rdd will be made.
The output rdd will look something like this
output rdd
1,abc
2,edc
3,asd
4,qwe
5,axc
6,sdf
7,ghj
Its a basic thing but, I am new to spark and scala and trying things.
Use leftOuterJoin to match two RDDs by key, then use map to choose the "new value" (from rdd2) if it exists, or keep the "old" one otherwise:
// sample data:
val rdd1 = sc.parallelize(Seq((1, "aaa"), (2, "bbb"), (3, "ccc")))
val rdd2 = sc.parallelize(Seq((3, "333"), (4, "444"), (5, "555")))
val result = rdd1.leftOuterJoin(rdd2).map {
case (key, (oldV, maybeNewV)) => (key, maybeNewV.getOrElse(oldV))
}

Scala/Spark - Aggregating RDD

Just wondering how I can do the following:
Suppose I have an RDD containing (username, age, movieBought) for many usernames and some lines can have the same username and age but a different movieBought.
How can I remove the duplicated lines and transform it into (username, age, movieBought1, movieBought2...)?
Kind Regards
val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(_._3)))
val results = grouped.collect.toList
UPDATE (if each tuple also has number of movies item):
val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(m => (m._3, m._4))))
val results = grouped.collect.toList
I was gonna suggest collect and to list, but ka4eli beat me to it.
I guess you could also use the groupBy / groupByKey and then reduce/reduceByKey operation. The downside of this ofc is that the result (movie1,movie2,movie3..) are concatenated into 1 string (instead of a List structure, which makes accessing it difficult).
val group = rdd.map(x=>((x.name,x.age),x.movie))).groupBy(_._1)
val result = group.map(x=>(x._1._1,x._1._2,x._2.map(y=>y._2).reduce(_+","+_)

Spark sort RDD and join their rank

I have an RDD[(VertexId, Double)], and I want to sort it by _._2 and join the index(rank) with this RDD. Therefore I can get an element and its rank by filter.
Currently I sort the RDD by sortBy, but I do not know how to join a RDD with its rank. So I collect it as a sequence and zip it with its index. But this is not efficient. I am wondering if there is a more elegant way to do that.
The code I'm using right now are:
val tmpRes = graph.vertices.sortBy(_._2, ascending = false) // Sort all nodes by its PR score in descending order
.collect() // collect to master, this may be very expensive
tmpRes.zip(tmpRes.indices) // zip with index
if, by any chance, you'd like to bring back to the driver only n first tuples, then maybe you could use takeOrdered(n, [ordering]) where n is the number of results to bring back and ordering the comparator you'd like to use.
Otherwise, you can use the zipWithIndex transformation that will transform you RDD[(VertexId, Double)] into a RDD[((VertexId, Double), Long)] with the proper index (of course you should do that after your sort).
For example :
scala> val data = sc.parallelize(List(("A", 1), ("B", 2)))
scala> val sorted = data.sortBy(_._2)
scala> sorted.zipWithIndex.collect()
res1: Array[((String, Int), Long)] = Array(((A,1),0), ((B,2),1))
Regards,