Compare two rdd and the values which match from the right rdd put it in the rdd - scala

I have 2 rdd
rdd1 rdd2
1,abc 3,asd
2,edc 4,qwe
3,wer 5,axc
4,ert
5,tyu
6,sdf
7,ghj
Compare the two rdd and once which match the with the id will be updated with the value from the rdd2 to the rdd1.
I understand that rdd are immutable so I consider that the new rdd will be made.
The output rdd will look something like this
output rdd
1,abc
2,edc
3,asd
4,qwe
5,axc
6,sdf
7,ghj
Its a basic thing but, I am new to spark and scala and trying things.

Use leftOuterJoin to match two RDDs by key, then use map to choose the "new value" (from rdd2) if it exists, or keep the "old" one otherwise:
// sample data:
val rdd1 = sc.parallelize(Seq((1, "aaa"), (2, "bbb"), (3, "ccc")))
val rdd2 = sc.parallelize(Seq((3, "333"), (4, "444"), (5, "555")))
val result = rdd1.leftOuterJoin(rdd2).map {
case (key, (oldV, maybeNewV)) => (key, maybeNewV.getOrElse(oldV))
}

Related

Column bind two RDD in scala spark without KEYs

The two RDDs have the same number of rows.
I am searching for the R's equivalent to cbind()
It seems join() always requires a key.
Closest is .zip method. With appropriate subsequent .map usage. E.g.:
val rdd0 = sc.parallelize(Seq( (1, (2,3)), (2, (3,4)) ))
val rdd1 = sc.parallelize(Seq( (200,300), (300,400) ))
val zipRdd = (rdd0 zip rdd1).collect
returns:
zipRdd: Array[((Int, (Int, Int)), (Int, Int))] = Array(((1,(2,3)),(200,300)), ((2,(3,4)),(300,400)))
Indeed based on k,v with same num rows required.

Non Deterministic Behaviour of UNION of RDD in Spark

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??
I've a (myDF)dataframe of rows and converted to RDD :-
myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))
myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/
val rowCount = myRdd.count() // Count of Records in myRdd
val header = "name:country:date:nextdate:1" // random header
// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))
//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))
//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")
As Union doesn't preserve ordering it should not give below result
Output
name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
Without using any sorting, How's that possible to get above output??
sortByKey("true", 1)
But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
What is the possible reason for above behaviour??
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

How can I insert values of a DataFrame column into a list

I want to add the values of a DataFrame column(named as prediction) into a List, so that I can write a csv file using that list values which will further split that column into 3 more columns.
I have tried creating a new list and assigning the column to the list but it only adds the schema of the column instead of the data.
//This is the prediction column which is basically a model stored in the Value PredictionModel
val PredictionModel = model.transform(testDF)
PredictionModel.select("features","label","prediction")
val ListOfPredictions:List[String]= List(PredictionModel.select("prediction").toString()
The expected result is basically the data of the column being assigned to the list so that it can be used further.
But the actual outcome is only the schema of the column being assigned to the list as follows:
[prediction: double]
You can write whole DataFrame as csv:
PredictionModel.select("features","label","prediction")
.write
.option("header","true")
.option("delimiter",",")
.csv("C:/yourfile.csv")
But if you want dataframe as List of concatenated df columns you can try this:
val data = Seq(
(1, 99),
(1, 99),
(1, 70),
(1, 20)
).toDF("id", "value")
val ok: List[String] = data
.select(concat_ws(",", data.columns.map(data(_)): _*))
.map(s => s.getString(0))
.collect()
.toList
output:
ok.foreach(println(_))
1,99
1,99
1,70
1,20

spark.createDataFrame () not working with Seq RDD

CreateDataFrame takes 2 arguments , an rdd and schema.
my schema is like this
val schemas= StructType(
Seq(
StructField("number",IntegerType,false),
StructField("notation", StringType,false)
)
)
in one case i am able to create dataframe from RDD like below:
`val data1=Seq(Row(1,"one"),Row(2,"two"))
val rdd=spark.sparkContext.parallelize(data1)
val final_df= spark.createDataFrame(rdd,schemas)`
In other case like below .. i am not able to
`val data2=Seq((1,"one"),(2,"two"))
val rdd=spark.sparkContext.parallelize(data2)
val final_df= spark.createDataFrame(rdd,schemas)`
Whats wrong with data2 for not able to become a valid rdd for Dataframe?
but we can able to create dataframe using toDF() with data2 but not CreateDataFrame.
val data2_DF=Seq((1,"one"),(2,"two")).toDF("number", "notation")
Please help me understand this behaviour.
Is Row mandatory while creating dataframe?
In the second case, just do :
val final_df = spark.createDataFrame(rdd)
Because your RDD is an RDD of Tuple2 (which is a Product), the schema is known at compile time, so you don't need to specify a schema

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.