Unable to compare Spark SQL Date columns - scala

I have a case class in scala
case class TestDate (id: String, loginTime: java.sql.Date)
I created 2 RDD's of type TestDate
I wanted to do an inner join on two rdd's where the values of loginTime column is equal. Please find the code snippet below,
firstRDD.toDF.registerTempTable("firstTable")
secondRDD.toDF.registerTempTable("secondTable")
val res = sqlContext.sql("select * from firstTable INNER JOIN secondTable on to_date(firstTable.loginTime) = to_date(secondTable.loginTime)")
I'm not getting any exception. But i'm not getting correct answer too.
It does a cartesian and some random dates are generated in the result.

The issue was due to a wrong format given while creating the date object. When the format was rectified, it worked fine.

You can try using another approach:
val df1 = firstRDD.toDF
val df2 = secondRDD.toDF
val res = df1.join(df2, Seq("loginTime"))
If it doesn't work, you can try casting your dates to string:
val df1 = firstRDD.toDF.withColumn("loginTimeStr", col("loginTime").cast("string"))
val df2 = secondRDD.toDF.withColumn("loginTimeStr", col("loginTime").cast("string"))
val res = df1.join(df2, Seq("loginTimeStr"))
Finally, maybe the problem is that you also need the ID column in the join?
val df1 = firstRDD.toDF
val df2 = secondRDD.toDF
val res = df1.join(df2, Seq("id", "loginTime"))

Related

How to join multiple dataFrames in spark with different column names and types without converting into RDD

My df1 has column of type Double, df2 has column of type Timestamp and df3 has column of type Integer.
I'm trying to achieve something like this:
df1 = ...
df2 = ...
df3 = ...
val df4 = df1.zip(df2).zip(df3)
However there's no such function like "zip". How can I archive such result?
There's no explicit zip for DataFrames. You can do workaround:
val df1Ordered = df1.withColumn("rowNr", row_number().over(Window.orderBy('someColumn));
// the same for other DataFrames
// now join those DataFrames
val newDF = df1Ordered.join(df2Ordered, "rowNr").join("df3Ordered", "rowNr")
However it will be quite slow, because there is no partitionBy in Window operation.

Compare 2 dataframes and filter results based on date column in spark

I have 2 dataframes in spark as mentioned below.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc");
test: org.apache.spark.sql.DataFrame = [test_dt: string]
val test1 = hivecontext.table("testing");
where test1 has columns like id,name,age,audit_dt
I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column.
I am able to compare literal date using lit function as mentioned below
val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23")))
Max Date in test dataframe is -> 2017-04-26
Data in test1 Dataframe ->
Id,Name,Age,Audit_Dt
1,Rahul,23,2017-04-26
2,Ankit,25,2017-04-26
3,Pradeep,28,2017-04-27
I just need the data for Id=3 since that only row qualifies the greater than criteria of max date.
I have already tried below mentioned option but it is not working.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc")
val MAX_AUDIT_DT = test.first().toString()
val output = test.filter(to_date(test("audit_date")).gt((lit(MAX_AUDIT_DT))))
Can anyone suggest as way to compare it with column of dataframe test?
Thanks
You can use non-equi joins, if both columns "test_dt" and "audit_date" are of class date.
/// cast to correct type
import org.apache.spark.sql.functions.to_date
val new_test = test.withColumn("test_dt",to_date($"test_dt"))
val new_test1 = test1.withColumn("Audit_Dt", to_date($"Audit_Dt"))
/// join
new_test1.join(new_test, $"Audit_Dt" > $"test_dt")
.drop("test_dt").show()
+---+-------+---+----------+
| Id| Name|Age| Audit_Dt|
+---+-------+---+----------+
| 3|Pradeep| 28|2017-04-27|
+---+-------+---+----------+
Data
val test1 = sc.parallelize(Seq((1,"Rahul",23,"2017-04-26"),(2,"Ankit",25,"2017-04-26"),
(3,"Pradeep",28,"2017-04-27"))).toDF("Id","Name", "Age", "Audit_Dt")
val test = sc.parallelize(Seq(("2017-04-26"))).toDF("test_dt")
Try with this:
test1.filter(to_date(test1("audit_date")).gt(to_date(test("test_dt"))))
Store the value in a variable and use in filter.
val dtValue = test.select("test_dt")
OR
val dtValue = test.first().getString(0)
Now apply filter
val output = test1.filter(to_date(test1("audit_date")).gt(lit(dtValue)))

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Spark Scala. Using an external variables "dataframe" in a map

I have two dataframes,
val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")
df1=
startTime name cause1 cause2
15679 CCY 5 7
15683 2 5
15685 1 9
15690 9 6
df2=
cause description causeType
3 Xxxxx cause1
1 xxxxx cause1
3 xxxxx cause2
4 xxxxx
2 Xxxxx
and I want to apply a complex function getTimeCust to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. I must have a new df (or rdd) with the following columns:
startTime name cause descriptionCause
My solution was
val rdd2 = df1.map(row => {
val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
Row (row(0),row(1),cause,descriptionCause)
})
If a run the code below I have a NullPointerException because the df2 is not visible.
The function getTimeCust(Int, Int, DataFrame) works well outside the map.
Use df1.join(df2, <join condition>) to join your dataframes together then select the fields you need from the joined dataframe.
You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map).
Try something like this:
def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause
import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()
Thank you #Assaf. Thanks to your answer and the spark udf with data frame. I have resolved the this problem. The solution is:
val getTimeCust= udf((cause1: Any, cause2: Any) => {
var lastCause = 0
var categoryCause=""
var descCause=""
lastCause= .............
categoryCause= ........
(lastCause, categoryCause)
})
and after call the udf as:
val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))
ANd finally the join
val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )

How to create a dataframe using the value of another dataframe

I am getting suppId DataFrame using below code.
val suppId = sqlContext.sql("SELECT supp_id FROM supplier")
The DataFrame return single or multiple value.
Now I want to create a DataFrame using the value of supp_id from suppId DataFrame. But not understand, how to write this.
I have written below code. But the code is not working.
val nonFinalPE = sqlContext.sql("select * from pmt_expr)
nonFinalPE.where("supp_id in suppId(supp_id)")
It took me a second to figure out what you're trying to do. But, it looks like you want rows from nonFinalPe that are also in suppId. You'd get this by doing an inner join of the two data frames which would look like below
val suppId = sqlContext.sql("SELECT supp_id FROM supplier")
val nonFinalPE = sqlContext.sql("select * from pmt_expr")
val joinedDF = nonFinalPE.join(suppId, nonFinalPE("???") === suppId("supp_id"), "inner")