Scala Comparing Values in 2 Spark Dataframes - scala

I am trying to write a condition statement for joining 2 Spark dataframes together in Scala:
val joinCondition = when($"filteredRESULT.key" == $"allDataUSE.key" && $"allDataUSE.timestamp" >= $"filteredRESULT.tripStart" && $"allDataUSE.timestamp" <= $"filteredRESULT.tripEND", $"allDataUSE.tripid" === $"filteredRESULT.tripid").otherwise($"allDataUSE.tripid" === 0)
The filteredRESULT df is very small, and includes a tripID, tripStart time, tripEnd time. My goal is to use filteredRESULT as a lookup table, where a row from the allDataUSE df is compared against the entries in filteredRESULT. For example:
If in allDataUSE, a row matches filteredRESULT key, a timestamp >= a trip's start time, and <= a trip's end time, then the tripid column in allDataUSE should receive the value of tripid in the filteredRESULT df.
I am getting a boolean error when I run the above conditional statement. How can I perform this operation? Thank you!!

You Are getting The boolean error because where condition expect the condition to return boolean,but the operator === in spark return the column as ouput.becuase of that you are getting that error.
below i am sharing the link to spark document you see that.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Column.html#equals(java.lang.Object)
public Column equalTo(Object other)
Equality test.
// Scala:
df.filter( df("colA") === df("colB") )
// Java
import static org.apache.spark.sql.functions.*;
df.filter( col("colA").equalTo(col("colB")) );
Parameters:
other - (undocumented)
Returns:
(undocumented)
so remove that === and replace with == it work

Related

generate dynamic join condition spark/scala

I have a array of tuple and I want to generate a join condition(OR) using that.
e.g.
input --> [("leftId", "rightId"), ("leftId", leftAltId")]
output --> leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
method signature:
def inner(leftDF: DataFrame, rightDF: DataFrame, fieldsToJoin: Array[(String,String)]): Unit = {
}
I tried using reduce operation on the array but output of my reduce operation is Column and not String hence it can't be fed back as input. I could do recursive but hoping there's simpler way to initiate empty column variable and build the query. thoughts ?
You can do something like this:
val cond = fieldsToJoin.map(x => col(x._1) === col(x._2)).reduce(_ || _)
leftDF.join(rightDF, cond)
Basically you first turn the array into an array of conditions (col transforms the string to column and then === does the comparison) and then the reduce adds the "or" between them. The result is a column you can use.

Spark dataframe date_add function with case when not working

I have a spark DataFrame in which I have a where condition to add number of dates in the existing date column based on some condition.
My code is something like below
F.date_add(df.transDate,
F.when(F.col('txn_dt') == '2016-01-11', 9999).otherwise(10)
)
since date_add() function accepts second argument as int, but my code returns as Column, it throws error.
How to collect value from case when condition?
pyspark.sql.functions.when() returns a Column, which is why your code is producing the TypeError: 'Column' object is not callable
You can get the desired result by moving the when to the outside, like this:
F.when(
F.col('txn_dt') == '2016-01-11',
F.date_add(df.transDate, 9999)
).otherwise(F.date_add(df.transDate, 10))

Join on several conditions without duplicated columns

Joining on identity with Spark leads to the common key column being duplicated in the final Dataset:
val result = ds1.join(ds2, ds1("key") === ds2("key"))
// result now has two "key" columns
This is avoidable by using a Seq instead of the comparison, similar to USING keyword in SQL:
val result = ds1.join(ds2, Seq("key"))
// result now has only one "key" column
However, this doesn't work when joining with a common key + another condition, like:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
// result has two "key" columns
val result = ds1.join(ds2, Seq("key") && ds1("foo") < ds2("foo"))
// compile error: value && is not a member of Seq[String]
Currently one way of getting out of this is to drop the duplicated column afterwards, but this is quite cumbersome:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
.drop(ds1("key"))
Is there a more natural, cleaner way to achieve the same goal?
You can separate equi join component and filter:
ds1.join(ds2, Seq("key")).where(ds1("foo") < ds2("foo"))

Implementing SQL logic via Dataframes using Spark and Scala

I have three columns (c1, c2, c3) in a Hive table t1. I have MySQL code that checks whether specific columns are null. I have the dataframe from the same table. I would like to implement the same logic via dataframe, df which has three columns, c1, c2, c3.
Here is the SQL-
if(
t1.c1=0 Or IsNull(t1.c1),
if(
IsNull(t1.c2/t1.c3),
1,
t1.c2/t1.c3
),
t1.c1
) AS myalias
I had drafted the following logic in scala using "when" as an alternative to "if" of SQL. I am facing problem writing "Or" logic(bolded below). How can I write the above SQL logic via Spark dataframe using Scala?
val df_withalias = df.withColumn("myalias",when(
Or((df("c1") == 0), isnull(df("c1"))),
when(
(isNull((df("c2") == 0)/df("c3")),
)
)
)
How can I write the above logic?
First, you can use Column's || operator to construct logical OR conditions. Also - note that when takes only 2 arguments (condition and value), and if you want to supply an alternative value (to be used if condition isn't met) - you need to use .otherwise:
val df_withalias = df.withColumn("myalias",
when(df("c1") === 0 || isnull(df("c1")),
when(isnull(df("c2")/df("c3")), 1).otherwise(df("c2")/df("c3"))
).otherwise(df("c1"))
)

How to update rows based on condition in spark-sql

I am working on spark-sql for data preparation.
The problem I am facing is after getting the result of sql query. How should I update rows based on the If-then-else condition.
What I am doing
val table_join = sqlContext.sql(""" SELECT a.*,b.col as someCol
from table1 a LEFT JOIN table2 b
on a.ID=b.ID """)
table_join.registerTempTable("Table_join")
Now when I have final joined table which is in df format. How should I update rows?
//Final filtering operation
val final_filtered_table = table_join.map{ case record=>
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") record.getAs[String]("col2")="UNKNOWN"
else if (record.getAs[String]("col1") == "N") record("col1")=""
else record
}
In the above map the if syntax works properly but the moment I apply the update condition to modify It gives me error.
But why the below query is working
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") "UNKNOWN"
But the moment I change "UNKNOWN" to record.getAs[String]("col2")="UNKNOWN" It gives me error at at .getAs
Another approach I tried is this:
val final_filtered_sql = table_join.map{row =>
if(row.getString(6) == "Y" && row.getString(33) == "") row.getString(6) == "UNKNOWN"
else if(row.getString(6) == "N") row.getString(6) == ""
else row
}
This is working but is this the right approach as I should not call the columns by their no's but instead their names. What approach should I follow to get names of the column and then update ??
Please help me regarding this. What syntax should I do to update rows based on the condition in dataframe in spark-sql
record.getAs[String]("col2")="UNKNOWN" won't work because record.getAs[String](NAME) will return a String which doesn't have a = method and assigning a new value to a string doesn't make sense.
DataFrame records don't have any setter methods because DataFrames are based on RDD which are immutable collections, meaning you cannot change their state and that's how you're trying to do here.
One way would be to create a new DataFrame using selectExpr on table_join and put that if/else logic there using SQL.