How to update rows based on condition in spark-sql - scala

I am working on spark-sql for data preparation.
The problem I am facing is after getting the result of sql query. How should I update rows based on the If-then-else condition.
What I am doing
val table_join = sqlContext.sql(""" SELECT a.*,b.col as someCol
from table1 a LEFT JOIN table2 b
on a.ID=b.ID """)
table_join.registerTempTable("Table_join")
Now when I have final joined table which is in df format. How should I update rows?
//Final filtering operation
val final_filtered_table = table_join.map{ case record=>
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") record.getAs[String]("col2")="UNKNOWN"
else if (record.getAs[String]("col1") == "N") record("col1")=""
else record
}
In the above map the if syntax works properly but the moment I apply the update condition to modify It gives me error.
But why the below query is working
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") "UNKNOWN"
But the moment I change "UNKNOWN" to record.getAs[String]("col2")="UNKNOWN" It gives me error at at .getAs
Another approach I tried is this:
val final_filtered_sql = table_join.map{row =>
if(row.getString(6) == "Y" && row.getString(33) == "") row.getString(6) == "UNKNOWN"
else if(row.getString(6) == "N") row.getString(6) == ""
else row
}
This is working but is this the right approach as I should not call the columns by their no's but instead their names. What approach should I follow to get names of the column and then update ??
Please help me regarding this. What syntax should I do to update rows based on the condition in dataframe in spark-sql

record.getAs[String]("col2")="UNKNOWN" won't work because record.getAs[String](NAME) will return a String which doesn't have a = method and assigning a new value to a string doesn't make sense.
DataFrame records don't have any setter methods because DataFrames are based on RDD which are immutable collections, meaning you cannot change their state and that's how you're trying to do here.
One way would be to create a new DataFrame using selectExpr on table_join and put that if/else logic there using SQL.

Related

Using LINQ to search comma separated string

There are a number of records in the table, and there is a column called AssignedTo, and the value for AssignedTo is comma separated string, the possible values for it could be something like:
"1"
"2"
"3"
"11"
"12"
"1,2"
"1,3"
"2,3"
"1,2,3"
"1,3,11"
"1,3,12"
If I use the following LINQ query to search, in case value = 1
records = records.Where(x => x.AssignedTo.Contains(value) || search == null);
It returns the records with AssignedTo value
"1", "11", "12", "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12"
I really want to only return the records with AssignedTo containing "1",
which are "1", "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12", do not want "11" and "12"
If I use the following LINQ query to search the qualified records, still value = 1
records = records.Where(x => x.AssignedTo.Contains("," + value + ",") ||
x.AssignedTo.StartsWith(value + ",") ||
x.AssignedTo.EndsWith("," + value) ||
value == null);
It returns the records with AssignedTo value "1,2", "1,3", "1,2,3", "1,3,11", "1,3,12", but missing the record with AssignedTo value "1".
Since something like this is likely a search filter, doing the operation in-memory likely isn't a very good option unless the row count is guaranteed to be manageable. Ideally something like this should be re-factored to use a proper relational structure rather than a comma-delimited string.
However, the example you had was mostly there, just missing an Equals option to catch the value by itself. I'd also take the 'value == null' check out of the Linq expression into a conditional as to whether to add the WHERE clause. The difference is with the condition in the Linq, this will generate that into the SQL, where-as by pre-checking you can avoid the SQL conditions all-together if there is no value specified.
if (!string.IsNullOrEmpty(value))
records = records.Where(x => x.AssignedTo.Contains("," + value + ",") ||
x.AssignedTo.StartsWith(value + ",") ||
x.AssignedTo.EndsWith("," + value) ||
x.AssignedTo == value);
This would catch "n,...", "...,n,...", "...,n", and "n".
A better method would be to split the string and search the results:
records = records.Where(x => x.AssignedTo.Split(',').Contains(value) || search == null);
Note that you can't use this method directly in an EF query since there's no way to translate it to standard SQL. So you may want to filter using your Contains as a starting spot (to reduce the number of false positives) and then filter in-memory:
records = records.Where(x => x.AssignedTo.Contains(value) || search == null)
.AsEnumerable() // do subsequent filtering in-memory
.Where(x => x.AssignedTo.Split(',').Contains(value) || search == null)
Or redesign
your database to use related tables rather than storing a comma-delimited list of strings...
If you are building a linq expression against database then Split function will throw an error. You can use expression below there.
if (!string.IsNullOrEmpty(value))
{
records = records.Where(x => (',' + x.AssignedTo + ',').Contains(',' + value + ',')
}

generate dynamic join condition spark/scala

I have a array of tuple and I want to generate a join condition(OR) using that.
e.g.
input --> [("leftId", "rightId"), ("leftId", leftAltId")]
output --> leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
method signature:
def inner(leftDF: DataFrame, rightDF: DataFrame, fieldsToJoin: Array[(String,String)]): Unit = {
}
I tried using reduce operation on the array but output of my reduce operation is Column and not String hence it can't be fed back as input. I could do recursive but hoping there's simpler way to initiate empty column variable and build the query. thoughts ?
You can do something like this:
val cond = fieldsToJoin.map(x => col(x._1) === col(x._2)).reduce(_ || _)
leftDF.join(rightDF, cond)
Basically you first turn the array into an array of conditions (col transforms the string to column and then === does the comparison) and then the reduce adds the "or" between them. The result is a column you can use.

Scala Comparing Values in 2 Spark Dataframes

I am trying to write a condition statement for joining 2 Spark dataframes together in Scala:
val joinCondition = when($"filteredRESULT.key" == $"allDataUSE.key" && $"allDataUSE.timestamp" >= $"filteredRESULT.tripStart" && $"allDataUSE.timestamp" <= $"filteredRESULT.tripEND", $"allDataUSE.tripid" === $"filteredRESULT.tripid").otherwise($"allDataUSE.tripid" === 0)
The filteredRESULT df is very small, and includes a tripID, tripStart time, tripEnd time. My goal is to use filteredRESULT as a lookup table, where a row from the allDataUSE df is compared against the entries in filteredRESULT. For example:
If in allDataUSE, a row matches filteredRESULT key, a timestamp >= a trip's start time, and <= a trip's end time, then the tripid column in allDataUSE should receive the value of tripid in the filteredRESULT df.
I am getting a boolean error when I run the above conditional statement. How can I perform this operation? Thank you!!
You Are getting The boolean error because where condition expect the condition to return boolean,but the operator === in spark return the column as ouput.becuase of that you are getting that error.
below i am sharing the link to spark document you see that.
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Column.html#equals(java.lang.Object)
public Column equalTo(Object other)
Equality test.
// Scala:
df.filter( df("colA") === df("colB") )
// Java
import static org.apache.spark.sql.functions.*;
df.filter( col("colA").equalTo(col("colB")) );
Parameters:
other - (undocumented)
Returns:
(undocumented)
so remove that === and replace with == it work

Join on several conditions without duplicated columns

Joining on identity with Spark leads to the common key column being duplicated in the final Dataset:
val result = ds1.join(ds2, ds1("key") === ds2("key"))
// result now has two "key" columns
This is avoidable by using a Seq instead of the comparison, similar to USING keyword in SQL:
val result = ds1.join(ds2, Seq("key"))
// result now has only one "key" column
However, this doesn't work when joining with a common key + another condition, like:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
// result has two "key" columns
val result = ds1.join(ds2, Seq("key") && ds1("foo") < ds2("foo"))
// compile error: value && is not a member of Seq[String]
Currently one way of getting out of this is to drop the duplicated column afterwards, but this is quite cumbersome:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
.drop(ds1("key"))
Is there a more natural, cleaner way to achieve the same goal?
You can separate equi join component and filter:
ds1.join(ds2, Seq("key")).where(ds1("foo") < ds2("foo"))

Replace Empty values with nulls in Spark Dataframe

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.
Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))