Spark specify multiple logical condition in where clause of spark dataframe - scala

While defining the multiple logical/relational condition in spark scala dataframe getting the error as mentioned below. But same thing is working fine in scala
Python code:
df2=df1.where(((col('a')==col('b')) & (abs(col('c')) <= 1))
| ((col('a')==col('fin')) & ((col('b') <= 3) & (col('c') > 1)) & (col('d') <= 500))
| ((col('a')==col('b')) & ((col('c') <= 15) & (col('c') > 3)) & (col('d') <= 200))
| ((col('a')==col('b')) & ((col('c') <= 30) & (col('c') > 15)) & (col('c') <= 100)))
Tried for scala equivalent:
val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) & (abs(col("c") ) <= 1))
notebook:2: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) & (abs(col("c") ) <= 1))
How to define multiple logical condition in spark dataframe using scala

eq returns a Boolean, <= returns a Column. They are incompatible.
You probably want this :
df.where((col("a") === col("b")) && (abs(col("c") ) <= 1))
=== is used for equality between columns and returns a Column, and there we can use && to do multiple conditions in the same where.

With Spark you should use
=== instead of == or eq (see explanation)
&& instead of & (&& is logical AND, & is binary AND)
val df_aqua_xentry_dtb_match = df_aqua_xentry.where((col("a") === col("b")) && (abs(col("c") ) <= 1))

Please see the below solution.
df.where("StudentId == 1").explain(true)
== Parsed Logical Plan ==
'Filter ('StudentId = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Analyzed Logical Plan ==
StudentId: int, SubjectName: string, Marks: int
Filter (StudentId#7 = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Optimized Logical Plan ==
LocalRelation [StudentId#7, SubjectName#8, Marks#9]
Here we used where clause, internally optimizer converted to filter opetration eventhough where clause in code level.
So we can apply filter function on rows of data frame like below
df.filter(row => row.getString(1) == "A" && row.getInt(0) == 1).show()
Here 0 and 1 are columns of data frames. In my case schema is (StudentId(Int), SubjectName(string), Marks(Int))

There are few issues with your Scala version of code.
"eq" is basically to compare two strings in Scala (desugars to == in Java) so
when you try to compare two Columns using "eq", it returns a boolean
instead of Column type. Here you can use "===" operator for Column comparison.
String comparison
scala> "praveen" eq "praveen"
res54: Boolean = true
scala> "praveen" eq "nag"
res55: Boolean = false
scala> lit(1) eq lit(2)
res56: Boolean = false
scala> lit(1) eq lit(1)
res57: Boolean = false
Column comparison
scala> lit(1) === lit(2)
res58: org.apache.spark.sql.Column = (1 = 2)
scala> lit(1) === lit(1)
19/08/02 14:00:40 WARN Column: Constructing trivially true equals predicate, '1 = 1'. Perhaps you need to use aliases.
res59: org.apache.spark.sql.Column = (1 = 1)
You are using a "betwise AND" operator instead of "and"/"&&" operator for Column type. This is reason you were getting the above error (as it was expecting a boolean instead Column).
scala> df.show
+---+---+
| id|id1|
+---+---+
| 1| 2|
+---+---+
scala> df.where((col("id") === col("id1")) && (abs(col("id")) > 2)).show
+---+---+
| id|id1|
+---+---+
+---+---+
scala> df.where((col("id") === col("id1")) and (abs(col("id")) > 2)).show
+---+---+
| id|id1|
+---+---+
+---+---+
Hope this helps !

Related

Multiple Filter condition in scala and in and not in clause filter

I am trying to do a filter similar to below using scala
where col1 = 'abc'
and col2 not in (0,4)
and col3 in (1,2,3,4)
I tried writing something like this
val finalDf: DataFrame =
initDf.filter(col("col1") ="abc")
.filter(col("col2") <> 0)
.filter(col("col2") <> 4)
.filter(col("col3") = 1 ||col("col3") = 2 ||col("col3") = 3 ||col("col3") = 4)
or
val finalDf: DataFrame =
initDf.filter(col("col1") ="abc")
&& col("col2") != 0 && col("col2") != 4
&& (col("col3") = 1
|| col("col3") = 2
|| col("col3") = 3
|| col("col3") = 4))
both not seems to be working. Can anyone help me on this.
For col operators are a little bit different
For equality use ===
For Inequality =!=
If you want to use literals you can use lit function
Your example may look like this
dfMain.filter(col("col1") === lit("abc"))
.filter(col("col2") =!= lit(0))
.filter(col("col2") =!= lit(4))
.filter(col("col3") === lit(1) || col("col3") === lit(2) ||col("col3") === lit(3) ||col("col3") === lit(4))
You can also use isin instead of this filter with multiply ors
If you want to find more about operators for cols you ca read this
Medium blog post part1
Medium blog post part2

Smaller or equal comparison syntax error

My UDF is comparing if time difference between two columns is within 5 days limit. If == operator is used, expression compiles properly, but <= (or lt) fails with type mismatch error. Code:
val isExpiration : (Column, Column, Column) =>
Column = (BCED, termEnd, agrEnd) =>
{
if(abs(datediff(if(termEnd == null) {agrEnd} else {termEnd}, BCED)) lt 6)
{lit(0)}
else
{lit(1)}
}
Error:
notebook:3: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
if(abs(datediff(if(termEnd == null) {agrEnd} else {termEnd}, BCED)) lt 6) {lit(0)}...
^
I must be missing something obvious - can anyone advice how to test if Column value is smaller or equal to a constant?
It looks like you have mixed udf and Spark functions, you need to use only one of them. When possible it's always preferable not to use and udf since those can not be optimized (and are thus generally slower). Without udf it could be done as follows:
df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
.withColumn("expired", when(abs(datediff($"end", $"BCED")) lt 6, 0).otherwise(1))
I introduced a temporary column to make the code a bit more readable.
Using an udf it could, for example, be done as follows:
val isExpired = udf((a: Date, b: Date) => {
if ((math.abs(a.getTime() - b.getTime()) / (1000 * 3600 * 24)) < 6) {
0
} else {
1
}
})
df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
.withColumn("expired", isExpired($"end", $"BCED"))
Here, I again made use of a temporary column but this logic could be moved into the udf if preferred.
That's because abs(col).lt(6) returns an object of type Column and if expects the condition to be evaluated to true or false which is a Scala Boolean type.
Plus, UDF doesn't work on Column Data Type, it works with Scala Objects (Int, String, Boolean etc)
Since all you're doing is using Spark SQL functions, you can rewrite your UDF like this:
val isExpiration = (
when(
abs(datediff(coalesce($"termEnd", $"agrEnd") , $"BCED")) <= 6, lit(0)
).otherwise(lit(1))
)
And, the usage would be:
df.show
//+----------+----------+----------+
//| BCED| termEnd| agrEnd|
//+----------+----------+----------+
//|2018-06-10|2018-06-25|2018-06-25|
//|2018-06-10| null|2018-06-15|
//+----------+----------+----------+
df.withColumn("x", isExpiration).show
//+----------+----------+----------+---+
//| BCED| termEnd| agrEnd| x|
//+----------+----------+----------+---+
//|2018-06-10|2018-06-25|2018-06-25| 1|
//|2018-06-10| null|2018-06-15| 0|
//+----------+----------+----------+---+

SPARK SQL : How to filter records by multiple colmuns and using groupBy too

//dataset
michael,swimming,silve,2016,USA
usha,running,silver,2014,IND
lisa,javellin,gold,2014,USA
michael,swimming,silver,2017,USA
Questions --
1) How many silver medals have been won by the USA in each sport -- and the code throws the error value === is not the member of string
val rdd = sc.textFile("/home/training/mydata/sports.txt")
val text =rdd.map(lines=>lines.split(",")).map(arrays=>arrays(0),arrays(1),arrays(2),arrays(3),arrays(4)).toDF("first_name","sports","medal_type","year","country")
text.filter(text("medal_type")==="silver" && ("country")==="USA" groupBy("year").count().show
2) What is the difference between === and ==
When I use filter and select with === with just one condition in it (no && or ||), it shows me the string result and boolean result respectively but when I use select and filter with ==, errors throws
using this:
text.filter(text("medal_type")==="silver" && text("country")==="USA").groupBy("year").count().show
+----+-----+
|year|count|
+----+-----+
|2017| 1|
+----+-----+
Will just answer your first question. (note that there is a typo in silver in first line)
About the second question:
== and === is just a functions in Scala
In spark === is using equalTo method which is the equality test
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#equalTo-java.lang.Object-
// Scala:
df.filter( df("colA") === df("colB") )
// Java
import static org.apache.spark.sql.functions.*;
df.filter( col("colA").equalTo(col("colB")) );
and == is using euqals method which just test if two references are the same object.
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#equals-java.lang.Object-
Notice the return types of each function == (equals) returns boolean while === (equalTo) returns a Column of the results.

Count empty values in dataframe column in Spark (Scala)

I'm trying to count empty values in column in DataFrame like this:
df.filter((df(colname) === null) || (df(colname) === "")).count()
In colname there is a name of the column. This works fine if column type is string but if column type is integer and there are some nulls this code always returns 0. Why is this so? How to change it to make it work?
As mentioned on the question that df.filter((df(colname) === null) || (df(colname) === "")).count() works for String data types but the testing shows that null are not handled.
#Psidom's answer handles both null and empty but does not handle for NaN.
checking for .isNaN should handle all three cases
df.filter(df(colName).isNull || df(colName) === "" || df(colName).isNaN).count()
You can use isNull to test the null condition:
val df = Seq((Some("a"), Some(1)), (null, null), (Some(""), Some(2))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: int]
df.filter(df("A").isNull || df("A") === "").count
// res7: Long = 2
df.filter(df("B").isNull || df("B") === "").count
// res8: Long = 1

How to join Datasets on multiple columns?

Given two Spark Datasets, A and B I can do a join on single column as follows:
a.joinWith(b, $"a.col" === $"b.col", "left")
My question is whether you can do a join using multiple columns. Essentially the equivalent of the following DataFrames api code:
a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")
You can do it exactly the same way as with Dataframe:
val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS
xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// | _1| _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]| null|
// +------------+-----------+
In Spark < 2.0.0 you can use something like this:
xs.as("xs").joinWith(
ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")
There's another way of joining by chaining where one after another. You first specify a join (and optionally its type) followed by where operator(s), i.e.
scala> case class A(id: Long, name: String)
defined class A
scala> case class B(id: Long, name: String)
defined class B
scala> val as = Seq(A(0, "zero"), A(1, "one")).toDS
as: org.apache.spark.sql.Dataset[A] = [id: bigint, name: string]
scala> val bs = Seq(B(0, "zero"), B(1, "jeden")).toDS
bs: org.apache.spark.sql.Dataset[B] = [id: bigint, name: string]
scala> as.join(bs).where(as("id") === bs("id")).show
+---+----+---+-----+
| id|name| id| name|
+---+----+---+-----+
| 0|zero| 0| zero|
| 1| one| 1|jeden|
+---+----+---+-----+
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).show
+---+----+---+----+
| id|name| id|name|
+---+----+---+----+
| 0|zero| 0|zero|
+---+----+---+----+
The reason for such a goodie is that the Spark optimizer will join (no pun intended) consecutive wheres into one with join. Use explain operator to see the underlying logical and physical plans.
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
: +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
+- LocalRelation [id#35L, name#36]
== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
: +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
+- *Filter isnotnull(name#36)
+- LocalTableScan [id#35L, name#36]
In Java, the && operator does not work. The correct way to join based on multiple columns in Spark-Java is as below:
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);
The and function works like the && operator.