How to create when condition using map in spark - scala

I'm trying to create a when condition like this:
df.
withColumn("new_col", when($a === 1, "item_a")
.when($b === 2 && $a === 1, "item_b"))
I created a map with column and values:
val condMap = Map(
($a === 1) -> "item_a",
($b === 2 && $a === 1) -> "item_b"
)
And tried:
df
.withColumn("new_col", condMap.map(c => when(c)))
But it's not working, how can I create a when condition using this map?

Related

Multiple Filter condition in scala and in and not in clause filter

I am trying to do a filter similar to below using scala
where col1 = 'abc'
and col2 not in (0,4)
and col3 in (1,2,3,4)
I tried writing something like this
val finalDf: DataFrame =
initDf.filter(col("col1") ="abc")
.filter(col("col2") <> 0)
.filter(col("col2") <> 4)
.filter(col("col3") = 1 ||col("col3") = 2 ||col("col3") = 3 ||col("col3") = 4)
or
val finalDf: DataFrame =
initDf.filter(col("col1") ="abc")
&& col("col2") != 0 && col("col2") != 4
&& (col("col3") = 1
|| col("col3") = 2
|| col("col3") = 3
|| col("col3") = 4))
both not seems to be working. Can anyone help me on this.
For col operators are a little bit different
For equality use ===
For Inequality =!=
If you want to use literals you can use lit function
Your example may look like this
dfMain.filter(col("col1") === lit("abc"))
.filter(col("col2") =!= lit(0))
.filter(col("col2") =!= lit(4))
.filter(col("col3") === lit(1) || col("col3") === lit(2) ||col("col3") === lit(3) ||col("col3") === lit(4))
You can also use isin instead of this filter with multiply ors
If you want to find more about operators for cols you ca read this
Medium blog post part1
Medium blog post part2

Spark DataFrame Get Null Count For All Columns

I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:
First my DataFrame that just contains one column (for simplicity):
val recVacDate = dfRaw.select("STATE")
When I print using a simple filter, I get to see the following:
val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051
But when I use this code below, I get just 1 as a result and I do not understand why?
val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*)
println(nullCount.count()) // Prints 1
Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.
This kind of fixed it:
df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)
Notice the use of when clause after the count.

Spark specify multiple logical condition in where clause of spark dataframe

While defining the multiple logical/relational condition in spark scala dataframe getting the error as mentioned below. But same thing is working fine in scala
Python code:
df2=df1.where(((col('a')==col('b')) & (abs(col('c')) <= 1))
| ((col('a')==col('fin')) & ((col('b') <= 3) & (col('c') > 1)) & (col('d') <= 500))
| ((col('a')==col('b')) & ((col('c') <= 15) & (col('c') > 3)) & (col('d') <= 200))
| ((col('a')==col('b')) & ((col('c') <= 30) & (col('c') > 15)) & (col('c') <= 100)))
Tried for scala equivalent:
val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) & (abs(col("c") ) <= 1))
notebook:2: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
val df_aqua_xentry_dtb_match=df_aqua_xentry.where((col("a") eq col("b")) & (abs(col("c") ) <= 1))
How to define multiple logical condition in spark dataframe using scala
eq returns a Boolean, <= returns a Column. They are incompatible.
You probably want this :
df.where((col("a") === col("b")) && (abs(col("c") ) <= 1))
=== is used for equality between columns and returns a Column, and there we can use && to do multiple conditions in the same where.
With Spark you should use
=== instead of == or eq (see explanation)
&& instead of & (&& is logical AND, & is binary AND)
val df_aqua_xentry_dtb_match = df_aqua_xentry.where((col("a") === col("b")) && (abs(col("c") ) <= 1))
Please see the below solution.
df.where("StudentId == 1").explain(true)
== Parsed Logical Plan ==
'Filter ('StudentId = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Analyzed Logical Plan ==
StudentId: int, SubjectName: string, Marks: int
Filter (StudentId#7 = 1)
+- Project [_1#3 AS StudentId#7, _2#4 AS SubjectName#8, _3#5 AS Marks#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Optimized Logical Plan ==
LocalRelation [StudentId#7, SubjectName#8, Marks#9]
Here we used where clause, internally optimizer converted to filter opetration eventhough where clause in code level.
So we can apply filter function on rows of data frame like below
df.filter(row => row.getString(1) == "A" && row.getInt(0) == 1).show()
Here 0 and 1 are columns of data frames. In my case schema is (StudentId(Int), SubjectName(string), Marks(Int))
There are few issues with your Scala version of code.
"eq" is basically to compare two strings in Scala (desugars to == in Java) so
when you try to compare two Columns using "eq", it returns a boolean
instead of Column type. Here you can use "===" operator for Column comparison.
String comparison
scala> "praveen" eq "praveen"
res54: Boolean = true
scala> "praveen" eq "nag"
res55: Boolean = false
scala> lit(1) eq lit(2)
res56: Boolean = false
scala> lit(1) eq lit(1)
res57: Boolean = false
Column comparison
scala> lit(1) === lit(2)
res58: org.apache.spark.sql.Column = (1 = 2)
scala> lit(1) === lit(1)
19/08/02 14:00:40 WARN Column: Constructing trivially true equals predicate, '1 = 1'. Perhaps you need to use aliases.
res59: org.apache.spark.sql.Column = (1 = 1)
You are using a "betwise AND" operator instead of "and"/"&&" operator for Column type. This is reason you were getting the above error (as it was expecting a boolean instead Column).
scala> df.show
+---+---+
| id|id1|
+---+---+
| 1| 2|
+---+---+
scala> df.where((col("id") === col("id1")) && (abs(col("id")) > 2)).show
+---+---+
| id|id1|
+---+---+
+---+---+
scala> df.where((col("id") === col("id1")) and (abs(col("id")) > 2)).show
+---+---+
| id|id1|
+---+---+
+---+---+
Hope this helps !

Spark/Scala repeated creation of DataFrames using the same function on different data subsets

My current code repeatedly creates new DataFrames (df_1, df_2, df_3) using the same function, but applied on different subsets of the original DataFrame df (e.g. where("category == 1')).
I would like to create a function that can automate the creation of these DataFrames.
In the following example, My DataFrame df has three columns: "category", "id", and "amount". Assume I have 10 categories. I want to summarise the value of the column 'category' as well as count the number of occurrences of 'category' based on different categories:
val df_1 = df.where("category == 1")
.groupBy("id")
.agg(sum(when(col("amount") > 0,(col("amount")))).alias("total_incoming_cat_1"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_1"))
val df_2 = df.where("category == 2")
.groupBy("id")
.agg(sum(when(col("amount") > 0,(col("amount")))).alias("total_incoming_cat_2"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_2"))
val df_3 = df.where("category == 3")
.groupBy("id")
.agg(sum(when(col("amount") > 0, (col("amount")))).alias("total_incoming_cat_3"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_3"))
I would like something like this:
def new_dfs(L:List, df:DataFrame): DataFrame={
for l in L{
val df_+l df.filter($amount == l)
.groupBy("id")
.agg(sum(when(col("amount") > 0, (col("amount")))).alias("total_incoming_cat_"+l),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat_"+l))
}
}
it is not better to group by category and id
df
.groupBy("category","id")
.agg(sum(when(col("amount") > 0,(col("amount")))).alias("total_incoming_cat"),
count(when(col("amount") < 0, (col("amount")))).alias("total_outgoing_cat"))

Count empty values in dataframe column in Spark (Scala)

I'm trying to count empty values in column in DataFrame like this:
df.filter((df(colname) === null) || (df(colname) === "")).count()
In colname there is a name of the column. This works fine if column type is string but if column type is integer and there are some nulls this code always returns 0. Why is this so? How to change it to make it work?
As mentioned on the question that df.filter((df(colname) === null) || (df(colname) === "")).count() works for String data types but the testing shows that null are not handled.
#Psidom's answer handles both null and empty but does not handle for NaN.
checking for .isNaN should handle all three cases
df.filter(df(colName).isNull || df(colName) === "" || df(colName).isNaN).count()
You can use isNull to test the null condition:
val df = Seq((Some("a"), Some(1)), (null, null), (Some(""), Some(2))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: int]
df.filter(df("A").isNull || df("A") === "").count
// res7: Long = 2
df.filter(df("B").isNull || df("B") === "").count
// res8: Long = 1