Pyspark Dataframes: Does when(cond,value) always evaluate value? - pyspark

so I am trying to conditionally apply an udf some_function() to column b1, based on the value in a1. (otherwise don't apply). Using pyspark.sql.functions.when(condition, value) and a simple udf
some_function = udf(lambda x: x.translate(...))
df = df.withColumn('c1',when(df.a1 == 1, some_function(df.b1)).otherwise(df.b1))
With this example data:
| a1| b1|
---------------
| 1|'text'|
| 2| null|
I am seeing that some_function() is always evaluated (i.e. the udf calls translate() on null and crashes), regardless of condition and applied if condition is true. To clarify, this is not about udfs handling null correctly, but when(...) always executing value, if value is an udf.
Is this behaviour intended? If so, how can I apply a method conditionally so it doesn't get executed if condition is not met?

Related

pyspark join 2 columns if condition is met, and insert string into the result

I have a pyspark dataframe like this:
+-------+---------------+------------+
|s_field|s_check| t_filter|
+-------+---------------+------------+
| MANDT| true| !=E|
| WERKS| true|0010_0020_0021_00...|
+-------+---------------+------------+
And as a first step, I split t_filter based on _ with f.split(f.col("t_filter"), "_")
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_")).show(truncate=False)
+-------+---------------+------------+------------+------------+
|s_field|s_check| t_filter| t_filter_1|
+-------+---------------+------------+------------+------------+
| MANDT| true| 070_70| [!= E]|
| WERKS| true|0010_0020_0021_00...| [0010, 0020, 0021, 00...]
+-------+---------------+------------+------------+------------+
What I want to achieve is to create a new column, using s_field and t_filter as the input while doing a logic check for !=.
ultimate aim
+------------+------------+------------+
| t_filter_2|
+------------+------------+------------+
| MANDT != 'E'|
| WERKS in ('0010', '0020', ...)|
+------------+------------+------------+
I have tried using withColumn but I keep getting error on col must be Column.
I am also not sure what the proper approach should be in order to achieve this.
Note: there is a large amount of rows, like 10k. I understand that using a UDF would be quite expensive, so i'm interested to know if there are other ways that can be done.
You can achieve this using withColumn with conditional evaluation by using the when and otherwise function. Following your example the following logic would apply, if t_filter contains != concatenate s_field and t_filter, else first convert t_filter_1 array to a string with , as separator then concat with s_field along with literals for in and ().
from pyspark.sql import functions as f
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
)
Output
+-------+-------+--------------------+-------------------------+---------------------------------------+
|s_check|s_field|t_filter |t_filter_1 |t_filter_2 |
+-------+-------+--------------------+-------------------------+---------------------------------------+
|true |MANDT |!='E' |[!='E'] |MANDT!='E' |
|true |WERKS |0010_0020_0021_00...|[0010, 0020, 0021, 00...]|WERKS in ('0010','0020','0021','00...')|
+-------+-------+--------------------+-------------------------+---------------------------------------+
Complete Working Example
from pyspark.sql import functions as f
filters_data = [
{"s_field": "MANDT", "s_check": True, "t_filter": "!='E'"},
{"s_field": "WERKS", "s_check": True, "t_filter": "0010_0020_0021_00..."},
]
filters = spark.createDataFrame(filters_data)
filters = filters.withColumn("t_filter_1", f.split(f.col("t_filter"), "_"))
filters.withColumn(
"t_filter_2",
f.when(f.col("t_filter").contains("!="), f.concat("s_field", "t_filter")).otherwise(
f.concat(f.col("s_field"), f.lit(" in ('"), f.concat_ws("','", "t_filter_1"), f.lit("')"))
),
).show(200, False)

Scala/Spark: Checking for null elements in an array column but IntelliJ suggests not to use null?

I have a column called responseTimes which is of arrayType:
ArrayType(IntegerType,true)
I'm trying to add another column to count the number of null or not-set values in this array:
val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.withColumn("totalNulls", when(contains_null(col("responseTimes")),
lit(1)).otherwise(0))
Although this gives me the right output, IntelliJ keeps telling me to avoid the use of null in my UDF which makes me think this is bad. Is there any other way to do it? Also, is it possible without using UDFs?
The reason is very simple , it is because of the rules of spark udf, well spark deals with null in a different distributed way, I don't know if you know the array_contains built-in function in spark sql.
If UDFs are needed, follow these rules:
Scala code should deal with null values gracefully and shouldn’t error out if there are null values.
Scala code should return None (or null) for values that are unknown, missing, or irrelevant. DataFrames should also use null for for values that are unknown, missing, or irrelevant.
Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
Please refer to this link if you like tp read more: https://mungingdata.com/apache-spark/dealing-with-null/
You can rewrite your UDF to use Option. In scala, Option(null) gives None, so you can do :
val contains_null = udf((xs: Seq[Integer]) => xs.exists(e => Option(e).isEmpty))
However, if you are using Spark 2.4+, it is more suitable to use Spark built-in functions for this. To check if an array column contains null elements, use exists as suggested by #mck's answer.
If you want to get the count of nulls in array you can combine filter and size function :
df.withColumn("totalNulls", size(expr("filter(responseTimes, x -> x is null)")))
A better way is probably to use higher order function exists to check isnull for each array element:
// sample dataframe
val df = spark.sql("select array(1,null,2) responseTimes union all select array(3,4)")
df.show
+-------------+
|responseTimes|
+-------------+
| [1,, 2]|
| [3, 4]|
+-------------+
// check whether there exists null elements in the array
val df2 = df.withColumn("totalNulls", expr("int(exists(responseTimes, x -> isnull(x)))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+
You can also use array_max together with transform:
val df2 = df.withColumn("totalNulls", expr("int(array_max(transform(responseTimes, x -> isnull(x))))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+

unable to convert null value to 0

I'm working with databricks and I don't understand why I'm not able to convert null value to 0 in what it seems like a regular integer column.
I've tried these two options:
#udf(IntegerType())
def null_to_zero(x):
"""
Helper function to transform Null values to zeros
"""
return 0 if x == 'null' else x
and later:
.withColumn("col_test", null_to_zero(col("col")))
and everything is returned as null.
and the second option simply doesn't have any impact .na.fill(value=0,subset=["col"])
What do I'm missing here? Is this a specific behavior of null values with databricks?
The nulls are represented as None, not as a string null. For your case it's better to use coalesce function instead, like this (example based on docs):
from pyspark.sql.functions import coalesce, lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.withColumn("col_test", coalesce(cDf["a"], lit(0.0))).show()
will give you desired behavior:
+----+----+--------+
| a| b|col_test|
+----+----+--------+
|null|null| 0.0|
| 1|null| 1.0|
|null| 2| 0.0|
+----+----+--------+
If you need more complex logic, then you can use when/otherwise, with condition on null:
cDf.withColumn("col_test", when(cDf["a"].isNull(), lit(0.0)).otherwise(cDf["a"])).show()

Why doesn't a '==' comparison on a column work in Spark SQL?

I have a simple spark statement but it seems to return false contrary to expected result of true:
spark.sql("SELECT 1 AS a").withColumn("b", lit($"a" == 1)).show
+---+-----+
| a| b|
+---+-----+
| 1|false|
+---+-----+
I've tried $"a" == lit(1) and $"a".equals(1) etc. but all return false.
A statement of $"a" >= 1 returns true so why not $"a" == 1?
Scala has defined === operator that works as a type-safe equals operator, very similar to the operator in javascript. Spark framework defines the equalTo method in Column class. equalTo returns a new Column object that has the result of comparing two column values. The method equalTo is used by === operator to compare column values. Operator == uses the equals method that checks if both the objects being compared are referencing to the same object. Have a look at the spark API docs for these methods in column class:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Column.html#equalTo-java.lang.Object-
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/Column.html#equals-java.lang.Object-
Using a triple '=' solved my problem.
spark.sql("SELECT 1 AS a").withColumn("b", $"a" === 1).show
+---+----+
| a| b|
+---+----+
| 1|true|
+---+----+

Spark: Is "count" on Grouped Data a Transformation or an Action?

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.
I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer:
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset
The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD
EDIT:
always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:
empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1:
You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.
However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.
Here however it is safe to say count is a transformation.