Count the number of non-null values in a Spark DataFrame - scala

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.
Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python.
Say you have:
val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()
How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?

One straight forward option is to use .describe() function to get a summary of your data frame, where the count row includes a count of non-null values:
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+

Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...
You can do something like:
import org.apache.spark.sql.functions.{sum,when, count}
df.agg(
(sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
(sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
(sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
).show()
EDIT: sum(when($"x".isNotNull,0).otherwise(1)) can also just be replaced by count($"x") which only counts non-null values. As I find this not obvious, I tend to use the sum notation which is more clear

Here's how I did it in Scala 2.11, Spark 2.3.1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.agg(
count("x").divide(count(lit(1)))
.as("x: percent non-null")
// ...copy paste that for columns y and z
).head()
count(*) counts non-null rows, count(1) runs on every row.
If you instead want to count percent null in population, find the complement of our count-based equation:
lit(1).minus(
count("x").divide(count(lit(1)))
)
.as("x: percent null")
It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:
// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
.divide(count(lit(1)))
.as("x: percent null")

Here is the simplest query:
d.filter($"x" !== null ).count

df.select(df.columns map count: _*)
or
df.select(df.columns map count: _*).toDF(df.columns: _*)

Spark 2.3+
(for string and numeric type columns)
df.summary("count").show()
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+

Related

Spark scala selecting multiple columns from a list and single columns

I'm attempting to do a select on a dataframe but I'm having a little bit of trouble.
I have this initial dataframe
+----------+-------+-------+-------+
|id|value_a|value_b|value_c|value_d|
+----------+-------+-------+-------+
And what I have to do is sum value_a with value_b and keep the others the same. So I have this list
val select_list = List(id, value_c, value_d)
and after this I do the select
df.select(select_list.map(col):_*, (col(value_a) + col(value_b)).as("value_b"))
And I'm expecting to get this:
+----------+-------+-------+
|id|value_c|value_d|value_b| --- that value_b is the sum of value_a and value_b (original)
+----------+-------+-------+
But i'm getting "a no _* annotation allowed here". Keep in mind that in reality I have a lot of columns so I need to use a list, I can't simply select each column. I'm running into this trouble because the new column that is the result of the sum has the same name of an existing column, so I can't just select(column("*"), sum....).drop(value_b) or I'd be dropping the old column and the new one with the sum.
What is the correct syntax to add multiple and single columns in a single select, or how else can I solve this?
for now I decided to do this:
df.select(col("*"), (col(value_a) + col(value_b)).as("value_b_tmp")).
drop("value_a", "value_b").withColumnRenamed("value_b_tmp", "value_b")
Which works fine but I understand the withColumn and withColumnRenamed is expensive because I'm creating pretty much a new dataframe with a new or renamed column and I'm looking for the less expensive operation possible.
Thanks in advance!
Simply use .withColumn function, it will replace the column if it exists:
df
.withColumn("value_b", col("value_a") + col("value_b"))
.select(select_list.map(col):_*)
You can create a new sum field and collect the result of the operation for the sum of the n columns as:
val df: DataFrame =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1,2,3),Row(1,2,3))),
StructType(List(
StructField("field1", IntegerType),
StructField("field2", IntegerType),
StructField("field3", IntegerType))))
val columnsToSum = df.schema.fieldNames
columnsToSum.filter(name => name != "field1")
.foldLeft(df.withColumn("sum", lit(0)))((df, column) =>
df.withColumn("sum", col("sum") + col(column)))
Gives:
+------+------+------+---+
|field1|field2|field3|sum|
+------+------+------+---+
| 1| 2| 3| 5|
| 1| 2| 3| 5|
+------+------+------+---+

How would I filter a dataframe by a column's percentile value in Scala Spark

Say I have this dataframe:
val df = Seq(("Mike",1),("Kevin",2),("Bob",3),("Steve",4)).toDF("name","score")
and I want to filter this dataframe so that it only returns rows where the "score" column is greater than on equal to the 75th percentile. How would I do this?
Thanks so much and have a great day!
What you want to base your filter on is the upper quartile.
It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point.
Based on the answer here, you can use spark's approximateQuantile to get what you want:
val q = df.stat.approxQuantile("score", Array(.75), 0)
q: Array[Double] = Array(3.0)
This array(q) gives you the boundary between 3rd and 4th quartiles.
Using a simple spark filter should get you what you want:
df.filter($"score" >= q.head).show
+-----+-----+
| name|score|
+-----+-----+
| Bob| 3|
|Steve| 4|
+-----+-----+

How to select the N highest values for each category in spark scala

Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

Spark: Is "count" on Grouped Data a Transformation or an Action?

I know that count called on an RDD or a DataFrame is an action. But while fiddling with the spark shell, I observed the following
scala> val empDF = Seq((1,"James Gordon", 30, "Homicide"),(2,"Harvey Bullock", 35, "Homicide"),(3,"Kristen Kringle", 28, "Records"),(4,"Edward Nygma", 30, "Forensics"),(5,"Leslie Thompkins", 31, "Forensics")).toDF("id", "name", "age", "department")
empDF: org.apache.spark.sql.DataFrame = [id: int, name: string, age: int, department: string]
scala> empDF.show
+---+----------------+---+----------+
| id| name|age|department|
+---+----------------+---+----------+
| 1| James Gordon| 30| Homicide|
| 2| Harvey Bullock| 35| Homicide|
| 3| Kristen Kringle| 28| Records|
| 4| Edward Nygma| 30| Forensics|
| 5|Leslie Thompkins| 31| Forensics|
+---+----------------+---+----------+
scala> empDF.groupBy("department").count //count returned a DataFrame
res1: org.apache.spark.sql.DataFrame = [department: string, count: bigint]
scala> res1.show
+----------+-----+
|department|count|
+----------+-----+
| Homicide| 2|
| Records| 1|
| Forensics| 2|
+----------+-----+
When I called count on GroupedData (empDF.groupBy("department")), I got another DataFrame as the result (res1). This leads me to believe that count in this case was a transformation. It is further supported by the fact that no computations were triggered when I called count, instead, they started when I ran res1.show.
I haven't been able to find any documentation that suggests count could be a transformation as well. Could someone please shed some light on this?
The .count() what you have used in your code is over RelationalGroupedDataset, which creates a new column with count of elements in the grouped dataset. This is a transformation. Refer:
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.GroupedDataset
The .count() that you use normally over RDD/DataFrame/Dataset is completely different from the above and this .count() is an Action. Refer: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.rdd.RDD
EDIT:
always use .count() with .agg() while operating on groupedDataSet in order to avoid confusion in future:
empDF.groupBy($"department").agg(count($"department") as "countDepartment").show
Case 1:
You use rdd.count() to count the number of rows. Since it initiates the DAG execution and returns the data to the driver, its an action for RDD.
for ex: rdd.count // it returns a Long value
Case 2:
If you call count on Dataframe, it initiates the DAG execution and returns the data to the driver, its an action for Dataframe.
for ex: df.count // it returns a Long value
Case 3:
In your case you are calling groupBy on dataframe which returns RelationalGroupedDataset object, and you are calling count on grouped Dataset which returns a Dataframe, so its a transformation since it doesn't gets the data to the driver and initiates the DAG execution.
for ex:
df.groupBy("department") // returns RelationalGroupedDataset
.count // returns a Dataframe so a transformation
.count // returns a Long value since called on DF so an action
As you've already figure out - if method returns a distributed object (Dataset or RDD) it can be qualified as a transformations.
However these distinctions are much better suited for RDDs than Datasets. The latter ones features an optimizer, including recently added cost based optimizer, and might be much less lazy the old API, blurring differences between transformation and action in some case.
Here however it is safe to say count is a transformation.