How to find columns with many nulls in Spark/Scala - scala

I have a dataframe in Spark/Scala which has 100's of column. Many of the oth columns have many null values. I'd like to find the columns that have more than 90% nulls and then drop them from my dataframe. How can I do that in Spark/Scala?

org.apache.spark.sql.functions.array and udf will help.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize[(String, String, String, String, String, String, String, String, String, String)](
Seq(
("a", null, null, null, null, null, null, null, null, null), // 90%
("b", null, null, null, null, null, null, null, null, ""), // 80%
("c", null, null, null, null, null, null, null, "", "") // 70%
)
).toDF("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9","c10")
// count nulls then check the condition
val check_90_null = udf { xs: Seq[String] =>
xs.count(_ == null) >= (xs.length * 0.9)
}
// all columns as array
val columns = array(df.columns.map(col): _*)
// filter out
df.where(not(check_90_null(columns)))
.show()
shows
+---+----+----+----+----+----+----+----+----+---+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|
+---+----+----+----+----+----+----+----+----+---+
| b|null|null|null|null|null|null|null|null| |
| b|null|null|null|null|null|null|null| | |
+---+----+----+----+----+----+----+----+----+---+
which the row started "a" is excluded.

Suppose you have a data frame like this:
val df = Seq((Some(1.0), Some(2), Some("a")),
(null, Some(3), null),
(Some(2.0), Some(4), Some("b")),
(null, null, Some("c"))
).toDF("A", "B", "C")
df.show
+----+----+----+
| A| B| C|
+----+----+----+
| 1.0| 2| a|
|null| 3|null|
| 2.0| 4| b|
|null|null| c|
+----+----+----+
Count NULL using agg function and filter columns based on the null counts and threshold, set it to be 1 here:
val null_thresh = 1 // if you want to use percentage
// val null_thresh = df.count() * 0.9
val to_keep = df.columns.filter(
c => df.agg(
sum(when(df(c).isNull, 1).otherwise(0)).alias(c)
).first().getLong(0) <= null_thresh
)
df.select(to_keep.head, to_keep.tail: _*).show
And you get:
+----+----+
| B| C|
+----+----+
| 2| a|
| 3|null|
| 4| b|
|null| c|
+----+----+

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Concatenate list of columns except when any of them is null

I have a dataframe for which I want to add a new column, which is a concatenation of all the items from the columns in listOfFixedColumns using "_". I want to set the new column's value to null if any of the columns in listOfFixedColumns is null.
+---+----+---------+
| a| b| unique_id |
+---+----+---------+
|foo | bar| foo_bar|
|null|bar | null |
|baz |null| null |
|null|null| null |
+---+----+---------+
I tried this which gets me only the concatenated column values
val listOfFixedColumns = List("A", "B", ..) // dynamic list of columns names as strings
df.withColumn("unique_id", concat_ws("_", listOfFixedColumns.map(c => col(c)): _*))
but I am not able to figure out how to take care of the null cases:
+---+----+---------+
| a| b|unique_id|
+---+----+---------+
|foo | bar| foo_bar|
|null|bar | bar |<-- needs a fix
|baz |null| baz |<-- needs a fix
|null|null| null |
+---+----+---------+
Do I have to use UDFs for this? I am a Scala beginner and any help will be appreciated.
You can use isNull method of Column class, together with the OR operator to find out when there is a null column. Then use is it in a condition with when:
import org.apache.spark.sql.functions.{col, concat_ws, when}
val df = Seq(
("foo", "bar", "foo_bar"),
(null, "bar", null),
("baz", null, null),
(null, null, null)
).toDF("A", "B", "C")
val listOfFixedColumns = List("A", "B", "C")
val hasNull = listOfFixedColumns
.map(col(_).isNull)
.reduce(_ || _)
val concatNonEmpty = concat_ws("_", listOfFixedColumns.map(col): _*)
df.withColumn("unique_id", when(!hasNull, concatNonEmpty).otherwise(null)).show
// +----+----+-------+---------------+
// | A| B| C| unique_id|
// +----+----+-------+---------------+
// | foo| bar|foo_bar|foo_bar_foo_bar|
// |null| bar| null| null|
// | baz|null| null| null|
// |null|null| null| null|
// +----+----+-------+---------------+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Is it possible to ignore null values when using LEAD window function in Spark?

My dataframe like this
id value date
1 100 2017
1 null 2016
1 20 2015
1 100 2014
I would like to get most recent previous value but ignoring null
id value date recent value
1 100 2017 20
1 null 2016 20
1 20 2015 100
1 100 2014 null
Is there any way to ignore null values while using lead window function?
Is it possible to ignore null values when using lead window function in Spark
It is not.
I would like to get most recent value but ignoring null
Just use last (or first) with ignoreNulls:
def last(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017), (1, None, 2016), (1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
df.withColumn(
"last_value",
last("value", true).over(Window.partitionBy("id").orderBy("date"))
).show
+---+-----+----+----------+
| id|value|date|last_value|
+---+-----+----+----------+
| 1| 100|2014| 100|
| 1| 20|2015| 20|
| 1| null|2016| 20|
| 1| 100|2017| 100|
+---+-----+----+----------+
Spark 3.2+ provides ignoreNulls inside lead and lag in Scala.
lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Test input:
import org.apache.spark.sql.expressions.Window
val df = Seq[(Integer, Integer, Integer)](
(1, 100, 2017),
(1, null, 2016),
(1, 20, 2015),
(1, 100, 2014)
).toDF("id", "value", "date")
lead:
val w = Window.partitionBy("id").orderBy(desc("date"))
val df2 = df.withColumn("lead_val", lead($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2017| 20|
// | 1| null|2016| 20|
// | 1| 20|2015| 100|
// | 1| 100|2014| null|
// +---+-----+----+--------+
lag:
val w = Window.partitionBy("id").orderBy("date")
val df2 = df.withColumn("lead_val", lag($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2014| null|
// | 1| 20|2015| 100|
// | 1| null|2016| 20|
// | 1| 100|2017| 20|
// +---+-----+----+--------+
You could do it in two steps:
Create a table with non null values
Join on the original table
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017),
(1, None, 2016),
(1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
// Step 1
val filledDf = df
.where($"value".isNotNull)
.withColumnRenamed("value", "recent_value")
// Step 2
val window: WindowSpec = Window.partitionBy("l.id", "l.date").orderBy($"r.date".desc)
val finalDf = df.as("l")
.join(filledDf.as("r"), $"l.id" === $"r.id" && $"l.date" > $"r.date", "left")
.withColumn("rn", row_number().over(window))
.where($"rn" === 1)
.select("l.id", "l.date", "value", "recent_value")
finalDf.orderBy($"date".desc).show
+---+----+-----+------------+
| id|date|value|recent_value|
+---+----+-----+------------+
| 1|2017| 100| 20|
| 1|2016| null| 20|
| 1|2015| 20| 100|
| 1|2014| 100| null|
+---+----+-----+------------+

Aggregate multiple columns using methods that can't be called directly from GroupedData class (like "last()") and rename them to original names [duplicate]

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 5 years ago.
Assuming we have the following DF
scala> df.show
+---+----+----+----+-------------------+---+
| id|name| cnt| amt| dt|scn|
+---+----+----+----+-------------------+---+
| 1|null| 1|1.12|2000-01-02 00:11:11|112|
| 1| aaa| 1|1.11|2000-01-01 00:00:00|111|
| 2| bbb|null|2.22|2000-01-03 12:12:12|201|
| 2|null| 2|1.13| null|200|
| 2|null|null|2.33| null|202|
| 3| ccc| 3|3.34| null|302|
| 3|null|null|3.33| null|301|
| 3|null|null| 0.0|2000-12-31 23:59:59|300|
+---+----+----+----+-------------------+---+
I want to get the following DF - sorted by scn, groupped by id and take the last not-null value for every column (except id and scn).
It can be done like this:
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.orderBy("scn")
.groupBy("id")
.agg(last("name", true) as "name",
last("cnt", true) as "cnt",
last("amt", true) as "amt",
last("dt", true) as "dt")
.show
// Exiting paste mode, now interpreting.
+---+----+---+----+-------------------+
| id|name|cnt| amt| dt|
+---+----+---+----+-------------------+
| 1| aaa| 1|1.12|2000-01-02 00:11:11|
| 3| ccc| 3|3.34|2000-12-31 23:59:59|
| 2| bbb| 2|2.33|2000-01-03 12:12:12|
+---+----+---+----+-------------------+
In real life I want to process different DFs with a large amount of columns.
My question is - how can i specify all columns (except id and scn) in the .agg(last(col_name, true)) programmatically?
Code for generating a source DF:
case class C(id: Integer, name: String, cnt: Integer, amt: Double, dt: String, scn: Integer)
val cc = Seq(
C(1, null, 1, 1.12, "2000-01-02 00:11:11", 112),
C(1, "aaa", 1, 1.11, "2000-01-01 00:00:00", 111),
C(2, "bbb", null, 2.22, "2000-01-03 12:12:12", 201),
C(2, null, 2, 1.13, null,200),
C(2, null, null, 2.33, null, 202),
C(3, "ccc", 3, 3.34, null, 302),
C(3, null, null, 3.33, "20001-01-01 00:33:33", 301),
C(3, null, null, 0.00, "2000-12-31 23:59:59", 300)
)
val t = sc.parallelize(cc, 4).toDF()
val df = t.withColumn("dt", $"dt".cast("timestamp"))
val cols = df.columns.filterNot(_.equals("id"))
Solution similar to this answer, plus renaming columns in the resulting DF to the original ones:
val exprs = df.columns.filterNot(_.equals("id")).map(last(_, true))
val r = df.orderBy("scn").groupBy("id").agg(exprs.head, exprs.tail: _*).toDF(df.columns:_*)
Result:
scala> r.show
+---+----+---+----+-------------------+---+
| id|name|cnt| amt| dt|scn|
+---+----+---+----+-------------------+---+
| 1| aaa| 1|1.12|2000-01-02 00:11:11|112|
| 3| ccc| 3|3.34|2000-12-31 23:59:59|302|
| 2| bbb| 2|2.33|2000-01-03 12:12:12|202|
+---+----+---+----+-------------------+---+
or:
val exprs = df.columns.filterNot(_.equals("id")).map(c=>last(c, true).as(c.toString))
val r = df.orderBy("scn").groupBy("id").agg(exprs.head, exprs.tail: _*)