Is it possible to ignore null values when using LEAD window function in Spark? - scala

My dataframe like this
id value date
1 100 2017
1 null 2016
1 20 2015
1 100 2014
I would like to get most recent previous value but ignoring null
id value date recent value
1 100 2017 20
1 null 2016 20
1 20 2015 100
1 100 2014 null
Is there any way to ignore null values while using lead window function?

Is it possible to ignore null values when using lead window function in Spark
It is not.
I would like to get most recent value but ignoring null
Just use last (or first) with ignoreNulls:
def last(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017), (1, None, 2016), (1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
df.withColumn(
"last_value",
last("value", true).over(Window.partitionBy("id").orderBy("date"))
).show
+---+-----+----+----------+
| id|value|date|last_value|
+---+-----+----+----------+
| 1| 100|2014| 100|
| 1| 20|2015| 20|
| 1| null|2016| 20|
| 1| 100|2017| 100|
+---+-----+----+----------+

Spark 3.2+ provides ignoreNulls inside lead and lag in Scala.
lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Test input:
import org.apache.spark.sql.expressions.Window
val df = Seq[(Integer, Integer, Integer)](
(1, 100, 2017),
(1, null, 2016),
(1, 20, 2015),
(1, 100, 2014)
).toDF("id", "value", "date")
lead:
val w = Window.partitionBy("id").orderBy(desc("date"))
val df2 = df.withColumn("lead_val", lead($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2017| 20|
// | 1| null|2016| 20|
// | 1| 20|2015| 100|
// | 1| 100|2014| null|
// +---+-----+----+--------+
lag:
val w = Window.partitionBy("id").orderBy("date")
val df2 = df.withColumn("lead_val", lag($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2014| null|
// | 1| 20|2015| 100|
// | 1| null|2016| 20|
// | 1| 100|2017| 20|
// +---+-----+----+--------+

You could do it in two steps:
Create a table with non null values
Join on the original table
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017),
(1, None, 2016),
(1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
// Step 1
val filledDf = df
.where($"value".isNotNull)
.withColumnRenamed("value", "recent_value")
// Step 2
val window: WindowSpec = Window.partitionBy("l.id", "l.date").orderBy($"r.date".desc)
val finalDf = df.as("l")
.join(filledDf.as("r"), $"l.id" === $"r.id" && $"l.date" > $"r.date", "left")
.withColumn("rn", row_number().over(window))
.where($"rn" === 1)
.select("l.id", "l.date", "value", "recent_value")
finalDf.orderBy($"date".desc).show
+---+----+-----+------------+
| id|date|value|recent_value|
+---+----+-----+------------+
| 1|2017| 100| 20|
| 1|2016| null| 20|
| 1|2015| 20| 100|
| 1|2014| 100| null|
+---+----+-----+------------+

Related

create a simple DF after reading a parquet file

I am a new developer on Scala and I met some problems to write a simple code on Spark Scala. I have this DF that I get after reading a parquet file :
ID Timestamp
1 0
1 10
1 11
2 20
3 15
And what I want is to create a DF result from the first DF (if the ID = 2 for example, the timestamp should be multiplied by two). So, I created a new class :
case class OutputData(id: bigint, timestamp:bigint)
And here is my code :
val tmp = spark.read.parquet("/user/test.parquet").select("id", "timestamp")
val outputData:OutputData = tmp.map(x:Row => {
var time_result
if (x.getString("id") == 2) {
time_result = x.getInt(2)* 2
}
if (x.getString("id") == 1) {
time_result = x.getInt(2) + 10
}
OutputData2(x.id, time_result)
})
case class OutputData2(id: bigint, timestamp:bigint)
Can you help me please ?
To make the implementation easier, you can cast your df using a case class, the process that Dataset with object notation instead of access to your row each time that you want the value of some element. Apart of that, based on your input and output will take have same format you can use same case class instead of define 2.
Code looks like:
// Sample intput data
val df = Seq(
(1, 0L),
(1, 10L),
(1, 11L),
(2, 20L),
(3, 15L)
).toDF("ID", "Timestamp")
df.show()
// Case class as helper
case class OutputData(ID: Integer, Timestamp: Long)
val newDF = df.as[OutputData].map(record=>{
val newTime = if(record.ID == 2) record.Timestamp*2 else record.Timestamp // identify your id and apply logic based on that
OutputData(record.ID, newTime)// return same format with updated values
})
newDF.show()
The output of above code:
// original
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 20|
| 3| 15|
+---+---------+
// new one
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 40|
| 3| 15|
+---+---------+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Aggregate multiple columns using methods that can't be called directly from GroupedData class (like "last()") and rename them to original names [duplicate]

This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 5 years ago.
Assuming we have the following DF
scala> df.show
+---+----+----+----+-------------------+---+
| id|name| cnt| amt| dt|scn|
+---+----+----+----+-------------------+---+
| 1|null| 1|1.12|2000-01-02 00:11:11|112|
| 1| aaa| 1|1.11|2000-01-01 00:00:00|111|
| 2| bbb|null|2.22|2000-01-03 12:12:12|201|
| 2|null| 2|1.13| null|200|
| 2|null|null|2.33| null|202|
| 3| ccc| 3|3.34| null|302|
| 3|null|null|3.33| null|301|
| 3|null|null| 0.0|2000-12-31 23:59:59|300|
+---+----+----+----+-------------------+---+
I want to get the following DF - sorted by scn, groupped by id and take the last not-null value for every column (except id and scn).
It can be done like this:
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.orderBy("scn")
.groupBy("id")
.agg(last("name", true) as "name",
last("cnt", true) as "cnt",
last("amt", true) as "amt",
last("dt", true) as "dt")
.show
// Exiting paste mode, now interpreting.
+---+----+---+----+-------------------+
| id|name|cnt| amt| dt|
+---+----+---+----+-------------------+
| 1| aaa| 1|1.12|2000-01-02 00:11:11|
| 3| ccc| 3|3.34|2000-12-31 23:59:59|
| 2| bbb| 2|2.33|2000-01-03 12:12:12|
+---+----+---+----+-------------------+
In real life I want to process different DFs with a large amount of columns.
My question is - how can i specify all columns (except id and scn) in the .agg(last(col_name, true)) programmatically?
Code for generating a source DF:
case class C(id: Integer, name: String, cnt: Integer, amt: Double, dt: String, scn: Integer)
val cc = Seq(
C(1, null, 1, 1.12, "2000-01-02 00:11:11", 112),
C(1, "aaa", 1, 1.11, "2000-01-01 00:00:00", 111),
C(2, "bbb", null, 2.22, "2000-01-03 12:12:12", 201),
C(2, null, 2, 1.13, null,200),
C(2, null, null, 2.33, null, 202),
C(3, "ccc", 3, 3.34, null, 302),
C(3, null, null, 3.33, "20001-01-01 00:33:33", 301),
C(3, null, null, 0.00, "2000-12-31 23:59:59", 300)
)
val t = sc.parallelize(cc, 4).toDF()
val df = t.withColumn("dt", $"dt".cast("timestamp"))
val cols = df.columns.filterNot(_.equals("id"))
Solution similar to this answer, plus renaming columns in the resulting DF to the original ones:
val exprs = df.columns.filterNot(_.equals("id")).map(last(_, true))
val r = df.orderBy("scn").groupBy("id").agg(exprs.head, exprs.tail: _*).toDF(df.columns:_*)
Result:
scala> r.show
+---+----+---+----+-------------------+---+
| id|name|cnt| amt| dt|scn|
+---+----+---+----+-------------------+---+
| 1| aaa| 1|1.12|2000-01-02 00:11:11|112|
| 3| ccc| 3|3.34|2000-12-31 23:59:59|302|
| 2| bbb| 2|2.33|2000-01-03 12:12:12|202|
+---+----+---+----+-------------------+---+
or:
val exprs = df.columns.filterNot(_.equals("id")).map(c=>last(c, true).as(c.toString))
val r = df.orderBy("scn").groupBy("id").agg(exprs.head, exprs.tail: _*)

How to find columns with many nulls in Spark/Scala

I have a dataframe in Spark/Scala which has 100's of column. Many of the oth columns have many null values. I'd like to find the columns that have more than 90% nulls and then drop them from my dataframe. How can I do that in Spark/Scala?
org.apache.spark.sql.functions.array and udf will help.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize[(String, String, String, String, String, String, String, String, String, String)](
Seq(
("a", null, null, null, null, null, null, null, null, null), // 90%
("b", null, null, null, null, null, null, null, null, ""), // 80%
("c", null, null, null, null, null, null, null, "", "") // 70%
)
).toDF("c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9","c10")
// count nulls then check the condition
val check_90_null = udf { xs: Seq[String] =>
xs.count(_ == null) >= (xs.length * 0.9)
}
// all columns as array
val columns = array(df.columns.map(col): _*)
// filter out
df.where(not(check_90_null(columns)))
.show()
shows
+---+----+----+----+----+----+----+----+----+---+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|
+---+----+----+----+----+----+----+----+----+---+
| b|null|null|null|null|null|null|null|null| |
| b|null|null|null|null|null|null|null| | |
+---+----+----+----+----+----+----+----+----+---+
which the row started "a" is excluded.
Suppose you have a data frame like this:
val df = Seq((Some(1.0), Some(2), Some("a")),
(null, Some(3), null),
(Some(2.0), Some(4), Some("b")),
(null, null, Some("c"))
).toDF("A", "B", "C")
df.show
+----+----+----+
| A| B| C|
+----+----+----+
| 1.0| 2| a|
|null| 3|null|
| 2.0| 4| b|
|null|null| c|
+----+----+----+
Count NULL using agg function and filter columns based on the null counts and threshold, set it to be 1 here:
val null_thresh = 1 // if you want to use percentage
// val null_thresh = df.count() * 0.9
val to_keep = df.columns.filter(
c => df.agg(
sum(when(df(c).isNull, 1).otherwise(0)).alias(c)
).first().getLong(0) <= null_thresh
)
df.select(to_keep.head, to_keep.tail: _*).show
And you get:
+----+----+
| B| C|
+----+----+
| 2| a|
| 3|null|
| 4| b|
|null| c|
+----+----+

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit