Compare dates in scala present in dataframe column - scala

I am trying to compare dates below in filter as below:-
dataframe KIN_PRC_FILE has column pos_price_expiration_dt that has value 9999-12-31
val formatter = new SimpleDateFormat("yyyy-MM-dd");
val CURRENT_DATE = formatter.format(Calendar.getInstance().getTime());
val FILT_KMART_KIN_DATA= KIN_PRC_FILE.filter(s"(pos_price_expiration_dt)>=$CURRENT_DATE AND pos_price_type_cd").show(10)
but seems above query returns null records, can somebody help me to understand what is wrong here.

You just need to add single commas to your current_date variable
KIN_PRC_FILE.filter(s"pos_price_expiration_dt >= '$CURRENT_DATE'")
Quick example here
INPUT
df.show
+-----------------------+---+
|pos_price_expiration_dt| id|
+-----------------------+---+
| 2018-11-20| a|
| 2018-12-28| b|
| null| c|
+-----------------------+---+
OUTPUT
df.filter(s"pos_price_expiration_dt>='$CURRENT_DATE'").show
+-----------------------+---+
|pos_price_expiration_dt| id|
+-----------------------+---+
| 2018-12-28| b|
+-----------------------+---+

Note that you are using string comparison that has date values. Since you have it in the descending order i.e yyyy-MM-dd, this works, but not safe always.
You should consider casting the column to "date" format before doing such comparisons.
And for the current date you can always use the built-in variable. Check this out:
scala> val KIN_PRC_FILE = Seq(("2018-11-01"),("2018-11-15"),("2018-11-30"),("2018-11-28"),(null)).toDF("pos_price_expiration_dt").withColumn("pos_price_expiration_dt",'pos_price_expiration_dt.cast("date"))
KIN_PRC_FILE: org.apache.spark.sql.DataFrame = [pos_price_expiration_dt: date]
scala> KIN_PRC_FILE.printSchema
root
|-- pos_price_expiration_dt: date (nullable = true)
scala> KIN_PRC_FILE.show
+-----------------------+
|pos_price_expiration_dt|
+-----------------------+
| 2018-11-01|
| 2018-11-15|
| 2018-11-30|
| 2018-11-28|
| null|
+-----------------------+
scala> KIN_PRC_FILE.filter(s"pos_price_expiration_dt >= current_date ").show
+-----------------------+
|pos_price_expiration_dt|
+-----------------------+
| 2018-11-30|
| 2018-11-28|
+-----------------------+
scala>

Related

How to filter Date Columns and store them as numbers in Data Frames using Scala

I have a dataframe (dateds1) which looks like below,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|
I need help in filtering it out, my output should look like below,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916 | 20080909 |20090209 |20170909 |
| 19940920 | 20080910 |19990505 |20160930 |
| 19930924 | 20160629 |20031207 |20280213 |
| 19920928 | 20070624 |20040605 |20190924 |
| 19911003 | 20110707 |20110707 |20200330 |
| 19901007 | 20090209 |20090209 |20110313 |
| 19891011 | 19990505 |19990505 |20210313 |
I tried using filter, but I was able to filter only for either of the one case, when the dates are in YYYY/MM/DD or YYYY-MM-DD 00:00:00 format and number of columns are fixed. Can someone please help me in figuring it out for both the formats and when the number of columns are dynamic(They might be increasing or decreasing).
They should be converted from Date Datatype to Integers or Long in this format YYYYMMDD.
Note: The records in this Dataframe or either in YYYY/MM/DD or YYYY-MM-DD 00:00:00 format.
Any help is appreciated. Thanks
To do that conversion dynamically you'll have to iterate through all columns and perform different operations depending on the column type.
Here's an example:
import java.sql.Date
import org.apache.spark.sql.types._
import java.sql.Timestamp
val originalDf = Seq(
(Timestamp.valueOf("2016-09-30 03:04:00"),Date.valueOf("2016-09-30")),
(Timestamp.valueOf("2016-07-30 00:00:00"),Date.valueOf("2016-10-30"))
).toDF("ts_value","date_value")
Original table details:
> originalDf.show
+-------------------+----------+
| ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+
> originalDf.printSchema
root
|-- ts_value: timestamp (nullable = true)
|-- date_value: date (nullable = true)
Example of conversion operation:
val newDf = originalDf.columns.foldLeft(originalDf)((df, name) => {
val data_type = df.schema(name).dataType
if(data_type == DateType)
df.withColumn(name, date_format(col(name), "yyyyMMdd").cast(IntegerType))
else if(data_type == TimestampType)
df.withColumn(name, year(col(name))*10000 + month(col(name))*100 + dayofmonth(col(name)))
else
df
})
New table details:
newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930| 20160930|
|20160730| 20161030|
+--------+----------+
newDf.printSchema
root
|-- ts_value: integer (nullable = true)
|-- date_value: integer (nullable = true)
If you don't want to perform this operation in all columns you can manually specify the columns by changing
val newDf = originalDf.columns.foldLeft ...
to
val newDf = Seq("col1_name","col2_name", ... ).foldLeft ...
Hope this helps!

Spark scala create multiple columns from array column

Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22
You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output
you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.

Filling missing dates in spark dataframe column

I've a spark data frame with columns - "date" of type timestamp and "quantity" of type long. For each date, I've some value for quantity. The dates are sorted in increasing order. But there are some dates which are missing.
For eg -
Current df -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
14-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
20-09-2016 | 2
As you can see, the df has some missing dates like 12-09-2016, 13-09-2016 etc. I want to put 0 in the quantity field for those missing dates such that resultant df should look like -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
12-09-2016 | 0
13-09-2016 | 0
14-09-2016 | 0
15-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
18-09-2016 | 0
19-09-2016 | 0
20-09-2016 | 2
Any help/suggestion regarding this will be appreciated. Thanks in advance.
Note that I am coding in scala.
I have written this answer in a bit verbose way for easy understanding of the code. It can be optimized.
Needed imports
import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, TimestampType}
UDFs for String to Valid date format
val date_transform = udf((date: String) => {
val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
val dt = LocalDate.parse(date, dtFormatter)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
Below UDF code taken from Iterate over dates range
def fill_dates = udf((start: String, excludedDiff: Int) => {
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val fromDt = LocalDateTime.parse(start, dtFormatter)
(1 to (excludedDiff - 1)).map(day => {
val dt = fromDt.plusDays(day)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
})
Setting up sample dataframe (df)
val df = Seq(
("10-09-2016", 1),
("11-09-2016", 2),
("14-09-2016", 0),
("16-09-2016", 1),
("17-09-2016", 0),
("20-09-2016", 2)).toDF("date", "quantity")
.withColumn("date", date_transform($"date").cast(TimestampType))
.withColumn("quantity", $"quantity".cast(LongType))
df.printSchema()
root
|-- date: timestamp (nullable = true)
|-- quantity: long (nullable = false)
df.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-14 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Create a temporary dataframe(tempDf) to union with df:
val w = Window.orderBy($"date")
val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
.filter($"diff" > 1) // Pick date diff more than one day to generate our date
.withColumn("next_dates", fill_dates($"date", $"diff"))
.withColumn("quantity", lit("0"))
.withColumn("date", explode($"next_dates"))
.withColumn("date", $"date".cast(TimestampType))
tempDf.show(false)
+-------------------+--------+----+------------------------+
|date |quantity|diff|next_dates |
+-------------------+--------+----+------------------------+
|2016-09-12 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-13 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-15 00:00:00|0 |2 |[2016-09-15] |
|2016-09-18 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
|2016-09-19 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
+-------------------+--------+----+------------------------+
Now union two dataframes
val result = df.union(tempDf.select("date", "quantity"))
.orderBy("date")
result.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-12 00:00:00| 0|
|2016-09-13 00:00:00| 0|
|2016-09-14 00:00:00| 0|
|2016-09-15 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-18 00:00:00| 0|
|2016-09-19 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Based on the #mrsrinivas excellent answer, here is the PySpark version.
Needed imports
from typing import List
import datetime
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
from pyspark.sql.types import DateType, ArrayType
UDF to create the range of next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]
Function the create the DateFrame filling the dates (support "grouping" columns):
def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))
window = Window.orderBy(*group_columns, date_column)
return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
.filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
.withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
.drop("_diff", "_next_dates")
The usage of the function:
fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
df = df.union(fill_df)
It assumes that the date column is already in date type.
Here is a slight modification, to use this function with months and enter measure columns (columns that should be set to zero) instead of group columns:
from typing import List
import datetime
from dateutil import relativedelta
import math
import pyspark.sql.functions as f
from pyspark.sql import DataFrame, Window
from pyspark.sql.types import DateType, ArrayType
def fill_time_gaps_date_diff_based(df: pyspark.sql.dataframe.DataFrame, measure_columns: list, date_column: str):
group_columns = [col for col in df.columns if col not in [date_column]+measure_columns]
# save measure sums for qc
qc = df.agg({col: 'sum' for col in measure_columns}).collect()
# convert month to date
convert_int_to_date = f.udf(lambda mth: datetime.datetime(year=math.floor(mth/100), month=mth%100, day=1), DateType())
df = df.withColumn(date_column, convert_int_to_date(date_column))
# sort values
df = df.orderBy(group_columns)
# get_fill_dates_df (instead of months_between also use date_diff for days)
window = Window.orderBy(*group_columns, date_column)
# calculate diff column
fill_df = df.withColumn(
"_diff",
f.months_between(f.lead(date_column, 1).over(window), date_column).cast(IntegerType())
).filter(
f.col("_diff") > 1
)
# generate next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [
start_date + relativedelta.relativedelta(months=months)
for months in range(1, diff)
]
get_next_dates_udf = f.udf(_get_next_dates, ArrayType(DateType()))
fill_df = fill_df.withColumn(
"_next_dates",
get_next_dates_udf(date_column, "_diff")
)
# set measure columns to 0
for col in measure_columns:
fill_df = fill_df.withColumn(col, f.lit(0))
# explode next_dates column
fill_df = fill_df.withColumn(date_column, f.explode('_next_dates'))
# drop unneccessary columns
fill_df = fill_df.drop(
"_diff",
"_next_dates"
)
# union df with fill_df
df = df.union(fill_df)
# qc: should be removed for productive runs
if qc != df.agg({col: 'sum' for col in measure_columns}).collect():
raise ValueError('Sums before and after run do not fit.')
return df
Please note, that I assume that the month is given as Integer in the form YYYYMM. This could easily be adjusted by modifying the "convert month to date" part.

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11