How to obtain raws that have a datetime greater than specific datetime? - scala

I want to obtain only those raws in Spark DataFrame df that have a datetime greater than 2017-Jul-10 08:35. How can I do it?
I know how to extract rows corresponding to specific datetime, e.g. 2017-Jul-10, however I don't know how to make the comparison, i.e.e greater than 2017-Jul-10 08:35.
df = df.filter(df("p_datetime") === "2017-Jul-10")

Your p_datetime is in custom date format so you need to convert to proper date format to compare,
Below is a simple example to represent your problem
val df = Seq(
("2017-Jul-10", "0.26"),
("2017-Jul-9", "0.81"),
("2015-Jul-8", "0.24"),
("2015-Jul-11", "null"),
("2015-Jul-12", "null"),
("2015-Jul-15", "0.13")
).toDF("datetime", "value")
val df1 = df.withColumn("datetime", from_unixtime(unix_timestamp($"datetime", "yyyy-MMM-dd")))
df1.filter($"datetime".gt(lit("2017-07-10"))).show // greater than
df1.filter($"datetime" > (lit("2017-07-10"))).show
Output:
+-------------------+-----+
| datetime|value|
+-------------------+-----+
|2017-07-10 00:00:00| 0.26|
+-------------------+-----+
df1.filter($"datetime".lt(lit("2017-07-10"))).show //less than
df1.filter($"datetime" < (lit("2017-07-10"))).show
Output:
+-------------------+-----+
| datetime|value|
+-------------------+-----+
|2017-07-09 00:00:00| 0.81|
|2015-07-08 00:00:00| 0.24|
|2015-07-11 00:00:00| null|
|2015-07-12 00:00:00| null|
|2015-07-15 00:00:00| 0.13|
+-------------------+-----+
df1.filter($"datetime".geq(lit("2017-07-10"))).show // greater than equal to
df1.filter($"datetime" <= (lit("2017-07-10"))).show
Output:
+-------------------+-----+
| datetime|value|
+-------------------+-----+
|2017-07-10 00:00:00| 0.26|
+-------------------+-----+
Edit: You can also compare timestamp by just
val df1 = df.withColumn("datetime", unix_timestamp($"datetime", "yyyy-MMM-dd")) //cast to timestamp
df4.filter($"datetime" >= (lit("2017-07-10").cast(TimestampType))).show
//cast "2017-07-10" also to timestamp
Hope this helps!

Related

How to add the incremental date value with respect to first row value in spark dataframe

Input :
+------+--------+
|Test |01-12-20|
|Ravi | null|
|Son | null|
Expected output :
+------+--------+
|Test |01-12-20|
|Ravi |02-12-20|
|Son |03-12-20|
I tried with .withColumn(col("dated"),date_add(col("dated"),1));
But this result in NULL for all the columns values.
Could you please help me with getting the incremental values on the date second column?
This will be a working solution for you
Input
df = spark.createDataFrame([("Test", "01-12-20"),("Ravi", None),("Son", None)],[ "col1","col2"])
df.show()
df = df.withColumn("col2", F.to_date(F.col("col2"),"dd-MM-yy"))
# a dummy col for window function
df = df.withColumn("del_col", F.lit(0))
_w = W.partitionBy(F.col("del_col")).orderBy(F.col("del_col").desc())
df = df.withColumn("rn_no", F.row_number().over(_w)-1)
# Create a column with the same date
df = df.withColumn("dated", F.first("col2").over(_w))
df = df.selectExpr('*', 'date_add(dated, rn_no) as next_date')
df.show()
DF
+----+--------+
|col1| col2|
+----+--------+
|Test|01-12-20|
|Ravi| null|
| Son| null|
+----+--------+
Final Output
+----+----------+-------+-----+----------+----------+
|col1| col2|del_col|rn_no| dated| next_date|
+----+----------+-------+-----+----------+----------+
|Test|2020-12-01| 0| 0|2020-12-01|2020-12-01|
|Ravi| null| 0| 1|2020-12-01|2020-12-02|
| Son| null| 0| 2|2020-12-01|2020-12-03|
+----+----------+-------+-----+----------+----------+

Spark Scala - 7 Day Rolling Sum

I have some data that I want to calculate a 7 day rolling sum on. Every row for a specific date should be counted as 1 occurrence. My thought process here is to use something like:
val myWindow = Window.orderBy("Date").rangeBetween(currentRow,days(7))
val myData = df.withColumn("Count",df.count().over(myWindow))
But the rangeBetween piece doesn't allow for days(7), for looking 7 days ahead from the current date.
Any thoughts?
Input Data:
val df = Seq(
("08/04/2013",22),
("08/05/2013",24),
("08/06/2013",26),
("08/07/2013",29),
("08/08/2013",24),
("08/09/2013",24),
("08/10/2013",22),
("08/11/2013",24),
("08/11/2013",26)
).toDF("Date","Code")
+----------+----+
| Date|Code|
+----------+----+
|08/04/2013| 22|
|08/05/2013| 24|
|08/06/2013| 26|
|08/07/2013| 29|
|08/08/2013| 24|
|08/09/2013| 24|
|08/10/2013| 22|
|08/11/2013| 24|
|08/11/2013| 26|
+----------+----+
Expected output:
+----------+-----------+------+
| Start|End|Amount|Count |
+----------+-----------+------+
|08/04/2013| 08/10/2013|7 |
|08/05/2013| 08/11/2013|8 |
+----------+-----------+------+
From Spark 2.3 you have to use long values with rangeBetween. As one day has 86400 seconds, you can express your query as:
val myWindow = Window.orderBy("Date").rangeBetween(0, 7 * 86400)
val myData = df
.withColumn("Date", to_date($"Date", "MM/dd/yyyy").cast("timestamp").cast("long"))
.withColumn("Count", count($"*").over(myWindow))
.withColumn("Date", $"Date".cast("timestamp").cast("date"))

Convert DataFrame String colum to Timestamp

I am trying the following code to convert a string date column to a timestamp column:
val df = Seq(
("19-APR-2019 10:11:10"),
("19-MAR-2019 10:11:10"),
("19-FEB-2019 10:11:10")
).toDF("date")
.withColumn("new_date", to_utc_timestamp(to_date('date, "dd-MMM-yyyy hh:mm:ss"), "UTC"))
df.show
It almost works but it lost hours
+--------------------+-------------------+
| date| new_date|
+--------------------+-------------------+
|19-APR-2019 10:11:10|2019-04-19 00:00:00|
|19-MAR-2019 10:11:10|2019-03-19 00:00:00|
|19-FEB-2019 10:11:10|2019-02-19 00:00:00|
+--------------------+-------------------+
Do you have any idea or any other solution?
as SMaz mentioned in comment, followings lines do the tick:
import org.apache.sql.functions.to_timestamp
df.withColumn("new_date", to_timestamp('date, "dd-MMM-yyyy hh:mm:ss"))

Compare dates in scala present in dataframe column

I am trying to compare dates below in filter as below:-
dataframe KIN_PRC_FILE has column pos_price_expiration_dt that has value 9999-12-31
val formatter = new SimpleDateFormat("yyyy-MM-dd");
val CURRENT_DATE = formatter.format(Calendar.getInstance().getTime());
val FILT_KMART_KIN_DATA= KIN_PRC_FILE.filter(s"(pos_price_expiration_dt)>=$CURRENT_DATE AND pos_price_type_cd").show(10)
but seems above query returns null records, can somebody help me to understand what is wrong here.
You just need to add single commas to your current_date variable
KIN_PRC_FILE.filter(s"pos_price_expiration_dt >= '$CURRENT_DATE'")
Quick example here
INPUT
df.show
+-----------------------+---+
|pos_price_expiration_dt| id|
+-----------------------+---+
| 2018-11-20| a|
| 2018-12-28| b|
| null| c|
+-----------------------+---+
OUTPUT
df.filter(s"pos_price_expiration_dt>='$CURRENT_DATE'").show
+-----------------------+---+
|pos_price_expiration_dt| id|
+-----------------------+---+
| 2018-12-28| b|
+-----------------------+---+
Note that you are using string comparison that has date values. Since you have it in the descending order i.e yyyy-MM-dd, this works, but not safe always.
You should consider casting the column to "date" format before doing such comparisons.
And for the current date you can always use the built-in variable. Check this out:
scala> val KIN_PRC_FILE = Seq(("2018-11-01"),("2018-11-15"),("2018-11-30"),("2018-11-28"),(null)).toDF("pos_price_expiration_dt").withColumn("pos_price_expiration_dt",'pos_price_expiration_dt.cast("date"))
KIN_PRC_FILE: org.apache.spark.sql.DataFrame = [pos_price_expiration_dt: date]
scala> KIN_PRC_FILE.printSchema
root
|-- pos_price_expiration_dt: date (nullable = true)
scala> KIN_PRC_FILE.show
+-----------------------+
|pos_price_expiration_dt|
+-----------------------+
| 2018-11-01|
| 2018-11-15|
| 2018-11-30|
| 2018-11-28|
| null|
+-----------------------+
scala> KIN_PRC_FILE.filter(s"pos_price_expiration_dt >= current_date ").show
+-----------------------+
|pos_price_expiration_dt|
+-----------------------+
| 2018-11-30|
| 2018-11-28|
+-----------------------+
scala>

Filling missing dates in spark dataframe column

I've a spark data frame with columns - "date" of type timestamp and "quantity" of type long. For each date, I've some value for quantity. The dates are sorted in increasing order. But there are some dates which are missing.
For eg -
Current df -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
14-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
20-09-2016 | 2
As you can see, the df has some missing dates like 12-09-2016, 13-09-2016 etc. I want to put 0 in the quantity field for those missing dates such that resultant df should look like -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
12-09-2016 | 0
13-09-2016 | 0
14-09-2016 | 0
15-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
18-09-2016 | 0
19-09-2016 | 0
20-09-2016 | 2
Any help/suggestion regarding this will be appreciated. Thanks in advance.
Note that I am coding in scala.
I have written this answer in a bit verbose way for easy understanding of the code. It can be optimized.
Needed imports
import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, TimestampType}
UDFs for String to Valid date format
val date_transform = udf((date: String) => {
val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
val dt = LocalDate.parse(date, dtFormatter)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
Below UDF code taken from Iterate over dates range
def fill_dates = udf((start: String, excludedDiff: Int) => {
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val fromDt = LocalDateTime.parse(start, dtFormatter)
(1 to (excludedDiff - 1)).map(day => {
val dt = fromDt.plusDays(day)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
})
Setting up sample dataframe (df)
val df = Seq(
("10-09-2016", 1),
("11-09-2016", 2),
("14-09-2016", 0),
("16-09-2016", 1),
("17-09-2016", 0),
("20-09-2016", 2)).toDF("date", "quantity")
.withColumn("date", date_transform($"date").cast(TimestampType))
.withColumn("quantity", $"quantity".cast(LongType))
df.printSchema()
root
|-- date: timestamp (nullable = true)
|-- quantity: long (nullable = false)
df.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-14 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Create a temporary dataframe(tempDf) to union with df:
val w = Window.orderBy($"date")
val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
.filter($"diff" > 1) // Pick date diff more than one day to generate our date
.withColumn("next_dates", fill_dates($"date", $"diff"))
.withColumn("quantity", lit("0"))
.withColumn("date", explode($"next_dates"))
.withColumn("date", $"date".cast(TimestampType))
tempDf.show(false)
+-------------------+--------+----+------------------------+
|date |quantity|diff|next_dates |
+-------------------+--------+----+------------------------+
|2016-09-12 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-13 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-15 00:00:00|0 |2 |[2016-09-15] |
|2016-09-18 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
|2016-09-19 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
+-------------------+--------+----+------------------------+
Now union two dataframes
val result = df.union(tempDf.select("date", "quantity"))
.orderBy("date")
result.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-12 00:00:00| 0|
|2016-09-13 00:00:00| 0|
|2016-09-14 00:00:00| 0|
|2016-09-15 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-18 00:00:00| 0|
|2016-09-19 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Based on the #mrsrinivas excellent answer, here is the PySpark version.
Needed imports
from typing import List
import datetime
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
from pyspark.sql.types import DateType, ArrayType
UDF to create the range of next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]
Function the create the DateFrame filling the dates (support "grouping" columns):
def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))
window = Window.orderBy(*group_columns, date_column)
return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
.filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
.withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
.drop("_diff", "_next_dates")
The usage of the function:
fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
df = df.union(fill_df)
It assumes that the date column is already in date type.
Here is a slight modification, to use this function with months and enter measure columns (columns that should be set to zero) instead of group columns:
from typing import List
import datetime
from dateutil import relativedelta
import math
import pyspark.sql.functions as f
from pyspark.sql import DataFrame, Window
from pyspark.sql.types import DateType, ArrayType
def fill_time_gaps_date_diff_based(df: pyspark.sql.dataframe.DataFrame, measure_columns: list, date_column: str):
group_columns = [col for col in df.columns if col not in [date_column]+measure_columns]
# save measure sums for qc
qc = df.agg({col: 'sum' for col in measure_columns}).collect()
# convert month to date
convert_int_to_date = f.udf(lambda mth: datetime.datetime(year=math.floor(mth/100), month=mth%100, day=1), DateType())
df = df.withColumn(date_column, convert_int_to_date(date_column))
# sort values
df = df.orderBy(group_columns)
# get_fill_dates_df (instead of months_between also use date_diff for days)
window = Window.orderBy(*group_columns, date_column)
# calculate diff column
fill_df = df.withColumn(
"_diff",
f.months_between(f.lead(date_column, 1).over(window), date_column).cast(IntegerType())
).filter(
f.col("_diff") > 1
)
# generate next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [
start_date + relativedelta.relativedelta(months=months)
for months in range(1, diff)
]
get_next_dates_udf = f.udf(_get_next_dates, ArrayType(DateType()))
fill_df = fill_df.withColumn(
"_next_dates",
get_next_dates_udf(date_column, "_diff")
)
# set measure columns to 0
for col in measure_columns:
fill_df = fill_df.withColumn(col, f.lit(0))
# explode next_dates column
fill_df = fill_df.withColumn(date_column, f.explode('_next_dates'))
# drop unneccessary columns
fill_df = fill_df.drop(
"_diff",
"_next_dates"
)
# union df with fill_df
df = df.union(fill_df)
# qc: should be removed for productive runs
if qc != df.agg({col: 'sum' for col in measure_columns}).collect():
raise ValueError('Sums before and after run do not fit.')
return df
Please note, that I assume that the month is given as Integer in the form YYYYMM. This could easily be adjusted by modifying the "convert month to date" part.