Fetch week start date and week end date from Date - pyspark

I need to fetch week start date and week end date from a given date, taking into account that the week starts from Sunday and ends on Saturday.
I referred this post but this takes monday as starting day of week. Is there any inbuilt function in spark which can take care of this?

Find out the day of the week and Use selectExpr to iterate through columns , and making Sunday as week start date
from pyspark.sql import functions as F
df_b = spark.createDataFrame([('1','2020-07-13')],[ "ID","date"])
df_b = df_b.withColumn('day_of_week', F.dayofweek(F.col('date')))
df_b = df_b.selectExpr('*', 'date_sub(date, day_of_week-1) as week_start')
df_b = df_b.selectExpr('*', 'date_add(date, 7-day_of_week) as week_end')
df_b.show()
+---+----------+-----------+----------+----------+
| ID| date|day_of_week|week_start| week_end|
+---+----------+-----------+----------+----------+
| 1|2020-07-13| 2|2020-07-12|2020-07-18|
+---+----------+-----------+----------+----------+
Update in Spark SQL
Crete a Temporary view from the data-frame first
df_a.createOrReplaceTempView("df_a_sql")
Code here
%sql
select *, date_sub(date,dayofweek-1) as week_start,
date_sub(date, 7-dayofweek) as week_end
from
(select *, dayofweek(date) as dayofweek
from df_a_sql) T
Output
+---+----------+-----------+----------+----------+
| ID| date|day_of_week|week_start| week_end|
+---+----------+-----------+----------+----------+
| 1|2020-07-13| 2|2020-07-12|2020-07-18|
+---+----------+-----------+----------+----------+

Perhaps this is helpful -
Load the test data
val df = spark.sql("select cast('2020-07-12' as date) as date")
df.show(false)
df.printSchema()
/**
* +----------+
* |date |
* +----------+
* |2020-07-15|
* +----------+
*
* root
* |-- date: date (nullable = true)
*/
week starting from SUNDAY and ending SATURDAY
// week starting from SUNDAY and ending SATURDAY
df.withColumn("week_end", next_day($"date", "SAT"))
.withColumn("week_start", date_sub($"week_end", 6))
.show(false)
/**
* +----------+----------+----------+
* |date |week_end |week_start|
* +----------+----------+----------+
* |2020-07-12|2020-07-18|2020-07-12|
* +----------+----------+----------+
*/
week starting from MONDAY and ending SUNDAY
// week starting from MONDAY and ending SUNDAY
df.withColumn("week_end", next_day($"date", "SUN"))
.withColumn("week_start", date_sub($"week_end", 6))
.show(false)
/**
* +----------+----------+----------+
* |date |week_end |week_start|
* +----------+----------+----------+
* |2020-07-12|2020-07-19|2020-07-13|
* +----------+----------+----------+
*/
week starting from TUESDAY and ending MONDAY
// week starting from TUESDAY and ending MONDAY
df.withColumn("week_end", next_day($"date", "MON"))
.withColumn("week_start", date_sub($"week_end", 6))
.show(false)
/**
* +----------+----------+----------+
* |date |week_end |week_start|
* +----------+----------+----------+
* |2020-07-12|2020-07-13|2020-07-07|
* +----------+----------+----------+
*/

Find out the start date and end date of week in pyspark dataframe. Monday being the first day of week.
def add_start_end_week(dataframe, timestamp_col, StartDate, EndDate):
""""
Function:
Get the start date and the end date of week
args
dataframe: spark dataframe
column_name: timestamp column based on which we have to calculate the start date and end date
StartDate: start date column name of week
EndDate: end date column name of week
"""
dataframe = dataframe.withColumn(
'day_of_week', dayofweek(col(timestamp_col)))
# start of the week (Monday as first day)
dataframe = dataframe.withColumn('StartDate',when(col("day_of_week")>1, \
expr("date_add(date_sub({},day_of_week-1),1)".format(timestamp_col))). \
otherwise(expr("date_sub({},6)".format(timestamp_col))))
#End of the Week
dataframe = dataframe.withColumn('EndDate',when(col("day_of_week")>1, \
expr("date_add(date_add({},7-day_of_week),1)".format(timestamp_col))). \
otherwise(col("{}".format(timestamp_col))))
return dataframe
Validate the above function:
df = spark.createDataFrame([('2021-09-26',),('2021-09-25',),('2021-09-24',),('2021-09-23',),('2021-09-22',),('2021-09-21',),('2021-09-20',)], ['dt'])
dataframe = df.withColumn('day_of_week', dayofweek(col('dt')))
# start of the week (Monday as first day)
dataframe = dataframe.withColumn('StartDate',when(col("day_of_week")>1,expr("date_add(date_sub(dt,day_of_week-1),1)")).otherwise(expr("date_sub(dt,6)")))
#End of the Week
dataframe = dataframe.withColumn('EndDate',when(col("day_of_week")>1,expr("date_add(date_add(dt,7-day_of_week),1)")).otherwise(col("dt")))

Related

How to convert short date D-MMM-yyyy using PySpark

Why just the Jan works when try to convert using the code below?
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "D-MMM-yyyy"))
display(df2)
Result:
Date
------------
undefined
2021-01-02
D is a day of year.
The first one works because 02 is in fact in January, but 05 is not in November.
If you try:
data = [{"date": "05-Jan-2000"}, {"date": "02-Jan-2021"}]
It will work for both.
However, you need d which is the day of the month. So use d-MMM-yyyy.
For further information please see: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
D is day-of-the-year.
What you're looking for is d - day of the month.
PySpark supports the Java DateTimeFormatter patterns: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "dd-MMM-yyyy"))
df2.show()
+----------+
| date|
+----------+
|2000-11-05|
|2021-01-02|
+----------+

How to convert one time zone to another in Spark Dataframe

I am reading from PostgreSQL into Spark Dataframe and have date column in PostgreSQL like below:
last_upd_date
---------------------
"2021-04-21 22:33:06.308639-05"
But in spark dataframe it's adding the hour interval.
eg: 2020-04-22 03:33:06.308639
Here it is adding 5 hours to the last_upd_date column.
But I want output as 2021-04-21 22:33:06.308639
Can anyone help me how to fix this spark dataframe.
You can create an udf that formats the timestamp with the required timezone:
import java.time.{Instant, ZoneId}
val formatTimestampWithTz = udf((i: Instant, zone: String)
=> i.atZone(ZoneId.of(zone))
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS")))
val df = Seq(("2021-04-21 22:33:06.308639-05")).toDF("dateString")
.withColumn("date", to_timestamp('dateString, "yyyy-MM-dd HH:mm:ss.SSSSSSx"))
.withColumn("date in Berlin", formatTimestampWithTz('date, lit("Europe/Berlin")))
.withColumn("date in Anchorage", formatTimestampWithTz('date, lit("America/Anchorage")))
.withColumn("date in GMT-5", formatTimestampWithTz('date, lit("-5")))
df.show(numRows = 10, truncate = 50, vertical = true)
Result:
-RECORD 0------------------------------------------
dateString | 2021-04-21 22:33:06.308639-05
date | 2021-04-22 05:33:06.308639
date in Berlin | 2021-04-22 05:33:06.308639
date in Anchorage | 2021-04-21 19:33:06.308639
date in GMT-5 | 2021-04-21 22:33:06.308639

Convert seconds to hhmmss Spark

I have a UDF that creates a timestamp out of 2 field values with date and time. However, the field with time is of seconds format.
So how сan I merge 2 fields of type date and seconds into an hour of a type Unix timestamp?
My current implementation looks like this:
private val unix_epoch = udf[Long, String, String]{ (date, time) =>
deltaDateFormatter.parseDateTime(s"$date $formatted").getSeconds
}
def transform(inputDf: DataFrame): Unit = {
inputDf
.withColumn("event_hour", unix_epoch($"event_date", $"event_time"))
.withColumn("event_ts", from_unixtime($"event_hour").cast(TimestampType))
}
Input data:
event_date,event_time
20170501,87721
20170501,87728
20170501,87721
20170501,87726
Desired output:
event_tmstp, event_hour
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:08,1493598128
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:06,1493598126
Update. data schema:
event_date: string (nullable = true)
event_time: integer (nullable = true)
Cast event_date to a unix timestamp, add the event_time column to get event_hour, and convert back to normal timestamp event_tmstp.
PS I'm not sure why event_time has 86400 seconds (1 day) more. I needed to subtract that to get your expected output.
val df = Seq(
("20170501", 87721),
("20170501", 87728),
("20170501", 87721),
("20170501", 87726)
).toDF("event_date","event_time")
val df2 = df.select(
unix_timestamp(to_date($"event_date", "yyyyMMdd")) + $"event_time" - 86400
).toDF("event_hour").select(
$"event_hour".cast("timestamp").as("event_tmstp"),
$"event_hour"
)
df2.show
+-------------------+----------+
| event_tmstp|event_hour|
+-------------------+----------+
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:08|1493598128|
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:06|1493598126|
+-------------------+----------+
Check below code if this helps without UDF
val df = Seq(
(20170501,87721),
(20170501,87728),
(20170501,87721),
(20170501,87726)
).toDF("date","time")
df
.withColumn("date",
to_date(
unix_timestamp($"date".cast("string"),
"yyyyMMdd"
).cast("timestamp")
)
)
.withColumn(
"event_hour",
unix_timestamp(
concat_ws(
" ",
$"date",
from_unixtime($"time","HH:mm:ss.S")
).cast("timestamp")
)
)
.withColumn(
"event_ts",
from_unixtime($"event_hour")
)
.show(false)
+----------+-----+----------+-------------------+
|date |time |event_hour|event_ts |
+----------+-----+----------+-------------------+
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87728|1493598128|2017-05-01 00:22:08|
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87726|1493598126|2017-05-01 00:22:06|
+----------+-----+----------+-------------------+

How to change date format from string (24 Jun 2020) to Date 24-06-2020 in spark sql?

I have a column with string values like '24 Jun 2020' i want to cast it as date type.
Is there a way to specify the format of input and output date format while casting from string to date type.
Spark date format is yyyy-MM-dd you can use either to_date,to_timestamp,from_unixtime + unix_timestamp functions to change your string to date.
Example:
df.show()
#+-----------+
#| dt|
#+-----------+
#|24 Jun 2020|
#+-----------+
#using to_date function
df.withColumn("new_format", to_date(col("dt"),'dd MMM yyyy')).show()
#using to_timestamp function
df.withColumn("new_format", to_timestamp(col("dt"),'dd MMM yyyy').cast("date")).show()
#+-----------+----------+
#| dt|new_format|
#+-----------+----------+
#|24 Jun 2020|2020-06-24|
#+-----------+----------+
df.withColumn("new_format", to_date(col("dt"),'dd MMM yyyy')).printSchema()
#root
# |-- dt: string (nullable = true)
# |-- new_format: date (nullable = true)
The default date format for date is yyyy-MM-dd -
val df1 = Seq("24 Jun 2020").toDF("dateStringType")
df1.show(false)
/**
* +--------------+
* |dateStringType|
* +--------------+
* |24 Jun 2020 |
* +--------------+
*/
// default date format is "yyyy-MM-dd"
df1.withColumn("dateDateType", to_date($"dateStringType", "dd MMM yyyy"))
.show(false)
/**
* +--------------+------------+
* |dateStringType|dateDateType|
* +--------------+------------+
* |24 Jun 2020 |2020-06-24 |
* +--------------+------------+
*/
// Use date_format to change the default date_format to "dd-MM-yyyy"
df1.withColumn("changDefaultFormat", date_format(to_date($"dateStringType", "dd MMM yyyy"), "dd-MM-yyyy"))
.show(false)
/**
* +--------------+------------------+
* |dateStringType|changDefaultFormat|
* +--------------+------------------+
* |24 Jun 2020 |24-06-2020 |
* +--------------+------------------+
*/

substract current date with another date in dataframe scala

First of all, thank you for the time in reading my question :)
My question is the following: In Spark with Scala, i have a dataframe that there contains a string with a date in format dd/MM/yyyy HH:mm, for example df
+----------------+
|date |
+----------------+
|8/11/2017 15:00 |
|9/11/2017 10:00 |
+----------------+
i want to get the difference of currentDate with date of dataframe in second, for example
df.withColumn("difference", currentDate - unix_timestamp(col(date)))
+----------------+------------+
|date | difference |
+----------------+------------+
|8/11/2017 15:00 | xxxxxxxxxx |
|9/11/2017 10:00 | xxxxxxxxxx |
+----------------+------------+
I try
val current = current_timestamp()
df.withColumn("difference", current - unix_timestamp(col(date)))
but get this error
org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' due to data type mismatch: differing types in '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' (timestamp and bigint).;;
I try too
val current = BigInt(System.currenttimeMillis / 1000)
df.withColumn("difference", current - unix_timestamp(col(date)))
and
val current = unix_timestamp(current_timestamp())
but the col "difference" is null
Thanks
You have to use correct format for unix_timestamp:
df.withColumn("difference", current_timestamp().cast("long") - unix_timestamp(col("date"), "dd/mm/yyyy HH:mm"))
or with recent version:
to_timestamp(col("date"), "dd/mm/yyyy HH:mm") - current_timestamp())
to get Interval column.