Previous and next month, year based on date in Pyspark - pyspark

I have data like this,
and I need output like this
How do I achieve this in Pyspark?

Use date Functions.
First format date to string using to_date
Next use add_months to create previous and next. Format this using date_format and store each in an array. Combine the two arrays into a struct column. Explode the struct column. Code and logic below
Data
df=spark.createDataFrame([('1' , 'A' ,'14-02-22' ) ,
('1' , 'B' , '11-03-22' )],
('Id' , 'Status' , 'Date' ))
df.show()
+---+------+--------+
| Id|Status| Date|
+---+------+--------+
| 1| A|14-02-22|
| 1| B|11-03-22|
(df.withColumn('Date',to_date('Date', "dd-MM-yy"))#Coerce string to date
.withColumn('Date1', F.struct(array(date_format(add_months('Date',-1),"MMM yy"), date_format('Date',"MMM yy")).alias('Previous')#Create an array of date and previous day, store in struct as Previous
,array(date_format('Date',"MMM yy"),date_format(add_months('Date',1),"MMM yy")).alias('Next')#Create an array of date and next day, store in struct as Previous
)).select('Id','Status','Date','Date1.*')#Select all required columns exploding Date1 with each struct element as column
).show(truncate=False)
+---+------+----------+----------------+----------------+
|Id |Status|Date |Previous |Next |
+---+------+----------+----------------+----------------+
|1 |A |2022-02-14|[Jan 22, Feb 22]|[Feb 22, Mar 22]|
|1 |B |2022-03-11|[Feb 22, Mar 22]|[Mar 22, Apr 22]|
+---+------+----------+----------------+----------------+
Alternatively use concat_ws as follows
(df.withColumn('Date',to_date('Date', "dd-MM-yy"))#Coerce string to date
.withColumn('Date1', F.struct(concat_ws('-',lit((date_format(add_months('Date',-1),"MMM yy").astype('string'))), lit((date_format('Date',"MMM yy").astype('string')))).alias('Previous')#create string by concat of date with string format of the previous months date
,concat_ws('-',lit((date_format(add_months('Date',1),"MMM yy").astype('string'))), lit((date_format('Date',"MMM yy").astype('string')))).alias('Next')#Create a string by concat date with string format of the next months date
)).select('Id','Status','Date','Date1.*')#Select all required columns exploding Date1 with each struct element as column
).show(truncate=False)
+---+------+----------+-------------+-------------+
|Id |Status|Date |Previous |Next |
+---+------+----------+-------------+-------------+
|1 |A |2022-02-14|Jan 22-Feb 22|Mar 22-Feb 22|
|1 |B |2022-03-11|Feb 22-Mar 22|Apr 22-Mar 22|
+---+------+----------+-------------+-------------+

The essential functions you need for this task are: "date_format" to make date in the desired format and "add_months" to add or subtract to date.
from pyspark.sql import functions as F
date_df = spark.createDataFrame(
[
('A', '02/09/2022'),
('B', '02/07/2022'),],
['name', 'date'])
(
date_df
.withColumn('date', F.to_date('date', 'dd/MM/yyyy'))
.withColumn(
'current_month',
F.date_format(F.col('date'), 'MMM yyyy'))
.withColumn(
'prev_month',
F.date_format(
F.add_months(F.col('date'),1),
'MMM yyyy'))
.withColumn(
'next_month',
F.date_format(
F.add_months(F.col('date'),-1),
'MMM yyyy'))
.withColumn(
'Previous',
F.concat(F.col('prev_month'), F.lit('-'), F.col('current_month')))
.withColumn(
'Next',
F.concat(F.col('current_month'), F.lit('-'), F.col('next_month')))
).show()
+----+----------+-------------+----------+----------+-----------------+-----------------+
|name| date|current_month|prev_month|next_month| Previous| Next|
+----+----------+-------------+----------+----------+-----------------+-----------------+
| A|2022-09-02| Sep 2022| Oct 2022| Aug 2022|Oct 2022-Sep 2022|Sep 2022-Aug 2022|
| B|2022-07-02| Jul 2022| Aug 2022| Jun 2022|Aug 2022-Jul 2022|Jul 2022-Jun 2022|
+----+----------+-------------+----------+----------+-----------------+-----------------+

Related

How to get 1st day of the year in pyspark

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark
You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+
x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'
You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

Spark dataframe filter a timestamp by just the date part

How can I filter a spark dataframe that has a column of type timestamp but filter out by just the date part. I tried below, but it only matches if time is 00:00:00.
Basically I want the filter to match all rows with date 2020-01-01 (3 rows)
import java.sql.Timestamp
val df = Seq(
(1, Timestamp.valueOf("2020-01-01 23:00:01")),
(2, Timestamp.valueOf("2020-01-01 00:00:00")),
(3, Timestamp.valueOf("2020-01-01 12:54:00")),
(4, Timestamp.valueOf("2019-12-15 09:54:00")),
(5, Timestamp.valueOf("2019-12-09 10:12:43"))
).toDF("someCol","someTimeStamp")
df.filter(df("someTimeStamp") === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 2|2020-01-01 00:00:00| // ONLY MATCHED with time 00:00
+-------+-------------------+
Use the to_date function to extract the date from the timestamp:
scala> df.filter(to_date(df("someTimeStamp")) === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 1|2020-01-01 23:00:01|
| 2|2020-01-01 00:00:00|
| 3|2020-01-01 12:54:00|
+-------+-------------------+

Why aggregation function pyspark.sql.functions.collect_list() adds local timezone offset on display?

I run the following code in a pyspark shell session. Running collect_list() after a groupBy, changes how timestamps are displayed (a UTC+02:00 offset is added, probably because this is the local offset at Greece where the code is run). Although the display is problematic, the timestamp under the hood remains unchanged. This can be observed either by adding a column with the actual unix timestamps or by reverting the dataframe to its initial shape through using pyspark.sql.functions.explode(). Is this a bug?
import datetime
import os
from pyspark.sql import functions, types, udf
# configure utc timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
os.environ['TZ']
time.tzset()
# create DataFrame
date_time = datetime.datetime(year = 2019, month=1, day=1, hour=12)
data = [(1, date_time), (1, date_time)]
schema = types.StructType([types.StructField("id", types.IntegerType(), False), types.StructField("time", types.TimestampType(), False)])
df_test = spark.createDataFrame(data, schema)
df_test.show()
+---+-------------------+
| id| time|
+---+-------------------+
| 1|2019-01-01 12:00:00|
| 1|2019-01-01 12:00:00|
+---+-------------------+
# GroupBy and collect_list
df_test1 = df_test.groupBy("id").agg(functions.collect_list("time"))
df_test1.show(1, False)
+---+----------------------------------------------+
|id |collect_list(time) |
+---+----------------------------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|
+---+----------------------------------------------+
# add column with unix timestamps
to_timestamp = functions.udf(lambda x : [value.timestamp() for value in x], types.ArrayType(types.FloatType()))
df_test1.withColumn("unix_timestamp",to_timestamp(functions.col("collect_list(time)")))
df_test1.show(1, False)
+---+----------------------------------------------+----------------------------+
|id |collect_list(time) |unix_timestamp |
+---+----------------------------------------------+----------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|[1.54634394E9, 1.54634394E9]|
+---+----------------------------------------------+----------------------------+
# explode list to distinct rows
df_test1.groupBy("id").agg(functions.collect_list("time")).withColumn("test", functions.explode(functions.col("collect_list(time)"))).show(2, False)
+---+----------------------------------------------+-------------------+
|id |collect_list(time) |test |
+---+----------------------------------------------+-------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
+---+----------------------------------------------+-------------------+
ps. 1.54634394E9 == 2019-01-01 12:00:00, which is the correct UTC timestamp
For me the code above works, but does not convert the time as in your case.
Maybe check what is your session time zone (and, optionally, set it to some tz):
spark.conf.get('spark.sql.session.timeZone')
In general TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to calculate difference between date column and current date?

I am trying to calculate the Date Diff between a column field and current date of the system.
Here is my sample code where I have hard coded the my column field with 20170126.
val currentDate = java.time.LocalDate.now
var datediff = spark.sqlContext.sql("""Select datediff(to_date('$currentDate'),to_date(DATE_FORMAT(CAST(unix_timestamp( cast('20170126' as String), 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'))) AS GAP
""")
datediff.show()
Output is like:
+----+
| GAP|
+----+
|null|
+----+
I need to calculate actual Gap Between the two dates but getting NULL.
You have not defined the type and format of "column field" so I assume it's a string in the (not-very-pleasant) format YYYYMMdd.
val records = Seq((0, "20170126")).toDF("id", "date")
scala> records.show
+---+--------+
| id| date|
+---+--------+
| 0|20170126|
+---+--------+
scala> records
.withColumn("year", substring($"date", 0, 4))
.withColumn("month", substring($"date", 5, 2))
.withColumn("day", substring($"date", 7, 2))
.withColumn("d", concat_ws("-", $"year", $"month", $"day"))
.select($"id", $"d" cast "date")
.withColumn("datediff", datediff(current_date(), $"d"))
.show
+---+----------+--------+
| id| d|datediff|
+---+----------+--------+
| 0|2017-01-26| 83|
+---+----------+--------+
PROTIP: Read up on functions object.
Caveats
cast
Please note that I could not convince Spark SQL to cast the column "date" to DateType given the rules in DateTimeUtils.stringToDate:
yyyy,
yyyy-[m]m
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d *
yyyy-[m]m-[d]dT*
date_format
I could not convince date_format to work either so I parsed "date" column myself using substring and concat_ws functions.