How to convert short date D-MMM-yyyy using PySpark - pyspark

Why just the Jan works when try to convert using the code below?
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "D-MMM-yyyy"))
display(df2)
Result:
Date
------------
undefined
2021-01-02

D is a day of year.
The first one works because 02 is in fact in January, but 05 is not in November.
If you try:
data = [{"date": "05-Jan-2000"}, {"date": "02-Jan-2021"}]
It will work for both.
However, you need d which is the day of the month. So use d-MMM-yyyy.
For further information please see: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

D is day-of-the-year.
What you're looking for is d - day of the month.
PySpark supports the Java DateTimeFormatter patterns: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/time/format/DateTimeFormatter.html
df2 = spark.createDataFrame([["05-Nov-2000"], ["02-Jan-2021"]], ["date"])
df2 = df2.withColumn("date", to_date(col("date"), "dd-MMM-yyyy"))
df2.show()
+----------+
| date|
+----------+
|2000-11-05|
|2021-01-02|
+----------+

Related

to_date in selectexpr of pyspark is truncating date time to year by default. how to avoid this?

I have a requirement where I derive the year to be loaded and then have to load the first day and last day of that year in date format to a table.
Here is what I'm doing:
boy = str(nxt_yr)+'01-01'
eoy = str(nxt_yr)+'12-31'
df_final = df_demo.selectExpr("to_date('{}','yyyy-MM-dd') as strt_dt".format(boy),"to_date('{}','yyyy-MM-dd') as end_dt".format(eoy))
spark.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
df_final.show(1)
this is giving me 2023-01-01 in both the fields in date datatype.
Is this expected behavior and if yes is there any workaround?
Note: I tried hardcoding the date as 2022-11-30 also in the code but still received the beginning of the year in the output.
Its working as expected , additionally you are missing a - within your dates you create for conversions
nxt_yr = 2022
boy = str(nxt_yr)+'-01-01'
# |
# /|\
# |
# |
eoy = str(nxt_yr)+'-12-31'
sql.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
sql.sql(f"""
SELECT
to_date('{boy}','yyyy-MM-dd') as strt_dt
,to_date('{eoy}','yyyy-MM-dd') as end_dt
"""
).show()
+----------+----------+
| strt_dt| end_dt|
+----------+----------+
|2022-01-01|2022-12-31|
+----------+----------+

Split date into day of the week, month,year using Pyspark

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))

How to parse a string like 4/23/19 to a timestamp in pysark

I have some columns with dates from a source files that look like 4/23/19
The 4 being the month, the 23 being the day and the 19 being 2019
How do I convert this to a timestamp in pyspark?
So far
def ParseDateFromFormats(col, formats):
return coalesce(*[to_timestamp(col, f) for f in formats])
df2 = df2.withColumn("_" + field.columnName, ParseDateFromFormats(df2[field.columnName], ["dd/MM/yyyy hh:mm", "dd/MM/yyyy", "dd-MMM-yy"]).cast(field.simpleTypeName))
There doesn't seem to be a date format that would work
The reason why your code didn't work might be cause you reversed days and months.
This works:
from pyspark.sql.functions import to_date
time_df = spark.createDataFrame([('4/23/19',)], ['dt'])
time_df.withColumn('proper_date', to_date('dt', 'MM/dd/yy')).show()
+-------+-----------+
| dt|proper_date|
+-------+-----------+
|4/23/19| 2019-04-23|
+-------+-----------+

Spark 2.2 extracting date not working from unix_timestamp

In Spark 2.2 extracting date not working from unix_timestamp
Input Data:
+-------------------------+
|UPDATE_TS |
+-------------------------+
|26NOV2009:03:27:01.154410|
|24DEC2012:00:47:46.805710|
|02MAY2013:00:45:33.233844|
|21NOV2014:00:33:39.350140|
|10DEC2013:00:30:30.532446|
I tried following approaches but output Im getting as null
Query Tired:
Spark sql
sqlContext.sql("select from_unixtime(unix_timestamp(UPDATE_TS,'ddMMMyyyy:HH:MM:SS.ssssss'), 'yyyy') as new_date from df_vendor_tab").show()
DSL:
df_id.withColumn('part_date', from_unixtime(unix_timestamp(df_id.UPDATE_TS, "ddMMMyyyy:HH:MM:SS.sss"), "yyyy"))
expected output:
2009
2012
2013
2014
2013
You're using the incorrect format string. Capital M is for month. Lower case m is for minute.
The following would work:
from pyspark.sql.functions import from_unixtime, unix_timestamp, to_date
df_id.withColumn(
'part_date',
from_unixtime(unix_timestamp(df_id.UPDATE_TS, "ddMMMyyyy:HH:mm:SS.SSSSSS"), "yyyy")
).show(truncate=False)
#+-------------------------+---------+
#|UPDATE_TS |part_date|
#+-------------------------+---------+
#|26NOV2009:03:27:01.154410|2009 |
#|24DEC2012:00:47:46.805710|2012 |
#|02MAY2013:00:45:33.233844|2013 |
#|21NOV2014:00:33:39.350140|2014 |
#|10DEC2013:00:30:30.532446|2013 |
#+-------------------------+---------+
Simple spark-sql is working fine with unix_timestamp and from_unixtime
sqlContext.sql("Select from_unixtime(unix_timestamp('26NOV2009:03:27:01.154410', 'ddMMMyyyy'), 'yyyy')").show
Output:
+----+
| _c0|
+----+
|2009|
+----+
Since you are looking for extracting year, i have not considered hour, min, secs...

substract current date with another date in dataframe scala

First of all, thank you for the time in reading my question :)
My question is the following: In Spark with Scala, i have a dataframe that there contains a string with a date in format dd/MM/yyyy HH:mm, for example df
+----------------+
|date |
+----------------+
|8/11/2017 15:00 |
|9/11/2017 10:00 |
+----------------+
i want to get the difference of currentDate with date of dataframe in second, for example
df.withColumn("difference", currentDate - unix_timestamp(col(date)))
+----------------+------------+
|date | difference |
+----------------+------------+
|8/11/2017 15:00 | xxxxxxxxxx |
|9/11/2017 10:00 | xxxxxxxxxx |
+----------------+------------+
I try
val current = current_timestamp()
df.withColumn("difference", current - unix_timestamp(col(date)))
but get this error
org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' due to data type mismatch: differing types in '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' (timestamp and bigint).;;
I try too
val current = BigInt(System.currenttimeMillis / 1000)
df.withColumn("difference", current - unix_timestamp(col(date)))
and
val current = unix_timestamp(current_timestamp())
but the col "difference" is null
Thanks
You have to use correct format for unix_timestamp:
df.withColumn("difference", current_timestamp().cast("long") - unix_timestamp(col("date"), "dd/mm/yyyy HH:mm"))
or with recent version:
to_timestamp(col("date"), "dd/mm/yyyy HH:mm") - current_timestamp())
to get Interval column.