PySpark: Convert Date - date

How can I convert this date to a date format such that I can eventually transform it into yyyy-MM-dd? Similar examples, Convert string of format MMM d yyyy hh:mm AM/PM to date using Pyspark, could not solve it.
df = spark.createDataFrame(sc.parallelize([
['Wed Sep 30 21:06:00 1998'],
['Fri Apr 1 08:37:00 2022'],
]),
['Date'])
+--------------------+
| Date|
+--------------------+
|Wed Sep 30 21:06:...|
|Fri Apr 1 08:37:...|
+--------------------+
# fail
df.withColumn('Date', F.to_date(F.col('Date'), "DDD MMM dd hh:mm:ss yyyy")).show()

I think you are using wrong symbols for Day-Of-Week and Hour - try this one:
from pyspark.sql.functions import to_date
df = spark.createDataFrame([('Wed Sep 30 21:06:00 1998',), ('Fri Apr 1 08:37:00 2022',)], 'Date: string')
df.withColumn('Date', to_date('Date', "E MMM dd HH:mm:ss yyyy")).show()
+----------+
| Date|
+----------+
|1998-09-30|
|2022-04-01|
+----------+

Related

date format function MMM YYYY in spark sql returning inaccurate values

I'm trying to get month year out of a date but there's something wrong in the output only for the month year December2020, it's returning December2021 instead of December2020, output
in the cancelation_year column I got the year using this function :
year(last_order_date) and it's returning the year correctly.
in the cancelation_month_year I used
date_format(last_order_date,'MMMM YYYY') and it's only returning wrong value for december 2020
from pyspark.sql import functions as F
data = [{"dt": "12/27/2020 5:11:53 AM"}]
df = spark.createDataFrame(data)
df.withColumn("ts_new", date_format(to_date("dt", "M/d/y h:m:s 'AM'"), "MMMM yyyy")).show()
+--------------------+-------------+
| dt| ts_new|
+--------------------+-------------+
|12/27/2020 5:11:5...|December 2020|
+--------------------+-------------+

what is the best way to cast or handle the date datatype in pyspark

Can you please help me to cast the below datatype in pyspark in the better possible way? we cant handle this in the dataframe.
Input:
Aug 11, 2020 04:34:54.0 PM
to expected output:
2020-08-11 04:34:54:00 PM
Try with from_unixtime, unix_timestamp functions.
Example:
#sample data in dataframe
df.show(10,False)
#+--------------------------+
#|ts |
#+--------------------------+
#|Aug 11, 2020 04:34:54.0 PM|
#+--------------------------+
df.withColumn("dt",from_unixtime(unix_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a"),"yyyy-MM-dd hh:mm:ss.SSS a")).\
show(10,False)
#+--------------------------+--------------------------+
#|ts |dt |
#+--------------------------+--------------------------+
#|Aug 11, 2020 04:34:54.0 PM|2020-08-11 04:34:54.000 PM|
#+--------------------------+--------------------------+
If you want new column to be timestamp type then use to_timestamp function in spark.
df.withColumn("dt",to_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a")).\
show(10,False)
#+--------------------------+-------------------+
#|ts |dt |
#+--------------------------+-------------------+
#|Aug 11, 2020 04:34:54.0 PM|2020-08-11 16:34:54|
#+--------------------------+-------------------+
df.withColumn("dt",to_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a")).printSchema()
#root
# |-- ts: string (nullable = true)
# |-- dt: timestamp (nullable = true)

How to change date format from string (24 Jun 2020) to Date 24-06-2020 in spark sql?

I have a column with string values like '24 Jun 2020' i want to cast it as date type.
Is there a way to specify the format of input and output date format while casting from string to date type.
Spark date format is yyyy-MM-dd you can use either to_date,to_timestamp,from_unixtime + unix_timestamp functions to change your string to date.
Example:
df.show()
#+-----------+
#| dt|
#+-----------+
#|24 Jun 2020|
#+-----------+
#using to_date function
df.withColumn("new_format", to_date(col("dt"),'dd MMM yyyy')).show()
#using to_timestamp function
df.withColumn("new_format", to_timestamp(col("dt"),'dd MMM yyyy').cast("date")).show()
#+-----------+----------+
#| dt|new_format|
#+-----------+----------+
#|24 Jun 2020|2020-06-24|
#+-----------+----------+
df.withColumn("new_format", to_date(col("dt"),'dd MMM yyyy')).printSchema()
#root
# |-- dt: string (nullable = true)
# |-- new_format: date (nullable = true)
The default date format for date is yyyy-MM-dd -
val df1 = Seq("24 Jun 2020").toDF("dateStringType")
df1.show(false)
/**
* +--------------+
* |dateStringType|
* +--------------+
* |24 Jun 2020 |
* +--------------+
*/
// default date format is "yyyy-MM-dd"
df1.withColumn("dateDateType", to_date($"dateStringType", "dd MMM yyyy"))
.show(false)
/**
* +--------------+------------+
* |dateStringType|dateDateType|
* +--------------+------------+
* |24 Jun 2020 |2020-06-24 |
* +--------------+------------+
*/
// Use date_format to change the default date_format to "dd-MM-yyyy"
df1.withColumn("changDefaultFormat", date_format(to_date($"dateStringType", "dd MMM yyyy"), "dd-MM-yyyy"))
.show(false)
/**
* +--------------+------------------+
* |dateStringType|changDefaultFormat|
* +--------------+------------------+
* |24 Jun 2020 |24-06-2020 |
* +--------------+------------------+
*/

Unexpected date when converting string to timestamp in pyspark

The following example:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([('Feb 4 1997 10:30:00',), ('Jan 14 2000 13:33:00',), ('Jan 13 2020 01:20:12',)], ['t'])
ts_format = "MMM dd YYYY HH:mm:ss"
df.select(df.t,
F.to_timestamp(df.t, ts_format),
F.date_format(F.current_timestamp(), ts_format))\
.show(truncate=False)
Outputs:
+--------------------+-----------------------------------------+------------------------------------------------------+
|t |to_timestamp(`t`, 'MMM dd YYYY HH:mm:ss')|date_format(current_timestamp(), MMM dd YYYY HH:mm:ss)|
+--------------------+-----------------------------------------+------------------------------------------------------+
|Feb 4 1997 10:30:00 |1996-12-29 10:30:00 |Jan 22 2020 14:38:28 |
|Jan 14 2000 13:33:00|1999-12-26 13:33:00 |Jan 22 2020 14:38:28 |
|Jan 22 2020 14:29:12|2019-12-29 14:29:12 |Jan 22 2020 14:38:28 |
+--------------------+-----------------------------------------+------------------------------------------------------+
Question:
The conversion from current_timestamp() to string works with the given format. Why the other way (String to Timestamp) doesn't?
Notes:
pyspark 2.4.4 docs point to simpleDateFormat patterns
Changing the year's format to lowercase fixed the issue
ts_format = "MMM dd yyyy HH:mm:ss"

converting specific string format to date in sparksql

I have a column that contains a string with the following date as a string Sat Sep 14 09:54:30 UTC 2019. Not familiar with format at all.
I need to convert to date or timestamp. Just a unit that I can compare against. I just need a point of comparison with a precision of one day.
This can help you get the timestamp from your string and then you get the days from it using Spark SQL(2.x)
spark.sql("""SELECT from_utc_timestamp(from_unixtime(unix_timestamp("Sat Sep 14 09:54:30 UTC 2019","EEE MMM dd HH:mm:ss zzz yyyy") ),"IST")as timestamp""").show()
+-------------------+
| timestamp|
+-------------------+
|2019-09-14 20:54:30|
+-------------------+