How can I convert this date to a date format such that I can eventually transform it into yyyy-MM-dd? Similar examples, Convert string of format MMM d yyyy hh:mm AM/PM to date using Pyspark, could not solve it.
df = spark.createDataFrame(sc.parallelize([
['Wed Sep 30 21:06:00 1998'],
['Fri Apr 1 08:37:00 2022'],
]),
['Date'])
+--------------------+
| Date|
+--------------------+
|Wed Sep 30 21:06:...|
|Fri Apr 1 08:37:...|
+--------------------+
# fail
df.withColumn('Date', F.to_date(F.col('Date'), "DDD MMM dd hh:mm:ss yyyy")).show()
I think you are using wrong symbols for Day-Of-Week and Hour - try this one:
from pyspark.sql.functions import to_date
df = spark.createDataFrame([('Wed Sep 30 21:06:00 1998',), ('Fri Apr 1 08:37:00 2022',)], 'Date: string')
df.withColumn('Date', to_date('Date', "E MMM dd HH:mm:ss yyyy")).show()
+----------+
| Date|
+----------+
|1998-09-30|
|2022-04-01|
+----------+
Related
I'm trying to get month year out of a date but there's something wrong in the output only for the month year December2020, it's returning December2021 instead of December2020, output
in the cancelation_year column I got the year using this function :
year(last_order_date) and it's returning the year correctly.
in the cancelation_month_year I used
date_format(last_order_date,'MMMM YYYY') and it's only returning wrong value for december 2020
from pyspark.sql import functions as F
data = [{"dt": "12/27/2020 5:11:53 AM"}]
df = spark.createDataFrame(data)
df.withColumn("ts_new", date_format(to_date("dt", "M/d/y h:m:s 'AM'"), "MMMM yyyy")).show()
+--------------------+-------------+
| dt| ts_new|
+--------------------+-------------+
|12/27/2020 5:11:5...|December 2020|
+--------------------+-------------+
Can you please help me to cast the below datatype in pyspark in the better possible way? we cant handle this in the dataframe.
Input:
Aug 11, 2020 04:34:54.0 PM
to expected output:
2020-08-11 04:34:54:00 PM
Try with from_unixtime, unix_timestamp functions.
Example:
#sample data in dataframe
df.show(10,False)
#+--------------------------+
#|ts |
#+--------------------------+
#|Aug 11, 2020 04:34:54.0 PM|
#+--------------------------+
df.withColumn("dt",from_unixtime(unix_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a"),"yyyy-MM-dd hh:mm:ss.SSS a")).\
show(10,False)
#+--------------------------+--------------------------+
#|ts |dt |
#+--------------------------+--------------------------+
#|Aug 11, 2020 04:34:54.0 PM|2020-08-11 04:34:54.000 PM|
#+--------------------------+--------------------------+
If you want new column to be timestamp type then use to_timestamp function in spark.
df.withColumn("dt",to_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a")).\
show(10,False)
#+--------------------------+-------------------+
#|ts |dt |
#+--------------------------+-------------------+
#|Aug 11, 2020 04:34:54.0 PM|2020-08-11 16:34:54|
#+--------------------------+-------------------+
df.withColumn("dt",to_timestamp(col("ts"),"MMM d, yyyy hh:mm:ss.SSS a")).printSchema()
#root
# |-- ts: string (nullable = true)
# |-- dt: timestamp (nullable = true)
I have a column with string values like '24 Jun 2020' i want to cast it as date type.
Is there a way to specify the format of input and output date format while casting from string to date type.
Spark date format is yyyy-MM-dd you can use either to_date,to_timestamp,from_unixtime + unix_timestamp functions to change your string to date.
Example:
df.show()
#+-----------+
#| dt|
#+-----------+
#|24 Jun 2020|
#+-----------+
#using to_date function
df.withColumn("new_format", to_date(col("dt"),'dd MMM yyyy')).show()
#using to_timestamp function
df.withColumn("new_format", to_timestamp(col("dt"),'dd MMM yyyy').cast("date")).show()
#+-----------+----------+
#| dt|new_format|
#+-----------+----------+
#|24 Jun 2020|2020-06-24|
#+-----------+----------+
df.withColumn("new_format", to_date(col("dt"),'dd MMM yyyy')).printSchema()
#root
# |-- dt: string (nullable = true)
# |-- new_format: date (nullable = true)
The default date format for date is yyyy-MM-dd -
val df1 = Seq("24 Jun 2020").toDF("dateStringType")
df1.show(false)
/**
* +--------------+
* |dateStringType|
* +--------------+
* |24 Jun 2020 |
* +--------------+
*/
// default date format is "yyyy-MM-dd"
df1.withColumn("dateDateType", to_date($"dateStringType", "dd MMM yyyy"))
.show(false)
/**
* +--------------+------------+
* |dateStringType|dateDateType|
* +--------------+------------+
* |24 Jun 2020 |2020-06-24 |
* +--------------+------------+
*/
// Use date_format to change the default date_format to "dd-MM-yyyy"
df1.withColumn("changDefaultFormat", date_format(to_date($"dateStringType", "dd MMM yyyy"), "dd-MM-yyyy"))
.show(false)
/**
* +--------------+------------------+
* |dateStringType|changDefaultFormat|
* +--------------+------------------+
* |24 Jun 2020 |24-06-2020 |
* +--------------+------------------+
*/
The following example:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([('Feb 4 1997 10:30:00',), ('Jan 14 2000 13:33:00',), ('Jan 13 2020 01:20:12',)], ['t'])
ts_format = "MMM dd YYYY HH:mm:ss"
df.select(df.t,
F.to_timestamp(df.t, ts_format),
F.date_format(F.current_timestamp(), ts_format))\
.show(truncate=False)
Outputs:
+--------------------+-----------------------------------------+------------------------------------------------------+
|t |to_timestamp(`t`, 'MMM dd YYYY HH:mm:ss')|date_format(current_timestamp(), MMM dd YYYY HH:mm:ss)|
+--------------------+-----------------------------------------+------------------------------------------------------+
|Feb 4 1997 10:30:00 |1996-12-29 10:30:00 |Jan 22 2020 14:38:28 |
|Jan 14 2000 13:33:00|1999-12-26 13:33:00 |Jan 22 2020 14:38:28 |
|Jan 22 2020 14:29:12|2019-12-29 14:29:12 |Jan 22 2020 14:38:28 |
+--------------------+-----------------------------------------+------------------------------------------------------+
Question:
The conversion from current_timestamp() to string works with the given format. Why the other way (String to Timestamp) doesn't?
Notes:
pyspark 2.4.4 docs point to simpleDateFormat patterns
Changing the year's format to lowercase fixed the issue
ts_format = "MMM dd yyyy HH:mm:ss"
I have a column that contains a string with the following date as a string Sat Sep 14 09:54:30 UTC 2019. Not familiar with format at all.
I need to convert to date or timestamp. Just a unit that I can compare against. I just need a point of comparison with a precision of one day.
This can help you get the timestamp from your string and then you get the days from it using Spark SQL(2.x)
spark.sql("""SELECT from_utc_timestamp(from_unixtime(unix_timestamp("Sat Sep 14 09:54:30 UTC 2019","EEE MMM dd HH:mm:ss zzz yyyy") ),"IST")as timestamp""").show()
+-------------------+
| timestamp|
+-------------------+
|2019-09-14 20:54:30|
+-------------------+