The following example:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([('Feb 4 1997 10:30:00',), ('Jan 14 2000 13:33:00',), ('Jan 13 2020 01:20:12',)], ['t'])
ts_format = "MMM dd YYYY HH:mm:ss"
df.select(df.t,
F.to_timestamp(df.t, ts_format),
F.date_format(F.current_timestamp(), ts_format))\
.show(truncate=False)
Outputs:
+--------------------+-----------------------------------------+------------------------------------------------------+
|t |to_timestamp(`t`, 'MMM dd YYYY HH:mm:ss')|date_format(current_timestamp(), MMM dd YYYY HH:mm:ss)|
+--------------------+-----------------------------------------+------------------------------------------------------+
|Feb 4 1997 10:30:00 |1996-12-29 10:30:00 |Jan 22 2020 14:38:28 |
|Jan 14 2000 13:33:00|1999-12-26 13:33:00 |Jan 22 2020 14:38:28 |
|Jan 22 2020 14:29:12|2019-12-29 14:29:12 |Jan 22 2020 14:38:28 |
+--------------------+-----------------------------------------+------------------------------------------------------+
Question:
The conversion from current_timestamp() to string works with the given format. Why the other way (String to Timestamp) doesn't?
Notes:
pyspark 2.4.4 docs point to simpleDateFormat patterns
Changing the year's format to lowercase fixed the issue
ts_format = "MMM dd yyyy HH:mm:ss"
Related
How can I convert this date to a date format such that I can eventually transform it into yyyy-MM-dd? Similar examples, Convert string of format MMM d yyyy hh:mm AM/PM to date using Pyspark, could not solve it.
df = spark.createDataFrame(sc.parallelize([
['Wed Sep 30 21:06:00 1998'],
['Fri Apr 1 08:37:00 2022'],
]),
['Date'])
+--------------------+
| Date|
+--------------------+
|Wed Sep 30 21:06:...|
|Fri Apr 1 08:37:...|
+--------------------+
# fail
df.withColumn('Date', F.to_date(F.col('Date'), "DDD MMM dd hh:mm:ss yyyy")).show()
I think you are using wrong symbols for Day-Of-Week and Hour - try this one:
from pyspark.sql.functions import to_date
df = spark.createDataFrame([('Wed Sep 30 21:06:00 1998',), ('Fri Apr 1 08:37:00 2022',)], 'Date: string')
df.withColumn('Date', to_date('Date', "E MMM dd HH:mm:ss yyyy")).show()
+----------+
| Date|
+----------+
|1998-09-30|
|2022-04-01|
+----------+
I am trying to convert the date time in the string format to DateTime but its showing "FormatException (FormatException: Trying to read dd from MONDAY at position 0)"
safePrint(Intl.withLocale(
'en',
() => new DateFormat("dd, EEE MMM yyyy HH:mm:ss \'GMT\'")
.parse(batch_timing[j])));
print(new DateFormat('dd, EEE MMM yyyy HH:mm:ss \'GMT\'')
.parse(batch_timing[j]));
Note : date format is "Mon, 05 Sep 2022 10:00:00 GMT"
There is a correction in converting :
check the below
String mDate ="Mon, 05 Sep 2022 10:00:00 GMT";
print(DateFormat('EEE, dd MMM yyyy HH:mm:ss \'GMT\'')
.parse(mDate));
you have to use EEE, dd MMM yyyy HH:mm:ss \'GMT\'' instead of dd, EEE MMM yyyy HH:mm:ss \'GMT\''
I want to convert a string date column to a date or timestamp (YYYY-MM-DD). How can i do it in scala Spark Sql ?
Input:
D1
Apr 24 2022|
Jul 08 2021|
Jan 16 2022|
Expected :
D2
2022-04-24|
2021-07-08|
2022-01-16|
You can use to_date and format the input accordingly.
Need to make the month characters in uppercase to get it recognized in the pattern.
Refer this to get the format patterns
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
select 'Apr 24 2022' D1, to_date(upper('Apr 24 2022'),'MMM dd yyyy') D2
union
select 'Jul 08 2021' D1, to_date(upper('Jul 08 2021'),'MMM dd yyyy') D2
union
select 'Jan 16 2022' D1, to_date(upper('Jan 16 2022'),'MMM dd yyyy') D2
I am new to Pyspark
I am trying to convert a string with value Jun 22 2021 1:04PM to a timestamp using the below code block but its making the value as null, where as its showing the datatype is timestamp
df = df.withColumn("date", F.from_unixtime(F.unix_timestamp("date","MMM d, yyyy hh:mm:ss a"),'yyyy-MM-dd').cast('timestamp'))
Your date is of the format MMM d yyyy hh:mmaa
To convert a string like above format. Do like below
from pyspark.sql import functions as f
df.withColumn("date_2", f.from_unixtime(f.unix_timestamp("date", 'MMM d yyyy hh:mmaa'),'MM-dd-yyyy HH:mm:ss')).show()
try this one:
df=df.withColumn("date", from_unixtime(unix_timestamp(col("date"), "MMM d, yyyy hh:mm:ss a"),"yyyy-MM-dd")).show(false)
I need to transform convert Tue Jul 07 2020 12:30:42 to timestamp with scala for spark.
So the expected result will be : 2020-07-07 12:30:42
Any idea, how to make this please ?
You can use the to_timestamp function.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") <-- Spark 3.0 Only.
df.withColumn("date", to_timestamp('string, "E MMM dd yyyy HH:mm:ss"))
.show(false)
+------------------------+-------------------+
|string |date |
+------------------------+-------------------+
|Tue Jul 07 2020 12:30:42|2020-07-07 12:30:42|
+------------------------+-------------------+