spark date_format showing null in version 2.4 - scala

I am trying to convert date format. I have data like below in one column:
04-JUN-21 09.07.55.061067 PM
I am trying to convert it into below format:
2021-06-04 21:07:55
I am trying below code to do this:
val df = Seq("04-JUN-21 09.07.55.061067 PM", "05-JUN-21 09.07.55.061067 PM").toDF("UPLOADED_DATE")
df.select(date_format(to_timestamp($"UPLOADED_DATE","dd-MMM-yy hh.mm.ss.SSSSSS a"),"yyyy-MM-dd HH:mm:ss") as "UPLOADED_DATE").show
df.registerTempTable("tbl")
spark.sql("select date_format(to_timestamp(UPLOADED_DATE,'dd-MMM-yy hh.mm.ss.SSSSSS a'),'yyyy-MM-dd HH:mm:ss') as UPLOADED_DATE from tbl").show()
I am getting result when using spark 3.0.2 version:
+-------------------+
| UPLOADED_DATE|
+-------------------+
|2021-06-04 21:07:55|
|2021-06-05 21:07:55|
+-------------------+
But getting null when using spark 2.4.7 version:
+-------------+
|UPLOADED_DATE|
+-------------+
| null|
| null|
+-------------+
Kindly let me know if there is a different way to do it in spark 2.4 version.

Here is a workaround. It seems the milliseconds part of the timestamp is causing issues, so if you're going to discard it anyway, you can remove it before calling to_timestamp:
df.select(
date_format(
to_timestamp(
regexp_replace($"UPLOADED_DATE", "\\.\\d{6}", ""),
"dd-MMM-yy hh.mm.ss a"
),
"yyyy-MM-dd HH:mm:ss"
) as "UPLOADED_DATE"
).show
+-------------------+
| UPLOADED_DATE|
+-------------------+
|2021-06-04 21:07:55|
|2021-06-05 21:07:55|
+-------------------+

Related

Convert String to Timestamp in Spark (Hive) and the datetime is invalid

I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:
scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| null|
+--------------------------------------------------+
only showing top 1 row
scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| 2022-03-27 03:15:00|
+--------------------------------------------------+
only showing top 1 row
A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution
I can't change the data source.
What can I do?
Thanks
EDIT
The official documentation says:
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp,
to_timestamp and to_date will fail if the specified datetime pattern
is invalid. In Spark 3.0 or earlier, they result NULL. (1)
Let's consider the following dataframe with a column called ts.
val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")
in spark 3.1+, we can use to_timestamp which automatically adds one hour in your situation.
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
in spark 3.0 and 2.4.7, we obtain this:
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500| null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
But strangely, in spark 2.4.7, to_utc_timestamp works the same way as to_timestamp in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:
df.withColumn("ts", concat(
substring('ts, 0, 4), lit("-"),
substring('ts, 5, 2), lit("-"),
substring('ts, 7, 5), lit(":"),
substring('ts,12,2), lit(":"),
substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
+-------------------+-------------------+
| ts| time|
+-------------------+-------------------+
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
+-------------------+-------------------+
I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:
select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
SUBSTR('20220327021000',5,2),'-',
SUBSTR('20220327021000',7,2),' ',
SUBSTR('20220327021020',9,2),':',
SUBSTR('20220327021020',11,2),':',
SUBSTR('20220327021020',13,2)) as timestamp);
select
cast(
regexp_replace("20220327031005",
'^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);

How to get week of year in spark 3.0+?

I'm trying to create a calendar file with columns for day, month, etc. The following code works fine, but I couldn't find a clean way to extract the week of year (1-52). In spark 3.0+, the following line of code doesn't work: .withColumn("week_of_year", date_format(col("day_id"), "W"))
I know that I can create a view/table and then run a SQL query on it to extract the week_of_year, but is there no better way to do it?
`
df.withColumn("day_id", to_date(col("day_id"), date_fmt))
.withColumn("week_day", date_format(col("day_id"), "EEEE"))
.withColumn("month_of_year", date_format(col("day_id"), "M"))
.withColumn("year", date_format(col("day_id"), "y"))
.withColumn("day_of_month", date_format(col("day_id"), "d"))
.withColumn("quarter_of_year", date_format(col("day_id"), "Q"))
It seems those patterns are not supported anymore in spark 3+
Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: w, Please use the SQL function EXTRACT instead
You can use this:
import org.apache.spark.sql.functions._
df.withColumn("week_of_year", weekofyear($"date"))
TESTING
INPUT
val df = List("2021-05-15", "1985-10-05")
.toDF("date")
.withColumn("date", to_date($"date", "yyyy-MM-dd")
df.show
+----------+
| date|
+----------+
|2021-05-15|
|1985-10-05|
+----------+
OUTPUT
df.withColumn("week_of_year", weekofyear($"date")).show
+----------+------------+
| date|week_of_year|
+----------+------------+
|2021-05-15| 19|
|1985-10-05| 40|
+----------+------------+
The exception you saw, recomend to use EXTRACT SQL function instead https://spark.apache.org/docs/3.0.0/api/sql/index.html#extract
val df = Seq(("2019-11-16 16:50:59.406")).toDF("input_timestamp")
df.selectExpr("input_timestamp", "extract(week FROM input_timestamp) as w").show
+--------------------+---+
| input_timestamp| w|
+--------------------+---+
|2019-11-16 16:50:...| 46|
+--------------------+---+

How to parse date time?

I'm trying to parse a date column that is currently type string. It is in the format of
2005-04-24T09:12:49Z
I have Spark version 2.1. And I've tried the following
spark.sql("SELECT TO_DATE(Date) FROM df").show()
This returns
2005-04-24
but no timestamp.
Next I tried
val ts = unix_timestamp($"Date", "yyyy-dd-MM HH:mm:ss").cast("timestamp")
df.withColumn("Date", ts).show()
This returned all nulls
Then I tried
spark.sql("select TO_DATE(Date_Resulted, 'yyyy-MM-ddTHH:mm:ssZ') AS date from lab").show()
but this just returned the error:
org.apache.spark.sql.AnalysisException: Invalid number of arguments for function to_date; line 1 pos 7
There has to be an easy way to parse this string date column to return type DateTime. Any help would be much appreciated
There are different ways to get datetime in Spark.
Let's use the following sample data:
val df=Seq("2005-04-24T09:12:49Z").toDF("time_stamp")
df.createOrReplaceTempView("tmp")
Date from timestamp
//in spark sql api
spark.sql("select to_date(time_stamp)dt from tmp").show()
//in dataframe api
df.withColumn("dt",to_date('time_stamp)).select("dt").show()
Result:
//+----------+
//| dt|
//+----------+
//|2005-04-24|
//+----------+
get datetime from timestamp - Using from_unixtime and unix_timestamp functions
//in spark sql api
spark.sql("""select timestamp(from_unixtime(unix_timestamp(time_stamp,"yyyy-MM-dd'T'hh:mm:ss'Z'"),"yyyy-MM-dd hh:mm:ss")) as ts from tmp""").show()
//in dataframe api
df.withColumn("dt",from_unixtime(unix_timestamp('time_stamp,"yyyy-MM-dd'T'hh:mm:ss'Z'"),"yyyy-MM-dd hh:mm:ss").cast("timestamp")).select("dt").show()
// Result:
// +-------------------+
// | ts|
// +-------------------+
// |2005-04-24 09:12:49|
// +-------------------+
get datetime from timestamp - Using unix_timestamp function
//in spark sql api
spark.sql("""select timestamp(unix_timestamp(time_stamp,"yyyy-MM-dd'T'hh:mm:ss'Z'")) as ts from tmp""").show()
//in dataframe api
df.withColumn("dt",unix_timestamp('time_stamp,"yyyy-MM-dd'T'hh:mm:ss'Z'").cast("timestamp")).select("dt").show()
// Result:
// +-------------------+
// | ts|
// +-------------------+
// |2005-04-24 09:12:49|
// +-------------------+
get datetime from timestamp - Using to_timestamp function
//in spark sql api
spark.sql("select to_timestamp(time_stamp)ts from tmp").show()
//in dataframe api
df.withColumn("dt",to_timestamp('time_stamp)).select("dt").show()
// Result:
// +-------------------+
// | ts|
// +-------------------+
// |2005-04-24 04:12:49|
// +-------------------+
get datetime from timestamp - Using to_timestamp function with format specified
//in spark sql api
spark.sql("""select to_timestamp(time_stamp,"yyyy-MM-dd'T'hh:mm:ss'Z'")ts from tmp""").show()
//in dataframe api
df.withColumn("dt",to_timestamp($"time_stamp","yyyy-MM-dd'T'hh:mm:ss'Z'")).select("dt").show()
// Result:
// +-------------------+
// | ts|
// +-------------------+
// |2005-04-24 09:12:49|
// +-------------------+
this will work :
val df = Seq("2005-04-24T09:12:49Z").toDF("date")
df
.withColumn("date_converted", to_timestamp($"date", "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.show()
gives:
+--------------------+-------------------+
| date| date_converted|
+--------------------+-------------------+
|2005-04-24T09:12:49Z|2005-04-24 09:12:49|
+--------------------+-------------------+

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.