Convert String to Timestamp in Spark (Hive) and the datetime is invalid - scala

I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:
scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| null|
+--------------------------------------------------+
only showing top 1 row
scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| 2022-03-27 03:15:00|
+--------------------------------------------------+
only showing top 1 row
A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution
I can't change the data source.
What can I do?
Thanks
EDIT
The official documentation says:
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp,
to_timestamp and to_date will fail if the specified datetime pattern
is invalid. In Spark 3.0 or earlier, they result NULL. (1)

Let's consider the following dataframe with a column called ts.
val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")
in spark 3.1+, we can use to_timestamp which automatically adds one hour in your situation.
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
in spark 3.0 and 2.4.7, we obtain this:
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500| null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
But strangely, in spark 2.4.7, to_utc_timestamp works the same way as to_timestamp in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:
df.withColumn("ts", concat(
substring('ts, 0, 4), lit("-"),
substring('ts, 5, 2), lit("-"),
substring('ts, 7, 5), lit(":"),
substring('ts,12,2), lit(":"),
substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
+-------------------+-------------------+
| ts| time|
+-------------------+-------------------+
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
+-------------------+-------------------+

I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:
select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
SUBSTR('20220327021000',5,2),'-',
SUBSTR('20220327021000',7,2),' ',
SUBSTR('20220327021020',9,2),':',
SUBSTR('20220327021020',11,2),':',
SUBSTR('20220327021020',13,2)) as timestamp);
select
cast(
regexp_replace("20220327031005",
'^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);

Related

How to get 1st day of the year in pyspark

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark
You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+
x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'
You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

Convert DataFrame String colum to Timestamp

I am trying the following code to convert a string date column to a timestamp column:
val df = Seq(
("19-APR-2019 10:11:10"),
("19-MAR-2019 10:11:10"),
("19-FEB-2019 10:11:10")
).toDF("date")
.withColumn("new_date", to_utc_timestamp(to_date('date, "dd-MMM-yyyy hh:mm:ss"), "UTC"))
df.show
It almost works but it lost hours
+--------------------+-------------------+
| date| new_date|
+--------------------+-------------------+
|19-APR-2019 10:11:10|2019-04-19 00:00:00|
|19-MAR-2019 10:11:10|2019-03-19 00:00:00|
|19-FEB-2019 10:11:10|2019-02-19 00:00:00|
+--------------------+-------------------+
Do you have any idea or any other solution?
as SMaz mentioned in comment, followings lines do the tick:
import org.apache.sql.functions.to_timestamp
df.withColumn("new_date", to_timestamp('date, "dd-MMM-yyyy hh:mm:ss"))

Why aggregation function pyspark.sql.functions.collect_list() adds local timezone offset on display?

I run the following code in a pyspark shell session. Running collect_list() after a groupBy, changes how timestamps are displayed (a UTC+02:00 offset is added, probably because this is the local offset at Greece where the code is run). Although the display is problematic, the timestamp under the hood remains unchanged. This can be observed either by adding a column with the actual unix timestamps or by reverting the dataframe to its initial shape through using pyspark.sql.functions.explode(). Is this a bug?
import datetime
import os
from pyspark.sql import functions, types, udf
# configure utc timezone
spark.conf.set("spark.sql.session.timeZone", "UTC")
os.environ['TZ']
time.tzset()
# create DataFrame
date_time = datetime.datetime(year = 2019, month=1, day=1, hour=12)
data = [(1, date_time), (1, date_time)]
schema = types.StructType([types.StructField("id", types.IntegerType(), False), types.StructField("time", types.TimestampType(), False)])
df_test = spark.createDataFrame(data, schema)
df_test.show()
+---+-------------------+
| id| time|
+---+-------------------+
| 1|2019-01-01 12:00:00|
| 1|2019-01-01 12:00:00|
+---+-------------------+
# GroupBy and collect_list
df_test1 = df_test.groupBy("id").agg(functions.collect_list("time"))
df_test1.show(1, False)
+---+----------------------------------------------+
|id |collect_list(time) |
+---+----------------------------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|
+---+----------------------------------------------+
# add column with unix timestamps
to_timestamp = functions.udf(lambda x : [value.timestamp() for value in x], types.ArrayType(types.FloatType()))
df_test1.withColumn("unix_timestamp",to_timestamp(functions.col("collect_list(time)")))
df_test1.show(1, False)
+---+----------------------------------------------+----------------------------+
|id |collect_list(time) |unix_timestamp |
+---+----------------------------------------------+----------------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|[1.54634394E9, 1.54634394E9]|
+---+----------------------------------------------+----------------------------+
# explode list to distinct rows
df_test1.groupBy("id").agg(functions.collect_list("time")).withColumn("test", functions.explode(functions.col("collect_list(time)"))).show(2, False)
+---+----------------------------------------------+-------------------+
|id |collect_list(time) |test |
+---+----------------------------------------------+-------------------+
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
|1 |[2019-01-01 14:00:00.0, 2019-01-01 14:00:00.0]|2019-01-01 12:00:00|
+---+----------------------------------------------+-------------------+
ps. 1.54634394E9 == 2019-01-01 12:00:00, which is the correct UTC timestamp
For me the code above works, but does not convert the time as in your case.
Maybe check what is your session time zone (and, optionally, set it to some tz):
spark.conf.get('spark.sql.session.timeZone')
In general TimestampType in pyspark is not tz aware like in Pandas rather it passes long ints and displays them according to your machine's local time zone (by default).

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.