to_timestamp() in pyspark giving null values - scala

I am trying the following simple transformation.
data = [["06/15/2020 14:04:04]]
cols = ["date"]
df = spark.createDataFrame(data,cols)
df = df.withColumn("datetime",F.to_timestamp(F.col("date"),'MM/DD/YYYY HH24:MI:SS))
df.show()
But this gives me an error "All week-based patterns are unsupported since Spark 3.0, detected: Y, Please use the SQL function EXTRACT instead"
I want to format the data into that date format and convert it to timestamp.

You should use this format - MM/dd/yyyy HH:mm:ss'
Check this page for all datetime format related details.
df = df.withColumn("datetime",to_timestamp(col("date"),'MM/dd/yyyy HH:mm:ss'))
df.show()
+-------------------+-------------------+
| date| datetime|
+-------------------+-------------------+
|06/15/2020 14:04:04|2020-06-15 14:04:04|
+-------------------+-------------------+

The different elements of the timestamp pattern are explained in Spark's documentation. Note that Spark parses timestamps utilising Java's SimpleTimeFormat which uses a somewhat confusing set of format symbols. The symbol matching the hour in 24hrs representation is simply H with no digital suffixes. The minutes are m and not M which is for the month. The year is matched by y and not by Y which is for week year. Week years are unsupported hence the message you're getting.
In your case, the proper format should be MM/dd/yyyy HH:mm:ss.

Related

If column having dates in multiple format, Get last date of month for specific date format

I have a spark data frame having two columns (SEQ - Integer, MAIN_DATE - Date) as:
Now I want to add a column based on the condition that if the format of MAIN_DATE is "MMM-YYYY" then it should be converted to Last day of the month and new data frame should look like this:
Any suggestion will be much appreciated.
You can use Spark's when/otherwise methods in order to operate differently for each different date format of the MAIN_DATE column.
More specifically, you can simply match the MMM-yyyy date format values of the column based on the field's String length (since we know that those values we always have 8 characters) as a condition in when and then:
use to_date to convert the String value to a valid date based on a format we give as an argument, and
use last_date to get the last day of the month each curry date in MAIN_DATE is referring to.
As for the "regular" rows with the dd-MMM-yyyy date format, just a to_date conversion would be sufficient within the otherwise method.
After that, all there's left to do is to convert the dates back to the desired dd-MMM-yyyy format (because to_date converts a given date to the yyyy-MM-dd format).
This is the solution in Scala (split in into two withColumns to make it more readable, instead of an one-liner):
df.withColumn("END_DATE",
when(length(col("MAIN_DATE")).equalTo(8), last_day(to_date(col("MAIN_DATE"), "MMM-yyyy")))
.otherwise(to_date(col("MAIN_DATE"), "dd-MMM-yyyy")))
.withColumn("END_DATE", date_format(col("END_DATE"), "dd-MMM-yyyy"))
This is what the resulting df DataFrame will look like:
+---+-----------+-----------+
|SEQ| MAIN_DATE| END_DATE|
+---+-----------+-----------+
| 1|16-JAN-2020|16-Jan-2020|
| 2| FEB-2017|28-Feb-2017|
+---+-----------+-----------+

How would I convert spark scala dataframe column to datetime?

Say I have a dataframe with two columns, both that need to be converted to datetime format. However, the current formatting of the columns varies from row to row, and when I apply to to_date method, I get all nulls returned.
Here's a screenshot of the format....
the code I tried is...
date_subset.select(col("InsertDate"),to_date(col("InsertDate")).as("to_date")).show()
which returned
Your datetime is not in the default format, so you should give the format.
to_date(col("InsertDate"), "MM/dd/yyyy HH:mm")
I don't know which one is month and date, but you can do that in this way.

PySpark - Spark SQL: how to convert timestamp with UTC offset to epoch/unixtime?

How can I convert a timestamp in the format 2019-08-22T23:57:57-07:00 into unixtime using Spark SQL or PySpark?
The most similar function I know is unix_timestamp() it doesn't accept the above time format with UTC offset.
Any suggestion on how I could approach that using preferably Spark SQL or PySpark?
Thanks
The java SimpleDateFormat pattern for ISO 8601time zone in this case is XXX.
So you need to use yyyy-MM-dd'T'HH:mm:ssXXX as your format string.
SparkSQL
spark.sql(
"""select unix_timestamp("2019-08-22T23:57:57-07:00", "yyyy-MM-dd'T'HH:mm:ssXXX")
AS epoch"""
).show(truncate=False)
#+----------+
#|epoch |
#+----------+
#|1566543477|
#+----------+
Spark DataFrame
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([("2019-08-22T23:57:57-07:00",)], ["timestamp"])
df.withColumn(
"unixtime",
unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ssXXX")
).show(truncate=False)
#+-------------------------+----------+
#|timestamp |unixtime |
#+-------------------------+----------+
#|2019-08-22T23:57:57-07:00|1566543477|
#+-------------------------+----------+
Note that pyspark is just a wrapper on spark - generally I've found the scala/java docs are more complete than the python ones. It may be helpful in the future.

Spark using timestamp inside a RDD

I'm trying to compare timestamps within a map, but Spark seems to be using a different timezone or something else that is really weird.
I read a dummy csv file like the following to build the input dataframe :
"ts"
"1970-01-01 00:00:00"
"1970-01-01 00:00:00"
df.show(2)
+-------------------+
| ts |
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:00:00|
+-------------------+
For now, nothing to report, but then :
df.rdd.map { row =>
val timestamp = row.getTimestamp(0)
val timestampMilli = timestamp.toInstant.toEpochMilli
val epoch = Timestamp.from(Instant.EPOCH)
val epochMilli = epoch.toInstant.toEpochMilli
(timestamp, timestampMilli, epoch, epochMilli)
}.foreach(println)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
I don't understand why both timestamp are not 1970-01-01 00:00:00.0, 0. Anyone know what I'm missing ?
NB : I already setup the session timezone to UTC, using the following properties.
spark.sql.session.timeZone=UTC
user.timezone=UTC
The java.sql.Timestamp class inherits from java.util.Date. They both have the behavior of storing a UTC-based numeric timestamp, but displaying time in the local time zone. You'd see this with .toString() in Java, the same as you're seeing with println in the code given.
I believe your OS (or environment) is set to something similar to Europe/London. Keep in mind that at the Unix epoch (1970-01-01T00:00:00Z), London was on BST (UTC+1).
Your timestampMilli variable is showing -3600000 because it's interpreted your input in local time as 1970-01-01T00:00:00+01:00, which is equivalent to 1969-12-31T23:00:00Z.
Your epoch variable is showing 1970-01-01 01:00:00.0 because 0 is equivalent to 1970-01-01T00:00:00Z, which is equivalent to 1970-01-01T01:00:00+01:00.
See also:
Is java.sql.Timestamp timezone specific?
How do I make java.sql.Timestamp UTC time?
Java - Convert java.time.Instant to java.sql.Timestamp without Zone offset
I do see you noted you set your session time zone to UTC, which in theory should work. But clearly the results are showing that it isn't using that. Sorry, but I don't know Spark well enough to tell you why. But I would focus on that part of the problem.

Spark Timestamp issue, same timestamp but mismatching

I am moving data from source into my bucket and need to write a script for data validation. But for Timestamp data type, I face some weird issue: I have two rows containing two same timestamp [2017-06-08 17:50:02.422437], [2017-06-08 17:50:02.422], because the second one is having a different format due to different file system configuration Spark considers them different. Is there anyway to resolve this problem? The ideal way is to ignore this column when doing data frame comparison.
You can use unix_timestamp and use that number for comparison. For actual date requirements you can use from_unixtime to convert to your required format. Not sure it is efficient method on large volume of data...
sqlContext.sql("Select unix_timestamp('2017-06-08 17:50:02.422'), unix_timestamp('2017-06-08 17:50:02.422437') ").show
+----------+----------+
| _c0| _c1|
+----------+----------+
|1496958602|1496958602|
+----------+----------+