Spark sql function to get time stamp from date time string - pyspark

I am converting sql version of a code to spark-sql (pyspark) one. I need to get time component from a datetime string. In sql, the code used is
to_char(a.timeclose,''HH24:MI:SS'') as timeclose
Which function can be used for pyspark?
Input : 1970-01-01 16:37:59
Output: 16:37:59

you can do like this
Data :
+--------+
|datetime|
+--------+
|09:39:20|
+--------+
df.registertempTable("sample")
spark.sql("select split(datetime,':')[0] as a from sample").show()
Output :
+---+
| a|
+---+
| 09|
+---+

Spark does not have a standalone - Time Datatype that you will be able to utilise
The closest you can use is to convert your current datetime/timestamp to epoch using - unix_timestamp
Apart from that kindly go through - Minimalistic Reproducible Example and share examples/samples from your dataset to assist you

Found a solution
SQL Code to_char(a.timeopen,'HH24:MI:SS') as timeopen
Alternative for that in pyspark substring(CAST (a.timeopen AS string ),-8) as timeopen
Input : 1970-01-01 16:37:59
Output: 16:37:59

Related

to_timestamp() in pyspark giving null values

I am trying the following simple transformation.
data = [["06/15/2020 14:04:04]]
cols = ["date"]
df = spark.createDataFrame(data,cols)
df = df.withColumn("datetime",F.to_timestamp(F.col("date"),'MM/DD/YYYY HH24:MI:SS))
df.show()
But this gives me an error "All week-based patterns are unsupported since Spark 3.0, detected: Y, Please use the SQL function EXTRACT instead"
I want to format the data into that date format and convert it to timestamp.
You should use this format - MM/dd/yyyy HH:mm:ss'
Check this page for all datetime format related details.
df = df.withColumn("datetime",to_timestamp(col("date"),'MM/dd/yyyy HH:mm:ss'))
df.show()
+-------------------+-------------------+
| date| datetime|
+-------------------+-------------------+
|06/15/2020 14:04:04|2020-06-15 14:04:04|
+-------------------+-------------------+
The different elements of the timestamp pattern are explained in Spark's documentation. Note that Spark parses timestamps utilising Java's SimpleTimeFormat which uses a somewhat confusing set of format symbols. The symbol matching the hour in 24hrs representation is simply H with no digital suffixes. The minutes are m and not M which is for the month. The year is matched by y and not by Y which is for week year. Week years are unsupported hence the message you're getting.
In your case, the proper format should be MM/dd/yyyy HH:mm:ss.

PySpark - Spark SQL: how to convert timestamp with UTC offset to epoch/unixtime?

How can I convert a timestamp in the format 2019-08-22T23:57:57-07:00 into unixtime using Spark SQL or PySpark?
The most similar function I know is unix_timestamp() it doesn't accept the above time format with UTC offset.
Any suggestion on how I could approach that using preferably Spark SQL or PySpark?
Thanks
The java SimpleDateFormat pattern for ISO 8601time zone in this case is XXX.
So you need to use yyyy-MM-dd'T'HH:mm:ssXXX as your format string.
SparkSQL
spark.sql(
"""select unix_timestamp("2019-08-22T23:57:57-07:00", "yyyy-MM-dd'T'HH:mm:ssXXX")
AS epoch"""
).show(truncate=False)
#+----------+
#|epoch |
#+----------+
#|1566543477|
#+----------+
Spark DataFrame
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([("2019-08-22T23:57:57-07:00",)], ["timestamp"])
df.withColumn(
"unixtime",
unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ssXXX")
).show(truncate=False)
#+-------------------------+----------+
#|timestamp |unixtime |
#+-------------------------+----------+
#|2019-08-22T23:57:57-07:00|1566543477|
#+-------------------------+----------+
Note that pyspark is just a wrapper on spark - generally I've found the scala/java docs are more complete than the python ones. It may be helpful in the future.

Spark using timestamp inside a RDD

I'm trying to compare timestamps within a map, but Spark seems to be using a different timezone or something else that is really weird.
I read a dummy csv file like the following to build the input dataframe :
"ts"
"1970-01-01 00:00:00"
"1970-01-01 00:00:00"
df.show(2)
+-------------------+
| ts |
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:00:00|
+-------------------+
For now, nothing to report, but then :
df.rdd.map { row =>
val timestamp = row.getTimestamp(0)
val timestampMilli = timestamp.toInstant.toEpochMilli
val epoch = Timestamp.from(Instant.EPOCH)
val epochMilli = epoch.toInstant.toEpochMilli
(timestamp, timestampMilli, epoch, epochMilli)
}.foreach(println)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
I don't understand why both timestamp are not 1970-01-01 00:00:00.0, 0. Anyone know what I'm missing ?
NB : I already setup the session timezone to UTC, using the following properties.
spark.sql.session.timeZone=UTC
user.timezone=UTC
The java.sql.Timestamp class inherits from java.util.Date. They both have the behavior of storing a UTC-based numeric timestamp, but displaying time in the local time zone. You'd see this with .toString() in Java, the same as you're seeing with println in the code given.
I believe your OS (or environment) is set to something similar to Europe/London. Keep in mind that at the Unix epoch (1970-01-01T00:00:00Z), London was on BST (UTC+1).
Your timestampMilli variable is showing -3600000 because it's interpreted your input in local time as 1970-01-01T00:00:00+01:00, which is equivalent to 1969-12-31T23:00:00Z.
Your epoch variable is showing 1970-01-01 01:00:00.0 because 0 is equivalent to 1970-01-01T00:00:00Z, which is equivalent to 1970-01-01T01:00:00+01:00.
See also:
Is java.sql.Timestamp timezone specific?
How do I make java.sql.Timestamp UTC time?
Java - Convert java.time.Instant to java.sql.Timestamp without Zone offset
I do see you noted you set your session time zone to UTC, which in theory should work. But clearly the results are showing that it isn't using that. Sorry, but I don't know Spark well enough to tell you why. But I would focus on that part of the problem.

spark-sql builtin dayofmonth function returning weird results

For some strange reason , the dayofmonth function in spark seems to return strange value for years 1500 or less.
Following were the results that was obtained ->
scala> spark.sql("SELECT dayofmonth('1501-02-14') ").show()
+------------------------------------+
|dayofmonth(CAST(1501-02-14 AS DATE))|
+------------------------------------+
| 14|
+------------------------------------+
scala> spark.sql("SELECT dayofmonth('1500-02-14') ").show()
+------------------------------------+
|dayofmonth(CAST(1500-02-14 AS DATE))|
+------------------------------------+
| 13|
+------------------------------------+
scala> spark.sql("SELECT dayofmonth('1400-02-14') ").show()
+------------------------------------+
|dayofmonth(CAST(1400-02-14 AS DATE))|
+------------------------------------+
| 12|
+------------------------------------+
Can anyone explain , why spark behaves this way?
That's because dates are exposed externally as java.sql.Date and are represented internally as the number of dates since the Unix epoch (1970-01-01).
References: source 1, source 2 and 3.
This mainly creates lots of issues when dealing with dates before 1970 but you can try creating udfs (I can't believe I'm writing this) with external libraries that might be able to cope with this problem as adviced here.
Reminder: Of course, you need to take into account performance bottlenecks using udfs. More on that here.
For more information about Unix Time, you can read the following :
https://en.wikipedia.org/wiki/Unix_time

Spark Timestamp issue, same timestamp but mismatching

I am moving data from source into my bucket and need to write a script for data validation. But for Timestamp data type, I face some weird issue: I have two rows containing two same timestamp [2017-06-08 17:50:02.422437], [2017-06-08 17:50:02.422], because the second one is having a different format due to different file system configuration Spark considers them different. Is there anyway to resolve this problem? The ideal way is to ignore this column when doing data frame comparison.
You can use unix_timestamp and use that number for comparison. For actual date requirements you can use from_unixtime to convert to your required format. Not sure it is efficient method on large volume of data...
sqlContext.sql("Select unix_timestamp('2017-06-08 17:50:02.422'), unix_timestamp('2017-06-08 17:50:02.422437') ").show
+----------+----------+
| _c0| _c1|
+----------+----------+
|1496958602|1496958602|
+----------+----------+