Spark using timestamp inside a RDD - scala

I'm trying to compare timestamps within a map, but Spark seems to be using a different timezone or something else that is really weird.
I read a dummy csv file like the following to build the input dataframe :
"ts"
"1970-01-01 00:00:00"
"1970-01-01 00:00:00"
df.show(2)
+-------------------+
| ts |
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:00:00|
+-------------------+
For now, nothing to report, but then :
df.rdd.map { row =>
val timestamp = row.getTimestamp(0)
val timestampMilli = timestamp.toInstant.toEpochMilli
val epoch = Timestamp.from(Instant.EPOCH)
val epochMilli = epoch.toInstant.toEpochMilli
(timestamp, timestampMilli, epoch, epochMilli)
}.foreach(println)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
I don't understand why both timestamp are not 1970-01-01 00:00:00.0, 0. Anyone know what I'm missing ?
NB : I already setup the session timezone to UTC, using the following properties.
spark.sql.session.timeZone=UTC
user.timezone=UTC

The java.sql.Timestamp class inherits from java.util.Date. They both have the behavior of storing a UTC-based numeric timestamp, but displaying time in the local time zone. You'd see this with .toString() in Java, the same as you're seeing with println in the code given.
I believe your OS (or environment) is set to something similar to Europe/London. Keep in mind that at the Unix epoch (1970-01-01T00:00:00Z), London was on BST (UTC+1).
Your timestampMilli variable is showing -3600000 because it's interpreted your input in local time as 1970-01-01T00:00:00+01:00, which is equivalent to 1969-12-31T23:00:00Z.
Your epoch variable is showing 1970-01-01 01:00:00.0 because 0 is equivalent to 1970-01-01T00:00:00Z, which is equivalent to 1970-01-01T01:00:00+01:00.
See also:
Is java.sql.Timestamp timezone specific?
How do I make java.sql.Timestamp UTC time?
Java - Convert java.time.Instant to java.sql.Timestamp without Zone offset
I do see you noted you set your session time zone to UTC, which in theory should work. But clearly the results are showing that it isn't using that. Sorry, but I don't know Spark well enough to tell you why. But I would focus on that part of the problem.

Related

to_timestamp() in pyspark giving null values

I am trying the following simple transformation.
data = [["06/15/2020 14:04:04]]
cols = ["date"]
df = spark.createDataFrame(data,cols)
df = df.withColumn("datetime",F.to_timestamp(F.col("date"),'MM/DD/YYYY HH24:MI:SS))
df.show()
But this gives me an error "All week-based patterns are unsupported since Spark 3.0, detected: Y, Please use the SQL function EXTRACT instead"
I want to format the data into that date format and convert it to timestamp.
You should use this format - MM/dd/yyyy HH:mm:ss'
Check this page for all datetime format related details.
df = df.withColumn("datetime",to_timestamp(col("date"),'MM/dd/yyyy HH:mm:ss'))
df.show()
+-------------------+-------------------+
| date| datetime|
+-------------------+-------------------+
|06/15/2020 14:04:04|2020-06-15 14:04:04|
+-------------------+-------------------+
The different elements of the timestamp pattern are explained in Spark's documentation. Note that Spark parses timestamps utilising Java's SimpleTimeFormat which uses a somewhat confusing set of format symbols. The symbol matching the hour in 24hrs representation is simply H with no digital suffixes. The minutes are m and not M which is for the month. The year is matched by y and not by Y which is for week year. Week years are unsupported hence the message you're getting.
In your case, the proper format should be MM/dd/yyyy HH:mm:ss.

Spark sql function to get time stamp from date time string

I am converting sql version of a code to spark-sql (pyspark) one. I need to get time component from a datetime string. In sql, the code used is
to_char(a.timeclose,''HH24:MI:SS'') as timeclose
Which function can be used for pyspark?
Input : 1970-01-01 16:37:59
Output: 16:37:59
you can do like this
Data :
+--------+
|datetime|
+--------+
|09:39:20|
+--------+
df.registertempTable("sample")
spark.sql("select split(datetime,':')[0] as a from sample").show()
Output :
+---+
| a|
+---+
| 09|
+---+
Spark does not have a standalone - Time Datatype that you will be able to utilise
The closest you can use is to convert your current datetime/timestamp to epoch using - unix_timestamp
Apart from that kindly go through - Minimalistic Reproducible Example and share examples/samples from your dataset to assist you
Found a solution
SQL Code to_char(a.timeopen,'HH24:MI:SS') as timeopen
Alternative for that in pyspark substring(CAST (a.timeopen AS string ),-8) as timeopen
Input : 1970-01-01 16:37:59
Output: 16:37:59

Spark SQL and timezones - How to transform a unix timestamp to a localised timestamp

From a Spark DataFrame I need to convert a epoch/unix timestamp column (eg. 1509102527 = GMT: Friday, 27 October 2017 11:08:47) to a localised timestamp in order to get the local hour in a specific timezone.
Is there a Spark SQL function that can take the unix timestamp and return a localised java.sql.Timestamp?
I already tried to use from_unixtime function, but it returns a localised timestamp based on the default system timezone of the machine the code is running on. The only solution I found so far is to convert that timestamp back to UTC and then from UTC to the target Timezone.
This is a unit test that works with the workaround, but there should be a better way to do it.
test("timezone localization should not change effective unix timestamp") {
import org.apache.spark.sql.functions._
val df = Seq(1509102527)
.toDF("unix_timestamp")
.withColumn("machine_localised_timestamp", from_unixtime('unix_timestamp))
.withColumn("utc_timestamp", to_utc_timestamp('machine_localised_timestamp, TimeZone.getDefault().getID()))
.withColumn("local_time", from_utc_timestamp('utc_timestamp, "Europe/Amsterdam"))
.withColumn("local_hour", hour('local_time))
.withColumn("reverted_unix_timestamp", unix_timestamp('local_time))
df.show(false)
val row = df.collect()(0)
row(row.fieldIndex("unix_timestamp")) shouldBe 1509102527
row(row.fieldIndex("reverted_unix_timestamp")) shouldBe 1509102527
row(row.fieldIndex("local_hour")) shouldBe 13
}

presto from_unixtime function is right?

i make a query about bigint to timestamp and value is '1494257400'
i will use a presto query
but presto is not collect result about from_unixtime() function.
hive version.
select from_unixtime(1494257400) result : '2017-05-09 00:30:00'
presto version.
Blockquote
select from_unixtime(1494257400) result : '2017-05-08 08:30:00'
hive gave a collect result, but presto is not collect result. how i can solve about it?
The presto from_unixtime returns you a date at UTC when the one from Hive returns you a date on your local time zone.
According to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF, from_unixtime:
Converts the number of seconds from unix epoch (1970-01-01 00:00:00
UTC) to a string representing the timestamp of that moment in the
current system time zone in the format of "1970-01-01 00:00:00".
The output of Hive is not that good because ISO formatted strings should show GMT data if they have any which are not GMT+00.
With Hive, you can use to_utc_timestamp({any primitive type} ts, string timezone) to convert your timestamp to the proper timezones. Take a look at the manual whose link is provided above.

Spark SQL is not converting timezone correctly [duplicate]

This question already has answers here:
Spark Strutured Streaming automatically converts timestamp to local time
(4 answers)
Closed 4 years ago.
Using Scala 2.10.4 and spark 1.5.1 and spark 1.6
sqlContext.sql(
"""
|select id,
|to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')),
|from_utc_timestamp(from_unixtime(at), 'US/Pacific'),
|from_unixtime(at),
|to_date(from_unixtime(at)),
| at
|from events
| limit 100
""".stripMargin).collect().foreach(println)
Spark-Submit options:
--driver-java-options '-Duser.timezone=US/Pacific'
result:
[56d2a9573bc4b5c38453eae7,2016-02-28,2016-02-27 16:01:27.0,2016-02-28 08:01:27,2016-02-28,1456646487]
[56d2aa1bfd2460183a571762,2016-02-28,2016-02-27 16:04:43.0,2016-02-28 08:04:43,2016-02-28,1456646683]
[56d2aaa9eb63bbb63456d5b5,2016-02-28,2016-02-27 16:07:05.0,2016-02-28 08:07:05,2016-02-28,1456646825]
[56d2aab15a21fa5f4c4f42a7,2016-02-28,2016-02-27 16:07:13.0,2016-02-28 08:07:13,2016-02-28,1456646833]
[56d2aac8aeeee48b74531af0,2016-02-28,2016-02-27 16:07:36.0,2016-02-28 08:07:36,2016-02-28,1456646856]
[56d2ab1d87fd3f4f72567788,2016-02-28,2016-02-27 16:09:01.0,2016-02-28 08:09:01,2016-02-28,1456646941]
The time in US/Pacific should be 2016-02-28 00:01:27 etc but some how it subtracts "8" hours twice
after reading for sometime following are the conclusions:
Spark-Sql doesn't support date-time, and nor timezones
Using timestamp is the only solution
from_unixtime(at) parses the epoch time correctly, just that the printing of it as a string changes it due to timezone. It is safe to assume that the from_unixtime will convert it correctly ( although printing it might show different results)
from_utc_timestamp will shift ( not just convert) the timestamp to that timezone, in this case it will subtract 8 hours to the time since (-08:00)
printing sql results messes up the times with respect to timezone param
For the record, here we convert Long values like that using an UDF.
For our purpose, we are interested in only the Date string representation of the timestamp (in ms since epoch in UTC)
val udfToDateUTC = udf((epochMilliUTC: Long) => {
val dateFormatter = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(java.time.ZoneId.of("UTC"))
dateFormatter.format(java.time.Instant.ofEpochMilli(epochMilliUTC))
})
This way, we control the parsing as well as the rendering of the dates.