Spark SQL is not converting timezone correctly [duplicate]

Spark SQL is not converting timezone correctly [duplicate] - scala

This question already has answers here:
Spark Strutured Streaming automatically converts timestamp to local time
(4 answers)
Closed 4 years ago.
Using Scala 2.10.4 and spark 1.5.1 and spark 1.6
sqlContext.sql(
"""
|select id,
|to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')),
|from_utc_timestamp(from_unixtime(at), 'US/Pacific'),
|from_unixtime(at),
|to_date(from_unixtime(at)),
| at
|from events
| limit 100
""".stripMargin).collect().foreach(println)
Spark-Submit options:
--driver-java-options '-Duser.timezone=US/Pacific'
result:
[56d2a9573bc4b5c38453eae7,2016-02-28,2016-02-27 16:01:27.0,2016-02-28 08:01:27,2016-02-28,1456646487]
[56d2aa1bfd2460183a571762,2016-02-28,2016-02-27 16:04:43.0,2016-02-28 08:04:43,2016-02-28,1456646683]
[56d2aaa9eb63bbb63456d5b5,2016-02-28,2016-02-27 16:07:05.0,2016-02-28 08:07:05,2016-02-28,1456646825]
[56d2aab15a21fa5f4c4f42a7,2016-02-28,2016-02-27 16:07:13.0,2016-02-28 08:07:13,2016-02-28,1456646833]
[56d2aac8aeeee48b74531af0,2016-02-28,2016-02-27 16:07:36.0,2016-02-28 08:07:36,2016-02-28,1456646856]
[56d2ab1d87fd3f4f72567788,2016-02-28,2016-02-27 16:09:01.0,2016-02-28 08:09:01,2016-02-28,1456646941]
The time in US/Pacific should be 2016-02-28 00:01:27 etc but some how it subtracts "8" hours twice

after reading for sometime following are the conclusions:
Spark-Sql doesn't support date-time, and nor timezones
Using timestamp is the only solution
from_unixtime(at) parses the epoch time correctly, just that the printing of it as a string changes it due to timezone. It is safe to assume that the from_unixtime will convert it correctly ( although printing it might show different results)
from_utc_timestamp will shift ( not just convert) the timestamp to that timezone, in this case it will subtract 8 hours to the time since (-08:00)
printing sql results messes up the times with respect to timezone param

For the record, here we convert Long values like that using an UDF.
For our purpose, we are interested in only the Date string representation of the timestamp (in ms since epoch in UTC)
val udfToDateUTC = udf((epochMilliUTC: Long) => {
val dateFormatter = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(java.time.ZoneId.of("UTC"))
dateFormatter.format(java.time.Instant.ofEpochMilli(epochMilliUTC))
})
This way, we control the parsing as well as the rendering of the dates.

Related

How do I get DateTime in flutter with microseconds? [duplicate]

This question already has an answer here:
DateTime.fromMillisecondsSinceEpoch not returning an exact DateTime object
(1 answer)
Closed last month.
I am storing several rows in Postgres (Supabase) and want to use a timestampz field both as the date of creation as well as to "group" these rows.
When I use the now() of postgres the timestampz has microseconds, whereas the DateTime.now() of flutter has only milliseconds.
See image that shows the difference
What I am doing now is to write the first record (using the column default set to now()), read it to get the timestamp and then set this for the other rows.
Is there a way to generate DateTime.now() with microseconds in flutter?

You can use millisecondsSinceEpoch to get milliseconds from datetime.
final now = DateTime.now().millisecondsSinceEpoch;
print(now);
This will return you milliseconds since the "Unix epoch" 1970-01-01T00:00:00Z.
https://api.flutter.dev/flutter/dart-core/DateTime/millisecondsSinceEpoch.html

Spark Scala - convert Timestamp with milliseconds to Timestamp without milliseconds

I have a column in Timestamp format that includes milliseconds.
I would like to reformat my timestamp column so that it does not include milliseconds. For example if my Timestamp column has values like 2019-11-20T12:23:13.324+0000, I would like my reformatted Timestamp column to have values of 2019-11-20T12:23:13
Is there a straight forward way to perform this operation in spark-scala? I have found lots of posts on converting string to timestamp but not for changing the format of a timestamp.

You can try trunc.
See more examples: https://sparkbyexamples.com/spark/spark-date-functions-truncate-date-time/

Get ISO year week in Apache Spark using scala / from_unixtime behaving differently in Spark and Hive

I am trying to get ISO year week in Spark with scala from a date which is in string format.
The following SQL query returns the expected result in hive.
i.e if the date is 1st January 2016, as per ISO standard it is 53rd week of year 2015 and hence the result 201553.
hive> select from_unixtime(unix_timestamp('20160101', 'yyyyMMdd'), 'Yww');
OK
201553
Time taken: 0.444 seconds, Fetched: 1 row(s)
If I try to run the same in Spark via Spark sql, it is giving me a different result.
scala> spark.sql("""select from_unixtime(unix_timestamp('20160101', 'yyyyMMdd'), 'Yww')""").show
+------------------------------------------------------+
|from_unixtime(unix_timestamp(20160101, yyyyMMdd), Yww)|
+------------------------------------------------------+
| 201601|
+------------------------------------------------------+
The result I need from Spark program in 201553.
I am using Spark version 2.3
Can someone explain what's going on?
Please let me know if there is any way to get ISO year week in Spark.

Spark using timestamp inside a RDD

I'm trying to compare timestamps within a map, but Spark seems to be using a different timezone or something else that is really weird.
I read a dummy csv file like the following to build the input dataframe :
"ts"
"1970-01-01 00:00:00"
"1970-01-01 00:00:00"
df.show(2)
+-------------------+
| ts |
+-------------------+
|1970-01-01 00:00:00|
|1970-01-01 00:00:00|
+-------------------+
For now, nothing to report, but then :
df.rdd.map { row =>
val timestamp = row.getTimestamp(0)
val timestampMilli = timestamp.toInstant.toEpochMilli
val epoch = Timestamp.from(Instant.EPOCH)
val epochMilli = epoch.toInstant.toEpochMilli
(timestamp, timestampMilli, epoch, epochMilli)
}.foreach(println)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
(1970-01-01 00:00:00.0,-3600000,1970-01-01 01:00:00.0,0)
I don't understand why both timestamp are not 1970-01-01 00:00:00.0, 0. Anyone know what I'm missing ?
NB : I already setup the session timezone to UTC, using the following properties.
spark.sql.session.timeZone=UTC
user.timezone=UTC

The java.sql.Timestamp class inherits from java.util.Date. They both have the behavior of storing a UTC-based numeric timestamp, but displaying time in the local time zone. You'd see this with .toString() in Java, the same as you're seeing with println in the code given.
I believe your OS (or environment) is set to something similar to Europe/London. Keep in mind that at the Unix epoch (1970-01-01T00:00:00Z), London was on BST (UTC+1).
Your timestampMilli variable is showing -3600000 because it's interpreted your input in local time as 1970-01-01T00:00:00+01:00, which is equivalent to 1969-12-31T23:00:00Z.
Your epoch variable is showing 1970-01-01 01:00:00.0 because 0 is equivalent to 1970-01-01T00:00:00Z, which is equivalent to 1970-01-01T01:00:00+01:00.
See also:
Is java.sql.Timestamp timezone specific?
How do I make java.sql.Timestamp UTC time?
Java - Convert java.time.Instant to java.sql.Timestamp without Zone offset
I do see you noted you set your session time zone to UTC, which in theory should work. But clearly the results are showing that it isn't using that. Sorry, but I don't know Spark well enough to tell you why. But I would focus on that part of the problem.

Spark SQL and timezones - How to transform a unix timestamp to a localised timestamp

From a Spark DataFrame I need to convert a epoch/unix timestamp column (eg. 1509102527 = GMT: Friday, 27 October 2017 11:08:47) to a localised timestamp in order to get the local hour in a specific timezone.
Is there a Spark SQL function that can take the unix timestamp and return a localised java.sql.Timestamp?
I already tried to use from_unixtime function, but it returns a localised timestamp based on the default system timezone of the machine the code is running on. The only solution I found so far is to convert that timestamp back to UTC and then from UTC to the target Timezone.
This is a unit test that works with the workaround, but there should be a better way to do it.
test("timezone localization should not change effective unix timestamp") {
import org.apache.spark.sql.functions._
val df = Seq(1509102527)
.toDF("unix_timestamp")
.withColumn("machine_localised_timestamp", from_unixtime('unix_timestamp))
.withColumn("utc_timestamp", to_utc_timestamp('machine_localised_timestamp, TimeZone.getDefault().getID()))
.withColumn("local_time", from_utc_timestamp('utc_timestamp, "Europe/Amsterdam"))
.withColumn("local_hour", hour('local_time))
.withColumn("reverted_unix_timestamp", unix_timestamp('local_time))
df.show(false)
val row = df.collect()(0)
row(row.fieldIndex("unix_timestamp")) shouldBe 1509102527
row(row.fieldIndex("reverted_unix_timestamp")) shouldBe 1509102527
row(row.fieldIndex("local_hour")) shouldBe 13
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark SQL is not converting timezone correctly [duplicate] - scala

Related

How do I get DateTime in flutter with microseconds? [duplicate]

Spark Scala - convert Timestamp with milliseconds to Timestamp without milliseconds

Get ISO year week in Apache Spark using scala / from_unixtime behaving differently in Spark and Hive

Spark using timestamp inside a RDD

Spark SQL and timezones - How to transform a unix timestamp to a localised timestamp

Categories

Resources