PySpark date_trunc behavior regarding daylight saving time - pyspark

I want to truncate some date to day, however I have some inconsistencies caused by the daylight saving time which started occurring on this date 2022-10-30.
F.date_trunc('day', 'my_date')
input
result
2022-10-30 01:41:02
2022-10-30 00:00:00
2022-10-30 07:41:01
2022-10-30 01:00:00
What is the reason for this happening?
I use pyspark v2.4.4.

Related

How to fix incorrect Date due to timezone in Core Data from using Calendar.current.startOfDay?

I have erroneously used Calendar.current.startOfDay(for: Date()) to populate a date attribute in Core Data. This means that when users cross different timezones I may have different dates unintentially stored in the date attribute field e.g.
Timezone 1 - 25th 23:00
Timezone 2 - 25th 22:00
Timezone 3 - 26th 05:00
I need to update the Calendar to use UTC Timezone but I need to also perform a migration so that the existing entries in Core Data read like this…
Result:
Timezone 1 - 26th 00:00
Timezone 2 - 26th 00:00
Timezone 3 - 26th 00:00
What are the steps to perform this migration. If I do a UTC startOfDay on it Timezone 1 would get 25th 00:00 instead of 26th 0:00 which is what it should be. Is it possible to accurately update existing entries?
Edit:
For some context I need a reliable way to get all the entries for the 26th for example. I used startOfDay to store the date as it meant I could query by it too and have the relevant entry returned (at any moment in time get the startOfDay and it will give me the entries for the whole day). For historical dates I can do the same - let's say the user has navigated back 2 days I can take startOfDay and subtract 2 days using Calendar.current.date(byAdding: .day, value: -2, to: date) and query for that.
So now the timezone breaks the above logic but is there some way to fix this? If I loop through the entries I can figure out the date it was supposed to be for and perhaps change the attribute to a string - e.g. 26-05-2021 or start to store day, month, year instead and query that.
From reading your answer Duncan I don't think I want to use UTC calendar as it would start to store the entry against the incorrect date from the users perspective dependent on their timezone e.g. user moves to next day and utc is still on previous.
Edit 2:
In a migration I will take the date that is stored and map it to new day, month and year properties storing those instead by getting them from Calendar.current.dateComponents([.day, .month, .year], from: date). Then instead of query by date I will query by day month and year of the Calendar.current where the user is. The side effect here is there is potential the user adds something for today (27th) changes timezone and sees 26th data but I don't think it can be avoided and the old data will then show as intended.
If you took the current time and used Calendar.current.startOfDay(for: Date()) to calculate midnight in the user's local time zone, you have a loss of information. You don't know what time of day the operation was performed. If you saved the time of day in the local time zone in another field, you could reconstruct a Date in UTC.
It isn't clear that what you did was wrong. The day, month, and year is only meaningful in a specific time zone. I am in the Washington DC metro area. We are in daylight savings time (EDT). It is currently 20:56 on the 26th of May. However, it's 1:56 AM on the 27th of May in London, 2:57 AM in Munich, and 3:57 AM in Tel Aviv. All at the exact same instant in time. In UTC it is 0:57 AM on the 27th of May.
Most people think of the calendar date in their local time zone. That is their frame of reference. If you ask me the date right now, I'll tell you it's the evening of the 26th of May. Unless I know you are in a different time zone, that's the "right" answer to me.
If I start out at midnight on a given day in my time zone, calling Calendar.current.startOfDay(for: Date()) each hour, I'll get midnight that day for all 24 hours in my local time zone. For the first 20 hours of the day, that would be the same result I would get if I created a Calendar in UTC and made the same call. However, at 20:00 EDT, I would start getting the next calendar day if I made the same query in UTC.
If you don't know what time of day you made the call to Calendar.current.startOfDay(for: Date()), there is no foolproof to figure out the day/month year in UTC at the instant you made the call. It depends on the time of day in the local timezone, and that timezone's offset from UTC.
Consider this code:
var calendarUTC = Calendar(identifier: .gregorian)
if let utcTimeZone = TimeZone(identifier: "UTC") {
print("Valid time zone")
calendarUTC.timeZone = utcTimeZone
}
print ("Start of day in UTC is \(calendarUTC.startOfDay(for: Date()))")
print ("Start of day in local time zone is \(Calendar.current.startOfDay(for: Date()))")
That outputs:
Start of day in UTC is 2021-05-27 00:00:00 +0000
Start of day in local time zone is 2021-05-26 04:00:00 +0000
That's because right now, which is 20:56 on 26 May in my time zone, it's 0:56 on 27 May in UTC. So if I ask the UTC calendar for the start of day for now (Date()) I get midnight on 27 May, in UTC.
If I ask the same question of my local calendar, I get midnight on 26 may in my time zone, which is 4:00 AM on 26 May in UTC.
If I ran the same code this morning at 8:00 AM in my time zone, I would have gotten the output:
Start of day in UTC is 2021-05-26 00:00:00 +0000
Start of day in local time zone is 2021-05-26 04:00:00 +0000
(Since at 8:00 AM on 26 May in EDT is also 26 May in UTC.)
It's tricky and not 100% reliable and only works if you know that all days were created using startOfDay. But first you need to decide what you want. Say one date was created at 10pm in the New York, and one at exactly the same moment in London, at 4am the next day. What day do you want to be stored?
If your date stored is 25th, 10pm, then you know it was created in a timezone where the day started at 10pm UTC. You are lucky, there are only two time zones that would have created this, one without DST, one with DST. So you know it happened in one of these two time zones, within 24 hours.
Unfortunately, time zones cover 26 hours. Fortunately, only some islands in the Pacific Ocean have same time and different dates (+13 and -11 hours). For these places, you cannot possibly know which date is correct, but very few people would be affected.

time stamp are resetting to Zero when changed from to_date to to_char in postgresql

I need to change the following sql query to the postgres format. how can I do that?
eg:
round((TIME_TO_SEC(testruntest.endtime) - TIME_TO_SEC(testruntest.starttime))/60,2)
I tried this query and got error as "time_to_sec" is not a supported function...
Use the SQL standard EXTRACT function:
EXTRACT(epoch FROM testruntest.endtime)
The documentation describes:
For timestamp with time zone values, the number of seconds since 1970-01-01 00:00:00 UTC (can be negative); for date and timestamp values, the number of seconds since 1970-01-01 00:00:00 local time; for interval values, the total number of seconds in the interval

Strange time zone in PostgreSQL timestamp conversion

This SQL:
select to_timestamp(extract(epoch from '0001-01-01 00:00:00'::timestamp))
produces this output:
0001-01-01 08:06:00+08:06
I realize that to_timestamp() always adds a time zone, hence the additional 8 hours and +8 in the time zone segment. But what is the :06? And where did the extra 6 minutes come from?
EDIT
If I initially execute set local timezone to 'UTC'; then I get expected results.
Before UTC was invented, each city had its own local time, mostly with a difference of just some minutes among each other.
Just after standardization of timezones (and the respective adoption by everybody), the local times were set to the values we know today.
That's why you get these strange results for ancient dates, specially before the year 1900.
Actually, Taipei changed from UTC+08:06 to UTC+08:00 only in Jan 1st of 1896, so dates before it will have the +08:06 offset.
If you set your timezone to UTC this doesn't happen, basically because UTC's offset is zero and never changes.

Postgresql not converting timezone correctly

I am building an application that needs to handle different timezones, including timezones with daylight savings time. I am storing all the dates/times using Postgressql's timestamp with timezone data type, and all are stored in the UTC timezone (I modified the timezone entry in postgresql.conf).
Here is an example table that I have created:
CREATE TABLE category_city
(
category_city_id serial NOT NULL,
appt_start_time timestamp(0) with time zone,
appt_end_time timestamp(0) with time zone,
appt_days_of_week character varying,
CONSTRAINT category_city_pkey PRIMARY KEY (category_city_id )
);
I have been testing some conversions using the 'America\Edmonton' timezone. Currently, this timezone is 6 hours behind UTC. I have a timestamp in this table that has the value 1970-01-01 15:00:00+00.
Now I perform this query:
SELECT appt_start_time at time zone 'America/Edmonton' as start
FROM category_city
This should give me the timestamp 1970-01-01 09:00:00+00, but instead it gives me the timestamp 1970-01-01 08:00:00+00, an offset of -7. which is the correct offset when not in DST, but is obviously not correct right now (DST is in effect).
I must be missing something because I'm sure I'm not the only one that needs to handle different timezones with DST.
The time on my server is correct, this is the output of the date command on Ubuntu:
Wed Apr 30 17:11:59 MDT 2014
Can anyone see something that I am overlooking, or has any experienced something similar and found a way to workaround it? Any help would be appreciated!
At 1970-01-01 09:00:00+00 the time at the timezone 'America/Edmonton' was 1970-01-01 08:00:00 regardless of when you ask. If you want the time at a specific timezone you need to make it explicit:
select '1970-01-01 15:00:00+00' at time zone 'MST' as start;
start
---------------------
1970-01-01 08:00:00
select '1970-01-01 15:00:00+00' at time zone 'MDT' as start;
start
---------------------
1970-01-01 09:00:00
I guess it is easy to see above why that time at the MDT timezone will always be the same.

PostgreSQL date() with timezone

I'm having an issue selecting dates properly from Postgres - they are being stored in UTC, but
not converting with the Date() function properly.
Converting the timestamp to a date gives me the wrong date if it's past 4pm PST.
2012-06-21 should be 2012-06-20 in this case.
The starts_at column datatype is timestamp without time zone. Here are my queries:
Without converting to PST timezone:
Select starts_at from schedules where id = 40;
starts_at
---------------------
2012-06-21 01:00:00
Converting gives this:
Select (starts_at at time zone 'pst') from schedules where id = 40;
timezone
------------------------
2012-06-21 02:00:00-07
But neither convert to the correct date in the timezone.
Basically what you want is:
$ select starts_at AT TIME ZONE 'UTC' AT TIME ZONE 'US/Pacific' from schedules where id = 40
I got the solution from this article is below, which is straight GOLD!!! It explains this non-trivial issue very clearly, give it a read if you wish to understand pstgrsql TZ management better.
Expressing PostgreSQL timestamps without zones in local time
Here is what is going on. First you should know that 'PST timezone is 8 hours behind UTC timezone so for instance Jan 1st 2014, 4:30 PM PST (Wed, 01 Jan 2014 16:00:30 -0800) is equivalent to Jan 2nd 2014, 00:30 AM UTC (Thu, 02 Jan 2014 00:00:30 +0000). Any time after 4:00pm in PST slips over to the next day, interpreted as UTC.
Also, as Erwin Brandstetter mentioned above, postresql has two type of timestamps data type, one with a timezone and one without.
If your timestamps include a timezone, then a simple:
$ select starts_at AT TIME ZONE 'US/Pacific' from schedules where id = 40
will work. However if your timestamp is timezoneless, executing the above command will not work, and you must FIRST convert your timezoneless timestamp to a timestamp with a timezone, namely a UTC timezone, and ONLY THEN convert it to your desired 'PST' or 'US/Pacific' (which are the same up to some daylight saving time issues. I think you should be fine with either).
Let me demonstrate with an example where I create a timezoneless timestamp. Let's assume for convenience that our local timezone is indeed 'PST' (if it weren't then it gets a tiny bit more complicated which is unnecessary for the purpose of this explanation).
Say I have:
$ select timestamp '2014-01-2 00:30:00' AS a, timestamp '2014-01-2 00:30:00' AT TIME ZONE 'UTC' AS b, timestamp '2014-01-2 00:30:00' AT TIME ZONE 'UTC' AT TIME ZONE 'PST' AS c, timestamp '2014-01-2 00:30:00' AT TIME ZONE 'PST' AS d
This will yield:
"a"=>"2014-01-02 00:30:00" (This is the timezoneless timestamp)
"b"=>"2014-01-02 00:30:00+00" (This is the UTC TZ timestamp, note that up to a timezone, it is equivalent to the timezoneless one)
"c"=>"2014-01-01 16:30:00" (This is the correct 'PST' TZ conversion of the UTC timezone, if you read the documentation postgresql will not print the actual TZ for this conversion)
"d"=>"2014-01-02 08:30:00+00"
The last timestamp is the reason for all the confusion regarding converting timezoneless timestamp from UTC to 'PST' in postgresql. When we write:
timestamp '2014-01-2 00:30:00' AT TIME ZONE 'PST' AS d
We are taking a timezoneless timestamp and try to convert it to 'PST TZ (we indirectly assume that postgresql will understand that we want it to convert the timestamp from a UTC TZ, but postresql has plans of its own!). In practice, what postgresql does is it takes the timezoneless timestamp ('2014-01-2 00:30:00) and treats it as if it WERE ALREADY a 'PST' TZ timestamp (i.e: 2014-01-2 00:30:00 -0800) and converts that to UTC timezone!!! So it actually pushes it 8 hours ahead instead of back! Thus we get (2014-01-02 08:30:00+00).
Anyway, this last (un-intuitive) behavior is the cause of all confusion. Read the article if you want a more thorough explanation, I actually got results which are a bit different then their on this last part, but the general idea is the same.
I don't see the exact type of starts_at in your question. You really should include this information, it is the key to the solution. I'll have to guess.
PostgreSQL always stores UTC time for the type timestamp with time zone internally. Input and output (display) are adjusted to the current timezone setting or to the given time zone. The effect of AT TIME ZONE also changes with the underlying data type. See:
Ignoring time zones altogether in Rails and PostgreSQL
If you extract a date from type timestamp [without time zone], you get the date for the current time zone. The day in the output will be the same as in the display of the timestamp value.
If you extract a date from type timestamp with time zone (timestamptz for short), the time zone offset is "applied" first. You still get the date for the current time zone, which agrees with the display of the timestamp. The same point in time translates to the next day in parts of Europe, when it is past 4 p.m. in California for instance. To get the date for a certain time zone, apply AT TIME ZONE first.
Therefore, what you describe at the top of the question contradicts your example.
Given that starts_at is a timestamp [without time zone] and the time on your server is set to the local time. Test with:
SELECT now();
Does it display the same time as a clock on your wall? If yes (and the db server is running with correct time), the timezone setting of your current session agrees with your local time zone. If no, you may want to visit the setting of timezone in your postgresql.conf or your client for the session. Details in the manual.
Be aware that the timezone offset used the opposite sign of what's displayed in timestamp literals. See:
Peculiar time zone handling in a Postgres database
To get your local date from starts_at just
SELECT starts_at::date
Tantamount to:
SELECT date(starts_at)
BTW, your local time is at UTC-7 right now, not UTC-8, because daylight savings time is in effect (not among the brighter ideas of the human race).
Pacific Standard TIME (PST) is normally 8 hours "earlier" (bigger timestamp value) than UTC (Universal Time Zone), but during daylight saving periods (like now) it can be 7 hours. That's why timestamptz is displayed as 2012-06-21 02:00:00-07 in your example. The construct AT TIME ZONE 'PST' takes daylight saving time into account. These two expressions yield different results (one in winter, one in summer) and may result in different dates when cast:
SELECT '2012-06-21 01:00:00'::timestamp AT TIME ZONE 'PST'
, '2012-12-21 01:00:00'::timestamp AT TIME ZONE 'PST'
I know this is an old one but You may want to consider using AT TIME ZONE "US/Pacific" when casting to avoid any PST/PDT issues. So
SELECT starts_at::TIMESTAMPTZ AT TIME ZONE "US/Pacific"
FROM schedules
WHERE ID = '40';