PostgreSQL: Date Difference with fractions - postgresql

SELECT cu.user_id, cu.last_activity, cu.updated_time,
DATE_PART('day', cu.last_activity - cu.updated_time), to_char(end_date - start_date, 'DD.HH24')
FROM stats.core_users cu
WHERE cu.user_id = '117132014' or cu.user_id = '117132012';
Get the result like:
117132014 2017-12-11 10:34:51.349905 2017-12-09 12:00:38.503518 1 01.22
117132012 2017-12-11 05:18:20.312283 2017-12-08 15:46:51.914085 2 02.13
Is is feasible to get the day difference with fractions like 1.91 days in the first case, instead of 1 days and 22 hours, to be more precise and easier to fit in a machine learning model?

date_part() does what it's name says: it returns one part of several elements from a date, interval or timestamp. In your case it's one part of an interval (because timestamp - timestamp returns an interval).
If you want the result as a fraction, you need to extract the seconds of the interval and then divide that by 86400 (which is the number of seconds in a day)
extract(epoch from cu.last_activity - cu.updated_time) / 86400

Related

Unable to convert age function to a float with Postgresql

I am trying to write a query to output the average earnings per hour spent dashing by day of week.
worker_table:-
session_id
worker_id
session_start
session_end
total_pay
num_of_deliveries
7712
9347
2020-08-31 03:32:43
2020-08-31 05:53:43
46.72
3
1560
5645
2020-07-26 01:48:40
2020-07-26 04:48:40
65.32
4
So far I am able to extract the day of week and age but I'm not too sure how to cast the age to numeric so I can proceed with my query. When I run the query below, the age would be "02:21:00" but I want it to be a float so I can divide total_pay with age. Thanks.
select extract(dow from session_start), age(session_end, session_start)
from worker_table
Edit: If For some reason the table is not showing correctly. Please view my table here: https://pastebin.com/g8GvyWWR
To get the number of seconds from an interval, you can use extract epoch. In your case, you could do something like this -
select extract(epoch from age(session_end, session_start)) as session_length_in_seconds
from worker_table
The full query might then look something like this -
select
extract(dow from session_start),
avg(total_pay / (extract(epoch from age(session_end, session_start)) / 3600.0)) as avg_earnings_per_hour
from worker_table
group by 1

hour() function of excel in postgres (equivalent)

I am working recently with postgres and I have to make several calculations. However I have not been able to imitate the HOUR () function of Excel, I read the official information but it did not help me much.
The function receives a decimal and obtains the hour, minutes and seconds of the same, example the decimal 0,99988426 returns 11:59:50. Try doing this in postgres (i use PostgreSQL 10.4) with the to_timestamp function: select to_char (to_timestamp (0.99988426), 'HH24: MI: SS'); this return 19:00:00. Surely I am omitting something, some idea of how to solve this?
24:00:00 or 86400 seconds = 1
Half day(12:00 noon) or 43200 seconds = 43200/86400 = 0.5
11:59:50 or 86390 seconds = 86390/86400 = 0.99988426
So to convert your decimal value to time, all you have to do is multiply it with 86400 which will give you seconds and convert it to your format in following ways:
SELECT TO_CHAR((0.99988426 * 86400) * '1 second'::interval, 'HH24:MI:SS');
SELECT (0.99988426 * 86400) * interval '1 sec';
There are two major differences to handle:
Excel does not consider the time zone. The serial date 0 starts at 0h00, but Postgres uses the time zone so it becomes 19h. You would need to use UTC in Postgres result to have the same as in Excel.
select to_char (to_timestamp (0), 'HH24: MI: SS'),to_char (to_timestamp (0) AT TIME ZONE 'UTC', 'HH24: MI: SS');
to_char | to_char
------------+------------
19: 00: 00 | 00: 00: 00
Excel considers that 1 is one day, while Postgres considers 1 as 1 second. To get the same behavior, multiply your number by the 86400, i.e. the number of seconds in a day
select to_char (to_timestamp (0.99988426*86400) AT TIME ZONE 'UTC', 'HH24: MI: SS');
to_char
------------
23: 59: 50
(1 row)

How to bin timestamp data into buckets of n minutes in postgres

I have the following query which works, binning timestamped "observations" into buckets whose boundaries are defined by the bins table:
SELECT
count(id),
width_bucket(
time :: TIMESTAMP,
(SELECT ARRAY(SELECT start_time
FROM bins
WHERE owner_id = 'some id'
ORDER BY start_time ASC) :: TIMESTAMP[])
) bucket
FROM observations
WHERE owner_id = 'some id'
GROUP BY bucket
ORDER BY bucket;
I would like to modify this to allow for querying arbitrary n-minute bins starting from a specified timestamp, rather than having to pull from from an actual "bins" table.
That is, given a start time, a "bin width" in minutes, and a number of bins, is there a way I can generate the array of timestamps to pass into the width_bucket function?
Alternatively, is there a different/simpler approach to get the same results?
Use the function generate_series(start, stop, step interval), e.g.
select array(
select generate_series(
timestamp '2018-04-15 00:00',
'2018-04-15 01:00',
'30 minutes'))
array
---------------------------------------------------------------------
{"2018-04-15 00:00:00","2018-04-15 00:30:00","2018-04-15 01:00:00"}
(1 row)
Example in Db<>fiddle.
The above answers seem to do what you want, but as of PostgreSQL 14, there is now a function date_bin just for binning timestamps.
Quoting the documentation:
date_bin(stride,source,origin)
source is a value expression of type timestamp or timestamp with time zone. (Values of type date are cast automatically to timestamp.) stride is a value expression of type interval. The return value is likewise of type timestamp or timestamp with time zone, and it marks the beginning of the bin into which the source is placed.
Examples:
SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP > '2001-01-01');
Result: 2020-02-11 15:30:00
SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-01-01 00:02:30');
Result: 2020-02-11 15:32:30
In the case of full units (1 minute, 1 hour, etc.), it gives the same result as the analogous date_trunc call, but the difference is that date_bin can truncate to an arbitrary interval.
The stride interval must be greater than zero and cannot contain units of month or larger.
I would like to call special attention to the line
The return value [...] marks the beginning of the bin into which the source is placed.
This means that input timestamps will always be binned by "rounding down", rather than binning to whichever bin is closest. E.g. if you do:
SELECT date_bin('1 hour', '2021-10-13 00:59:59', '2021-10-13 00:00:00');
Then the result will be 2020-10-13 00:00:00 (rounded down by 59 minutes and 59 seconds), NOT 2021-10-13 01:00:00 (which is only one second away from the supplied timestamp). So the date_bin function does something slightly different than exactly what you ask for, but I figure this is good to post for anyone coming here in the future.
A different approach without a series:
Divide the difference of time and start by the width of the bin (5 minutes in the example) and add 1 because the first bucket of width_bucket(...) is 1 not 0.
floor(extract(epoch from (time - '2019-06-04 00:00'::timestamp)) / (5 * 60) ) + 1 as bucket
Getting the start of the bin is also possible
to_timestamp(floor(extract(epoch from a.time) / (5 * 60)) * (5 * 60)) as bin_start
Putting this all together:
SELECT
count(id),
floor(extract(epoch from (time - '2019-06-04 00:00'::timestamp)) / (5 * 60) ) + 1 as bucket,
to_timestamp(floor(extract(epoch from time) / (5 * 60)) * (5 * 60)) as bin_start
FROM observations
WHERE owner_id = 'some id'
GROUP BY bucket, bin_start
ORDER BY bucket;

Pandas DateOffset, step back one day

I try to understand why
print(pd.Timestamp("2015-01-01") - pd.DateOffset(day=1))
does not result in
pd.Timestamp("2014-12-31")
I am using Pandas 0.18. I run within the CET timezone.
You can check pandas.tseries.offsets.DateOffset:
*kwds
Temporal parameter that add to or replace the offset value.
Parameters that add to the offset (like Timedelta):
years
months
weeks
days
hours
minutes
seconds
microseconds
nanoseconds
Parameters that replace the offset value:
year
month
day
weekday
hour
minute
second
microsecond
nanosecond
print(pd.Timestamp("2015-01-01") - pd.DateOffset(days=1))
2014-12-31 00:00:00
Another solution:
print(pd.Timestamp("2015-01-01") - pd.offsets.Day(1))
2014-12-31 00:00:00
Also it is possible to subtract Timedelta:
print(pd.Timestamp("2015-01-01") - pd.Timedelta(1, unit='d'))
pd.DateOffset(day=1) works (ie no error is raised) because "day" is a valid parameter, as is "days".
Look at the below one: "day" resets the actual day, "days" adds to the original day.
pd.Timestamp("2019-12-25") + pd.DateOffset(day=1)
Timestamp('2019-12-01 00:00:00')
pd.Timestamp("2019-12-25") + pd.DateOffset(days=1)
Timestamp('2019-12-26 00:00:00')
Day(d) and DateOffset(days=d) do not behave exactly the same when used on timestamps with timezone information (at least on pandas 0.18.0). It looks like DateOffset add 1 day while keeping the hour information while Day adds just 24 hours of elapsed time.
>>> # 30/10/2016 02:00+02:00 is the hour before the DST change
>>> print(pd.Timestamp("2016-10-30 02:00+02:00", tz="Europe/Brussels") + pd.offsets.Day(1))
2016-10-31 01:00:00+01:00
>>> print(pd.Timestamp("2016-10-30 02:00+02:00", tz="Europe/Brussels") + pd.DateOffset(days=1))
2016-10-31 02:00:00+01:00

DB2 - get the average time between set of dates

I have a list of events and each one has a startDate and endDate. I need to know the average time taken for each event.
I need something like this:
select sum ( (timestamp(startDate) - timestamp(endDate)) for each event )
/ (count of events)
It only makes mathematical sense to take the AVG() of a numeric value, not datetime values or durations. Since you want your answer to be in minutes precision, you want to get your difference in minutes, then convert back to days, hours, minutes. (There are 24*60=1440 minutes in a standard day.)
with q as
(select avg(
timestampdiff(4, char(endDate - startDate) )
) as avgmns
from yourChosenData
)
select int(avgmns / 1440) as avg_days,
int( mod(avgmns,1440) / 60) as avg_mins,
mod(avgmns, 60) as avg_secs
from q
As mentioned below, timestampdiff() is an estimate. To avoid this issue, one could use a more accurate calculation.
with q as
(select avg(
( days(endDate) - days(startDate) ) * 1440
+ ( midnight_seconds(endDate) - midnight_seconds(startDate) ) / 60
) as avgmns
from yourChosenData
)
select int(avgmns / 1440) as avg_days,
int( mod(avgmns,1440) / 60) as avg_mins,
mod(avgmns, 60) as avg_secs
from q
In order to address the DST issue, if needed, one might choose either of:
include a UTC offset column corresponding to each timestamp field. This would also be useful if timstamps were being recorded in more than one timezone. The diference in offsets could then be fed into the calculation along with the timestamps.
provide a deterministic UDF which could return a UTC or DST adjustment offset for a given timestamp. If multiple timezones are involved, then the zone should also be a parameter to the function. Depending on the geographic areas involved, the logic may also need to consider areas which observe alternative DST rules.
You have to be careful of the denominator to prevent a 0 division: SQL0802 - Data Conversion or Data Mapping Error
Depending on the precision of the results, you will need to convert the date. Let's suppose you need seconds (2)
select
sum ( timestampdiff(2, endDate - startDate))
/
sum (count of events)
from yourTable
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0000861.html