PostgreSQL: how do I group rows by 'nearby' timestamps

PostgreSQL: how do I group rows by 'nearby' timestamps - postgresql

Considering the following simplified situation:
create table trans
(
id integer not null
, tm timestamp without time zone not null
, val integer not null
, cus_id integer not null
);
insert into trans
(id, tm, val, cus_id)
values
(1, '2017-12-12 16:42:00', 2, 500) --
,(2, '2017-12-12 16:42:02', 4, 501) -- <--+---------+
,(3, '2017-12-12 16:42:05', 7, 502) -- |dt=54s |
,(4, '2017-12-12 16:42:56', 3, 501) -- <--+ |dt=59s
,(5, '2017-12-12 16:43:00', 2, 503) -- |
,(6, '2017-12-12 16:43:01', 5, 501) -- <------------+
,(7, '2017-12-12 16:43:15', 6, 502) --
,(8, '2017-12-12 16:44:50', 4, 501) --
;
I want to group rows by cus_id, but also where the interval between time stamps of consecutive rows for the same cus_id is less than 1 minute.
In the example above this applies to rows with id's 2, 4 and 6. These rows have the same cus_id (501) and have intervals below 1 minute. The interval id{2,4} is 54s and for id{2,6} it is 59s. The interval id{4,6} is also below 1 minute, but it is overridden by the larger interval id{2,6}.
I need a query that gives me the output:
cus_id | tm | val
--------+---------------------+-----
501 | 2017-12-12 16:42:02 | 12
(1 row)
The tm value would be the tm of the first row, i.e. with the lowest tm. The val would be the sum(val) of the grouped rows.
In the example 3 rows are grouped, but that could also be 2, 4, 5, ...
For simplicity, I only let the rows for cus_id 501 have nearby time stamps, but in my real table, there would be a lot more of them. It contains 20M+ rows.
Is this possible?

Naive (subobtimal) solution using a CTE
(a faster approach would avoid the CTE, replacing it by a joined subquery or maybe even use a window function) :
-- Step one: find the start of a cluster
-- (the start is everything after a 60 second silence)
WITH starters AS (
SELECT * FROM trans tr
WHERE NOT EXISTS (
SELECT * FROM trans nx
WHERE nx.cus_id = tr.cus_id
AND nx.tm < tr.tm
AND nx.tm >= tr.tm -'60sec'::interval
)
)
-- SELECT * FROM starters ; \q
-- Step two: join everything within 60sec to the starter
-- and aggregate the clusters
SELECT st.cus_id
, st.id AS id
, MAX(tr.id) AS max_id
, MIN(tr.tm) AS first_tm
, MAX(tr.tm) AS last_tm
, SUM(tr.val) AS val
FROM trans tr
JOIN starters st ON st.cus_id = tr.cus_id
AND st.tm <= tr.tm AND st.tm > tr.tm -'60sec'::interval
GROUP BY 1,2
ORDER BY 1,2
;

Related

Less than or equal to symbol not bringing desired results in Db2 database

Below query when executed against a DB2 database does not bring in records from 31st March 2019. Ideally it should bring in those records as well since operator used is <=. There are rows and it works if I give <'2019-04-01' however we do not want to use this and go with <=.
select wonum, requireddate ,cost
from workorder
where reportdate >='2019-03-01' AND reportdate <= '2019-03-31'

If reportdate is a datetime, then you might want to consider renaming the column to eg. reportdatetime or maybe REPORT_DATETIME, but hey it's your Database design.
SO, anyway, you could do this
select wonum, requireddate ,cost from workorder
where DATE(reportdate) >='2019-03-01' AND DATE(reportdate) <= '2019-03-31'
or
select wonum, requireddate ,cost from workorder
where DATE(reportdate) BETWEEN '2019-03-01' AND '2019-03-31'
or
select wonum, requireddate ,cost from workorder
where reportdate >='2019-03-01' AND reportdate <= '2019-03-31-24.00.00'

This works as designed.
'2019-03-31' == timestamp('2019-03-31-00.00.00')
If you really don't want to use < (is the < sign forbidden in your organization? :)), try the following:
reportdate <= timestamp('2019-03-31-23.59.59.999999999999', 12)
BTW, There is an interesting thing with timestamps in Db2:
with t(row, ts) as (values
(1, timestamp('2019-03-31-23.59.59.999999999999', 12))
, (2, timestamp('2019-04-01-00.00.00', 12) - 0.000000000001 second)
, (3, timestamp('2019-03-31-24.00.00', 12))
, (4, timestamp('2019-03-31-23.59.59.999999999999', 12) + 0.000000000001 second)
, (5, timestamp('2019-04-01-00.00.00', 12))
)
select row, ts, dense_rank() over (order by ts) order
from t;
ROW TS ORDER
----------- -------------------------------- --------------------
1 2019-03-31-23.59.59.999999999999 1
2 2019-03-31-23.59.59.999999999999 1
3 2019-03-31-24.00.00.000000000000 2
4 2019-04-01-00.00.00.000000000000 3
5 2019-04-01-00.00.00.000000000000 3
2019-03-31-24.00.00 is a "special" timestamp (with the 24:00:00 time part).
It's greater than any 2019-03-31-xx timestamp, but less than 2019-04-01-00.00.00.
So, as Paul mentioned, you may use reportdate <= '2019-03-31-24.00.00' instead of reportdate <= timestamp('2019-03-31-23.59.59.999999999999', 12).
Note, that we must specify the fractional seconds length (12) explicitly in the latest case. The timestamp casts to timestamp(6) with data truncation otherwise.

Closest datetime for PostgreSQL 9.5

I'm using PostgreSQL 9.5
and I have a table like this:
CREATE TABLE tracks (
track bigserial NOT NULL,
time_track timestamp,
CONSTRAINT pk_aircraft_tracks PRIMARY KEY ( track )
);
I want to obtain track for the closest value of datetime by SELECT operator.
e.g, if I have:
track datatime
1 | 2016-12-01 21:02:47
2 | 2016-11-01 21:02:47
3 |2016-12-01 22:02:47
For input datatime 2016-12-01 21:00, the track is 2.
I foud out Is there a postgres CLOSEST operator? similar queston for integer.
But it is not working with datatime or PostgreSQL 9.5 :
SELECT * FROM
(
(SELECT time_track, track FROM tracks WHERE time_track >= now() ORDER BY time_track LIMIT 1) AS above
UNION ALL
(SELECT time_track, track FROM tracks WHERE time_track < now() ORDER BY time_track DESC LIMIT 1) AS below
)
ORDER BY abs(?-time_track) LIMIT 1;
The error:
ERROR: syntax error at or near "UNION"
LINE 4: UNION ALL

Track 1 is the closest to '2016-12-01 21:00':
with tracks(track, datatime) as (
values
(1, '2016-12-01 21:02:47'::timestamp),
(2, '2016-11-01 21:02:47'),
(3, '2016-12-01 22:02:47')
)
select *
from tracks
order by
case when datatime > '2016-12-01 21:00' then datatime - '2016-12-01 21:00'
else '2016-12-01 21:00' - datatime end
limit 1;
track | datatime
-------+---------------------
1 | 2016-12-01 21:02:47
(1 row)

how to efficiently locate a value from one table among values from another table, with SQL

I have a problem in Postgresql which I find even difficult to describe in the title: I have two tables, containing each a range of values very similar but not identical. Suppose I have values like 0, 10, 20, 30, ... in one, and 1, 5, 6, 9, 10, 12, 19, 25, 26, ... in the second one (these are milliseconds). For each value of the second one I want to find the values immediately lower and higher in the first one. So, for the value 12 it would give me 10 and 20. I'm doing it like this :
SELECT s.*, MAX(v1."millisec") AS low_v, MIN(v2."millisec") AS high_v
FROM "signals" AS s, "tracks" AS v1, "tracks" AS v2
WHERE v1."millisec" <= s."d_time"
AND v2."millisec" > s."d_time"
GROUP BY s."d_time", s."field2"; -- this is just an example
And it works ... but it is very slow once I process several thousands of lines, even with indexes on s."d_time" and v.millisec. So, I think there must be a much better way to do it, but I fail to find one. Could anyone help me ?

Try:
select s.*,
(select millisec
from tracks t
where t.millisec <= s.d_time
order by t.millisec desc
limit 1
) as low_v,
(select millisec
from tracks t
where t.millisec > s.d_time
order by t.millisec asc
limit 1
) as high_v
from signals s;
Be sure you have an index for track.millisec;
If you had just created
the index, you'll need to analyze the table to take advantage of it.

Naive (trivial) way to find the preceding and next value.
-- the data (this could have been part of the original question)
CREATE TABLE table_one (id SERIAL NOT NULL PRIMARY KEY
, msec INTEGER NOT NULL -- index maight help
);
CREATE TABLE table_two (id SERIAL NOT NULL PRIMARY KEY
, msec INTEGER NOT NULL -- index maight help
);
INSERT INTO table_one(msec) VALUES (0), ( 10), ( 20), ( 30);
INSERT INTO table_two(msec) VALUES (1), ( 5), ( 6), ( 9), ( 10), ( 12), ( 19), ( 25), ( 26);
-- The query: find lower/higher values in table one
-- , but but with no values between "us" and "them".
--
SELECT this.msec AS this
, prev.msec AS prev
, next.msec AS next
FROM table_two this
LEFT JOIN table_one prev ON prev.msec < this.msec AND NOT EXISTS (SELECT 1 FROM table_one nx WHERE nx.msec < this.msec AND nx.msec > prev.msec)
LEFT JOIN table_one next ON next.msec > this.msec AND NOT EXISTS (SELECT 1 FROM table_one nx WHERE nx.msec > this.msec AND nx.msec < next.msec)
;
Result:
CREATE TABLE
CREATE TABLE
INSERT 0 4
INSERT 0 9
this | prev | next
------+------+------
1 | 0 | 10
5 | 0 | 10
6 | 0 | 10
9 | 0 | 10
10 | 0 | 20
12 | 10 | 20
19 | 10 | 20
25 | 20 | 30
26 | 20 | 30
(9 rows)

try this :
select * from signals s,
(select millisec low_value,
lead(millisec) over (order by millisec) high_value from tracks) intervals
where s.d_time between low_value and high_value-1
For this type of problem "Window functions" are ideal see : http://www.postgresql.org/docs/9.1/static/tutorial-window.html

Weighted moving average in Amazon Redshift

Is there a way to calculate a weighted moving average with a fixed window size in Amazon Redshift? In more detail, given a table with a date column and a value column, for each date compute the weighted average value over a window of a specified size, with weights specified in an auxiliary table.
My search attempts so far yielded plenty of examples for doing this with window functions for simple average (without weights), for example here. There are also some related suggestions for postgres, e.g., this SO question, however Redshift's feature set is quite sparse compared with postgres and it doesn't support many of the advanced features that are suggested.

Assuming we have the following tables:
create temporary table _data (ref_date date, value int);
insert into _data values
('2016-01-01', 34)
, ('2016-01-02', 12)
, ('2016-01-03', 25)
, ('2016-01-04', 17)
, ('2016-01-05', 22)
;
create temporary table _weight (days_in_past int, weight int);
insert into _weight values
(0, 4)
, (1, 2)
, (2, 1)
;
Then, if we want to calculate a moving average over a window of three days (including the current date) where values closer to the current date are assigned a higher weight than those further in the past, we'd expect for the weighted average for 2016-01-05 (based on values from 2016-01-05, 2016-01-04 and 2016-01-03):
(22*4 + 17*2 + 25*1) / (4+2+1) = 147 / 7 = 21
And the query could look as follows:
with _prepare_window as (
select
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date) as days_in_past
, t2.value * weight as weighted_value
, weight
, count(t2.ref_date) over(partition by t1.ref_date rows between unbounded preceding and unbounded following) as num_values_in_window
from
_data t1
left join
_data t2 on datediff(day, t2.ref_date, t1.ref_date) between 0 and 2
left join
_weight on datediff(day, t2.ref_date, t1.ref_date) = days_in_past
order by
t1.ref_date
, datediff(day, t2.ref_date, t1.ref_date)
)
select
ref_date
, round(sum(weighted_value)::float/sum(weight), 0) as weighted_average
from
_prepare_window
where
num_values_in_window = 3
group by
ref_date
order by
ref_date
;
Giving the result:
ref_date | weighted_average
------------+------------------
2016-01-03 | 23
2016-01-04 | 19
2016-01-05 | 21
(3 rows)

tsql PIVOT function

Need help with the following query:
Current Data format:
StudentID EnrolledStartTime EnrolledEndTime
1 7/18/2011 1.00 AM 7/18/2011 1.05 AM
2 7/18/2011 1.00 AM 7/18/2011 1.09 AM
3 7/18/2011 1.20 AM 7/18/2011 1.40 AM
4 7/18/2011 1.50 AM 7/18/2011 1.59 AM
5 7/19/2011 1.00 AM 7/19/2011 1.05 AM
6 7/19/2011 1.00 AM 7/19/2011 1.09 AM
7 7/19/2011 1.20 AM 7/19/2011 1.40 AM
8 7/19/2011 1.10 AM 7/18/2011 1.59 AM
I would like to calculate the time difference between EnrolledEndTime and EnrolledStartTime and group it with 15 minutes difference and the count of students that enrolled in the time.
Expected Result :
Count(StudentID) Date 0-15Mins 16-30Mins 31-45Mins 46-60Mins
4 7/18/2011 3 1 0 0
4 7/19/2011 2 1 0 1
Can I use a combination of the PIVOT function to acheive the required result. Any pointers would be helpful.

Create a table variable/temp table that includes all the columns from the original table, plus one column that marks the row as 0, 16, 31 or 46. Then
SELECT * FROM temp table name PIVOT (Count(StudentID) FOR new column name in (0, 16, 31, 46).
That should put you pretty close.

It's possible (just see the basic pivot instructions here: http://msdn.microsoft.com/en-us/library/ms177410.aspx), but one problem you'll have using pivot is that you need to know ahead of time which columns you want to pivot into.
E.g., you mention 0-15, 16-30, etc. but actually, you have no idea how long some students might take -- some might take 24-hours, or your full session timeout, or what have you.
So to alleviate this problem, I'd suggesting having a final column as a catch-all, labeled something like '>60'.
Other than that, just do a select on this table, selecting the student ID, the date, and a CASE statement, and you'll have everything you need to work the pivot on.
CASE WHEN date2 - date1 < 15 THEN '0-15' WHEN date2-date1 < 30 THEN '16-30'...ELSE '>60' END.

I have an old version of ms sql server that doesn't support pivot. I wrote the sql for getting the data. I cant test the pivot, so I tried my best, couldn't test the pivot part. The rest of the sql will give you the exact data for the pivot table. If you accept null instead of 0, it can be written alot more simple, you can skip the "a subselect" part defined in "with a...".
declare #t table (EnrolledStartTime datetime,EnrolledEndTime datetime)
insert #t values('2011/7/18 01:00', '2011/7/18 01:05')
insert #t values('2011/7/18 01:00', '2011/7/18 01:09')
insert #t values('2011/7/18 01:20', '2011/7/18 01:40')
insert #t values('2011/7/18 01:50', '2011/7/18 01:59')
insert #t values('2011/7/19 01:00', '2011/7/19 01:05')
insert #t values('2011/7/19 01:00', '2011/7/19 01:09')
insert #t values('2011/7/19 01:20', '2011/7/19 01:40')
insert #t values('2011/7/19 01:10', '2011/7/19 01:59')
;with a
as
(select * from
(select distinct dateadd(day, cast(EnrolledStartTime as int), 0) date from #t) dates
cross join
(select '0-15Mins' t, 0 group1 union select '16-30Mins', 1 union select '31-45Mins', 2 union select '46-60Mins', 3) i)
, b as
(select (datediff(minute, EnrolledStartTime, EnrolledEndTime )-1)/15 group1, dateadd(day, cast(EnrolledStartTime as int), 0) date
from #t)
select count(b.date) count, a.date, a.t, a.group1 from a
left join b
on a.group1 = b.group1
and a.date = b.date
group by a.date, a.t, a.group1
-- PIVOT(max(date)
-- FOR group1
-- in(['0-15Mins'], ['16-30Mins'], ['31-45Mins'], ['46-60Mins'])AS p

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL: how do I group rows by 'nearby' timestamps - postgresql

Related

Less than or equal to symbol not bringing desired results in Db2 database

Closest datetime for PostgreSQL 9.5

how to efficiently locate a value from one table among values from another table, with SQL

Weighted moving average in Amazon Redshift

tsql PIVOT function

Categories

Resources