Temporal Aggregation in PostgreSQL - postgresql

I am working on a Java implementation for temporal aggregation using a PostgreSQL database.
My table looks like this
Value | Start | Stop
(int) | (Date) | (Date)
-------------------------------
1 | 2004-01-01 | 2010-01-01
4 | 2000-01-01 | 2008-01-01
So to visualize this periods:
------------------------------
----------------------------------------
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[ 4 ][ 5=4+1 ][ 1 ]
My algorithm now calculates temporal aggregations of the data, e.g. SUM():
Value | Start | Stop
-------------------------------
4 | 2000-01-01 | 2004-01-01
5 | 2004-01-01 | 2008-01-01
1 | 2008-01-01 | 2010-01-01
In order to test the gained results, I now would like to query the data directly using PostgreSQL. I know that there is no easy way to this problem, yet. However, there surely is a way to get the same results. The aggregations Count, Max, Min, Sum and Average should be supported. I do not mind a bad or slow solution, it just has to work.
A query I found so far which should work similarly is the following:
select count(*), ts, te
from ( checkout a normalize checkout b using() ) checkoutNorm
group by ts, te;
My adoption looks like this:
select count(*), start, stop
from ( myTable a normalize myTable b using() ) myTableNorm
group by start, stop;
However, an error was reported ERROR: syntax error at or near "normalize" -- LINE 2: from ( ndbs_10 a normalize ndbs_10 b using() ) ndbsNorm.
Does anyone has a solution to this problem? It does not have to be based on the above query, as long as it works. Thanks a lot.

Your question was really hard to understand. But I think I figured it out.
You want a running sum over value. Values are only applicable between start and stop of a time period. So they have to be added at the begin of that period and deducted at the end.
In addition you want the begin and end of the resulting period the sum is valid for.
That should do it:
-- DROP SCHEMA x CASCADE;
CREATE SCHEMA x;
CREATE TABLE x.tbl(val int, start date, stop date);
INSERT INTO x.tbl VALUES
(4 ,'2000-01-01' ,'2008-01-01')
,(7 ,'2001-01-01' ,'2009-01-01')
,(1 ,'2004-01-01' ,'2010-01-01')
,(2 ,'2005-01-01' ,'2006-01-01');
WITH a AS (
SELECT start as ts, val FROM x.tbl
UNION ALL
SELECT stop, val * (-1) FROM x.tbl
ORDER BY 1, 2)
SELECT sum(val) OVER w AS val_sum
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts;
val_sum | start | stop
--------+------------+------------
4 | 2000-01-01 | 2001-01-01
11 | 2001-01-01 | 2004-01-01
12 | 2004-01-01 | 2005-01-01
14 | 2005-01-01 | 2006-01-01
12 | 2006-01-01 | 2008-01-01
8 | 2008-01-01 | 2009-01-01
1 | 2009-01-01 | 2010-01-01
0 | 2010-01-01 |
Edit after request
For all requested aggregate functions:
SELECT period
,val_sum
,val_count
,val_sum::float /val_count AS val_avg
,(SELECT min(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_min
,(SELECT max(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_max
,start
,stop
FROM (
WITH a AS (
SELECT start as ts, val, 1 AS c FROM x.tbl
UNION ALL
SELECT stop, val, -1 FROM x.tbl
ORDER BY 1, 2)
SELECT count(*) OVER w AS period
,sum(val*c) OVER w AS val_sum
,sum(c) OVER w AS val_count
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts
) y
WHERE stop IS NOT NULL;
period | val_sum | val_count | val_avg | val_min | val_max | start | stop
--------+---------+-----------+---------+---------+---------+------------+------------
1 | 4 | 1 | 4 | 4 | 4 | 2000-01-01 | 2001-01-01
2 | 11 | 2 | 5.5 | 4 | 7 | 2001-01-01 | 2004-01-01
3 | 12 | 3 | 4 | 1 | 7 | 2004-01-01 | 2005-01-01
4 | 14 | 4 | 3.5 | 1 | 7 | 2005-01-01 | 2006-01-01
5 | 12 | 3 | 4 | 1 | 7 | 2006-01-01 | 2008-01-01
6 | 8 | 2 | 4 | 1 | 7 | 2008-01-01 | 2009-01-01
7 | 1 | 1 | 1 | 1 | 1 | 2009-01-01 | 2010-01-01
min() and max could possibly be optimized, but that should be good enough.
CTE (WITH clause) and and subqueries are exchangeable, as you can see.

Related

Filter a sum of values until a certain threshold is reached

DbFiddle
Stuck. Need SO :)
Consider the following distribution of values.
ID CNT SEC SHOW(Bool)
1 10 1
2 1 1
3 25 1
4 1 1
5 2 1
6 10 1
7 50 2
8 90 2
My goal is to filter by sec and then
sort by cnt ascending,
sort by id ascending
and then flag/filter all rows as show - false where cnt is < 5 and until the sum of cnt of all hidden rows (show=false) is >= 5.
So the sum of all "hidden" rows may never be < 5.
Expected outcome for sec=1:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 2 | 1 | 1 | false |
| 4 | 1 | 2 | false |
| 5 | 2 | 4 | false |
| 1 | 10 | 14 | false | -- The sum of all hidden rows before this point is 4
| 6 | 10 | 24 | true | -- The total of all hidden rows is now >= 5.
| 3 | 25 | 49 | true |
Expected outcome for sec=2:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 7 | 50 | 50 | true |
| 8 | 90 | 140 | true |
I can already sort the values and create the sums etc. I have not figured out, how to determine how to set the cutoff point, when "hidding" is not necessary.
I am already doing this in "client code" and I want to migrate it to sql.
Here LAG() will help to achieve what you want. You can write your query like below:
with cte as (
SELECT
id, cnt, sec,
sum(cnt) over (partition by sec order by cnt,id) sum_
FROM
tbl )
select
id, cnt, sum_,
case
when sum_<5 or lag(sum_) over (partition by sec order by cnt,id) <5 then 'false'
else
'true'
end as "show"
from cte
DEMO

Cumulative sum of multiple window functions

I have a table with the structure:
id | date | player_id | score
--------------------------------------
1 | 2019-01-01 | 1 | 1
2 | 2019-01-02 | 1 | 1
3 | 2019-01-03 | 1 | 0
4 | 2019-01-04 | 1 | 0
5 | 2019-01-05 | 1 | 1
6 | 2019-01-06 | 1 | 1
7 | 2019-01-07 | 1 | 0
8 | 2019-01-08 | 1 | 1
9 | 2019-01-09 | 1 | 0
10 | 2019-01-10 | 1 | 0
11 | 2019-01-11 | 1 | 1
I want to create two more columns, 'total_score', 'last_seven_days'.
total_score is a rolling sum of the player_id score
last_seven_days is the score for the last seven days including to and prior to the date
I have written the following SQL query:
SELECT id,
date,
player_id,
score,
sum(score) OVER all_scores AS all_score,
sum(score) OVER last_seven AS last_seven_score
FROM scores
WINDOW all_scores AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
last_seven AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING);
and get the following output:
id | date | player_id | score | all_score | last_seven_score
------------------------------------------------------------------
1 | 2019-01-01 | 1 | 1 | |
2 | 2019-01-02 | 1 | 1 | 1 | 1
3 | 2019-01-03 | 1 | 0 | 2 | 2
4 | 2019-01-04 | 1 | 0 | 2 | 2
5 | 2019-01-05 | 1 | 1 | 2 | 2
6 | 2019-01-06 | 1 | 1 | 3 | 3
7 | 2019-01-07 | 1 | 0 | 4 | 4
8 | 2019-01-08 | 1 | 1 | 4 | 4
9 | 2019-01-09 | 1 | 0 | 5 | 4
10 | 2019-01-10 | 1 | 0 | 5 | 3
11 | 2019-01-11 | 1 | 1 | 5 | 3
I have realised that I need to change this
last_seven AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING)
to instead of being 7, to use some sort of date format because just having the number 7 will introduce errors.
i.e. it would be nice to be able to do date - 2days or date - 6days
I also would like to add columns such as 3 months, 6 months, 12 months later down the track and so need it to be able to be dynamic.
DEMO
demo:db<>fiddle
Solution for Postgres 11+:
Using RANGE interval as #LaurenzAlbe did
Solution for Postgres <11:
(just presenting the "days" part, the "all_scores" part is the same)
Joining the table against itself on the player_id and the relevant date range:
SELECT s1.*,
(SELECT SUM(s2.score)
FROM scores s2
WHERE s2.player_id = s1.player_id
AND s2."date" BETWEEN s1."date" - interval '7 days' AND s1."date" - interval '1 days')
FROM scores s1
You need to use a window by RANGE:
last_seven AS (PARTITION BY player_id
ORDER BY date
RANGE BETWEEN INTERVAL '7 days' PRECEDING
AND INTERVAL '1 day' PRECEDING)
This solution will work only from v11 on.

postgres tablefunc, sales data grouped by product, with crosstab of months

TIL about tablefunc and crosstab. At first I wanted to "group data by columns" but that doesn't really mean anything.
My product sales look like this
product_id | units | date
-----------------------------------
10 | 1 | 1-1-2018
10 | 2 | 2-2-2018
11 | 3 | 1-1-2018
11 | 10 | 1-2-2018
12 | 1 | 2-1-2018
13 | 10 | 1-1-2018
13 | 10 | 2-2-2018
I would like to produce a table of products with months as columns
product_id | 01-01-2018 | 02-01-2018 | etc.
-----------------------------------
10 | 1 | 2
11 | 13 | 0
12 | 0 | 1
13 | 20 | 0
First I would group by month, then invert and group by product, but I cannot figure out how to do this.
After enabling the tablefunc extension,
SELECT product_id, coalesce("2018-1-1", 0) as "2018-1-1"
, coalesce("2018-2-1", 0) as "2018-2-1"
FROM crosstab(
$$SELECT product_id, date_trunc('month', date)::date as month, sum(units) as units
FROM test
GROUP BY product_id, month
ORDER BY 1$$
, $$VALUES ('2018-1-1'::date), ('2018-2-1')$$
) AS ct (product_id int, "2018-1-1" int, "2018-2-1" int);
yields
| product_id | 2018-1-1 | 2018-2-1 |
|------------+----------+----------|
| 10 | 1 | 2 |
| 11 | 13 | 0 |
| 12 | 0 | 1 |
| 13 | 10 | 10 |

Grouping by rolling date interval in Netezza

I have a table in Netezza that looks like this
Date Stock Return
2015-01-01 A xxx
2015-01-02 A xxx
2015-01-03 A 0
2015-01-04 A 0
2015-01-05 A xxx
2015-01-06 A xxx
2015-01-07 A xxx
2015-01-08 A xxx
2015-01-09 A xxx
2015-01-10 A 0
2015-01-11 A 0
2015-01-12 A xxx
2015-01-13 A xxx
2015-01-14 A xxx
2015-01-15 A xxx
2015-01-16 A xxx
2015-01-17 A 0
2015-01-18 A 0
2015-01-19 A xxx
2015-01-20 A xxx
The data represents stock returns for various stocks and dates. what I need to do is group the data by a given interval, and day of that interval. Another difficulty is that weekends the (0s) will have to be discounted (ignoring public holidays). And the start date of the first interval should be an arbitrary date.
For example my out put should look sth like this
Interval Q01 Q02 Q03 Q04 Q05
1 xxx xxx xxx xxx xxx
2 xxx xxx xxx xxx xxx
3 xxx xxx xxx xxx xxx
4 xxx xxx xxx xxx xxx
This output would represent an interval of the length 5 working days, with averaged returns as results, in terms of the raw data from above,
start date 1st Jan, 1st Interval includes 1/2/5/6/7 (3 and 4 are weekends and are ignored) Q01 would be the 1st, Q02 the 2nd, Q03 the 5th etc. The second interval goes from 8/9/12/13/14.
What I tried unsuccessfully is using
CEIL(CAST(EXTRACT(DOY FROM DATE) AS FLOAT) / CAST (10 AS FLOAT)) AS interval
EXTRACT(DAY FROM DATE) % 10 AS DAYinInterval
I also tried playing around with rolling counters and for variable starting dates setting my DOY to zero with s.th like this
CEIL(CAST(EXTRACT(DOY FROM DATE) - EXTRACT(DOY FROM 'start-date' AS FLOAT) / CAST (10 AS FLOAT)) AS Interval
The one thing that came closest to what I would expect is this
SUM(Number) OVER(PARTITION BY STOCK ORDER BY DATE ASC rows 10 preceding) AS Counter
Unfortunately it goes from 1 to 10 followed by 11s where it should start from 1 to 10 again.
I would love to see how this can get implemented in an elegant way. thanks
I'm not entirely sure I understand the question, but I think I might, so I'm going to take a swing at this with some windowed aggregates and subqueries.
Here's the sample data, plugging in some random non-zero data for weekdays.
DATE | STOCK | RETURN
------------+-------+--------
2015-01-01 | A | 16
2015-01-02 | A | 80
2015-01-03 | A | 0
2015-01-04 | A | 0
2015-01-05 | A | 60
2015-01-06 | A | 25
2015-01-07 | A | 12
2015-01-08 | A | 1
2015-01-09 | A | 81
2015-01-10 | A | 0
2015-01-11 | A | 0
2015-01-12 | A | 35
2015-01-13 | A | 20
2015-01-14 | A | 69
2015-01-15 | A | 72
2015-01-16 | A | 89
2015-01-17 | A | 0
2015-01-18 | A | 0
2015-01-19 | A | 100
2015-01-20 | A | 67
(20 rows)
Here's my swing at it, with embedded comments.
select avg(return),
date_period,
day_period
from (
-- use row_number to generate a sequential value for each DOW,
-- with a WHERE to filter out the weekends
select date,
stock,
return,
date_period ,
row_number() over (partition by date_period order by date asc) day_period
from (
-- bin out the entries by date_period using the first_value of the entire set as the starting point
-- modulo 7
select date,
stock,
return,
date + (first_value(date) over (order by date asc) - date) % 7 date_period
from stocks
where date >= '2015-01-01'
-- setting the starting period date here
)
foo
where extract (dow from date) not in (1,7)
)
foo
group by date_period, day_period
order by date_period asc;
The results:
AVG | DATE_PERIOD | DAY_PERIOD
------------+-------------+------------
16.000000 | 2015-01-01 | 1
80.000000 | 2015-01-01 | 2
60.000000 | 2015-01-01 | 3
25.000000 | 2015-01-01 | 4
12.000000 | 2015-01-01 | 5
1.000000 | 2015-01-08 | 1
81.000000 | 2015-01-08 | 2
35.000000 | 2015-01-08 | 3
20.000000 | 2015-01-08 | 4
69.000000 | 2015-01-08 | 5
72.000000 | 2015-01-15 | 1
89.000000 | 2015-01-15 | 2
100.000000 | 2015-01-15 | 3
67.000000 | 2015-01-15 | 4
(14 rows)
Changing the starting date to '2015-01-03' to see if it adjusts properly:
...
from stocks
where date >= '2015-01-03'
...
And the results:
AVG | DATE_PERIOD | DAY_PERIOD
------------+-------------+------------
60.000000 | 2015-01-03 | 1
25.000000 | 2015-01-03 | 2
12.000000 | 2015-01-03 | 3
1.000000 | 2015-01-03 | 4
81.000000 | 2015-01-03 | 5
35.000000 | 2015-01-10 | 1
20.000000 | 2015-01-10 | 2
69.000000 | 2015-01-10 | 3
72.000000 | 2015-01-10 | 4
89.000000 | 2015-01-10 | 5
100.000000 | 2015-01-17 | 1
67.000000 | 2015-01-17 | 2
(12 rows)

Linear regression with postgres

I use Postgres and i have a large number of rows with values and date per station.
(Dates can be separated by several days.)
id | value | idstation | udate
--------+-------+-----------+-----
1 | 5 | 12 | 1984-02-11 00:00:00
2 | 7 | 12 | 1984-02-17 00:00:00
3 | 8 | 12 | 1984-02-21 00:00:00
4 | 9 | 12 | 1984-02-23 00:00:00
5 | 4 | 12 | 1984-02-24 00:00:00
6 | 8 | 12 | 1984-02-28 00:00:00
7 | 9 | 14 | 1984-02-21 00:00:00
8 | 15 | 15 | 1984-02-21 00:00:00
9 | 14 | 18 | 1984-02-21 00:00:00
10 | 200 | 19 | 1984-02-21 00:00:00
Forgive what may be a silly question, but I'm not much of a database guru.
Is it possible to directly enter a SQL query that will calculate linear regression per station for each date, knowing that the regression must be calculate only with actual id date, previous id date and next id date ?
For example linear regression for id 2 must be calculate with value 7(actual),5(previous),8(next) for dates 1984-02-17 , 1984-02-11 and 1984-02-21
Edit : I have to use regr_intercept(value,udate) but i really don't know how to do this if i have to use only actual, previous and next value/date for each lines.
Edit2 : 3 rows added to idstation(12); id and dates numbers are changed
Hope you can help me, thank you !
This is the combination of Joop's statistics and Denis's window functions:
WITH num AS (
SELECT id, idstation
, (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
, value AS value
FROM thedata
)
-- id + the ids of the {prev,next} records
-- within the same idstation group
, drag AS (
SELECT id AS center
, LAG(id) OVER www AS prev
, LEAD(id) OVER www AS next
FROM thedata
WINDOW www AS (partition by idstation ORDER BY id)
)
-- junction CTE between ID and its three feeders
, tri AS (
SELECT center AS this, center AS that FROM drag
UNION ALL SELECT center AS this , prev AS that FROM drag
UNION ALL SELECT center AS this , next AS that FROM drag
)
SELECT t.this, n.idstation
, regr_intercept(value,idate) AS intercept
, regr_slope(value,idate) AS slope
, regr_r2(value,idate) AS rsq
, regr_avgx(value,idate) AS avgx
, regr_avgy(value,idate) AS avgy
FROM num n
JOIN tri t ON t.that = n.id
GROUP BY t.this, n.idstation
;
Results:
INSERT 0 7
this | idstation | intercept | slope | rsq | avgx | avgy
------+-----------+-------------------+-------------------+-------------------+------------------+------------------
1 | 12 | -46 | 1 | 1 | 52 | 6
2 | 12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
3 | 12 | -10.6666666666667 | 0.333333333333333 | 1 | 54.5 | 7.5
4 | 14 | | | | 51 | 9
5 | 15 | | | | 51 | 15
6 | 18 | | | | 51 | 14
7 | 19 | | | | 51 | 200
(7 rows)
The clustering of the group-of-three can probably be done more elegantly using a rank() or row_number() function, which would also allow larger sliding windows to be used.
DROP SCHEMA zzz CASCADE;
CREATE SCHEMA zzz ;
SET search_path=zzz;
CREATE TABLE thedata
( id INTEGER NOT NULL PRIMARY KEY
, value INTEGER NOT NULL
, idstation INTEGER NOT NULL
, udate DATE NOT NULL
);
INSERT INTO thedata(id,value,idstation,udate) VALUES
(1 ,5 ,12 ,'1984-02-21' )
,(2 ,7 ,12 ,'1984-02-23' )
,(3 ,8 ,12 ,'1984-02-26' )
,(4 ,9 ,14 ,'1984-02-21' )
,(5 ,15 ,15 ,'1984-02-21' )
,(6 ,14 ,18 ,'1984-02-21' )
,(7 ,200 ,19 ,'1984-02-21' )
;
WITH a AS (
SELECT idstation
, (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
, value AS value
FROM thedata
)
SELECT idstation
, regr_intercept(value,idate) AS intercept
, regr_slope(value,idate) AS slope
, regr_r2(value,idate) AS rsq
, regr_avgx(value,idate) AS avgx
, regr_avgy(value,idate) AS avgy
FROM a
GROUP BY idstation
;
output:
idstation | intercept | slope | rsq | avgx | avgy
-----------+-------------------+-------------------+-------------------+------------------+------------------
15 | | | | 51 | 15
14 | | | | 51 | 9
19 | | | | 51 | 200
12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
18 | | | | 51 | 14
(5 rows)
Note: if you want a spline-like regression you should also use the lag() and lead() window functions, like in Denis's answer.
If the average is ok for you you could use avg build in... Something like
SELECT avg("value") FROM "my_table" WHERE "idstation" = 3;
Should do. For more complicated things you will need to write some pl/SQL-function I'm afraid or check for a addon on PostgreSQL.
Look into window functions. If I get your question correctly, lead() and lag() will likely give you precisely what you want. Example usage:
select idstation as idstation,
id as curr_id,
udate as curr_date,
lag(id) over w as prev_id,
lag(udate) over w as prev_date,
lead(id) over w as next_id,
lead(udate) over w as next_date
from dates
window w as (
partition by idstation order by udate, id
)
order by idstation, udate, id
http://www.postgresql.org/docs/current/static/tutorial-window.html