Grouping by rolling date interval in Netezza - postgresql

I have a table in Netezza that looks like this
Date Stock Return
2015-01-01 A xxx
2015-01-02 A xxx
2015-01-03 A 0
2015-01-04 A 0
2015-01-05 A xxx
2015-01-06 A xxx
2015-01-07 A xxx
2015-01-08 A xxx
2015-01-09 A xxx
2015-01-10 A 0
2015-01-11 A 0
2015-01-12 A xxx
2015-01-13 A xxx
2015-01-14 A xxx
2015-01-15 A xxx
2015-01-16 A xxx
2015-01-17 A 0
2015-01-18 A 0
2015-01-19 A xxx
2015-01-20 A xxx
The data represents stock returns for various stocks and dates. what I need to do is group the data by a given interval, and day of that interval. Another difficulty is that weekends the (0s) will have to be discounted (ignoring public holidays). And the start date of the first interval should be an arbitrary date.
For example my out put should look sth like this
Interval Q01 Q02 Q03 Q04 Q05
1 xxx xxx xxx xxx xxx
2 xxx xxx xxx xxx xxx
3 xxx xxx xxx xxx xxx
4 xxx xxx xxx xxx xxx
This output would represent an interval of the length 5 working days, with averaged returns as results, in terms of the raw data from above,
start date 1st Jan, 1st Interval includes 1/2/5/6/7 (3 and 4 are weekends and are ignored) Q01 would be the 1st, Q02 the 2nd, Q03 the 5th etc. The second interval goes from 8/9/12/13/14.
What I tried unsuccessfully is using
CEIL(CAST(EXTRACT(DOY FROM DATE) AS FLOAT) / CAST (10 AS FLOAT)) AS interval
EXTRACT(DAY FROM DATE) % 10 AS DAYinInterval
I also tried playing around with rolling counters and for variable starting dates setting my DOY to zero with s.th like this
CEIL(CAST(EXTRACT(DOY FROM DATE) - EXTRACT(DOY FROM 'start-date' AS FLOAT) / CAST (10 AS FLOAT)) AS Interval
The one thing that came closest to what I would expect is this
SUM(Number) OVER(PARTITION BY STOCK ORDER BY DATE ASC rows 10 preceding) AS Counter
Unfortunately it goes from 1 to 10 followed by 11s where it should start from 1 to 10 again.
I would love to see how this can get implemented in an elegant way. thanks

I'm not entirely sure I understand the question, but I think I might, so I'm going to take a swing at this with some windowed aggregates and subqueries.
Here's the sample data, plugging in some random non-zero data for weekdays.
DATE | STOCK | RETURN
------------+-------+--------
2015-01-01 | A | 16
2015-01-02 | A | 80
2015-01-03 | A | 0
2015-01-04 | A | 0
2015-01-05 | A | 60
2015-01-06 | A | 25
2015-01-07 | A | 12
2015-01-08 | A | 1
2015-01-09 | A | 81
2015-01-10 | A | 0
2015-01-11 | A | 0
2015-01-12 | A | 35
2015-01-13 | A | 20
2015-01-14 | A | 69
2015-01-15 | A | 72
2015-01-16 | A | 89
2015-01-17 | A | 0
2015-01-18 | A | 0
2015-01-19 | A | 100
2015-01-20 | A | 67
(20 rows)
Here's my swing at it, with embedded comments.
select avg(return),
date_period,
day_period
from (
-- use row_number to generate a sequential value for each DOW,
-- with a WHERE to filter out the weekends
select date,
stock,
return,
date_period ,
row_number() over (partition by date_period order by date asc) day_period
from (
-- bin out the entries by date_period using the first_value of the entire set as the starting point
-- modulo 7
select date,
stock,
return,
date + (first_value(date) over (order by date asc) - date) % 7 date_period
from stocks
where date >= '2015-01-01'
-- setting the starting period date here
)
foo
where extract (dow from date) not in (1,7)
)
foo
group by date_period, day_period
order by date_period asc;
The results:
AVG | DATE_PERIOD | DAY_PERIOD
------------+-------------+------------
16.000000 | 2015-01-01 | 1
80.000000 | 2015-01-01 | 2
60.000000 | 2015-01-01 | 3
25.000000 | 2015-01-01 | 4
12.000000 | 2015-01-01 | 5
1.000000 | 2015-01-08 | 1
81.000000 | 2015-01-08 | 2
35.000000 | 2015-01-08 | 3
20.000000 | 2015-01-08 | 4
69.000000 | 2015-01-08 | 5
72.000000 | 2015-01-15 | 1
89.000000 | 2015-01-15 | 2
100.000000 | 2015-01-15 | 3
67.000000 | 2015-01-15 | 4
(14 rows)
Changing the starting date to '2015-01-03' to see if it adjusts properly:
...
from stocks
where date >= '2015-01-03'
...
And the results:
AVG | DATE_PERIOD | DAY_PERIOD
------------+-------------+------------
60.000000 | 2015-01-03 | 1
25.000000 | 2015-01-03 | 2
12.000000 | 2015-01-03 | 3
1.000000 | 2015-01-03 | 4
81.000000 | 2015-01-03 | 5
35.000000 | 2015-01-10 | 1
20.000000 | 2015-01-10 | 2
69.000000 | 2015-01-10 | 3
72.000000 | 2015-01-10 | 4
89.000000 | 2015-01-10 | 5
100.000000 | 2015-01-17 | 1
67.000000 | 2015-01-17 | 2
(12 rows)

Related

How to sum for previous n number of days for a number of dates in MySQL

I have a list of dates each with a value in MYSQL.
For each date I want to sum the value for this date and the previous 4 days.
I also want to sum the values for the start of that month to the present date. So for example:
For 07/02/2021 sum all values from 07/02/2021 to 01/02/2021
For 06/02/2021 sum all values from 06/02/2021 to 01/02/2021
For 31/01/2021 sum all values from 31/01/2021 to 01/01/2021
The output should look like:
Any help would be appreciated.
Thanks
In MYSQL 8.0 you get to use analytic/windowed functions.
SELECT
*,
SUM(value) OVER (
ORDER BY date
ROWS BETWEEN 4 PRECEEDING
AND CURRENT ROW
) AS five_day_period,
SUM(value) OVER (
PARTITION BY DATE_FORMAT(date, '%Y-%m-01')
ORDER BY date
) AS month_to_date
FROM
your_table
In the first case, it's just saying sum up the value column, in date order, starting from 4 rows before the current row, and ending on the current row.
In the second case, there's no ROWS BETWEEN, and so it defaults to all the rows preceding the current row up to the current row. Instead, we add a PARTITION BY which says to treat all rows with the same calendar month separately from any rows on a different calendar month. This, all rows before the current one only looks back to the first row in the partition, which is the first row in the current month.
In MySQL 5.x there are no such functions. As such I would resort to correlated sub-queries.
SELECT
*,
(
SELECT SUM(value)
FROM your_table AS five_day_lookup
WHERE date >= DATE_SUB(your_table.date, INTERVAL 4 DAYS)
AND date <= your_table.date
)
AS five_day_period,
(
SELECT SUM(value)
FROM your_table AS monthly_lookup
WHERE date >= DATE(DATE_FORMAT(your_table.date, '%Y-%m-01'))
AND date <= your_table.date
)
AS month_to_date
FROM
your_table
Here is a other way to do that:
Select
t1.`mydate` AS 'Date'
, t1.`val` AS 'Value'
, SUM( IF(t2.`mydate` >= t1.`mydate` - INTERVAL 4 DAY,t2.val,0)) AS '5 Day Period'
, SUM( IF(t2.`mydate` >= DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),t2.val,0)) AS 'Month of Date'
FROM tab t1
LEFT JOIN tab t2 ON t2.`mydate`
BETWEEN LEAST( DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),
t1.`mydate` - INTERVAL 4 DAY)
AND t1.`mydate`
GROUP BY t1.`mydate`
ORDER BY t1.`mydate` desc;
sample
MariaDB [bkvie]> SELECT * FROM tab;
+----+------------+------+
| id | mydate | val |
+----+------------+------+
| 1 | 2021-02-07 | 10 |
| 2 | 2021-02-06 | 30 |
| 3 | 2021-02-05 | 40 |
| 4 | 2021-02-04 | 50 |
| 5 | 2021-02-03 | 10 |
| 6 | 2021-02-02 | 20 |
| 7 | 2021-01-31 | 20 |
| 8 | 2021-01-30 | 10 |
| 9 | 2021-01-29 | 30 |
| 10 | 2021-01-28 | 40 |
| 11 | 2021-01-27 | 20 |
| 12 | 2021-01-26 | 30 |
| 13 | 2021-01-25 | 10 |
| 14 | 2021-01-24 | 40 |
| 15 | 2021-02-01 | 10 |
+----+------------+------+
15 rows in set (0.00 sec)
result
MariaDB [bkvie]> Select
-> t1.`mydate` AS 'Date'
-> , t1.`val` AS 'Value'
-> , SUM( IF(t2.`mydate` >= t1.`mydate` - INTERVAL 4 DAY,t2.val,0)) AS '5 Day Period'
-> , SUM( IF(t2.`mydate` >= DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),t2.val,0)) AS 'Month of Date'
-> FROM tab t1
-> LEFT JOIN tab t2 ON t2.`mydate`
-> BETWEEN LEAST( DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),
-> t1.`mydate` - INTERVAL 4 DAY)
-> AND t1.`mydate`
-> GROUP BY t1.`mydate`
-> ORDER BY t1.`mydate` desc;
+------------+-------+--------------+---------------+
| Date | Value | 5 Day Period | Month of Date |
+------------+-------+--------------+---------------+
| 2021-02-07 | 10 | 140 | 170 |
| 2021-02-06 | 30 | 150 | 160 |
| 2021-02-05 | 40 | 130 | 130 |
| 2021-02-04 | 50 | 110 | 90 |
| 2021-02-03 | 10 | 70 | 40 |
| 2021-02-02 | 20 | 90 | 30 |
| 2021-02-01 | 10 | 110 | 10 |
| 2021-01-31 | 20 | 120 | 200 |
| 2021-01-30 | 10 | 130 | 180 |
| 2021-01-29 | 30 | 130 | 170 |
| 2021-01-28 | 40 | 140 | 140 |
| 2021-01-27 | 20 | 100 | 100 |
| 2021-01-26 | 30 | 80 | 80 |
| 2021-01-25 | 10 | 50 | 50 |
| 2021-01-24 | 40 | 40 | 40 |
+------------+-------+--------------+---------------+
15 rows in set (0.00 sec)
MariaDB [bkvie]>

postgresql: how to add column which shows time difference between the two rows

I have some table with time columns
id | time
----|-------------------
1 | 2021-2-2 10:21:00
2 | 2021-2-2 10:24:00
3 | 2021-2-2 10:27:00
4 | 2021-2-2 10:29:00
Now i want to add another column which shows time difference and another column which will add status based on the difference
id | time | diff | status
----|----------------------------------
1 | 2021-2-2 10:21:00 | None | None
2 | 2021-2-2 10:24:00 | 3:00 | 1
3 | 2021-2-2 10:26:00 | 2:00 | 0
4 | 2021-2-2 10:29:00 | 3:00 | 1
status = if diff >= 3 then 1 else 0
How can i do this. I have to get the data for the last seven days and then get this final table.
Use LAG:
SELECT id, time,
EXTRACT(EPOCH FROM time - LAG(time) OVER (ORDER BY time)) / 60 AS diff,
(EXTRACT(EPOCH FROM time - LAG(time) OVER (ORDER BY time)) / 60 >= 3)::int AS status
FROM yourTable
ORDER BY time;

Cumulative sum of multiple window functions

I have a table with the structure:
id | date | player_id | score
--------------------------------------
1 | 2019-01-01 | 1 | 1
2 | 2019-01-02 | 1 | 1
3 | 2019-01-03 | 1 | 0
4 | 2019-01-04 | 1 | 0
5 | 2019-01-05 | 1 | 1
6 | 2019-01-06 | 1 | 1
7 | 2019-01-07 | 1 | 0
8 | 2019-01-08 | 1 | 1
9 | 2019-01-09 | 1 | 0
10 | 2019-01-10 | 1 | 0
11 | 2019-01-11 | 1 | 1
I want to create two more columns, 'total_score', 'last_seven_days'.
total_score is a rolling sum of the player_id score
last_seven_days is the score for the last seven days including to and prior to the date
I have written the following SQL query:
SELECT id,
date,
player_id,
score,
sum(score) OVER all_scores AS all_score,
sum(score) OVER last_seven AS last_seven_score
FROM scores
WINDOW all_scores AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
last_seven AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING);
and get the following output:
id | date | player_id | score | all_score | last_seven_score
------------------------------------------------------------------
1 | 2019-01-01 | 1 | 1 | |
2 | 2019-01-02 | 1 | 1 | 1 | 1
3 | 2019-01-03 | 1 | 0 | 2 | 2
4 | 2019-01-04 | 1 | 0 | 2 | 2
5 | 2019-01-05 | 1 | 1 | 2 | 2
6 | 2019-01-06 | 1 | 1 | 3 | 3
7 | 2019-01-07 | 1 | 0 | 4 | 4
8 | 2019-01-08 | 1 | 1 | 4 | 4
9 | 2019-01-09 | 1 | 0 | 5 | 4
10 | 2019-01-10 | 1 | 0 | 5 | 3
11 | 2019-01-11 | 1 | 1 | 5 | 3
I have realised that I need to change this
last_seven AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING)
to instead of being 7, to use some sort of date format because just having the number 7 will introduce errors.
i.e. it would be nice to be able to do date - 2days or date - 6days
I also would like to add columns such as 3 months, 6 months, 12 months later down the track and so need it to be able to be dynamic.
DEMO
demo:db<>fiddle
Solution for Postgres 11+:
Using RANGE interval as #LaurenzAlbe did
Solution for Postgres <11:
(just presenting the "days" part, the "all_scores" part is the same)
Joining the table against itself on the player_id and the relevant date range:
SELECT s1.*,
(SELECT SUM(s2.score)
FROM scores s2
WHERE s2.player_id = s1.player_id
AND s2."date" BETWEEN s1."date" - interval '7 days' AND s1."date" - interval '1 days')
FROM scores s1
You need to use a window by RANGE:
last_seven AS (PARTITION BY player_id
ORDER BY date
RANGE BETWEEN INTERVAL '7 days' PRECEDING
AND INTERVAL '1 day' PRECEDING)
This solution will work only from v11 on.

postgres tablefunc, sales data grouped by product, with crosstab of months

TIL about tablefunc and crosstab. At first I wanted to "group data by columns" but that doesn't really mean anything.
My product sales look like this
product_id | units | date
-----------------------------------
10 | 1 | 1-1-2018
10 | 2 | 2-2-2018
11 | 3 | 1-1-2018
11 | 10 | 1-2-2018
12 | 1 | 2-1-2018
13 | 10 | 1-1-2018
13 | 10 | 2-2-2018
I would like to produce a table of products with months as columns
product_id | 01-01-2018 | 02-01-2018 | etc.
-----------------------------------
10 | 1 | 2
11 | 13 | 0
12 | 0 | 1
13 | 20 | 0
First I would group by month, then invert and group by product, but I cannot figure out how to do this.
After enabling the tablefunc extension,
SELECT product_id, coalesce("2018-1-1", 0) as "2018-1-1"
, coalesce("2018-2-1", 0) as "2018-2-1"
FROM crosstab(
$$SELECT product_id, date_trunc('month', date)::date as month, sum(units) as units
FROM test
GROUP BY product_id, month
ORDER BY 1$$
, $$VALUES ('2018-1-1'::date), ('2018-2-1')$$
) AS ct (product_id int, "2018-1-1" int, "2018-2-1" int);
yields
| product_id | 2018-1-1 | 2018-2-1 |
|------------+----------+----------|
| 10 | 1 | 2 |
| 11 | 13 | 0 |
| 12 | 0 | 1 |
| 13 | 10 | 10 |

Temporal Aggregation in PostgreSQL

I am working on a Java implementation for temporal aggregation using a PostgreSQL database.
My table looks like this
Value | Start | Stop
(int) | (Date) | (Date)
-------------------------------
1 | 2004-01-01 | 2010-01-01
4 | 2000-01-01 | 2008-01-01
So to visualize this periods:
------------------------------
----------------------------------------
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[ 4 ][ 5=4+1 ][ 1 ]
My algorithm now calculates temporal aggregations of the data, e.g. SUM():
Value | Start | Stop
-------------------------------
4 | 2000-01-01 | 2004-01-01
5 | 2004-01-01 | 2008-01-01
1 | 2008-01-01 | 2010-01-01
In order to test the gained results, I now would like to query the data directly using PostgreSQL. I know that there is no easy way to this problem, yet. However, there surely is a way to get the same results. The aggregations Count, Max, Min, Sum and Average should be supported. I do not mind a bad or slow solution, it just has to work.
A query I found so far which should work similarly is the following:
select count(*), ts, te
from ( checkout a normalize checkout b using() ) checkoutNorm
group by ts, te;
My adoption looks like this:
select count(*), start, stop
from ( myTable a normalize myTable b using() ) myTableNorm
group by start, stop;
However, an error was reported ERROR: syntax error at or near "normalize" -- LINE 2: from ( ndbs_10 a normalize ndbs_10 b using() ) ndbsNorm.
Does anyone has a solution to this problem? It does not have to be based on the above query, as long as it works. Thanks a lot.
Your question was really hard to understand. But I think I figured it out.
You want a running sum over value. Values are only applicable between start and stop of a time period. So they have to be added at the begin of that period and deducted at the end.
In addition you want the begin and end of the resulting period the sum is valid for.
That should do it:
-- DROP SCHEMA x CASCADE;
CREATE SCHEMA x;
CREATE TABLE x.tbl(val int, start date, stop date);
INSERT INTO x.tbl VALUES
(4 ,'2000-01-01' ,'2008-01-01')
,(7 ,'2001-01-01' ,'2009-01-01')
,(1 ,'2004-01-01' ,'2010-01-01')
,(2 ,'2005-01-01' ,'2006-01-01');
WITH a AS (
SELECT start as ts, val FROM x.tbl
UNION ALL
SELECT stop, val * (-1) FROM x.tbl
ORDER BY 1, 2)
SELECT sum(val) OVER w AS val_sum
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts;
val_sum | start | stop
--------+------------+------------
4 | 2000-01-01 | 2001-01-01
11 | 2001-01-01 | 2004-01-01
12 | 2004-01-01 | 2005-01-01
14 | 2005-01-01 | 2006-01-01
12 | 2006-01-01 | 2008-01-01
8 | 2008-01-01 | 2009-01-01
1 | 2009-01-01 | 2010-01-01
0 | 2010-01-01 |
Edit after request
For all requested aggregate functions:
SELECT period
,val_sum
,val_count
,val_sum::float /val_count AS val_avg
,(SELECT min(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_min
,(SELECT max(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_max
,start
,stop
FROM (
WITH a AS (
SELECT start as ts, val, 1 AS c FROM x.tbl
UNION ALL
SELECT stop, val, -1 FROM x.tbl
ORDER BY 1, 2)
SELECT count(*) OVER w AS period
,sum(val*c) OVER w AS val_sum
,sum(c) OVER w AS val_count
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts
) y
WHERE stop IS NOT NULL;
period | val_sum | val_count | val_avg | val_min | val_max | start | stop
--------+---------+-----------+---------+---------+---------+------------+------------
1 | 4 | 1 | 4 | 4 | 4 | 2000-01-01 | 2001-01-01
2 | 11 | 2 | 5.5 | 4 | 7 | 2001-01-01 | 2004-01-01
3 | 12 | 3 | 4 | 1 | 7 | 2004-01-01 | 2005-01-01
4 | 14 | 4 | 3.5 | 1 | 7 | 2005-01-01 | 2006-01-01
5 | 12 | 3 | 4 | 1 | 7 | 2006-01-01 | 2008-01-01
6 | 8 | 2 | 4 | 1 | 7 | 2008-01-01 | 2009-01-01
7 | 1 | 1 | 1 | 1 | 1 | 2009-01-01 | 2010-01-01
min() and max could possibly be optimized, but that should be good enough.
CTE (WITH clause) and and subqueries are exchangeable, as you can see.