Cumulative sum of multiple window functions - postgresql

I have a table with the structure:
id | date | player_id | score
--------------------------------------
1 | 2019-01-01 | 1 | 1
2 | 2019-01-02 | 1 | 1
3 | 2019-01-03 | 1 | 0
4 | 2019-01-04 | 1 | 0
5 | 2019-01-05 | 1 | 1
6 | 2019-01-06 | 1 | 1
7 | 2019-01-07 | 1 | 0
8 | 2019-01-08 | 1 | 1
9 | 2019-01-09 | 1 | 0
10 | 2019-01-10 | 1 | 0
11 | 2019-01-11 | 1 | 1
I want to create two more columns, 'total_score', 'last_seven_days'.
total_score is a rolling sum of the player_id score
last_seven_days is the score for the last seven days including to and prior to the date
I have written the following SQL query:
SELECT id,
date,
player_id,
score,
sum(score) OVER all_scores AS all_score,
sum(score) OVER last_seven AS last_seven_score
FROM scores
WINDOW all_scores AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
last_seven AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING);
and get the following output:
id | date | player_id | score | all_score | last_seven_score
------------------------------------------------------------------
1 | 2019-01-01 | 1 | 1 | |
2 | 2019-01-02 | 1 | 1 | 1 | 1
3 | 2019-01-03 | 1 | 0 | 2 | 2
4 | 2019-01-04 | 1 | 0 | 2 | 2
5 | 2019-01-05 | 1 | 1 | 2 | 2
6 | 2019-01-06 | 1 | 1 | 3 | 3
7 | 2019-01-07 | 1 | 0 | 4 | 4
8 | 2019-01-08 | 1 | 1 | 4 | 4
9 | 2019-01-09 | 1 | 0 | 5 | 4
10 | 2019-01-10 | 1 | 0 | 5 | 3
11 | 2019-01-11 | 1 | 1 | 5 | 3
I have realised that I need to change this
last_seven AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING)
to instead of being 7, to use some sort of date format because just having the number 7 will introduce errors.
i.e. it would be nice to be able to do date - 2days or date - 6days
I also would like to add columns such as 3 months, 6 months, 12 months later down the track and so need it to be able to be dynamic.
DEMO

demo:db<>fiddle
Solution for Postgres 11+:
Using RANGE interval as #LaurenzAlbe did
Solution for Postgres <11:
(just presenting the "days" part, the "all_scores" part is the same)
Joining the table against itself on the player_id and the relevant date range:
SELECT s1.*,
(SELECT SUM(s2.score)
FROM scores s2
WHERE s2.player_id = s1.player_id
AND s2."date" BETWEEN s1."date" - interval '7 days' AND s1."date" - interval '1 days')
FROM scores s1

You need to use a window by RANGE:
last_seven AS (PARTITION BY player_id
ORDER BY date
RANGE BETWEEN INTERVAL '7 days' PRECEDING
AND INTERVAL '1 day' PRECEDING)
This solution will work only from v11 on.

Related

How to sum for previous n number of days for a number of dates in MySQL

I have a list of dates each with a value in MYSQL.
For each date I want to sum the value for this date and the previous 4 days.
I also want to sum the values for the start of that month to the present date. So for example:
For 07/02/2021 sum all values from 07/02/2021 to 01/02/2021
For 06/02/2021 sum all values from 06/02/2021 to 01/02/2021
For 31/01/2021 sum all values from 31/01/2021 to 01/01/2021
The output should look like:
Any help would be appreciated.
Thanks
In MYSQL 8.0 you get to use analytic/windowed functions.
SELECT
*,
SUM(value) OVER (
ORDER BY date
ROWS BETWEEN 4 PRECEEDING
AND CURRENT ROW
) AS five_day_period,
SUM(value) OVER (
PARTITION BY DATE_FORMAT(date, '%Y-%m-01')
ORDER BY date
) AS month_to_date
FROM
your_table
In the first case, it's just saying sum up the value column, in date order, starting from 4 rows before the current row, and ending on the current row.
In the second case, there's no ROWS BETWEEN, and so it defaults to all the rows preceding the current row up to the current row. Instead, we add a PARTITION BY which says to treat all rows with the same calendar month separately from any rows on a different calendar month. This, all rows before the current one only looks back to the first row in the partition, which is the first row in the current month.
In MySQL 5.x there are no such functions. As such I would resort to correlated sub-queries.
SELECT
*,
(
SELECT SUM(value)
FROM your_table AS five_day_lookup
WHERE date >= DATE_SUB(your_table.date, INTERVAL 4 DAYS)
AND date <= your_table.date
)
AS five_day_period,
(
SELECT SUM(value)
FROM your_table AS monthly_lookup
WHERE date >= DATE(DATE_FORMAT(your_table.date, '%Y-%m-01'))
AND date <= your_table.date
)
AS month_to_date
FROM
your_table
Here is a other way to do that:
Select
t1.`mydate` AS 'Date'
, t1.`val` AS 'Value'
, SUM( IF(t2.`mydate` >= t1.`mydate` - INTERVAL 4 DAY,t2.val,0)) AS '5 Day Period'
, SUM( IF(t2.`mydate` >= DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),t2.val,0)) AS 'Month of Date'
FROM tab t1
LEFT JOIN tab t2 ON t2.`mydate`
BETWEEN LEAST( DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),
t1.`mydate` - INTERVAL 4 DAY)
AND t1.`mydate`
GROUP BY t1.`mydate`
ORDER BY t1.`mydate` desc;
sample
MariaDB [bkvie]> SELECT * FROM tab;
+----+------------+------+
| id | mydate | val |
+----+------------+------+
| 1 | 2021-02-07 | 10 |
| 2 | 2021-02-06 | 30 |
| 3 | 2021-02-05 | 40 |
| 4 | 2021-02-04 | 50 |
| 5 | 2021-02-03 | 10 |
| 6 | 2021-02-02 | 20 |
| 7 | 2021-01-31 | 20 |
| 8 | 2021-01-30 | 10 |
| 9 | 2021-01-29 | 30 |
| 10 | 2021-01-28 | 40 |
| 11 | 2021-01-27 | 20 |
| 12 | 2021-01-26 | 30 |
| 13 | 2021-01-25 | 10 |
| 14 | 2021-01-24 | 40 |
| 15 | 2021-02-01 | 10 |
+----+------------+------+
15 rows in set (0.00 sec)
result
MariaDB [bkvie]> Select
-> t1.`mydate` AS 'Date'
-> , t1.`val` AS 'Value'
-> , SUM( IF(t2.`mydate` >= t1.`mydate` - INTERVAL 4 DAY,t2.val,0)) AS '5 Day Period'
-> , SUM( IF(t2.`mydate` >= DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),t2.val,0)) AS 'Month of Date'
-> FROM tab t1
-> LEFT JOIN tab t2 ON t2.`mydate`
-> BETWEEN LEAST( DATE_ADD(DATE_ADD(LAST_DAY(t1.`mydate` ),INTERVAL 1 DAY),INTERVAL - 1 MONTH),
-> t1.`mydate` - INTERVAL 4 DAY)
-> AND t1.`mydate`
-> GROUP BY t1.`mydate`
-> ORDER BY t1.`mydate` desc;
+------------+-------+--------------+---------------+
| Date | Value | 5 Day Period | Month of Date |
+------------+-------+--------------+---------------+
| 2021-02-07 | 10 | 140 | 170 |
| 2021-02-06 | 30 | 150 | 160 |
| 2021-02-05 | 40 | 130 | 130 |
| 2021-02-04 | 50 | 110 | 90 |
| 2021-02-03 | 10 | 70 | 40 |
| 2021-02-02 | 20 | 90 | 30 |
| 2021-02-01 | 10 | 110 | 10 |
| 2021-01-31 | 20 | 120 | 200 |
| 2021-01-30 | 10 | 130 | 180 |
| 2021-01-29 | 30 | 130 | 170 |
| 2021-01-28 | 40 | 140 | 140 |
| 2021-01-27 | 20 | 100 | 100 |
| 2021-01-26 | 30 | 80 | 80 |
| 2021-01-25 | 10 | 50 | 50 |
| 2021-01-24 | 40 | 40 | 40 |
+------------+-------+--------------+---------------+
15 rows in set (0.00 sec)
MariaDB [bkvie]>

Filter a sum of values until a certain threshold is reached

DbFiddle
Stuck. Need SO :)
Consider the following distribution of values.
ID CNT SEC SHOW(Bool)
1 10 1
2 1 1
3 25 1
4 1 1
5 2 1
6 10 1
7 50 2
8 90 2
My goal is to filter by sec and then
sort by cnt ascending,
sort by id ascending
and then flag/filter all rows as show - false where cnt is < 5 and until the sum of cnt of all hidden rows (show=false) is >= 5.
So the sum of all "hidden" rows may never be < 5.
Expected outcome for sec=1:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 2 | 1 | 1 | false |
| 4 | 1 | 2 | false |
| 5 | 2 | 4 | false |
| 1 | 10 | 14 | false | -- The sum of all hidden rows before this point is 4
| 6 | 10 | 24 | true | -- The total of all hidden rows is now >= 5.
| 3 | 25 | 49 | true |
Expected outcome for sec=2:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 7 | 50 | 50 | true |
| 8 | 90 | 140 | true |
I can already sort the values and create the sums etc. I have not figured out, how to determine how to set the cutoff point, when "hidding" is not necessary.
I am already doing this in "client code" and I want to migrate it to sql.
Here LAG() will help to achieve what you want. You can write your query like below:
with cte as (
SELECT
id, cnt, sec,
sum(cnt) over (partition by sec order by cnt,id) sum_
FROM
tbl )
select
id, cnt, sum_,
case
when sum_<5 or lag(sum_) over (partition by sec order by cnt,id) <5 then 'false'
else
'true'
end as "show"
from cte
DEMO

How to group by each date with certain condition in postgresql?

I have a table like this in postgresql. Each row shows a customer subscribed our products. For example, Customer 1 paid a 1 month subscription at 2019-07-03.
date product period subscriber_id units
2019-07-03 A 1Month 1 1
2019-07-02 A 1Year 2 1
2019-07-01 B 1Year 1 1
2019-06-30 B 1Month 3 1
2019-06-30 A 1Month 4 1
2019-06-03 B 1Month 4 1
2019-06-03 A 1Month 1 1
I want to calculate total valid different subscribers on each day, the result will look like
base_date product total_distinct_count
2019-07-03 A 3
2019-07-03 B 3
2019-07-02 A 3
2019-07-02 B 3
2019-07-01 A 2
2019-07-01 B 3
2019-06-30 A 2
2019-06-30 B 1
...
There are 3 different customers (1, 2, 4) who still subscribe product A at 2019-07-03 in first row.
I've tried to use groupby on each day and distinct count,
SELECT date, COUNT(DISTINCT(subscribers_id))
-- do some conditions
GROUP BY date, product
I don't know how to group by with this condition. If there is a better way to solve this problem. I will very appreciate !!!
This is pretty straightforward if you use date ranges.
CREATE TABLE SUBSCRIPTION (
date date,
product text,
period interval,
subscriber_id int,
units int
);
INSERT INTO SUBSCRIPTION VALUES
('2019-07-03', 'A' , '1 month', 1, 1),
('2019-07-02', 'A', '1 year', 2, 1),
('2019-07-01', 'B', '1 year', 1, 1),
('2019-06-30', 'B', '1 month', 3, 1),
('2019-06-30', 'A', '1 month', 4, 1),
('2019-06-03', 'B', '1 month', 4, 1),
('2019-06-03', 'A', '1 month', 1, 1);
-- First, get the list of dateranges, from 2019-06-03 to 2019-07-03 (or whatever you want)
WITH dates as (
SELECT daterange(t::date, (t + interval '1' day)::date, '[)')
FROM generate_series('2019-06-03'::timestamp without time zone,
'2019-07-03',
interval '1' day) as g(t)
)
SELECT lower(daterange)::date, count(distinct subscriber_id)
FROM dates
LEFT JOIN subscription ON daterange <#
daterange(subscription.date,
(subscription.date + period)::date)
GROUP BY daterange
;
lower | count
------------+-------
2019-06-03 | 2
2019-06-04 | 2
2019-06-05 | 2
2019-06-06 | 2
2019-06-07 | 2
2019-06-08 | 2
2019-06-09 | 2
2019-06-10 | 2
2019-06-11 | 2
2019-06-12 | 2
2019-06-13 | 2
2019-06-14 | 2
2019-06-15 | 2
2019-06-16 | 2
2019-06-17 | 2
2019-06-18 | 2
2019-06-19 | 2
2019-06-20 | 2
2019-06-21 | 2
2019-06-22 | 2
2019-06-23 | 2
2019-06-24 | 2
2019-06-25 | 2
2019-06-26 | 2
2019-06-27 | 2
2019-06-28 | 2
2019-06-29 | 2
2019-06-30 | 3
2019-07-01 | 3
2019-07-02 | 4
2019-07-03 | 4
(31 rows)
You could improve performance by storing (and indexing) the subscription valid time as a daterange instead of calculating it in the query.
EDIT: As Jay pointed out, I forgot to group by product:
WITH dates as (
SELECT daterange(t::date, (t + interval '1' day)::date, '[)')
FROM generate_series('2019-06-03'::timestamp without time zone,
'2019-07-03',
interval '1' day) as g(t)
)
SELECT lower(daterange)::date, product, count(distinct subscriber_id)
FROM dates
LEFT JOIN subscription ON daterange <#
daterange(subscription.date,
(subscription.date + period)::date)
GROUP BY daterange, product
;
lower | product | count
------------+---------+-------
2019-06-03 | A | 1
2019-06-03 | B | 1
2019-06-04 | A | 1
2019-06-04 | B | 1
2019-06-05 | A | 1
2019-06-05 | B | 1
2019-06-06 | A | 1
2019-06-06 | B | 1
2019-06-07 | A | 1
2019-06-07 | B | 1
2019-06-08 | A | 1
2019-06-08 | B | 1
2019-06-09 | A | 1
2019-06-09 | B | 1
2019-06-10 | A | 1
2019-06-10 | B | 1
2019-06-11 | A | 1
2019-06-11 | B | 1
2019-06-12 | A | 1
2019-06-12 | B | 1
2019-06-13 | A | 1
2019-06-13 | B | 1
2019-06-14 | A | 1
2019-06-14 | B | 1
2019-06-15 | A | 1
2019-06-15 | B | 1
2019-06-16 | A | 1
2019-06-16 | B | 1
2019-06-17 | A | 1
2019-06-17 | B | 1
2019-06-18 | A | 1
2019-06-18 | B | 1
2019-06-19 | A | 1
2019-06-19 | B | 1
2019-06-20 | A | 1
2019-06-20 | B | 1
2019-06-21 | A | 1
2019-06-21 | B | 1
2019-06-22 | A | 1
2019-06-22 | B | 1
2019-06-23 | A | 1
2019-06-23 | B | 1
2019-06-24 | A | 1
2019-06-24 | B | 1
2019-06-25 | A | 1
2019-06-25 | B | 1
2019-06-26 | A | 1
2019-06-26 | B | 1
2019-06-27 | A | 1
2019-06-27 | B | 1
2019-06-28 | A | 1
2019-06-28 | B | 1
2019-06-29 | A | 1
2019-06-29 | B | 1
2019-06-30 | A | 2
2019-06-30 | B | 2
2019-07-01 | A | 2
2019-07-01 | B | 3
2019-07-02 | A | 3
2019-07-02 | B | 3
2019-07-03 | A | 3
2019-07-03 | B | 2

postgres tablefunc, sales data grouped by product, with crosstab of months

TIL about tablefunc and crosstab. At first I wanted to "group data by columns" but that doesn't really mean anything.
My product sales look like this
product_id | units | date
-----------------------------------
10 | 1 | 1-1-2018
10 | 2 | 2-2-2018
11 | 3 | 1-1-2018
11 | 10 | 1-2-2018
12 | 1 | 2-1-2018
13 | 10 | 1-1-2018
13 | 10 | 2-2-2018
I would like to produce a table of products with months as columns
product_id | 01-01-2018 | 02-01-2018 | etc.
-----------------------------------
10 | 1 | 2
11 | 13 | 0
12 | 0 | 1
13 | 20 | 0
First I would group by month, then invert and group by product, but I cannot figure out how to do this.
After enabling the tablefunc extension,
SELECT product_id, coalesce("2018-1-1", 0) as "2018-1-1"
, coalesce("2018-2-1", 0) as "2018-2-1"
FROM crosstab(
$$SELECT product_id, date_trunc('month', date)::date as month, sum(units) as units
FROM test
GROUP BY product_id, month
ORDER BY 1$$
, $$VALUES ('2018-1-1'::date), ('2018-2-1')$$
) AS ct (product_id int, "2018-1-1" int, "2018-2-1" int);
yields
| product_id | 2018-1-1 | 2018-2-1 |
|------------+----------+----------|
| 10 | 1 | 2 |
| 11 | 13 | 0 |
| 12 | 0 | 1 |
| 13 | 10 | 10 |

Setting muliple rows in postgres based on the set values of previous postgres rows

I'm running postgres 9.4
I'm essentially updating an existing unorganized structure to a folder based organization. Im auto-assigning an order number to each item for user reordering, but doing an initial setting of all of these values with a 1 time use update statement. However, It seems like SET is taking my subquery's from clause and not recreating it for each successive row that it sets.
Here's my query example:
UPDATE folder_items
SET order_number =
(SELECT COALESCE(MAX(folder_items_2.order_number), 0) + 1
FROM folder_items AS folder_items_2
WHERE folder_items.parent_folder_id = folder_items_2.parent_folder_id
AND folder_items.folder_set_id = folder_items_2.folder_set_id
AND folder_items.id != folder_items_2.id);
With my initial table:
| folder_id | folder_set_id | order_number
row 1 | 1 | 1 | null
row 2 | 2 | 1 | null
row 3 | 3 | 2 | null
row 4 | 4 | 2 | null
row 5 | 5 | 2 | null
row 6 | 6 | 3 | null
when I run my query I get something like
| folder_id | folder_set_id | order_number
row 1 | 1 | 1 | 1
row 2 | 2 | 1 | 1
row 3 | 3 | 2 | 1
row 4 | 4 | 2 | 1
row 5 | 5 | 2 | 1
row 6 | 6 | 3 | 1
However, I want results that look like this:
| folder_id | folder_set_id | order_number
row 1 | 1 | 1 | 1
row 2 | 2 | 1 | 2
row 3 | 3 | 2 | 1
row 4 | 4 | 2 | 2
row 5 | 5 | 2 | 3
row 6 | 6 | 3 | 1
Is there a way to get these desired results? Is the best way to do some sort of window function that counts how many in the same folder_set_id are underneath each row?
Use ROW_NUMBER to calculate the ORDER_ID, then update the table.
with new_order as (
SELECT "folder_id",
row_number() over ( partition by "folder_set_id"
order by "folder_id") as rn
FROM Table1
)
UPDATE Table1 AS t
SET "order_number" = n.rn
FROM new_order AS n
WHERE t."folder_id" = n."folder_id";
SQL DEMO
OUTPUT
| row_id | folder_id | folder_set_id | order_number |
|--------|-----------|---------------|--------------|
| row 1 | 1 | 1 | 1 |
| row 2 | 2 | 1 | 2 |
| row 3 | 3 | 2 | 1 |
| row 4 | 4 | 2 | 2 |
| row 5 | 5 | 2 | 3 |
| row 6 | 6 | 3 | 1 |