Computing rolling sums efficiently in PostgreSQL - postgresql

Supposing I have a set of transactions (purchases) with dates for a set of customers, I want to calculate a rolling x day sum of purchase amount and number of purchases by customer in that same window. I've gotten it to work using a window function, but I have to fill in for dates where the customer did not make any purchases. In so doing, I'm using a Cartesian product. Is there a more efficient approach so that it's more scalable as the number of customers – and time window – increases?
Edit: As noted in the comments, I'm on PostgreSQL v9.3.
Here's sample data (note that some customers may have 0, 1, or multiple purchases on a given date):
| id | cust_id | txn_date | amount |
|----|---------|------------|--------|
| 1 | 123 | 2017-08-17 | 10 |
| 2 | 123 | 2017-08-17 | 5 |
| 3 | 123 | 2017-08-18 | 5 |
| 4 | 123 | 2017-08-20 | 50 |
| 5 | 123 | 2017-08-21 | 100 |
| 6 | 456 | 2017-08-01 | 5 |
| 7 | 456 | 2017-08-01 | 5 |
| 8 | 456 | 2017-08-01 | 5 |
| 9 | 456 | 2017-08-30 | 5 |
| 10 | 456 | 2017-08-01 | 1000 |
| 11 | 789 | 2017-08-15 | 1000 |
| 12 | 789 | 2017-08-30 | 1000 |
And here's the desired output:
| cust_id | txn_date | sum_dly_txns | tot_txns_7d | cnt_txns_7d |
|---------|------------|--------------|-------------|-------------|
| 123 | 2017-08-17 | 15 | 15 | 2 |
| 123 | 2017-08-18 | 5 | 20 | 3 |
| 123 | 2017-08-20 | 50 | 70 | 4 |
| 123 | 2017-08-21 | 100 | 170 | 5 |
| 456 | 2017-08-01 | 1015 | 1015 | 4 |
| 456 | 2017-08-30 | 5 | 5 | 1 |
| 789 | 2017-08-15 | 1000 | 1000 | 1 |
| 789 | 2017-08-30 | 1000 | 1000 | 1 |
Here's SQL that produces the totals as desired:
SELECT *
FROM (
-- One row per day per user
WITH daily_txns AS (
SELECT
t.cust_id
,t.txn_date AS txn_date
,SUM(t.amount) AS sum_dly_txns
,COUNT(t.id) AS cnt_dly_txns
FROM transactions t
GROUP BY t.cust_id, txn_date
),
-- Every possible transaction date for every user
dummydates AS (
SELECT txn_date, uids.cust_id
FROM (
SELECT generate_series(
timestamp '2017-08-01'
,timestamp '2017-08-30'
,interval '1 day')::date
) d(txn_date)
CROSS JOIN (SELECT DISTINCT cust_id FROM daily_txns) uids
),
txns_dummied AS (
SELECT
d.cust_id
,d.txn_date
,COALESCE(sum_dly_txns,0) AS sum_dly_txns
,COALESCE(cnt_dly_txns,0) AS cnt_dly_txns
FROM dummydates d
LEFT JOIN daily_txns dx
ON d.txn_date = dx.txn_date
AND d.cust_id = dx.cust_id
ORDER BY d.txn_date, d.cust_id
)
SELECT
cust_id
,txn_date
,sum_dly_txns
,SUM(COALESCE(sum_dly_txns,0)) OVER w AS tot_txns_7d
,SUM(cnt_dly_txns) OVER w AS cnt_txns_7d
FROM txns_dummied
WINDOW w AS (
PARTITION BY cust_id
ORDER BY txn_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- 7d moving window
)
ORDER BY cust_id, txn_date
) xfers
WHERE sum_dly_txns > 0 -- Omit dates with no transactions
;
SQL Fiddle

Instead of ROWS BETWEEN 6 PRECEDING AND CURRENT ROW did you want to write RANGE '6 days' PRECEEDING ?
This must be what you are looking for:
SELECT DISTINCT
cust_id
,txn_date
,SUM(amount) OVER (PARTITION BY cust_id, txn_date) sum_dly_txns
,SUM(amount) OVER (PARTITION BY cust_id ORDER BY txn_date RANGE '6 days' PRECEDING)
,COUNT(*) OVER (PARTITION BY cust_id ORDER BY txn_date RANGE '6 days' PRECEDING)
from transactions
ORDER BY cust_id, txn_date
Edit: Since you are using an old version (I tested the one above on my postgresql 11), the point above will not make much sense so you will need to old-fashioned SQL (that is, witout window functions).
It is a bit less efficient but does a fair job.
WITH daily_txns AS (
SELECT
t.cust_id
,t.txn_date AS txn_date
,SUM(t.amount) AS sum_dly_txns
,COUNT(t.id) AS cnt_dly_txns
FROM transactions t
GROUP BY t.cust_id, txn_date
)
SELECT t1.cust_id, t1.txn_date, t1.sum_dly_txns, SUM(t2.sum_dly_txns), SUM(t2.cnt_dly_txns)
from daily_txns t1
join daily_txns t2 ON t1.cust_id = t2.cust_id and t2.txn_date BETWEEN t1.txn_date - 7 and t1.txn_date
group by t1.cust_id, t1.txn_date, t1.sum_dly_txns
order by t1.cust_id, t1.txn_date

Related

SUM OVER PARTITION ON Date range

Im trying to do a cumulative sum over specific periods of time for every row in Postgres, example:
|---------------------|------------------|------------------|
| Date | Value | Employee |
|---------------------|------------------|------------------|
| 25-01-1990 | 34 | Aaron |
|---------------------|------------------|------------------|
| 15-02-1990 | 4 | Aaron |
|---------------------|------------------|------------------|
| 02-03-1990 | 3 | Aaron |
|---------------------|------------------|------------------|
| 22-05-1990 | 7 | Aaron |
|---------------------|------------------|------------------|
Expected result, taking a range of 60 days:
|---------------------|------------------|------------------|
| Date | Value | Employee |
|---------------------|------------------|------------------|
| 25-01-1990 | 34 | Aaron |
|---------------------|------------------|------------------|
| 15-02-1990 | 38 | Aaron |
|---------------------|------------------|------------------|
| 02-03-1990 | 41 | Aaron |
|---------------------|------------------|------------------|
| 01-05-1990 | 10 | Aaron |
|---------------------|------------------|------------------|
I tried with the following but the results are not correct:
WITH tab AS (SELECT * FROM table_with_values)
SELECT tab.Date, SUM(tab.Value)
FILTER (WHERE tab.Date<=tab.Date AND tab.Date >=t.Date - INTERVAL '60 DAY')
OVER(PARTITION BY tab.Employee ORDER BY tab.Date ROWS BETWEEN UNBOUND PRECEDENT AND CURRENT ROW)
AS values_cumulative, tab.Employee
FROM tab
Try this:
SELECT date, employee, sum(bvalue)
FROM (
SELECT a.*, b.date as bdate, b.value as bvalue
FROM testtable a
LEFT JOIN testtable b ON
a.employee = b.employee AND
b.date <= a.date AND
b.date >= a.date - integer '60') c
GROUP BY employee, date
ORDER BY date ASC;
date | employee | sum
------------+----------+-----
1990-01-25 | Aaron | 34
1990-02-15 | Aaron | 38
1990-03-02 | Aaron | 41
1990-05-01 | Aaron | 10
(4 Zeilen)

postgres tablefunc, sales data grouped by product, with crosstab of months

TIL about tablefunc and crosstab. At first I wanted to "group data by columns" but that doesn't really mean anything.
My product sales look like this
product_id | units | date
-----------------------------------
10 | 1 | 1-1-2018
10 | 2 | 2-2-2018
11 | 3 | 1-1-2018
11 | 10 | 1-2-2018
12 | 1 | 2-1-2018
13 | 10 | 1-1-2018
13 | 10 | 2-2-2018
I would like to produce a table of products with months as columns
product_id | 01-01-2018 | 02-01-2018 | etc.
-----------------------------------
10 | 1 | 2
11 | 13 | 0
12 | 0 | 1
13 | 20 | 0
First I would group by month, then invert and group by product, but I cannot figure out how to do this.
After enabling the tablefunc extension,
SELECT product_id, coalesce("2018-1-1", 0) as "2018-1-1"
, coalesce("2018-2-1", 0) as "2018-2-1"
FROM crosstab(
$$SELECT product_id, date_trunc('month', date)::date as month, sum(units) as units
FROM test
GROUP BY product_id, month
ORDER BY 1$$
, $$VALUES ('2018-1-1'::date), ('2018-2-1')$$
) AS ct (product_id int, "2018-1-1" int, "2018-2-1" int);
yields
| product_id | 2018-1-1 | 2018-2-1 |
|------------+----------+----------|
| 10 | 1 | 2 |
| 11 | 13 | 0 |
| 12 | 0 | 1 |
| 13 | 10 | 10 |

Calculate out price in FIFO SQL

Using Postgres 11
Using FIFO, i would like to calculate the price of items taken from the inventory, to keep track of the value of the total inventory.
Dataset is as follows:
ID | prodno | amount_purchased | amount_taken | price | created_at
uuid 13976 10 NULL 130 <timestamp>
uuid 13976 10 NULL 150 <timestamp>
uuid 13976 10 NULL 110 <timestamp>
uuid 13976 10 NULL 100 <timestamp>
uuid 13976 NULL 14 ?? <timestamp>
Before inserting the row with amount_taken i would need to calculate what the avg price of each of the 14 items is, which in this case would be 135,71, but how to calculate this relatively efficient?
My initial idea was to delegate the rows into two temp tables, one where amount_taken is null, and one where it is not null, and then calculate all the rows down, but seeing as this table could become rather large, rather fast (since most of the time, only 1 item would be taken from the inventory), i worry this would be a decent solution in the short term, but would slow down, as the table becomes larger. So, what's the better solution internet?
Given this setup:
CREATE TABLE test (
id int
, prodno int
, quantity numeric
, price numeric
, created_at timestamp
);
INSERT INTO test VALUES
(1, 13976, 10, 130, NOW())
, (2, 13976, 10, 150, NOW()+'1 hours')
, (3, 13976, 10, 110, NOW()+'2 hours')
, (4, 13976, 10, 100, NOW()+'3 hours')
, (5, 13976, -14, NULL, NOW()+'4 hours')
, (6, 13976, -1, NULL, NOW()+'5 hours')
, (7, 13976, -10, NULL, NOW()+'6 hours')
;
then the SQL
SELECT id, prodno, created_at, qty_sold
-- 5
, round((cum_sold_cost - coalesce(lag(cum_sold_cost) over w, 0))/qty_sold, 2) as fifo_price
, qty_bought, prev_bought, total_cost
, prev_total_cost
, cum_sold_cost
, coalesce(lag(cum_sold_cost) over w, 0) as prev_cum_sold_cost
FROM (
SELECT id, tneg.prodno, created_at, qty_sold, tpos.qty_bought, prev_bought, total_cost, prev_total_cost
-- 4
, round(prev_total_cost + ((tneg.cum_sold - tpos.prev_bought)/(tpos.qty_bought - tpos.prev_bought))*(total_cost-prev_total_cost), 2) as cum_sold_cost
FROM (
SELECT id, prodno, created_at, -quantity as qty_sold
, sum(-quantity) over w as cum_sold
FROM test
WHERE quantity < 0
WINDOW w AS (PARTITION BY prodno ORDER BY created_at)
-- 1
) tneg
LEFT JOIN (
SELECT prodno
, sum(quantity) over w as qty_bought
, coalesce(sum(quantity) over prevw, 0) as prev_bought
, quantity * price as cost
, sum(quantity * price) over w as total_cost
, coalesce(sum(quantity * price) over prevw, 0) as prev_total_cost
FROM test
WHERE quantity > 0
WINDOW w AS (PARTITION BY prodno ORDER BY created_at)
, prevw AS (PARTITION BY prodno ORDER BY created_at ROWS BETWEEN unbounded preceding AND 1 preceding)
-- 2
) tpos
-- 3
ON tneg.cum_sold BETWEEN tpos.prev_bought AND tpos.qty_bought
AND tneg.prodno = tpos.prodno
) t
WINDOW w AS (PARTITION BY prodno ORDER BY created_at)
yields
| id | prodno | created_at | qty_sold | fifo_price | qty_bought | prev_bought | total_cost | prev_total_cost | cum_sold_cost | prev_cum_sold_cost |
|----+--------+----------------------------+----------+------------+------------+-------------+------------+-----------------+---------------+--------------------|
| 5 | 13976 | 2019-03-07 21:07:13.267218 | 14 | 135.71 | 20 | 10 | 2800 | 1300 | 1900.00 | 0 |
| 6 | 13976 | 2019-03-07 22:07:13.267218 | 1 | 150.00 | 20 | 10 | 2800 | 1300 | 2050.00 | 1900.00 |
| 7 | 13976 | 2019-03-07 23:07:13.267218 | 10 | 130.00 | 30 | 20 | 3900 | 2800 | 3350.00 | 2050.00 |
tneg contains information about quantities sold
| id | prodno | created_at | qty_sold | cum_sold |
|----+--------+----------------------------+----------+----------|
| 5 | 13976 | 2019-03-07 21:07:13.267218 | 14 | 14 |
| 6 | 13976 | 2019-03-07 22:07:13.267218 | 1 | 15 |
| 7 | 13976 | 2019-03-07 23:07:13.267218 | 10 | 25 |
tpos contains information about quantities bought
| prodno | qty_bought | prev_bought | cost | total_cost | prev_total_cost |
|--------+------------+-------------+------+------------+-----------------|
| 13976 | 10 | 0 | 1300 | 1300 | 0 |
| 13976 | 20 | 10 | 1500 | 2800 | 1300 |
| 13976 | 30 | 20 | 1100 | 3900 | 2800 |
| 13976 | 40 | 30 | 1000 | 4900 | 3900 |
We match rows in tneg with rows in tpos on the condition that cum_sold is between qty_bought and prev_bought.
cum_sold is the cumulative amount sold, qty_bought is the cumulative amount bought, and prev_bought is the previous value of qty_bought.
| id | prodno | created_at | qty_sold | cum_sold | qty_bought | prev_bought | total_cost | prev_total_cost | cum_sold_cost |
|----+--------+----------------------------+----------+----------+------------+-------------+------------+-----------------+---------------|
| 5 | 13976 | 2019-03-07 21:07:13.267218 | 14 | 14 | 20 | 10 | 2800 | 1300 | 1900.00 |
| 6 | 13976 | 2019-03-07 22:07:13.267218 | 1 | 15 | 20 | 10 | 2800 | 1300 | 2050.00 |
| 7 | 13976 | 2019-03-07 23:07:13.267218 | 10 | 25 | 30 | 20 | 3900 | 2800 | 3350.00 |
The fraction
((tneg.cum_sold - tpos.prev_bought)/(tpos.qty_bought - tpos.prev_bought)) as frac
measures how far cum_sold lies in between qty_bought and prev_bought. We use this fraction to compute
cum_sold_cost, the cumulative cost associated with buying cum_sold items.
cum_sold_cost lies frac distance between prev_total_cost and total_cost.
Once you obtain cum_sold_cost, you have everything you need to compute marginal FIFO unit prices.
For each line of tneg, the difference between cum_sold_cost and its previous value is the cost of the qty_sold.
FIFO price is simply the ratio of this cost and qty_sold.

1th and 7th row in grouping

I have this table named Samples. The Date column values are just symbolic date values.
+----+------------+-------+------+
| Id | Product_Id | Price | Date |
+----+------------+-------+------+
| 1 | 1 | 100 | 1 |
| 2 | 2 | 100 | 2 |
| 3 | 3 | 100 | 3 |
| 4 | 1 | 100 | 4 |
| 5 | 2 | 100 | 5 |
| 6 | 3 | 100 | 6 |
...
+----+------------+-------+------+
I want to group by product_id such that I have the 1'th sample in descending date order and a new colomn added with the Price of the 7'th sample row in each product group. If the 7'th row does not exist, then the value should be null.
Example:
+----+------------+-------+------+----------+
| Id | Product_Id | Price | Date | 7thPrice |
+----+------------+-------+------+----------+
| 4 | 1 | 100 | 4 | 120 |
| 5 | 2 | 100 | 5 | 100 |
| 6 | 3 | 100 | 6 | NULL |
+----+------------+-------+------+----------+
I belive I can achieve the table without the '7thPrice' with the following
SELECT * FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Product_Id ORDER BY date DESC) r, * FROM Samples
) T WHERE T.r = 1
Any suggestions?
You can try something like this. I used your query to create a CTE. Then joined rank1 to rank7.
;with sampleCTE
as
(SELECT ROW_NUMBER() OVER (PARTITION BY Product_Id ORDER BY date DESC) r, * FROM Samples)
select *
from
(select * from samplecte where r = 1) a
left join
(select * from samplecte where r=7) b
on a.product_id = b.product_id

Grouping in t-sql with latest dates

I have a table like this
Event ID | Contract ID | Event date | Amount |
----------------------------------------------
1 | 1 | 2009-01-01 | 100 |
2 | 1 | 2009-01-02 | 20 |
3 | 1 | 2009-01-03 | 50 |
4 | 2 | 2009-01-01 | 80 |
5 | 2 | 2009-01-04 | 30 |
For each contract I need to fetch the latest event and amount associated with the event and get something like this
Event ID | Contract ID | Event date | Amount |
----------------------------------------------
3 | 1 | 2009-01-03 | 50 |
5 | 2 | 2009-01-04 | 30 |
I can't figure out how to group the data correctly. Any ideas?
Thanks in advance.
SQL 2k5/2k8:
with cte_ranked as (
select *
, row_number() over (
partition by ContractId order by EvantDate desc) as [rank]
from [table])
select *
from cte_ranked
where [rank] = 1;
SQL 2k:
select t.*
from table as t
join (
select max(EventDate) as MaxDate
, ContractId
from table
group by ContractId) as mt
on t.ContractId = mt.ContractId
and t.EventDate = mt.MaxDate