I have an orders table with datetime when an order was placed, and when it was completed:
orderid
userid
price
status
createdat
doneat
1
128
100
completed
2/16/21 18:40:45
2/21/21 07:59:46
2
128
150
completed
2/21/21 05:27:29
2/23/21 11:58:23
3
128
100
completed
9/3/21 08:38:14
9/10/21 14:24:35
4
5
100
completed
5/28/22 23:28:07
6/26/22 06:10:35
5
5
100
canceled
7/8/22 22:28:57
8/10/22 06:55:17
6
5
100
completed
7/25/22 13:46:38
8/10/22 06:57:20
7
5
5
completed
8/7/22 18:07:07
8/12/22 06:56:23
I would like to have a new column that is the cumulative total (sum price) per user when the order was created:
orderid
userid
price
status
createdat
doneat
cumulative total when placed (per user)
1
128
100
completed
2/16/21 18:40:45
2/21/21 07:59:46
0
2
128
150
completed
2/21/21 05:27:29
2/23/21 11:58:23
0
3
128
100
completed
9/3/21 08:38:14
9/10/21 14:24:35
250
4
5
100
completed
5/28/22 23:28:07
6/26/22 06:10:35
0
5
5
100
canceled
7/8/22 22:28:57
8/10/22 06:55:17
100
6
5
100
completed
7/25/22 13:46:38
8/10/22 06:57:20
100
7
5
5
completed
8/7/22 18:07:07
8/12/22 06:56:23
100
The logic is sum the price for each user for all orders that were completed before the current row's created at date. For orderid=2, although it's the user's 2nd order, there are no orders that were completed before its createdat datetime of 2/21/21 05:27:29, so the cumulative total when placed is 0.
The same for orderid in [5,6,7]. For those orders and that userid, the only order that was completed before their createdat dates is order 4, so their cumulative total when placed is 100.
In PowerBI the logic is like this:
SUMX (
filter(
orders,
earlier orders.userid = orders.userid && orders.doneat < orders.createdat && order.status = 'completed'),
orders.price)
Would anyone have any hints of how to achieved this in postgresql?
I tried something like this and it didn't work.
select (case when o.doneat < o.createdat over (partition by o.userid, o.status order by o.createdat)
then sum(o.price) over (partition by o.userid, o.status ORDER BY o.doneat asc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
end) as cumulativetotal_whenplaced
from order o
Thank you
You can duplicate each row into:
an "original" (which we'll decorate with a flag keep = true), that has an accounting value val = 0 (so far), and a time t = createdat;
a "duplicate" (keep = false), that has the price to account for (if status is 'completed') as val and a time t = doneat.
Then it's just a matter of accounting for the right bits:
select orderid, userid, price, status, createdat, doneat, cumtot
from (
select *, sum(val) over (partition by userid order by t, keep desc) as cumtot
from (
select *, createdat as t, 0 as val, true as keep from foo
union all
select *, doneat as t,
case when status = 'completed' then price else 0 end as val,
false as keep
from foo
) as a
) as a
where keep
order by orderid;
Example: DB Fiddle.
Note for RedShift: the window expression above needs to be replaced by:
...
select *, sum(val) over (
partition by userid order by t, keep desc
rows unbounded preceding) as cumtot
...
Result for your data:
orderid
userid
price
status
createdat
doneat
cumtot
1
128
100
completed
2021-02-16T18:40:45.000Z
2021-02-21T07:59:46.000Z
0
2
128
150
completed
2021-02-21T05:27:29.000Z
2021-02-23T11:58:23.000Z
0
3
128
100
completed
2021-09-03T08:38:14.000Z
2021-09-10T14:24:35.000Z
250
4
5
100
completed
2022-05-28T23:28:07.000Z
2022-06-26T06:10:35.000Z
0
5
5
100
canceled
2022-07-08T22:28:57.000Z
2022-08-10T06:55:17.000Z
100
6
5
100
completed
2022-07-25T13:46:38.000Z
2022-08-10T06:57:20.000Z
100
7
5
5
completed
2022-08-07T18:07:07.000Z
2022-08-12T06:56:23.000Z
100
Note: this type of accounting across time is actually robust to many corner-cases (various orders overlapping, some starting and finishing while others are still in process, etc.) It is the basis for a fast interval compaction algorithm that I should describe someday on SO.
Bonus: try to figure out why the partitioning window is ordered by t (fairly obvious) and also by keep desc (less obvious).
Related
I need a query to return the initial and final numeric value of the number of listeners of some artists of the last 30 days ordered from the highest increase of listeners to the lowest.
To better understand what I mean, here are the tables involved.
artist table saves the information of a Spotify artist.
id
name
Spotify_id
1
Shakira
0EmeFodog0BfCgMzAIvKQp
2
Bizarrap
716NhGYqD1jl2wI1Qkgq36
platform_information table save the information that I want to get from the artists and on which platform.
id
platform
information
1
spotify
monthly_listeners
2
spotify
followers
platform_information_artist table stores information for each artist on a platform and information on a specific date.
id
platform_information_id
artist_id
date
value
1
1
1
2022-11-01
100000
2
1
1
2022-11-15
101000
3
1
1
2022-11-30
102000
4
1
2
2022-11-02
85000
5
1
2
2022-11-06
90000
6
1
2
2022-11-26
100000
Right now have this query:
SELECT (SELECT value
FROM platform_information_artist
WHERE artist_id = 1
AND platform_information_id =
(SELECT id from platform_information WHERE platform = 'spotify' AND information = 'monthly_listeners')
AND DATE(date) >= DATE(NOW()) - INTERVAL 30 DAY
ORDER BY date ASC
LIMIT 1) as month_start,
(SELECT value
FROM platform_information_artist
WHERE artist_id = 1
AND platform_information_id =
(SELECT id from platform_information WHERE platform = 'spotify' AND information = 'monthly_listeners')
AND DATE(date) >= DATE(NOW()) - INTERVAL 30 DAY
ORDER BY date DESC
LIMIT 1) as month_end,
(SELECT month_end - month_start) as diference
ORDER BY month_start;
Which returns the following:
month_start
month_end
difference
100000
102000
2000
The problem is that this query only returns the artist I specify.
And I need the information like this:
artist_id
name
platform_information_id
month_start_value
month_end_value
difference
2
Bizarrap
1
85000
100000
15000
1
Shakira
1
100000
102000
2000
The query should return the 5 artists that have grown the most in number of monthly listeners over the last 30 days, along with the starting value 30 days ago, and the current value.
Thanks for the help.
I have a table with the following entries in them
id price quantity
1. 10 75
2. 10 75
3. 10 -150
4. 10 75
5. 10 -75
What I need to do is to update each row with a number that is the number of times the running total has been 0. In the above example, the cumulative totals would be
id. cum_total
1. 750
2. 1500
3. 0
4. 750
5. 0
Desired result
id price quantity seq
1. 10 75 1
2. 10 75 1
3. 10 -150 1
4. 10 75 2
5. 10 -75 2
I'm now lost in a spiral of CTEs and window functions and figured I'd ask the experts.
Thanks in advance :-)
Here is one option using analytic functions:
WITH cte AS (
SELECT *, CASE WHEN SUM(price*quantity) OVER (ORDER BY id) = 0 THEN 1 ELSE 0 END AS price_sum
FROM yourTable
),
cte2 AS (
SELECT *, LAG(price_sum, 1, 0) OVER (ORDER BY id) price_sum_lag
FROM cte
)
SELECT id, price, quantity, 1 + SUM(price_sum_lag) OVER (ORDER BY id) cumulative_total
FROM cte2
ORDER BY id;
Demo
You may try running each CTE in succession to see how the logic is working.
With window functions:
SELECT id, price, quantity,
coalesce(
sum(CASE WHEN iszero THEN 1 ELSE 0 END)
OVER (ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
0
) + 1 AS batch
FROM (SELECT id, price, quantity,
sum(price * quantity) OVER (ORDER BY id) = 0 AS iszero
FROM mytable) AS subq;
I have a table like this
item_id date number
1 2000-01-01 100
1 2003-03-08 50
1 2004-04-21 10
1 2004-12-11 10
1 2010-03-03 10
2 2000-06-29 1
2 2002-05-22 2
2 2002-07-06 3
2 2008-10-20 4
I'm trying to get the average for each uniq Item_id over the last 3 dates.
It's difficult because there are missing date in between so a range of hardcoded dates doesn't always work.
I expect a result like :
item_id MyAverage
1 10
2 3
I don't really know how to do this. Currently i manage to do it for one item but i have trouble extending it to multiples items :
SELECT AVG(MyAverage.number) FROM (
SELECT date,number
FROM item_list
where item_id = 1
ORDER BY date DESC limit 3
) as MyAverage;
My main problem is with generalising the "DESC limit 3" over a group by id.
attempt :
SELECT item_id,AVG(MyAverage.number)
FROM (
SELECT item_id,date,number
FROM item_list
ORDER BY date DESC limit 3) as MyAverage
GROUP BY item_id;
The limit is messing things up there.
I have made it " work " using between date and date but it's not working as i want because i need a limit and not an hardcoded date..
Can anybody help
You can use row_number() to assign 1 to 3 for the records with the last date for an ID an then filter for that.
SELECT x.item_id,
avg(x.number)
FROM (SELECT il.item_id,
il.number,
row_number() OVER (PARTITION BY il.item_id
ORDER BY il.date DESC) rn
FROM item_list il) x
WHERE x.rn BETWEEN 1 AND 3
GROUP BY x.item_id;
I'm fairly close to this solution, but I just need a little help getting over the end.
I'm trying to get a running count of the occurrences of client_ids regardless of the date, however I need the dates and ids to still appear in my results to verify everything.
I found part of the solution here but have not been able to modify it enough for my needs.
Here is what the answer should be, counting if the occurrences of the client_ids sequentially :
id client_id deliver_on running_total
1 138 2017-10-01 1
2 29 2017-10-01 1
3 138 2017-10-01 2
4 29 2013-10-02 2
5 29 2013-10-02 3
6 29 2013-10-03 4
7 138 2013-10-03 3
However, here is what I'm getting:
id client_id deliver_on running_total
1 138 2017-10-01 1
2 29 2017-10-01 1
3 138 2017-10-01 1
4 29 2013-10-02 3
5 29 2013-10-02 3
6 29 2013-10-03 1
7 138 2013-10-03 2
Rather than counting the times the client_id appears sequentially, the code counts the time the id appears in the previous date range.
Here is my code and any help would be greatly appreciated.
Thank you,
SELECT n.id, n.client_id, n.deliver_on, COUNT(n.client_id) AS "running_total"
FROM orders n
LEFT JOIN orders o
ON (o.client_id = n.client_id
AND n.deliver_on > o.deliver_on)
GROUP BY n.id, n.deliver_on, n.client_id
ORDER BY n.deliver_on ASC
* EDIT WITH ANSWER *
I ending up solving my own question. Here is the solution with comments:
-- Set "1" for counting to be used later
WITH DATA AS (
SELECT
orders.id,
orders.client_id,
orders.deliver_on,
COUNT(1) -- Creates a column of "1" for counting the occurrences
FROM orders
GROUP BY 1
ORDER BY deliver_on, client_id
)
SELECT
id,
client_id,
deliver_on,
SUM(COUNT) OVER (PARTITION BY client_id
ORDER BY client_id, deliver_on
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- Counts the sequential client_ids based on the number of times they appear
FROM DATA
Just the answer posted to close the question:
-- Set "1" for counting to be used later
WITH DATA AS (
SELECT
orders.id,
orders.client_id,
orders.deliver_on,
COUNT(1) -- Creates a column of "1" for counting the occurrences
FROM orders
GROUP BY 1
ORDER BY deliver_on, client_id
)
SELECT
id,
client_id,
deliver_on,
SUM(COUNT) OVER (PARTITION BY client_id
ORDER BY client_id, deliver_on
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- Counts the sequential client_ids based on the number of times they appear
FROM DATA
I have a view that contains bank account activity.
ACCOUNT BALANCE_ROW AMOUNT SORT_ORDER
111 1 0.00 1
111 0 10.00 2
111 0 -2.50 3
111 1 7.50 4
222 1 100.00 5
222 0 25.00 6
222 1 125.00 7
ACCOUNT = account number
BALANCE_ROW = either starting or ending
balance would be 1, otherwise 0
AMOUNT = the amount
SORT_ORDER =
simple order to return the records in the order of start balance,
activity, and end balance
I need to figure out a way to see if the sum of the non balance_row rows equal the difference between the ending balance and the starting balance. The result for each account (1 for yes, 0 for no) would be simply added to the resulting result set.
Example:
Account 111 had a starting balance of 0.00. There were two account activity records of 10.00 and -2.5. That resulted in the ending balance of 7.50.
I've been playing around with temp tables, but I was not sure if there is a more efficient way of accomplishing this.
Thanks for any input you may have!
I would use ranking, then group rows by ACCOUNT calculating totals along the way:
;
WITH ranked AS (
SELECT
*,
rnk = ROW_NUMBER() OVER (PARTITION BY ACCOUNT ORDER BY SORT_ORDER)
FROM data
),
grouped AS (
SELECT
ACCOUNT,
BALANCE_DIFF = SUM(CASE BALANCE_ROW WHEN 1 THEN AMOUNT END
* CASE rnk WHEN 1 THEN -1 ELSE 1 END),
ACTIVITY_SUM = SUM(CASE BALANCE_ROW WHEN 0 THEN AMOUNT ELSE 0 END)
FROM data
GROUP BY
ACCOUNT
)
SELECT *
FROM grouped
WHERE BALANCE_DIFF <> ACTIVITY_SUM
Ranking is only used here to make it easier to calculate the starting/ending balance difference. If starting and ending balance rows had, for instance, different BALANCE_ROW codes (like 1 for the starting balance, 2 for the ending one), it would be possible to avoid ranking.
Untested code, but should be really close for comparing the summed balance with the balance_row as you've defined in your question.
SELECT
Account, /* Account Number */
(select sum(B.amount) from yourview B
where B.balance_row = 0 and
B.account = A.account and
B.sort_order BETWEEN A.sort_order and
(select max(sort_order) /* previous sort order value on account */
from yourview C where
C.balance_row = 1 and
C.account = A.account and
C.sort_order < A.sort_order)
) AS Test_Balance, /* Test_Balance = sum of amounts since last balance row */
Balance_Row /* Value of balance row */
FROM yourview A
WHERE A.Balance_Row = 1