PostgreSQL select statement to return rows after where condition - postgresql

I am working on a query to return the next 7 days worth of data every time an event happens indicated by "where event = 1". The goal is to then group all the data by the user id and perform aggregate functions on this data after the event happens - the event is encoded as binary [0, 1].
So far, I have been attempting to use nested select statements to structure the data how I would like to have it, but using the window functions is starting to restrict me. I am now thinking a self join could be more appropriate but need help in constructing such a query.
The query currently first creates daily aggregate values grouped by user and date (3rd level nested select). Then, the 2nd level sums the data "value_x" to obtain an aggregate value grouped by the user. Then, the 1st level nested select statement uses the lead function to grab the next rows value over and partitioned by each user which acts as selecting the next day's value when event = 1. Lastly, the select statement uses an aggregate function to calculate the average "sum_next_day_value_after_event" grouped by user and where event = 1. Put together, where event = 1, the query returns the avg(value_x) of the next row's total value_x.
However, this doesn't follow my time rule; "where event = 1", return the next 7 days worth of data after the event happens. If there is not 7 days worth of data, then return whatever data is <= 7 days. Yes, I currently only have one lead with the offset as 1, but you could just put 6 more of these functions to grab the next 6 rows. But, the lead function currently just grabs the next row without regard to date. So theoretically, the next row's "value_x" could actually be 15 days from where "event = 1". Also, as can be seen below in the data table, a user may have more than one row per day.
Here is the following query I have so far:
select
f.user_id
avg(f.sum_next_day_value_after_event) as sum_next_day_values
from (
select
bld.user_id,
lead(bld.value_x, 1) over(partition by bld.user_id order by bld.daily) as sum_next_day_value_after_event
from (
select
l.user_id,
l.daily,
sum(l.value_x) as sum_daily_value_x
from (
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x) l
group by l.user_id, l.day_ts
order by l.user_id) bld) f
group by f.user_id
Below is a snippet of the data from table_1:
user_id
day_ts
value_x
event
50
4/2/21 07:37
25
0
50
4/2/21 07:42
45
0
50
4/2/21 09:14
67
1
50
4/5/21 10:09
8
0
50
4/5/21 10:24
75
0
50
4/8/21 11:08
34
0
50
4/15/21 13:09
32
1
50
4/16/21 14:23
12
0
50
4/29/21 14:34
90
0
55
4/4/21 15:31
12
0
55
4/5/21 15:23
34
0
55
4/17/21 18:58
32
1
55
4/17/21 19:00
66
1
55
4/18/21 19:57
54
0
55
4/23/21 20:02
34
0
55
4/29/21 20:39
57
0
55
4/30/21 21:46
43
0
Technical details:
PostgreSQL, supported by EDB, version = 14.1
pgAdmin4, version 5.7
Thanks for the help!

"The query currently first creates daily aggregate values"
I don't see any aggregate function in your first query, so that the GROUP BY clause is useless.
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x
could be simplified as
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
which in turn provides no real added value, so this first query could be removed and the second query would become :
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
The order by user_id clause can also be removed at this step.
Now if you want to calculate the average value of the sum_daily_value_x in the period of 7 days after the event (I'm referring to the avg() function in your top query), you can use avg() as a window function that you can restrict to the period of 7 days after the event :
select f.user_id
, avg(f.sum_daily_value_x) over (order by f.daily range between current row and '7 days' following) as sum_next_day_values
from (
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
) AS f
group by f.user_id
The partition by f.user_id clause in the window function is useless because the rows have already been grouped by f.user_id before the window function is applied.
You can replace the avg() window function by any other one, for instance sum() which could better fit with the alias sum_next_day_values

Related

I need to find the number of users that were invoiced for an amount greater than 0 in the previous month and were not invoiced in the current month

I need to find the number of users that were invoiced for an amount greater than 0 in the previous month and were not invoiced in the current month. This calcualtion is to be done for 12 months in a single query. Output should be as below.
Month Count
01/07/2019 50
01/08/2019 34
01/09/2019 23
01/10/2019 98
01/11/2019 10
01/12/2019 5
01/01/2020 32
01/02/2020 65
01/03/2020 23
01/04/2020 12
01/05/2020 64
01/06/2020 54
01/07/2020 78
I am able to get the value only for one month. I want to get it for all months in a single query.
This is my current query:
SELECT COUNT(DISTINCT TWO_MONTHS_AGO.USER_ID), TWO_MONTHS_AGO.MONTH AS INVOICE_MONTH
FROM (
SELECT USER_ID, LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
WHERE invoice_amt > 0
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 2)
GROUP BY user_id
) AS TWO_MONTHS_AGO
LEFT JOIN (
SELECT user_id,LAST_DAY(invoice_ct_dt)) AS MONTH
FROM table a AS ID
AND LAST_DAY(invoice_ct_dt)) = ADD_MONTHS(LAST_DAY(CURRENT_DATE - 1), - 1)
GROUP BY USER_ID
) AS ONE_MONTH_AGO ON TWO_MONTHS_AGO.USER_ID = ONE_MONTH_AGO.USER_ID
WHERE ONE_MONTH_AGO.USER_ID IS NULL
GROUP BY INVOICE_MONTH;
Thank you in advance.
Lona
Probably lots of different approaches but the way I would do it is as follows:
Summarise data by user and month for the last 13 months (you need 12 months plus the previous month to that first month
Compare "this" month (that has data) to "next" month and select records where there is no "next" month data
Summarise this dataset by month and distinct userid
For example, assuming a table created as follows:
create table INVOICE_DATA (
USERID varchar(4),
INVOICE_DT date,
INVOICE_AMT NUMBER(10,2)
);
the following query should give you what you want - you may need to adjust it depending on whether you are including this month, or only up to the end of last month, in your calculation, etc.:
--Summarise data by user and month
WITH MONTH_SUMMARY AS
(
SELECT USERID
,TO_CHAR(INVOICE_DT,'YYYY-MM') "INVOICE_MONTH"
,TO_CHAR(ADD_MONTHS(INVOICE_DT,1),'YYYY-MM') "NEXT_MONTH"
,SUM(INVOICE_AMT) "MONTHLY_TOTAL"
FROM INVOICE_DATA
WHERE INVOICE_DT >= TRUNC(ADD_MONTHS(current_date(),-13),'MONTH') -- Last 13 months of data
GROUP BY 1,2,3
),
--Get data for users with invoices in this month but not the next month
USER_DATA AS
(
SELECT USERID, INVOICE_MONTH, MONTHLY_TOTAL
FROM MONTH_SUMMARY MS_THIS
WHERE NOT EXISTS
(
SELECT USERID
FROM MONTH_SUMMARY MS_NEXT
WHERE
MS_THIS.USERID = MS_NEXT.USERID AND
MS_THIS.NEXT_MONTH = MS_NEXT.INVOICE_MONTH
)
AND MS_THIS.INVOICE_MONTH < TO_CHAR(current_date(),'YYYY-MM') -- Don't include this month as obviously no next month to compare to
)
SELECT INVOICE_MONTH, COUNT(DISTINCT USERID) "USER_COUNT"
FROM USER_DATA
GROUP BY INVOICE_MONTH
ORDER BY INVOICE_MONTH
;

Recursively dividing a list of dates into groups

I have a list of start dates as below -
start dates sorted in descending order
The start dates are always sorted in descending order.
I am looking for a postgresql query that can give the following output -
start dates with groups
Basically i am trying to create groups of dates from the given list such that each date in a group is within 61 days from the date at the top of the corresponding group.
For example -
in the output,
Group 1 has first 4 records because all 4 start dates are within 61
days of record no. 2.
Group 2 contains only record no. 6 since it is more than 61 days
away from record no. 2.
Group 3 contains row no. 7 and 8 since they are more than 61
days away from record no. 6. but within 61 days of each other
P.S. I am new to postgresql and stackoverflow.
Any pointers will be helpfull
Your sample data does not match your sample output.
Your calculations in your sample output are wrong since this counts backwards and March and October both have 31 days.
To recurse properly you need to assign row numbers using dense_rank():
with recursive num as (
select row_number() over (order by start_date desc) as rn,
start_date
from dateslist
),
Then you create groups and find gaps by carrying anchor values forward as you recurse. Since you have the start_date information you can calculate the offset within groups at the same time:
find_gaps as (
select rn as anchor, start_date as anchor_date, rn, start_date, 0 as group_offset
from num
where rn = 1
union all
select case
when f.anchor_date - n.start_date > 61 then n.rn
else f.anchor
end,
case
when f.anchor_date - n.start_date > 61 then n.start_date
else f.anchor_date
end,
n.rn, n.start_date,
case
when f.anchor_date - n.start_date > 61 then n.start_date
else f.anchor_date
end - n.start_date
from find_gaps f
join num n on n.rn = f.rn + 1
)
The final query selects the columns you want for the output and applies a group number.
select start_date,
dense_rank() over (order by anchor) as group_number,
group_offset
from find_gaps
order by start_date desc;
Working Fiddle Demo

SQL - how to sum groups of 15 rows and find the max sum

The purpose of this question is to optimize some SQL by using set-based operations vs iterative (looping, like I'm doing below):
Some Explanation -
I have this cte that is inserted to a temp table #dataForPeak. Each row represents a minute, and a respective value retrieved.
For every row, my code uses a while loop to add 15 rows at a time (the current row + the next 14 rows). These sums are inserted into another temp table #PeakDemandIntervals, which is my workaround for then finding the max sum of these groups of 15.
I've bolded my end goal above. My code achieves this but in about 12 seconds for 26k rows. I'll be looking at much more data, so I know this is not enough for my use case.
My question is,
can anyone help me find a fast alternative to this loop?
It can include more tables, CTEs, nested queries, whatever. The while loop might not even be the issue, it's probably the inner code.
insert into #dataForPeak
select timestamp, value
from cte
order by timestamp;
while ##ROWCOUNT<>0
begin
declare #timestamp datetime = (select top 1 timestamp from #dataForPeak);
insert into #PeakDemandIntervals
select #timestamp, sum(interval.value) as peak
from (select * from #dataForPeak base
where base.timestamp >= #timestamp
and base.timestamp < DATEADD(minute,14,#timestamp)
) interval;
delete from #dataForPeak where timestamp = #timestamp;
end
select max(peak)
from #PeakDemandIntervals;
Edit
Here's an example of my goal, using groups of 3min instead of 15min.
Given the data:
Time | Value
1:50 | 2
1:51 | 4
1:52 | 6
1:53 | 8
1:54 | 6
1:55 | 4
1:56 | 2
the max sum (peak) I'm looking for is 20, because the group
1:52 | 6
1:53 | 8
1:54 | 6
has the highest sum.
Let me know if I need to clarify more than that.
Based on the example given it seems like you are trying to get the maximum value of a rolling sum. You can calculate the 15-minute rolling sum very easily as follow:
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
Note the key here is the ROWS 14 PRECEDING statement. It effectively state that SQL Server should sum the preceding 14 records with the current record which will give you your 15 minute interval.
Now you can simply max the result of the rolling sum. The full query will look as follow:
;WITH CTE_RollingSum
AS
(
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
)
SELECT MAX([RollingSum]) AS Peak
FROM CTE_RollingSum

Running Count Total with PostgresQL

I'm fairly close to this solution, but I just need a little help getting over the end.
I'm trying to get a running count of the occurrences of client_ids regardless of the date, however I need the dates and ids to still appear in my results to verify everything.
I found part of the solution here but have not been able to modify it enough for my needs.
Here is what the answer should be, counting if the occurrences of the client_ids sequentially :
id client_id deliver_on running_total
1 138 2017-10-01 1
2 29 2017-10-01 1
3 138 2017-10-01 2
4 29 2013-10-02 2
5 29 2013-10-02 3
6 29 2013-10-03 4
7 138 2013-10-03 3
However, here is what I'm getting:
id client_id deliver_on running_total
1 138 2017-10-01 1
2 29 2017-10-01 1
3 138 2017-10-01 1
4 29 2013-10-02 3
5 29 2013-10-02 3
6 29 2013-10-03 1
7 138 2013-10-03 2
Rather than counting the times the client_id appears sequentially, the code counts the time the id appears in the previous date range.
Here is my code and any help would be greatly appreciated.
Thank you,
SELECT n.id, n.client_id, n.deliver_on, COUNT(n.client_id) AS "running_total"
FROM orders n
LEFT JOIN orders o
ON (o.client_id = n.client_id
AND n.deliver_on > o.deliver_on)
GROUP BY n.id, n.deliver_on, n.client_id
ORDER BY n.deliver_on ASC
* EDIT WITH ANSWER *
I ending up solving my own question. Here is the solution with comments:
-- Set "1" for counting to be used later
WITH DATA AS (
SELECT
orders.id,
orders.client_id,
orders.deliver_on,
COUNT(1) -- Creates a column of "1" for counting the occurrences
FROM orders
GROUP BY 1
ORDER BY deliver_on, client_id
)
SELECT
id,
client_id,
deliver_on,
SUM(COUNT) OVER (PARTITION BY client_id
ORDER BY client_id, deliver_on
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- Counts the sequential client_ids based on the number of times they appear
FROM DATA
Just the answer posted to close the question:
-- Set "1" for counting to be used later
WITH DATA AS (
SELECT
orders.id,
orders.client_id,
orders.deliver_on,
COUNT(1) -- Creates a column of "1" for counting the occurrences
FROM orders
GROUP BY 1
ORDER BY deliver_on, client_id
)
SELECT
id,
client_id,
deliver_on,
SUM(COUNT) OVER (PARTITION BY client_id
ORDER BY client_id, deliver_on
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- Counts the sequential client_ids based on the number of times they appear
FROM DATA

SELECT record based upon dates

Assuming data such as the following:
ID EffDate Rate
1 12/12/2011 100
1 01/01/2012 110
1 02/01/2012 120
2 01/01/2012 40
2 02/01/2012 50
3 01/01/2012 25
3 03/01/2012 30
3 05/01/2012 35
How would I find the rate for ID 2 as of 1/15/2012?
Or, the rate for ID 1 for 1/15/2012?
In other words, how do I do a query that finds the correct rate when the date falls between the EffDate for two records? (Rate should be for the date prior to the selected date).
Thanks,
John
How about this:
SELECT Rate
FROM Table1
WHERE ID = 1 AND EffDate = (
SELECT MAX(EffDate)
FROM Table1
WHERE ID = 1 AND EffDate <= '2012-15-01');
Here's an SQL Fiddle to play with. I assume here that 'ID/EffDate' pair is unique for all table (at least the opposite doesn't make sense).
SELECT TOP 1 Rate FROM the_table
WHERE ID=whatever AND EffDate <='whatever'
ORDER BY EffDate DESC
if I read you right.
(edited to suit my idea of ms-sql which I have no idea about).