How to aggregate/partition window data by dynamic group? - postgresql

A question like this may have already been asked & answered, but I'm having trouble finding anything (it's tough to know what exactly to search for / how to phrase this).
If I have a table of values by date:
select *
from (values
(date '2018-05-11', 'lorem'),
(date '2018-05-10', 'ipsum'),
(date '2018-05-07', 'dolor'),
(date '2018-05-05', 'hello'),
(date '2018-05-04', 'world'),
(date '2018-04-30', 'foo'),
(date '2018-04-15', 'bar')
) as v(date, name)
order by date desc
How can I aggregate the values by date groups (e.g. "5 days") — grouping dynamically by the first value onwards (e.g. May 11-7, 6-1, Apr 30-26, etc.), not statically (e.g. modulo 5 days)?
Desired result:
min_date | max_date | names
-----------+------------+--------------------
2018-05-07 | 2018-05-11 | lorem, ipsum, dolor
2018-05-04 | 2018-05-05 | hello, world
2018-04-30 | 2018-04-30 | foo
2018-04-15 | 2018-04-15 | bar
————
I believe I need to first derive the max date to group each row under, which would be , e.g. 2018-05-11, 2018-05-05, etc.
I've tried two conceptual approaches for doing that, but neither work.
———
The first approach is to build up this rolling max date, but this isn't valid (column "groupbydate" does not exist):
select *,
case
when date > (lag(groupByDate) over w) - interval '5 days' then (lag(groupByDate) over w)
else date
end as groupByDate
from input
window w as (order by date desc)
————
The second approach is to "find" the max/"group by" for each row, but I'm not sure how to differentiate the current table row's date from the current window row's `date:
select *,
max(date) filter (where date < input.date + interval '5 days') over w
from input
window w as (order by date desc)
I think I can implement the second approach using a subquery, but I'm curious: is it possible to achieve this using window functions? Thank you!
EDIT: The second approach is wrong. It can find a different "group by" date for different dates that should be in the same group.

EDIT: Actually, this is wrong! This can find a different "group by" date for different dates that should be in the same group.
Here's how I achieved this with a subquery:
select date, name, (
select max(date)
from input as i2
where date < input.date + interval '5 days'
) as date_group
from input
And plugging into this outer query gets me my desired results:
select min_date, max_date, names
from (
select date_group, min(date) as min_date, max(date) as max_date, string_agg(name, ', ') as names
from groups -- results of above query, e.g. using CTE
group by date_group
order by date_group desc
) as x
Still curious if there's a way to do this with windowing functions. Thanks!

Related

extract days of daterange grouped by month postresql

I have a pickupDate and returnDate in my OrderHistory table. I want to extract the sum of rental days of all OrderHistory entries, grouped/ordered by month. A cte seems to be the solution but I don´t get how to implement it in my query since the cte´s i saw were refering to themselves where it says "FROM cte".
I tried something like this:
SELECT
SUM((EXTRACT (DAY FROM("OrderHistory"."returnDate")-("OrderHistory"."pickupDate")))) as traveltime
, to_char("OrderHistory"."pickupDate"::date, 'YYYY-MM') as M
FROM
"OrderHistory"
GROUP BY
M
ORDER BY
M
But the outcome doesn´t split bookings btw two months (e.g. pickupDate=27th march 2022 and returnDate=03rd of april 2022) but will assign the whole 7 days to the month of march, since the returndate is in it. It should show 4 days in march and 3 in april.
Sorry for the probably very stupid question but I am a beginner. (my code is written in postgresql btw)
PostgreSQL naming conventions
Are PostgreSQL column names case-sensitive?
use legal, lower-case names exclusively so double-quoting is not
needed.
Final result in db fiddle
Add daterange column.
alter table order_history add column date_ranges daterange;
update order_history
with a(m_begin, m_end, pickup_date) as
(select date_trunc('month', pickup_date)::date,
(date_trunc('month', pickup_date) + interval '1 month - 1 day')::date,
pickup_date from order_history)
update order_history set date_ranges =
daterange(a.m_begin, a.m_end,'[]') from a
where a.pickup_date = order_history.pickup_date;
then final query:
WITH A AS(
select
pickup_date,
return_date,
return_date - pickup_date as total,
case when return_date <# date_ranges then (return_date - pickup_date)
else ( date_trunc('month', pickup_date) + interval '1 month - 1 day')::date - pickup_date
end partial_mth
from order_history),
b as (SELECT *, a.total - partial_mth parital_not_mth FROM a)
select *,
case when to_char(pickup_date,'YYYY-MM') = to_char(return_date,'YYYY-MM')
then
sum(partial_mth) over(partition by to_char(pickup_date,'YYYY-MM')) +
sum(parital_not_mth) over (partition by to_char(return_date,'YYYY-MM'))
else sum(partial_mth) over(partition by to_char(pickup_date,'YYYY-MM'))
end
from b;
After trying different things I think I found the best answer to my question, that I want to share with the community:
WITH hier as (
SELECT
"OrderHistory"."pickupDate" as start_date
, "OrderHistory"."returnDate" as end_date
, to_char("OrderHistory"."pickupDate"::date, 'YYYY-MM') as M
FROM
"OrderHistory"
GROUP BY
1, 2, 3
ORDER BY
3
), calendar as (
select date '2022-01-01' + (n || ' days')::interval calendar_date
from generate_series(0, 365) n
)
select
to_char(calendar_date::date, 'YYYY-MM')
, count(*) as tage_gebucht
from calendar
inner join hier on calendar.calendar_date between start_date and end_date
where calendar_date between '2022-01-01' and '2022-12-31'
group by 1
order by 1;
I think this is the simplest solution I came up with.

Using 'over' function results in column "table.id" must appear in the GROUP BY clause or be used in an aggregate function

I'm currently writing an application which shows the growth of the total number of events in my table over time, I currently have the following query to do this:
query = session.query(
count(Event.id).label('count'),
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month')
).filter(
Event.date.isnot(None)
).group_by('year', 'month').all()
This results in the following output:
Count
Year
Month
100
2021
1
50
2021
2
75
2021
3
While this is okay on it's own, I want it to display the total number over time, so not just the number of events that month, so the desired outpout should be:
Count
Year
Month
100
2021
1
150
2021
2
225
2021
3
I read on various places I should use a window function using SqlAlchemy's over function, however I can't seem to wrap my head around it and every time I try using it I get the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.GroupingError) column "event.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT count(event.id) OVER (PARTITION BY event.date ORDER...
^
[SQL: SELECT count(event.id) OVER (PARTITION BY event.date ORDER BY EXTRACT(year FROM event.date), EXTRACT(month FROM event.date)) AS count, EXTRACT(year FROM event.date) AS year, EXTRACT(month FROM event.date) AS month
FROM event
WHERE event.date IS NOT NULL GROUP BY year, month]
This is the query I used:
session.query(
count(Event.id).over(
order_by=(
extract('year', Event.date),
extract('month', Event.date)
),
partition_by=Event.date
).label('count'),
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month')
).filter(
Event.date.isnot(None)
).group_by('year', 'month').all()
Could someone show me what I'm doing wrong? I've been searching for hours but can't figure out how to get the desired output as adding event.id in the group by would stop my rows from getting grouped by month and year
The final query I ended up using:
query = session.query(
extract('year', Event.date).label('year'),
extract('month', Event.date).label('month'),
func.sum(func.count(Event.id)).over(order_by=(
extract('year', Event.date),
extract('month', Event.date)
)).label('count'),
).filter(
Event.date.isnot(None)
).group_by('year', 'month')
I'm not 100% sure what you want, but I'm assuming you want the number of events up to that month for each month. You're going to first need to calculate the # of events per month and also sum them with the postgresql window function.
You can do that with in a single select statement:
SELECT extract(year FROM events.date) AS year
, extract(month FROM events.date) AS month
, SUM(COUNT(events.id)) OVER(ORDER BY extract(year FROM events.date), extract(month FROM events.date)) AS total_so_far
FROM events
GROUP BY 1,2
but it might be easier to think about if you split it into two:
SELECT year, month, SUM(events_count) OVER(ORDER BY year, month)
FROM (
SELECT extract(year FROM events.date) AS year
, extract(month FROM events.date) AS month
, COUNT(events.id) AS events_count
FROM events
GROUP BY 1,2
)
but not sure how to do that in SqlAlchemy

Generate missing data and fill it down - postgresql

I have the dataset:
The problem is that the records are added only if an event happened, e.g. for the row with id 13897, the record was updated on 4/18/2020 and then on 5/1/2020 - the status was changed. What I need is the status of each record at the end of every month.
I was thinking about the below logic:
generate the series of dates from the min(date) till now - T1
get distinct id from the dataset - T2
do cross join between two above tables so that we get a new row for every row in the second table - T3
extract the dataset with all required fields - T4
merge T3 and T4 by concatenate(date and id) - T5
sort T5 by id and d asc - T5
fill-down all the fields grouped by id - T5
generate the series of dates from min(date) till now with the interval of one month and get the last day of each month - T6
merge T5 and T6 by date - right join so that we get only rows with the date = end of month
I am on step 6.
SELECT *
FROM (SELECT d, Concat(dt, t2.id) AS cnct
FROM (SELECT d,d::date AS dt
FROM generate_series(
( SELECT min(created_at::date)
FROM new_table), CURRENT_DATE , interval '1 day') d) t1
CROSS JOIN
(SELECT DISTINCT id FROM new_table )) t2)t3
--in case if a record with the same id was updated several times throughout the day
LEFT JOIN (WITH cte AS
( SELECT id, status, created_at at time zone 'eat' at time zone 'utc' AS "created_at", updated_at::date AS date, updated_at::date, row_number() OVER (partition BY id, updated_at::date ORDER BY updated_at DESC) rnFROM new_table ))SELECT cte.*, Concat(updated_at::date, id) AS cnct
FROM cte
WHERE rn = 1) t4
ON t3.cnct = t4.cnct
I am stuck on step 7. I found fill column with last value from partition in postgresql but it is not what I need. I envision that I need to sort by a date block i.e. dates from min date to now for one id - 13894 are to be considered block 1, dates from min date to now for another id - 13897 are to be considered block 2. The next step I thought is to fill-down all fields per a block.
And another question, how do you deal with the event-based data to adapt it for the time-series?
Tried:
You can use Postgresql's DISTINCT ON feature to do this. We'll generate a series with the start of every month (you'll need to supply start and end dates here) and put the ID and the date into the DISTINCT ON so that we get only one row of new_table for each distinct ID and month pair. Then we simply filter and order to ensure that the row we're getting for each ID and month is the latest row for which the date is before the new month.
SELECT DISTINCT ON (new_table.id, month_start) *
FROM new_table, generate_series(start_date, end_date, interval '1 month') month_start
WHERE new_table.date < month_start
ORDER BY new_table.id, month_start ASC, new_table.date DESC;
(If you need your results to have the last day of the month and not the first day of the next month, you can just subtract 1 day from month_start in your select clause.)
EDIT: Running on the data you supplied, I get this:
SELECT DISTINCT ON (new_table.id, month_start) new_table.id, month_start - interval '1 day' as month_end, new_table.status
FROM new_table, generate_series('2020-05-01', '2020-06-01', interval '1 month') month_start
WHERE new_table.date < month_start
ORDER BY new_table.id, month_start ASC, new_table.date DESC;
id | month_end | status
-------+------------------------+--------
13894 | 2020-04-30 00:00:00-07 | 5
13894 | 2020-05-31 00:00:00-07 | 5
13897 | 2020-04-30 00:00:00-07 | 2
13897 | 2020-05-31 00:00:00-07 | 5
(4 rows)

How to form a dynamic pivot table or return multiple values from GROUP BY subquery

I'm having some major issues with the following query formation:
I have projects with start and end dates
Name Start End
---------------------------------------
Project 1 2020-08-01 2020-09-10
Project 2 2020-01-01 2025-01-01
and I'm trying to count the monthly working days within each project with the following subquery
select datetrunc('month', days) as d_month, count(days) as d_count
from generate_series(greatest('2020-08-01'::date, p.start), least('2020-09-14'::date, p.end), '1 day'::interval) days
where extract(DOW from days) not IN (0, 6)
group by d_month
where p.start is from the aliased main query and the dates are hard-coded for now, this correctly gives me the following result:
{"d_month"=>2020-08-01 00:00:00 +0000, "d_count"=>21}
{"d_month"=>2020-09-01 00:00:00 +0000, "d_count"=>10}
However subqueries can't return multiple values. The date range for the query is dynamic, so I would either need to somehow return the query as:
Name Start End 2020-08-01 2020-09-01 ...
-------------------------------------------------------------------------
Project 1 2020-08-01 2020-09-10 21 8
Project 2 2020-01-01 2025-01-01 21 10
Or simply return the whole subquery as JSON, but it doesn't seem to working either.
Any idea on how to achieve this or whether there are simpler solutions for this?
The most correct solution would be to create an actual calendar table that holds every possible day of interest to your business and, at a minimum for your purpose here, marks work days.
Ideally you would have columns to hold fiscal quarters, periods, and weeks to match your industry. You would also mark holidays. Joining to this table makes these kinds of calculations a snap.
create table calendar (
ddate date not null primary key,
is_work_day boolean default true
);
insert into calendar
select ts::date as ddate,
extract(dow from ts) not in (0,6) as is_work_day
from generate_series(
'2000-01-01'::timestamp,
'2099-12-31'::timestamp,
interval '1 day'
) as gs(ts);
Assuming a calendar table is not within scope, you can do this:
with bounds as (
select min(start) as first_start, max("end") as last_end
from my_projects
), cal as (
select ts::date as ddate,
extract(dow from ts) not in (0,6) as is_work_day
from bounds
cross join generate_series(
first_start,
last_end,
interval '1 day'
) as gs(ts)
), bymonth as (
select p.name, p.start, p.end,
date_trunc('month', c.ddate) as month_start,
count(*) as work_days
from my_projects p
join cal c on c.ddate between p.start and p.end
where c.is_work_day
group by p.name, p.start, p.end, month_start
)
select jsonb_object_agg(to_char(month_start, 'YYYY-MM-DD'), work_days)
|| jsonb_object_agg('name', name)
|| jsonb_object_agg('start', start)
|| jsonb_object_agg('end', "end") as result
from bymonth
group by name;
Doing a pivot from rows to columns in SQL is usually a bad idea, so the query produces json for you.

Postgres find where dates are NOT overlapping between two tables

I have two tables and I am trying to find data gaps in them where the dates do not overlap.
Item Table:
id unique start_date end_date data
1 a 2019-01-01 2019-01-31 X
2 a 2019-02-01 2019-02-28 Y
3 b 2019-01-01 2019-06-30 Y
Plan Table:
id item_unique start_date end_date
1 a 2019-01-01 2019-01-10
2 a 2019-01-15 'infinity'
I am trying to find a way to produce the following
Missing:
item_unique from to
a 2019-01-11 2019-01-14
b 2019-01-01 2019-06-30
step-by-step demo:db<>fiddle
WITH excepts AS (
SELECT
item,
generate_series(start_date, end_date, interval '1 day') gs
FROM items
EXCEPT
SELECT
item,
generate_series(start_date, CASE WHEN end_date = 'infinity' THEN ( SELECT MAX(end_date) as max_date FROM items) ELSE end_date END, interval '1 day')
FROM plan
)
SELECT
item,
MIN(gs::date) AS start_date,
MAX(gs::date) AS end_date
FROM (
SELECT
*,
SUM(same_day) OVER (PARTITION BY item ORDER BY gs)
FROM (
SELECT
item,
gs,
COALESCE((gs - LAG(gs) OVER (PARTITION BY item ORDER BY gs) >= interval '2 days')::int, 0) as same_day
FROM excepts
) s
) s
GROUP BY item, sum
ORDER BY 1,2
Finding the missing days is quite simple. This is done within the WITH clause:
Generating all days of the date range and subtract this result from the expanded list of the second table. All dates that not occur in the second table are keeping. The infinity end is a little bit tricky, so I replaced the infinity occurrence with the max date of the first table. This avoids expanding an infinite list of dates.
The more interesting part is to reaggregate this list again, which is the part outside the WITH clause:
The lag() window function take the previous date. If the previous date in the list is the last day then give out true (here a time changing issue occurred: This is why I am not asking for a one day difference, but a 2-day-difference. Between 2019-03-31 and 2019-04-01 there are only 23 hours because of daylight saving time)
These 0 and 1 values are aggregated cumulatively. If there is one gap greater than one day, it is a new interval (the days between are covered)
This results in a groupable column which can be used to aggregate and find the max and min date of each interval
Tried something with date ranges which seems to be a better way, especially for avoiding to expand long date lists. But didn't come up with a proper solution. Maybe someone else?