Postgresql: Create a date sequence, use it in date range query - postgresql

I'm not great with SQL but I have been making good progress on a project up to this point. Now I am completely stuck.
I'm trying to get a count for the number of apartments with each status. I want this information for each day so that I can trend it over time. I have data that looks like this:
table: y_unit_status
unit | date_occurred | start_date | end_date | status
1 | 2017-01-01 | 2017-01-01 | 2017-01-05 | Occupied No Notice
1 | 2017-01-06 | 2017-01-06 | 2017-01-31 | Occupied Notice
1 | 2017-02-01 | 2017-02-01 | | Vacant
2 | 2017-01-01 | 2017-01-01 | | Occupied No Notice
And I want to get output that looks like this:
date | occupied_no_notice | occupied_notice | vacant
2017-01-01 | 2 | 0 | 0
...
2017-01-10 | 1 | 1 | 0
...
2017-02-01 | 1 | 0 | 1
Or, this approach would work:
date | status | count
2017-01-01 | occupied no notice | 2
2017-01-01 | occupied notice | 0
date_occurred: Date when the status of the unit changed
start_date: Same as date_occurred
end_date: Date when status stopped being x and changed to y.
I am pulling in the number of bedrooms and a property id so the second approach of selecting counts for one status at a time would produce a relatively large number of rows vs. option 1 (if that matters).
I've found a lot of references that have gotten me close to what I'm looking for but I always end up with a sort of rolling, cumulative count.
Here's my query, which produces a column of dates and counts, which accumulate over time rather than reflecting a snapshot of counts for a particular day. You can see my references to another table where I'm pulling in a property id. The table schema is Property -> Unit -> Unit Status.
WITH t AS(
SELECT i::date from generate_series('2016-06-29', '2017-08-03', '1 day'::interval) i
)
SELECT t.i as date,
u.hproperty,
count(us.hmy) as count --us.hmy is the id
FROM t
LEFT OUTER JOIN y_unit_status us ON t.i BETWEEN us.dtstart AND
us.dtend
INNER JOIN y_unit u ON u.hmy = us.hunit -- to get property id
WHERE us.sstatus = 'Occupied No Notice'
AND t.i >= us.dtstart
AND t.i <= us.dtend
AND u.hproperty = '1'
GROUP BY t.i, u.hproperty
ORDER BY t.i
limit 1500
I also tried a FOR loop, iterating over the dates to determine cases where the date was between start and end but my logic wasn't working. Thanks for any insight!

You are on the right track, but you'll need to handle NULL values in end_date. If those means that status is assumed to be changed somewhere in the future (but not sure when it will change), the containment operators (#> and <#) for the daterange type are perfect for you (because ranges can be "unbounded"):
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d, status, count(unit)
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1, 2
To achieve the first variant, you can use conditional aggregation:
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d,
count(unit) filter (where status = 'Occupied No Notice') occupied_no_notice,
count(unit) filter (where status = 'Occupied Notice') occupied_notice,
count(unit) filter (where status = 'Vacant') vacant
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1
Notes:
The syntax filter (where <predicate>) is new to 9.4+. Before that, you can use CASE (and the fact that most aggregate functions does not include NULL values) to emulate it.
You can even index the expression daterange(start_date, end_date, '[]') (using gist) for better performance.
http://rextester.com/HWKDE34743

Related

Recursive CTE PostgreSQL Connecting Multiple IDs with Additional Logic for Other Fields

Within my PostgreSQL database, I have an id column that shows each unique lead that comes in. I also have a connected_lead_id column which shows whether accounts are related to each other (ie husband and wife, parents and children, group of friends, group of investors, etc).
When we count the number of ids created during a time period, we want to see the number of unique "groups" of connected_ids during a period. In other words, we wouldn't want to count both the husband and wife pair, we would only want to count one since they are truly one lead.
We want to be able to create a view that only has the "first" id based on the "created_at" date and then contains additional columns at the end for "connected_lead_id_1", "connected_lead_id_2", "connected_lead_id_3", etc.
We want to add in additional logic so that we take the "first" id's source, unless that is null, then take the "second" connected_lead_id's source unless that is null and so on. Finally, we want to take the earliest on_boarded_date from the connected_lead_id group.
id | created_at | connected_lead_id | on_boarded_date | source |
2 | 9/24/15 23:00 | 8 | |
4 | 9/25/15 23:00 | 7 | |event
7 | 9/26/15 23:00 | 4 | |
8 | 9/26/15 23:00 | 2 | |referral
11 | 9/26/15 23:00 | 336 | 7/1/17 |online
142 | 4/27/16 23:00 | 336 | |
336 | 7/4/16 23:00 | 11 | 9/20/18 |referral
End Goal:
id | created_at | on_boarded_date | source |
2 | 9/24/15 23:00 | | referral |
4 | 9/25/15 23:00 | | event |
11 | 9/26/15 23:00 | 7/1/17 | online |
Ideally, we would also have i number of extra columns at the end to show each connected_lead_id that is attached to the base id.
Thanks for the help!
Ok the best I can come up with at the moment is to first build maximal groups of related IDs, and then join back to your table of leads to get the rest of the data (See this SQL Fiddle for the setup, full queries and results).
To get the maximal groups you can use a recursive common table expression to first grow the groups, followed by a query to filter the CTE results down to just the maximal groups:
with recursive cte(grp) as (
select case when l.connected_lead_id is null then array[l.id]
else array[l.id, l.connected_lead_id]
end from leads l
union all
select grp || l.id
from leads l
join cte
on l.connected_lead_id = any(grp)
and not l.id = any(grp)
)
select * from cte c1
The CTE above outputs several similar groups as well as intermediary groups. The query predicate below prunes out the non maximal groups, and limits results to just one permutation of each possible group:
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| grp |
|------------|
| 2,8 |
| 4,7 |
| 14 |
| 11,336,142 |
| 12,13 |
Next join the final query above back to your leads table and use window functions to get the remaining column values, along with the distinct operator to prune it down to the final result set:
with recursive cte(grp) as (
...
)
select distinct
first_value(l.id) over (partition by grp order by l.created_at) id
, first_value(l.created_at) over (partition by grp order by l.created_at) create_at
, first_value(l.on_boarded_date) over (partition by grp order by l.created_at) on_boarded_date
, first_value(l.source) over (partition by grp
order by case when l.source is null then 2 else 1 end
, l.created_at) source
, grp CONNECTED_IDS
from cte c1
join leads l
on l.id = any(grp)
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| id | create_at | on_boarded_date | source | connected_ids |
|----|----------------------|-----------------|----------|---------------|
| 2 | 2015-09-24T23:00:00Z | (null) | referral | 2,8 |
| 4 | 2015-09-25T23:00:00Z | (null) | event | 4,7 |
| 11 | 2015-09-26T23:00:00Z | 2017-07-01 | online | 11,336,142 |
| 12 | 2015-09-26T23:00:00Z | 2017-07-01 | event | 12,13 |
| 14 | 2015-09-26T23:00:00Z | (null) | (null) | 14 |
demo:db<>fiddle
Main idea - sketch:
Looping through the ordered set. Get all ids, that haven't been seen before in any connected_lead_id (cli). These are your starting points for recursion.
The problem is your number 142 which hasn't been seen before but is in same group as 11 because of its cli. So it is would be better to get the clis of the unseen ids. With these values it's much simpler to calculate the ids of the groups later in the recursion part. Because of the loop a function/stored procedure is necessary.
The recursion part: First step is to get the ids of the starting clis. Calculating the first referring id by using the created_at timestamp. After that a simple tree recursion over the clis can be done.
1. The function:
CREATE OR REPLACE FUNCTION filter_groups() RETURNS int[] AS $$
DECLARE
_seen_values int[];
_new_values int[];
_temprow record;
BEGIN
FOR _temprow IN
-- 1:
SELECT array_agg(id ORDER BY created_at) as ids, connected_lead_id FROM groups GROUP BY connected_lead_id ORDER BY MIN(created_at)
LOOP
-- 2:
IF array_length(_seen_values, 1) IS NULL
OR (_temprow.ids || _temprow.connected_lead_id) && _seen_values = FALSE THEN
_new_values := _new_values || _temprow.connected_lead_id;
END IF;
_seen_values := _seen_values || _temprow.ids;
_seen_values := _seen_values || _temprow.connected_lead_id;
END LOOP;
RETURN _new_values;
END;
$$ LANGUAGE plpgsql;
Grouping all ids that refer to the same cli
Loop through the id arrays. If no element of the array was seen before, add the referred cli the output variable (_new_values). In both cases add the ids and the cli to the variable which stores all yet seen ids (_seen_values)
Give out the clis.
The result so far is {8, 7, 336} (which is equivalent to the ids {2,4,11,142}!)
2. The recursion:
-- 1:
WITH RECURSIVE start_points AS (
SELECT unnest(filter_groups()) as ids
),
filtered_groups AS (
-- 3:
SELECT DISTINCT
1 as depth, -- 3
first_value(id) OVER w as id, -- 4
ARRAY[(MIN(id) OVER w)] as visited, -- 5
MIN(created_at) OVER w as created_at,
connected_lead_id,
MIN(on_boarded_date) OVER w as on_boarded_date -- 6,
first_value(source) OVER w as source
FROM groups
WHERE connected_lead_id IN (SELECT ids FROM start_points)
-- 2:
WINDOW w AS (PARTITION BY connected_lead_id ORDER BY created_at)
UNION
SELECT
fg.depth + 1,
fg.id,
array_append(fg.visited, g.id), -- 8
LEAST(fg.created_at, g.created_at),
g.connected_lead_id,
LEAST(fg.on_boarded_date, g.on_boarded_date), -- 9
COALESCE(fg.source, g.source) -- 10
FROM groups g
JOIN filtered_groups fg
-- 7
ON fg.connected_lead_id = g.id AND NOT (g.id = ANY(visited))
)
SELECT DISTINCT ON (id) -- 11
id, created_at,on_boarded_date, source
FROM filtered_groups
ORDER BY id, depth DESC;
The WITH part gives out the results from the function. unnest() expands the id array into each row for each id.
Creating a window: The window function groups all values by their clis and orders the window by the created_at timestamp. In your example all values are in their own window excepting 11 and 142 which are grouped.
This is a help variable to get the latest rows later on.
first_value() gives the first value of the ordered window frame. Assuming 142 had a smaller created_at timestamp the result would have been 142. But it's 11 nevertheless.
A variable is needed to save which id has been visited yet. Without this information an infinite loop would be created: 2-8-2-8-2-8-2-8-...
The minimum date of the window is taken (same thing here: if 142 would have a smaller date than 11 this would be the result).
Now the starting query of the recursion is calculated. Following describes the recursion part:
Joining the table (the original function results) against the previous recursion result. The second condition is the stop of the infinite loop I mentioned above.
Appending the currently visited id into the visited variable.
If the current on_boarded_date is earlier it is taken.
COALESCE gives the first NOT NULL value. So the first NOT NULL source is safed throughout the whole recursion
After the recursion which gives a result of all recursion steps we want to filter out only the deepest visits of every starting id.
DISTINCT ON (id) gives out the row with the first occurence of an id. To get the last one, the whole set is descendingly ordered by the depth variable.

selecting records without value

I have a problem when I'm trying to reach the desired result. The task looks simple — make a daily count of occurrences of the event for top countries.
The main table looks like this:
id | date | country | col1 | col2 | ...
1 | 2018-01-01 21:21:21 | US | value 1 | value 2 | ...
2 | 2018-01-01 22:32:54 | UK | value 1 | value 2 | ...
From this table, I want to get daily event counts by the country, which is achieved by
SELECT date::DATE AT TIME ZONE 'UTC', country, COALESCE(count(id),0) FROM tab1
GROUP BY 1, 2
The problem comes when there is no event was made by an UK user on 2 January 2018
country_events
date | country | count
2018-01-01 | US | 23
2018-01-01 | UK | 5
2018-01-02 | US | 30
2018-01-02 | UK | 0 -> is desired result, but row is missing
I've tried to generate date series and series of countries which I'm looking for, then CROSS JOIN these two tables. This helper with columns date and country I've left joined with my result table like
SELECT * FROM helper h
LEFT JOIN country_events c ON c.date::DATE = h.date::DATE AND c.country = h.country
I'm using PostgreSQL.
You need an outer join, not a cross join:
SELECT tab1.date::date, tab1.country, coalesce(count(*), 0)
FROM generate_series(TIMESTAMP '2018-01-01 00:00:00',
TIMESTAMP '2018-01-31 00:00:00',
INTERVAL '1 day') AS ts(d)
LEFT JOIN tab1 ON tab1.date >= ts.d AND tab1.date < ts.d + INTERVAL '1 day'
GROUP BY tab1.date::date, tab1.country
ORDER BY tab1.date::date, tab1.country;
This will give the desired list for January 2018.

Compare every row to all other rows

Assume I have a table like this:
CREATE TABLE events (
event_id INTEGER,
begin_date DATE,
end_date DATE,
PRIMARY KEY (event_id));
With data like this:
INSERT INTO events SELECT 1 AS event_id,'2017-01-01'::DATE AS begin_date, '2017-01-07'::DATE AS end_date;
INSERT INTO events SELECT 2 AS event_id,'2017-01-04'::DATE AS begin_date, '2017-01-05'::DATE AS end_date;
INSERT INTO events SELECT 3 AS event_id,'2017-01-02'::DATE AS begin_date, '2017-01-03'::DATE AS end_date;
INSERT INTO events SELECT 4 AS event_id,'2017-01-03'::DATE AS begin_date, '2017-01-08'::DATE AS end_date;
INSERT INTO events SELECT 5 AS event_id,'2017-01-02'::DATE AS begin_date, '2017-01-09'::DATE AS end_date;
INSERT INTO events SELECT 6 AS event_id,'2017-01-03'::DATE AS begin_date, '2017-01-06'::DATE AS end_date;
INSERT INTO events SELECT 7 AS event_id,'2017-01-08'::DATE AS begin_date, '2017-01-09'::DATE AS end_date;
I would like to be able to do this:
SELECT a.event_id,
COUNT (*) AS COUNT
FROM events AS a
LEFT JOIN events AS b
ON a.begin_date < b.begin_date
AND a.end_date > b.end_date
GROUP BY a.event_id
ORDER BY a.event_id ASC
With results like this:
*----------*--------*
| event_id | count |
*-------------------*
| 1 | 3 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 3 |
| 6 | 1 |
| 7 | 1 |
*----------*------- *
But with a window function (because it's much faster than the inequality join). Something like this, where I can compare the outer row to the inner rows.
SELECT a.event_id,
COUNT(*) OVER (a.begin_date < b.begin_date AND a.end_date > b.end_date) AS count
FROM events AS a
ORDER BY a.event_id ASC
Ideally this would work on both Postgres and Redshift.
I don't think you will get out of using a JOIN here. Even the window function idea would require a sort of implicit join since the window being compared would have to be a set of all other records.
Instead, consider using Postgres' daterange type. You can cast your begin and end dates into a range and check to see if the exclusive range from your left table contains the inclusive range of your right table:
SELECT t1.event_id, count(*)
FROM events t1
LEFT OUTER JOIN events t2
ON daterange(t1.begin_date, t1.end_date, '()') #> daterange(t2.begin_date, t2.end_date, '[]') AND t1.event_id <> t2.event_id
GROUP BY 1
ORDER BY 1;
event_id | count
----------+-------
1 | 3
2 | 1
3 | 1
4 | 1
5 | 3
6 | 1
7 | 1
The real question (and I don't know the answer) is that if all of this casting and "range includes" #> logic is more efficient then your inequality version. While the Nested Loop Left Join used here has a lower estimated rows, the Total Cost is about 33% higher.
I have a hunch if your data were stored as an inclusive daterange type, then the cost would be less as a cast wouldn't need to happen (although because we are comparing an inclusive range to an exclusive range then it may be a wash).

Fill in missing rows when aggregating over multiple fields in Postgres

I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing.
SELECT
sd.date,
COUNT(sd.sale_id) AS sales,
sd.product
FROM sales_data sd
-- sales per product, per day
GROUP BY sd.product, sd.date
ORDER BY sd.product, sd.date
This produces the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-17 | 2 | shower gel
2017-08-21 | 1 | shower gel
As you can see - the date ranges per product are not continuous as sales_data just didn't contain any info for these products on some days.
What I'm aiming to do is to add a sales = 0 row for each product that is not sold on any day in a range - for example here, between 2017-08-17 and 2017-08-21 to give something like the the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-18 | 0 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-21 | 0 | soap
2017-08-17 | 2 | shower gel
2017-08-18 | 0 | shower gel
2017-08-19 | 0 | shower gel
2017-08-20 | 0 | shower gel
2017-08-21 | 1 | shower gel
In a simpler case where there was only a single product, it seems like the solution would be to use generate_series() i.e.:
create a full range of dates using generate_series
LEFT JOIN the already aggregated sales data onto the date series
COALESCE any NULL counts to 0 in the missing rows
The problem I have is that this approach does not seem to work dates repeat in the aggregated data as I'm grouping over not just multiple dates, but multiple products also.
It feels like I should be able to do something cunning with window functions here to solve this e.g. joining onto the full date range over partitions defined by the product name - but I can't see a way of actually getting this to work.
You could use:
WITH cte AS (
SELECT date, s.product
FROM ... -- some way to generate date series
CROSS JOIN (SELECT DISTINCT product FROM sales_data) s
)
SELECT
c.date,
c.product,
COUNT(sd.sale_id) AS sales
FROM cte c
LEFT JOIN sales_data sd
ON c.date = sd.date AND c.product= sd.product
GROUP BY c.date, c.product
ORDER BY c.date, c.product;
First create Cartesian product of dates and products, then LEFT JOIN to actual data and do calculations.
Oracle has great feature for this scenarios called Partitioned Outer Joins:
SELECT times.time_id, product, quantity
FROM inventory PARTITION BY (product)
RIGHT OUTER JOIN times ON (times.time_id = inventory.time_id)
WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('06/04/01', 'DD/MM/YY')
ORDER BY 2,1;
select
date,
count(sale_id) as sales,
product
from
sales_data
right join (
(
select d::date as date
from generate_series (
(select min(date) from sales_data),
(select max(date) from sales_data),
'1 day'
) gs (d)
) gs
cross join
(select distinct product from sales_data) p
) cj using (product, date)
group by product, date
order by product, date

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)