Postgres Find missing dates between a dataset of two columns - postgresql

Im trying to create a query that retuns missing dates between two columns and multiple rows.
Example:
leases
move_in move_out hotel_id
2021-04-01 2021-04-14 1
2021-04-17 2021-04-30 1
2021-04-01 2021-04-14 2
2021-04-17 2021-04-30 2
Result should be
date hotel_id
2021-04-15 1
2021-04-16 1
2021-04-15 2
2021-04-16 2

You're finding the difference between two sets. One is the leased hotel days. The other is all days in the month of April. And you're doing this for all hotels.
We can make a set of all days of April for all hotels. First we need to build the set of all days in the month of April: generate_series('2022-04-01'::date, '2022-04-30'::date, '1 day').
Then we need to cross join this with all hotel IDs.
select *
from generate_series('2021-04-01'::date, '2021-04-30'::date, '1 day') as dates(day)
cross join (
select distinct hotel_id as id
from leases
) as hotels(id)
Now for each day we can left join this with the leases for that day.
select *
from generate_series('2021-04-01'::date, '2021-04-30'::date, '1 day') as dates(day)
cross join (
select distinct hotel_id as id
from leases
) as hotels(id)
Any days without a lease won't have a lease.id, so filter on that.
select day, hotels.id
from generate_series('2021-04-01'::date, '2021-04-30'::date, '1 day') as dates(day)
cross join (
select distinct hotel_id as id
from leases
) as hotels(id)
left join leases on day between move_in and leases.move_out and hotel_id = hotels.id
where leases.id is null
order by hotels.id, day
Demonstration.

If you're using postgresql 14+ you can use multiranges to do this:
CREATE TEMP TABLE t (
"move_in" DATE,
"move_out" DATE,
"hotel_id" INTEGER
);
INSERT INTO t
("move_in", "move_out", "hotel_id")
VALUES ('2021-04-01', '2021-04-14', '1')
, ('2021-04-17', '2021-04-30', '1')
, ('2021-05-03', '2021-05-30', '1') -- added this as a test case
, ('2021-04-01', '2021-04-14', '2')
, ('2021-04-17', '2021-04-30', '2');
SELECT hotel_id, datemultirange(DATERANGE(MIN(move_in), MAX(move_out))) - range_agg(DATERANGE(move_in, move_out, '[]')) AS r
FROM t
GROUP BY hotel_id
returns
+--------+-------------------------------------------------+
|hotel_id|r |
+--------+-------------------------------------------------+
|2 |{[2021-04-14,2021-04-17)} |
|1 |{[2021-04-14,2021-04-17),[2021-04-30,2021-05-03)}|
+--------+-------------------------------------------------+
If you want to have 1 row per day you can use unnest and generate_series to expand the multiranges:
WITH available_ranges AS(
SELECT hotel_id, unnest(datemultirange(DATERANGE(MIN(move_in), MAX(move_out), '[]')) - range_agg(DATERANGE(move_in, move_out, '[]'))) AS r
FROM t
GROUP BY hotel_id
)
SELECT hotel_id, generate_series(lower(r), upper(r) - 1, '1 day'::interval)
FROM available_ranges
ORDER BY 1, 2
;
returns
+--------+---------------------------------+
|hotel_id|generate_series |
+--------+---------------------------------+
|1 |2021-04-15 00:00:00.000000 +00:00|
|1 |2021-04-16 00:00:00.000000 +00:00|
|1 |2021-05-01 00:00:00.000000 +00:00|
|1 |2021-05-02 00:00:00.000000 +00:00|
|2 |2021-04-15 00:00:00.000000 +00:00|
|2 |2021-04-16 00:00:00.000000 +00:00|
+--------+---------------------------------+

Related

First record by month & by year

I have a Rails application with 20+ years of data.
I'm struggling to create two SQLs:
Fetch the first record of each year (based on filters)
Fetch the first record of each month (based on filters)
I made a DBFiddle here: https://www.db-fiddle.com/f/wjQqrrpaJeiYG8zkExbaos/0
For the first query (yearly), the result should be:
a | b_id | created_at
74780 | 82373 | 2020-01-02 01:34:33 +0000
15670 | 16639 | 2019-02-24 14:33:56 +0000
14586 | 87594 | 2018-01-06 09:14:31 +0000
I can fetch the years and months using date_part('year', created_at) and date_part('month', created_at), but didn't find a way to "glue" them with min(created_at).
Try to use window function OVER:
with grouped as(
select *, min(created_at) over(partition by date_trunc('year', created_at))
from z order by date_trunc('year', created_at) desc
)
select a, b_id, created_at from grouped where min = created_at
For the first record by month you can use the same approach by replacing all date_trunc('year', created_at) with date_trunc('month', created_at)

How correctly test the date is within time intervals

I have a timestamp of an user action. And several time intervals when the user have grants to perform the action.
I need to check wether the timestamp of this action is within at least one of the time intervals or not.
Table with users:
CREATE TABLE ausers (
id serial PRIMARY KEY,
user_name VARCHAR(255) default NULL,
action_date TIMESTAMP
);
INSERT INTO ausers VALUES(1,'Jhon', '2018-02-21 15:05:06');
INSERT INTO ausers VALUES(2,'Bob', '2018-05-24 12:22:26');
#|id|user_name|action_date
----------------------------------
1|1 |Jhon |21.02.2018 15:05:06
2|2 |Bob |24.05.2018 12:22:26
Table with grants:
CREATE TABLE user_grants (
id serial PRIMARY KEY,
user_id INTEGER,
start_date TIMESTAMP,
end_date TIMESTAMP
);
INSERT INTO user_grants VALUES(1, 1, '2018-01-01 00:00:01', '2018-03-01 00:00:00');
INSERT INTO user_grants VALUES(2, 1, '2018-06-01 00:00:01', '2018-09-01 00:00:00');
INSERT INTO user_grants VALUES(3, 2, '2018-01-01 00:00:01', '2018-02-01 00:00:00');
INSERT INTO user_grants VALUES(4, 2, '2018-02-01 00:00:01', '2018-03-01 00:00:00');
#|id|user_id|start_date |end_date
------------------------------------------------------
1|1 |1 |01.01.2018 00:00:01 |01.03.2018 00:00:00
2|2 |1 |01.06.2018 00:00:01 |01.09.2018 00:00:00
3|3 |2 |01.01.2018 00:00:01 |01.02.2018 00:00:00
4|4 |2 |01.02.2018 00:00:01 |01.03.2018 00:00:00
The query:
select u.user_name,
case
when array_agg(gr.range) #> array_agg(tstzrange(u.action_date, u.action_date, '[]')) then 'Yes'
else 'No'
end as "permition was granted"
from ausers u
left join (select tstzrange(ug.start_date, ug.end_date, '[]') as range, ug.user_id as uid
from user_grants ug) as gr on gr.uid = u.id
group by u.user_name;
Result:
#|user_name|permition was granted
---------------------------------
1|Bob |No
2|Jhon |No
Timestamp '01.02.2018 15:05:06' is within "01.01.2018 00:00:01, 01.03.2018 00:00:00" range, so "Bob" had grants to perform action and where should be "Yes" in first row, not "No".
The expected output is like this:
#|user_name|permition was granted
---------------------------------
1|Bob |Yes
2|Jhon |No
I tried to test like this:
select array_agg(tstzrange('2018-02-21 15:05:06', '2018-02-21 15:05:06', '[]')) <# array_agg(tstzrange('2018-01-01 00:00:01', '2018-03-01 00:00:01', '[]'));
#|?column?
----------
|false
Result is "false".
But if remove array_agg function
select tstzrange('2018-02-21 15:05:06', '2018-02-21 15:05:06', '[]') <# tstzrange('2018-01-01 00:00:01', '2018-03-01 00:00:01', '[]');
#|?column?
----------
|true
It works fine - the result is "true". Why? Whats's wrong with array_agg?
I have to use array_agg because I have several time intervals to compare.
I have to make "fake" time interval
array_agg(tstzrange(u.action_date, u.action_date, '[]'))
from one timestamp because operator #> doesn't allow to compare the timestamp and array of timestamps intervals.
How to compare that one date is within at least on time interval from the array of time intervals?
There are several #> operators in PostgreSQL:
tstzrange #> tstzrange tests if the first interval contains the second
anyarray #> anyarray tests if the first array contains all elements of the second array.
In your query that will test if for each interval in the second array there is an equal interval in the first array.
Therebis a way to test if an interval is contained in one of the elements of an array of intervals:
someinterval <# ANY (array_of_intervals)
but there is no straightforward way to express your condition with an operator.
Do without an aggregate, join the two tables on #> and count the result rows.
Since the all three dates aare scalar quantities Postgres range checking is not required, a simple BETWEEN operation suffices.
select au.user_name
, case when ug.user_id is null then 'No' else 'Yes' end authorized
from ausers au
left join user_grants ug
on ( au.id = ug.id
and au.action_date between ug.start_date and ug.end_date
);
BTW. I think your expected results posted are backwards. Neither user name has a timestamp of '01.02.2018 15:05:06' as indicated in the description.

PostgreSQL Marketing Report

I'm writing out a query that takes ad marketing data from Google Ads, Microsoft, and Taboola and merges it into one table.
The table should have 3 rows, one for each ad company with 4 columns: traffic source (ad company), money spent, sales, and cost per conversion. Right now I'm just dealing with the first 2 till I get those right. The whole table's data should be grouped within that a given month's data.
Right now the results I'm getting are multiple rows from each traffic source, some of them merging months of data into the cost column instead of summing up the costs within a given month.
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
Here's an example of the results I'm getting. The month column is simply for testing purposes.
| traffic_source1 | cost | month1 |
|:----------------|:-----------|:---------------|
| Google | 210.00 | 01/09/18 00:00 |
| Google | 1,213.00 | 01/10/18 00:00 |
| Google | 2,481.00 | 01/11/18 00:00 |
| Google | 3,503.00 | 01/12/18 00:00 |
| Google | 7,492.00 | 01/01/19 00:00 |
| Microsoft | 22,059.00 | 01/02/19 00:00 |
| Microsoft | 16,958.00 | 01/03/19 00:00 |
| Microsoft | 7,582.00 | 01/04/19 00:00 |
| Microsoft | 76,125.00 | 01/05/19 00:00 |
| Taboola | 37,205.00 | 01/06/19 00:00 |
| Google | 45,910.00 | 01/07/19 00:00 |
| Google | 137,421.00 | 01/08/19 00:00 |
| Google | 29,501.00 | 01/09/19 00:00 |
Instead, it should look like this (Let's say for the month of July this year, for instance):
| traffic_source | cost |
|----------------|-----------|
| Google | 53,901.00 |
| Microsoft | 22,059.00 |
| Taboola | 37,205.00 |
Any help would be greatly appreciated, thanks!
You can try this way:
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
HAVING EXTRACT(month from month1) = ... desired month (July is 7)
The concept of a different table for each ad source is really a very bad idea. It vastly compounds the complexity of of queries requiring consolidation. It would be much better to have a single table having the source along with the other columns. Consider what happens when marketing decides to use 30-40 or more ad suppliers. If you cannot create a single table then at least standardize column names and types. Also build a view, a materialized view, or a table function (below) which combines all the traffic sources into a single source.
create or replace function consolidated_ad_traffic()
returns table( traffic_source text
, ad_month timestamp with time zone
, ad_cost numeric(11,2)
, ad_sales numeric(11,2)
, conversion_cost numeric(11,6)
)
language sql
AS $$
with ad_sources as
( select 'Google' as traffic_source
, "date" as ad_date
, round(cast (cost AS numeric ) / 1000000.0,2) as cost
, sales
, cost_per_conversion
from googleads_campaign
union all
select 'Taboola'
, "date"
, spent
, sales
, cost_per_conversion
from taboola_campaign
union all
select 'Microsoft'
, "TimePeriod"
, "Spend"
, sales
, cost_per_conversion
from microsoft_campaign
)
select * from ad_sources;
$$;
With a consolidated view of the data you can now write normal selects as though you had a single table. Such as:
select * from consolidated_ad_traffic();
select distinct on( traffic_source, to_char(ad_month, 'mm'))
traffic_source
, to_char(ad_month, 'Mon') "For Month"
, to_char(sum(ad_cost) over(partition by traffic_source, to_char(ad_month, 'Mon')), 'FM99,999,999,990.00') monthly_traffic_cost
, to_char(sum(ad_cost) over(partition by traffic_source), 'FM99,999,999,990.00') total_traffic_cost
from consolidated_ad_traffic();
select traffic_source, sum(ad_cost) ad_cost
from consolidated_ad_traffic()
group by traffic_source
order by traffic_source;
select traffic_source
, to_char(ad_month, 'dd-Mon') "For Month"
, sum(ad_cost) "Monthly Cost"
from consolidated_ad_traffic()
where date_trunc('month',ad_month) = date_trunc('month', date '2019-07-01')
and traffic_source = 'Google'
group by traffic_source, to_char(ad_month, 'dd-Mon') ;
Now this won't do much for updating but will drastically ease selection.

Getting data from postgres weekly (according to date)

user timespent(in sec) date(in timestamp)
u1 10 t1(2015-08-15)
u1 20 t2(2015-08-19)
u1 15 t3(2015-08-28)
u1 16 t4(2015-09-06)
Above is the format of my table, which represents timespent by user on a course and it is ordered by timestamp. I want to get sum of timespent by a particular user, say u1 weekly in the format :
start_date end_date sum
2015-08-15 2015-08-21 30
2015-08-22 2015-08-28 15
2015-08-29 2015-09-04 0
2015-09-05 2015-09-11 16
The difficulty lies in the fact that the seven-day periods that you want to get are not regular weeks starting with Monday.
You can not therefore use standard functions to get the week number based on the date, and have to use your own weeks generator using generate_series().
Example data:
create table sessions (user_name text, time_spent int, session_date timestamp);
insert into sessions values
('u1', 10, '2015-08-15'),
('u1', 20, '2015-08-19'),
('u1', 15, '2015-08-28'),
('u1', 16, '2015-09-06');
The query for an arbitrary chosen period from 2015-08-15 to 2015-09-06:
with weeks as (
select d::date start_date, d::date+ 6 end_date
from generate_series('2015-08-15', '2015-09-06', '7d'::interval) d
)
select w.start_date, w.end_date, coalesce(sum(time_spent), 0) total
from weeks w
left join (
select start_date, end_date, coalesce(time_spent, 0) time_spent
from weeks
join sessions
on session_date between start_date and end_date
where user_name = 'u1'
) s
on w.start_date = s.start_date and w.end_date = s.end_date
group by 1, 2
order by 1;
start_date | end_date | total
------------+------------+-------
2015-08-15 | 2015-08-21 | 30
2015-08-22 | 2015-08-28 | 15
2015-08-29 | 2015-09-04 | 0
2015-09-05 | 2015-09-11 | 16
(4 rows)
select
ui,
date_trunc('week', the_date)::date as start_date,
date_trunc('week', the_date)::date + 6 as end_date,
sum(timespent) as "sum"
from t
group by 1, 2, 3
order by 1,2
Something like this (assuming that by timestamp you mean the data type timestamp).
In order to make the 1st day of the week to be Sunday, I added and extra day to "date" in the group by.
select (start_date - date_part('dow', start_date) * interval '1 day')::date start_date,
(start_date + (6 - date_part('dow', start_date)) * interval '1 day')::date end_date,
total_time_spent
from (
select min("date") start_date, sum(timespent) total_time_spent
from mytable
where user=u1
group by date_part('year', "date"), date_part('week', "date" + interval '1 day')) "tmp"
order by start_date
This is a more generic approach, for any date interval.

Compare interval date by row

I am trying to group dates within a 1 year interval given an identifier by labeling which is the earliest date and which is the latest date. If there are no dates within a 1 year interval from that date, then it will record it's own date as the first and last date. For example originally the data is:
id | date
____________
a | 1/1/2000
a | 1/2/2001
a | 1/6/2000
b | 1/3/2001
b | 1/3/2000
b | 1/3/1999
c | 1/1/2000
c | 1/1/2002
c | 1/1/2003
And the output I want is:
id | first_date | last_date
___________________________
a | 1/1/2000 | 1/2/2001
b | 1/3/1999 | 1/3/2001
c | 1/1/2000 | 1/1/2000
c | 1/1/2002 | 1/1/2003
I have been trying to figure this out the whole day and can't figure it out. I can do it for cases id's with only 2 duplicates, but can't for greater values. Any help would be great.
SELECT id
, min(min_date) AS min_date
, max(max_date) AS max_date
, sum(row_ct) AS row_ct
FROM (
SELECT id, year, min_date, max_date, row_ct
, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM (
SELECT id
, extract(year FROM the_date)::int AS year
, min(the_date) AS min_date
, max(the_date) AS max_date
, count(*) AS row_ct
FROM tbl
GROUP BY id, year
) sub1
) sub2
GROUP BY id, grp
ORDER BY id, grp;
1) Group all rows per (id, year), in subquery sub1. Record min and max of the date. I added a count of rows (row_ct) for demonstration.
2) Subtract the row_number() from the year in the second subquery sub2. Thus, all rows in succession end up in the same group (grp). A gap in the years starts a new group.
3) In the final SELECT, group a second time, this time by (id, grp) and record min, max and row count again. Voilá. Produces exactly the result you are looking for.
-> SQLfiddle demo.
Related answers:
Return array of years as year ranges
Group by repeating attribute
select id, min ([date]) first_date, max([date]) last_date
from <yourTbl> group by id
Use this (SQLFiddle Demo):
SELECT id,
min(date) AS first_date,
max(date) AS last_date
FROM mytable
GROUP BY 1
ORDER BY 1