selecting records without value - postgresql

I have a problem when I'm trying to reach the desired result. The task looks simple — make a daily count of occurrences of the event for top countries.
The main table looks like this:
id | date | country | col1 | col2 | ...
1 | 2018-01-01 21:21:21 | US | value 1 | value 2 | ...
2 | 2018-01-01 22:32:54 | UK | value 1 | value 2 | ...
From this table, I want to get daily event counts by the country, which is achieved by
SELECT date::DATE AT TIME ZONE 'UTC', country, COALESCE(count(id),0) FROM tab1
GROUP BY 1, 2
The problem comes when there is no event was made by an UK user on 2 January 2018
country_events
date | country | count
2018-01-01 | US | 23
2018-01-01 | UK | 5
2018-01-02 | US | 30
2018-01-02 | UK | 0 -> is desired result, but row is missing
I've tried to generate date series and series of countries which I'm looking for, then CROSS JOIN these two tables. This helper with columns date and country I've left joined with my result table like
SELECT * FROM helper h
LEFT JOIN country_events c ON c.date::DATE = h.date::DATE AND c.country = h.country
I'm using PostgreSQL.

You need an outer join, not a cross join:
SELECT tab1.date::date, tab1.country, coalesce(count(*), 0)
FROM generate_series(TIMESTAMP '2018-01-01 00:00:00',
TIMESTAMP '2018-01-31 00:00:00',
INTERVAL '1 day') AS ts(d)
LEFT JOIN tab1 ON tab1.date >= ts.d AND tab1.date < ts.d + INTERVAL '1 day'
GROUP BY tab1.date::date, tab1.country
ORDER BY tab1.date::date, tab1.country;
This will give the desired list for January 2018.

Related

Postgres generate_series joined onto result set to fill empty dates within a range

I have a result set that sometimes has missing dates (because no data is present within that week), and need to fill those with zero's. For simplicity I've reduced the query and table down to
Table: generated_data
id | data | date_key
1 | 3 | 2021-12-13 03:00:00.000
2 | 1 | 2021-12-22 05:00:00.000
3 | 4 | 2021-12-24 07:00:00.000
4 | 7 | 2022-01-03 01:00:00.000
5 | 2 | 2022-01-05 02:00:00.000
Query:
Select
sum(data) / count(data),
DATE_TRUNC('week', date_key AT TIME ZONE 'America/New_York') as date_key
from generated_data
group by DATE_TRUNC('week', date_key AT TIME ZONE 'America/New_York') as date_key
would produce the following result set:
3 | 2021-12-13 00:00:00.000
2.5 | 2021-12-20 00:00:00.000
5.5 | 2022-01-03 00:00:00.000
but as you can see there's a missing date of 12/27 which I'd like to return in the result set as a zero. I've looked into using generate_series and joining onto the above simplified query, but haven't found a good solution.
The idea would be doing something like
SELECT GENERATE_SERIES('2021-11-08T00:00:00+00:00'::date, '2022-01-17T04:59:59.999000+00:00'::date, '1 week'::interval) as date_key
but I'm not sure how to join that back to the result query where just the missing dates are added. What would a on clause look like for something like that?
final result set would look like
3 | 2021-12-13 00:00:00.000
2.5 | 2021-12-20 00:00:00.000
0 | 2021-12-27 00:00:00.000
5.5 | 2022-01-03 00:00:00.000
At first, you should find the min and max of date and generate based on that. Then join a table with generated data
Demo
WITH data_range AS (
SELECT
min(date_key) AT TIME ZONE 'America/New_York' min,
max(date_key) AT TIME ZONE 'America/New_York' max
from generated_data
),
generated_range AS (
SELECT DATE_TRUNC(
'week',
GENERATE_SERIES(min, max, '1 week'::interval)
) AS date FROM data_range
)
SELECT
coalesce(sum(data) / count(data), 0),
DATE_TRUNC('week', gr.date)
FROM
generated_range gr
LEFT JOIN generated_data gd ON
DATE_TRUNC('week', gd.date_key AT TIME ZONE 'America/New_York') = gr.date
GROUP BY DATE_TRUNC('week', gr.date)
ORDER BY 2

How can I find the status in each month using a start and end date?

[ Title was: "Find out the facts: How to find the month wise active members in an healthcare organization per each year and also find the growth percentage" ]
i have 5 years of history data and would like to do some analytics on it. the data will contain active and inactive members data. the ask is for finding the active members per each month per each year.
what i am doing is am extracting month and year from effective data and grouping by month and year based on active status i.e. Status ='Active'
But in this manner I am losing the history records.
for example, if a person had membership from 01-01-2015 to 31-12-2016. this member will be shown as an inactive member now but the same person was an active member in that duration. So if I filter on the status, I will lose these old records.
i need to go to that month, Jan 2015 and check all whoever were active by that time. So I thought of doing another way.
I have extracted the month of expiry date and filtered like exp_month equal to or greater than extracted month of effective date as shown below. Here, I am not relying on the incoming source field containing member status. I am creating a field with logic to identify the status of the member during the period we are finding. This is just to identify active members per each month of year But i am not sure if this is giving me the perfect solution. Please suggest me the better approach.
SELECT extract(YEAR FROM member_effective_date) AS year
, extract(MONTH FROM member_expiry_date) AS month
, CASE WHEN extract(MONTH FROM member_expiry_date)
= extract(MONTH FROM member_effective_date)
OR extract(MONTH FROM member_expiry_date)
> extract(MONTH FROM member_effective_date)
THEN 'Yes'
ELSE 'No' END AS active_status
FROM table_name
You need to use a cross join with table of dates to get the status in each period. The cross join "inflates" the status table so you can evaluate the status for each period.
Here is an example:
CREATE TEMP TABLE table_name AS
SELECT 'member1' AS member
, '2020-01-01'::DATE AS member_effective_date
, '2020-04-27'::DATE AS member_expiry_date
;
WITH month_list
-- Month start and end for previous 12 months
AS (SELECT DATE_TRUNC('month',dt) AS month_start
, MAX(dt) AS month_end
FROM
-- List of the previous 365 dates
(SELECT DATE_TRUNC('day',SYSDATE) - (n * INTERVAL '1 day') AS dt
FROM
-- List of numbers from 1 to 365
(SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 365) )
GROUP BY month_start
)
SELECT extract(YEAR FROM b.month_start) AS year
, extract(MONTH FROM b.month_start) AS month
, CASE WHEN -- Effective before the month ended and
(a.member_effective_date <= b.month_end
AND a.member_expiry_date > b.month_start)
THEN 'Yes'
ELSE 'No' END AS active
FROM table_name a
CROSS JOIN month_list b -- Explicit cartesian product
ORDER BY 1,2
;
| year | month | active|
|------|-------|-------|
| 2019 | 8 | No |
| 2019 | 9 | No |
| 2019 | 10 | No |
| 2019 | 11 | No |
| 2019 | 12 | No |
| 2020 | 1 | Yes |
| 2020 | 2 | Yes |
| 2020 | 3 | Yes |
| 2020 | 4 | Yes |
| 2020 | 5 | No |
| 2020 | 6 | No |
| 2020 | 7 | No |
| 2020 | 8 | No |

Aggregate counts for each date between date range

I want to find the counts for each date for each city where I have a date range specified by two columns start_date and end_date.
Suppose I have created a table with values like this.
create table abc (city varchar(30),start_date date , end_date date);
insert into abc values('a','2018-01-01','2018-01-03');
insert into abc values('b','2018-01-02','2018-01-05');
insert into abc values('a','2018-01-03','2018-01-06');
insert into abc values('b','2018-01-03','2018-01-03');
insert into abc values('a','2018-01-02','2018-01-02');
insert into abc values('b','2018-01-02','2018-01-05');
I wish to find what are the counts for city a and b on each date. Here it should show me this.
a, 2018-01-01,1
a, 2018-01-02,2
a, 2018-01-03,2
a, 2018-01-04,1
a, 2018-01-05,1
a, 2018-01-06,1
b, 2018-01-02,2
b, 2018-01-03,3
b, 2018-01-04,2
b, 2018-01-05,2
If it was a single date a group by would have done it.
Any help is appreciated.
Use the function generate_series(start, stop, step interval) to get all dates within ranges:
select city, date::date, count(*)
from abc
cross join generate_series(start_date, end_date, '1day'::interval) date
group by 1, 2
order by 1, 2
city | date | count
------+------------+-------
a | 2018-01-01 | 1
a | 2018-01-02 | 2
a | 2018-01-03 | 2
a | 2018-01-04 | 1
a | 2018-01-05 | 1
a | 2018-01-06 | 1
b | 2018-01-02 | 2
b | 2018-01-03 | 3
b | 2018-01-04 | 2
b | 2018-01-05 | 2
(10 rows)
Cross join in the above query is a lateral join, the function is executed once for each row. Because Postgres allows functions returning set in the select list, you can also phrase this as:
select city, generate_series(start_date, end_date, '1day'::interval)::date date, count(*)
from abc
group by 1, 2
order by 1, 2

Postgresql: Create a date sequence, use it in date range query

I'm not great with SQL but I have been making good progress on a project up to this point. Now I am completely stuck.
I'm trying to get a count for the number of apartments with each status. I want this information for each day so that I can trend it over time. I have data that looks like this:
table: y_unit_status
unit | date_occurred | start_date | end_date | status
1 | 2017-01-01 | 2017-01-01 | 2017-01-05 | Occupied No Notice
1 | 2017-01-06 | 2017-01-06 | 2017-01-31 | Occupied Notice
1 | 2017-02-01 | 2017-02-01 | | Vacant
2 | 2017-01-01 | 2017-01-01 | | Occupied No Notice
And I want to get output that looks like this:
date | occupied_no_notice | occupied_notice | vacant
2017-01-01 | 2 | 0 | 0
...
2017-01-10 | 1 | 1 | 0
...
2017-02-01 | 1 | 0 | 1
Or, this approach would work:
date | status | count
2017-01-01 | occupied no notice | 2
2017-01-01 | occupied notice | 0
date_occurred: Date when the status of the unit changed
start_date: Same as date_occurred
end_date: Date when status stopped being x and changed to y.
I am pulling in the number of bedrooms and a property id so the second approach of selecting counts for one status at a time would produce a relatively large number of rows vs. option 1 (if that matters).
I've found a lot of references that have gotten me close to what I'm looking for but I always end up with a sort of rolling, cumulative count.
Here's my query, which produces a column of dates and counts, which accumulate over time rather than reflecting a snapshot of counts for a particular day. You can see my references to another table where I'm pulling in a property id. The table schema is Property -> Unit -> Unit Status.
WITH t AS(
SELECT i::date from generate_series('2016-06-29', '2017-08-03', '1 day'::interval) i
)
SELECT t.i as date,
u.hproperty,
count(us.hmy) as count --us.hmy is the id
FROM t
LEFT OUTER JOIN y_unit_status us ON t.i BETWEEN us.dtstart AND
us.dtend
INNER JOIN y_unit u ON u.hmy = us.hunit -- to get property id
WHERE us.sstatus = 'Occupied No Notice'
AND t.i >= us.dtstart
AND t.i <= us.dtend
AND u.hproperty = '1'
GROUP BY t.i, u.hproperty
ORDER BY t.i
limit 1500
I also tried a FOR loop, iterating over the dates to determine cases where the date was between start and end but my logic wasn't working. Thanks for any insight!
You are on the right track, but you'll need to handle NULL values in end_date. If those means that status is assumed to be changed somewhere in the future (but not sure when it will change), the containment operators (#> and <#) for the daterange type are perfect for you (because ranges can be "unbounded"):
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d, status, count(unit)
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1, 2
To achieve the first variant, you can use conditional aggregation:
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d,
count(unit) filter (where status = 'Occupied No Notice') occupied_no_notice,
count(unit) filter (where status = 'Occupied Notice') occupied_notice,
count(unit) filter (where status = 'Vacant') vacant
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1
Notes:
The syntax filter (where <predicate>) is new to 9.4+. Before that, you can use CASE (and the fact that most aggregate functions does not include NULL values) to emulate it.
You can even index the expression daterange(start_date, end_date, '[]') (using gist) for better performance.
http://rextester.com/HWKDE34743

Grouping based on every N days in postgresql

I have a table that includes ID, date, values (temperature) and some other stuff. My table looks like this:
+-----+--------------+------------+
| ID | temperature | Date |
+-----+--------------+------------+
| 1 | 26.3 | 2012-02-05 |
| 2 | 27.8 | 2012-02-06 |
| 3 | 24.6 | 2012-02-07 |
| 4 | 29.6 | 2012-02-08 |
+-----+--------------+------------+
I want to perform aggregation queries like sum and mean for every 10 days.
I was wondering if it is possible in psql or not?
SQL Fiddle
select
"date",
temperature,
avg(temperature) over(order by "date" rows 10 preceding) mean
from t
order by "date"
select id,
temperature,
sum(temperature) over (order by "date" rows between 10 preceding and current row)
from the_table;
It might not exactly be what you want, as it will do a moving sum over the last 10 rows, which is not necessarily the same as the last 10 days.
Since Postgres 11, you can now use a range based on an interval
select id,
temperature,
avg(temperature) over (order by "date"
range between interval '10 days' preceding and current row)
from the_table;