Postgresql running totals with groups missing data and outer joins - postgresql

I've written a sql query that pulls data from a user table and produces a running total and cumulative total of when users were created. The data is grouped by week (using the windowing feature of postgres). I'm using a left outer join to include the weeks when no users where created. Here is the query...
<!-- language: lang-sql -->
WITH reporting_period AS (
SELECT generate_series(date_trunc('week', date '2015-04-02'), date_trunc('week', date '2015-10-02'), interval '1 week') AS interval
)
SELECT
date(interval) AS interval
, count(users.created_at) as interval_count
, sum(count( users.created_at) ) OVER (order by date_trunc('week', users.created_at)) AS cumulative_count
FROM reporting_period
LEFT JOIN users
ON interval=date(date_trunc('week', users.created_at) )
GROUP BY interval, date_trunc('week', users.created_at) ORDER BY interval
It works almost perfectly. The cumulative value is calculated properly for weeks week a user was created. For weeks when no user was create it is set to grand total and not the cumulative total up to that point.
Notice the rows with ** the Week Tot column (interval_count) is 0 as expected but the Run Tot (cumulative_total) is 1053 which equals the grand total.
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 1053 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 1053 **
2015-05-11 | 0 | 1053 **
2015-05-18 | 1 | 30
2015-05-25 | 0 | 1053 **
...
2015-06-08 | 996 | 1031
...
2015-09-07 | 2 | 1052
2015-09-14 | 0 | 1053 **
2015-09-21 | 1 | 1053 **
2015-09-28 | 0 | 1053 **
This is what I would like
Week Week Tot Run Tot
-----------------------------------
2015-03-30 | 4 | 4
2015-04-06 | 13 | 17
2015-04-13 | 0 | 17 **
2015-04-20 | 9 | 26
2015-04-27 | 3 | 29
2015-05-04 | 0 | 29 **
...
It seems to me that if the outer join can somehow apply the grand total to the last column it should be possible to apply the current running total but I'm at a loss on how to do it.
Is this possible?

This is not guaranteed to work out of the box as I havent tested on acutal tables, but the key here is to join users on created_at over a range of dates.
with reportingperiod as (
select intervaldate as interval_begin,
intervaldate + interval '1 month' as interval_end
from (
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-03-15')),
DATE(DATE_TRUNC('day', DATE '2015-10-15')), interval '1 month') AS intervaldate
) as rp
)
select interval_end,
interval_count,
sum(interval_count) over (order by interval_end) as running_sum
from (
select interval_end,
count(u.created_at) as interval_count
from reportingperiod rp
left join (
select created_at
from users
where created_at < '2015-10-02'
) u on u.created_at > rp.interval_begin
and u.created_at <= rp.interval_end
group by interval_end
) q

I figured it out. The trick was subqueries. Here's my approach
Add a count column to the generate_series call with default value of 0
Select interval and count(users.created_at) from the users data
Union the the generate_series and the result from the select in step #2
(At this point the result will have duplicates for each interval)
Use the results in a subquery to get interval and max(interval_count) which eliminates duplicates
Use the window aggregation as before to get the running total
SELECT
interval
, interval_count
, SUM(interval_count ) OVER (ORDER BY interval) AS cumulative_count
FROM
(
SELECT interval, MAX(interval_count) AS interval_count FROM
(
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('week', DATE '2015-04-02')),
DATE(DATE_TRUNC('week', DATE '2015-10-02')), interval '1 week') AS interval,
0 AS interval_count
UNION
SELECT DATE_TRUNC('week', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
) sub1
GROUP BY interval
) grouped_data
I'm not sure if there are any serious performance issues with this approach but it seems to work. If anyone has a better, more elegant or performant approach I would love the feedback.
Edit: My solution doesn't work when trying to group by arbitrary time windows
Just tried this solution with the following changes
/* generate series using DATE_TRUNC('day'...)*/
SELECT GENERATE_SERIES(DATE(DATE_TRUNC('day', DATE '2015-04-02')),
DATE(DATE_TRUNC('day', DATE '2015-10-02')), interval '1 month') AS interval,
0 AS interval_count
/* And this part */
SELECT DATE_TRUNC('day', users.created_at) AS INTERVAL,
COUNT(users.created_at) AS interval_count FROM users
WHERE users.created_at < date '2015-10-02'
GROUP BY 1 ORDER BY 1
For example is is possible to produce these similar results but have the data grouped by intervals as so
3/15/15 - 4/14/15,
4/15/15 - 5/14/15,
5/15/15 - 6/14/15
etc.

Related

How to Shorten Execution Time for A View

I have 3 tables, a user table, an admin table, and a cust table. Both admin and cust tables are foreign keyed to the user_account table. Basically, every user has a user record, and the type of user they are is determined by if they have a record in the admin or the cust table.
user admin cust
user_id user_id | admin_id user_id | cust_id
--------- ---------|---------- ---------|---------
1 1 | a 2 | dd
2 4 | b 3 | ff
3
4
Then I have a login_history table that records the user_id and login timestamp every time a user logs into the app
login_history
user_id | login_on
---------|---------------------
1 | 2022-01-01 13:22:43
1 | 2022-01-02 16:16:27
3 | 2022-01-05 21:17:52
2 | 2022-01-11 11:12:26
3 | 2022-01-12 03:34:47
I would like to create a view that would contain all dates for the first day of each week in the year starting from jan 1st, and a count column that contains the count of unique admin users that logged in that week and a count of unique cust users that logged in that week. So the resulting view should contain the following 53 records, one for each week.
login_counts_view
week_start_date | admin_count | cust_count
-----------------|-------------|------------
2022-01-01 | 1 | 1
2022-01-08 | 0 | 2
2022-01-15 | 0 | 0
.
.
.
2022-12-31 | 0 | 0
Note that the first week (2022-01-01) only has 1 count for admin_count even though the admin with user_id 1 logged in twice that week.
Below is the current query I have for the view. However, the tables are pretty large and it takes over 10 seconds to retrieve all records from the view, mainly because of the left joined date comparisons.
CREATE VIEW login_counts_view AS
SELECT
week_start_dates.week_start_date::text AS week_start_date,
count(distinct a.user_id) AS admin_count,
count(distinct c.user_id) AS cust_count
FROM (
SELECT
to_char(i::date, 'YYYY-MM-DD') AS week_start_date
FROM
generate_series(date_trunc('year', NOW()), to_char(NOW(), 'YYYY-12-31')::date, '1 week') i
) week_start_dates
LEFT JOIN login_history l ON l.login_on::date BETWEEN week_start_dates.week_start_date::date AND (week_start_dates.week_start_date::date + INTERVAL '6 day')::date
LEFT JOIN admin a ON a.user_id = l.user_id
LEFT JOIN cust c ON c.user_id = l.user_id
GROUP BY week_start_date;
Does anyone have any tips as to how to make this query execute more efficiently?
Idea
Compute the pseudo-week of each login date: partition the year into 7-day slices and number them consecutively. The pseudo-week of a given date would be the ordinal number of the slice it falls into.
Then operate the joins on integers representing the pseudo-weeks instead of date values and comparisons.
Implementation
A view to implement this follows:
CREATE VIEW login_counts_view_fast AS
WITH RECURSIVE Numbers(i) AS ( SELECT 0 UNION ALL SELECT i + 1 FROM Numbers WHERE i < 52 )
SELECT CAST ( date_trunc('year', NOW()) AS DATE) + 7 * n.i week_start_date
, count(distinct lw.admin_id) admin_count
, count(distinct lw.cust_id) cust_count
FROM (
SELECT i FROM Numbers
) n
LEFT JOIN (
SELECT admin_id
, cust_id
, base
, pit
, pit-base delta
, (pit-base) / (3600 * 24 * 7) week
FROM (
SELECT a.user_id admin_id
, c.user_id cust_id
, CAST ( EXTRACT ( EPOCH FROM l.login_on ) AS INTEGER ) pit
, CAST ( EXTRACT ( EPOCH FROM date_trunc('year', NOW()) ) AS INTEGER ) base
FROM login_history l
LEFT JOIN admin a ON a.user_id = l.user_id
LEFT JOIN cust c ON c.user_id = l.user_id
) le
) lw
ON lw.week = n.i
GROUP BY n.i
;
Some remarks:
The epoch values are the number of seconds elapsed since an absolute base datetime (specifically 1/1/1970 0h00).
CASTS are necessary to convert doubles to integers and timestamps to dates as mandated by the signatures of postgresql date functions and in order to enforce integer arithmetics.
The recursive subquery is a generator of consecutive integers. It could possibly be replaced by a generate_series call (untested)
Evaluation
See it in action in this db fiddle
The query plan indicates savings of 50-70% in execution time.

Combine generate series and count into one query

Postgres version 9.4.18, PostGIS Version 2.2.
I removed some of the details about the tables from this question because I doubt it's needed to answer the question. I can add those details back if necessary.
Desired result:
I want a total count for each week of year and hour of day (0100 to 5223). I'm able to successfully generate a series of 0100 to 5223 (actually up to 5300), and I'm able to get a total count for each week of year and hour of day individually, but i'm unable to combine the queries so that weeks of year/hours of day with a zero county still show up. I want to combine the count result with the generate_series (and ideally divide that result by 30) to get something like below.
MM-DD | count_not_zero | count_not_zero_divided_by_30
-------+----------------+----------------------------
0100 | 10 | 33.3
0101 | 0 | 0
0102 | 0 | 0
...
0123 | 0 | 0
0200 | 3 | 10
0201 | 10 | 33.3
...
5223 | 20 | 66.6
Here are my individual queries that work...that I want to combine:
SELECT DISTINCT f_woyhh(d::timestamp) as woyhh
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 hour') d
GROUP BY woyhh
ORDER by woyhh asc;
SELECT dt, count(*) FROM
(SELECT f_woyhh((time)::timestamp at time zone 'utc' at time zone 'america/chicago')
AS dt,
EXTRACT(YEAR FROM time) AS ctYear, count(*)
AS ct
FROM counties c
INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE countyname = 'Milwaukee' AND state = 'WI' AND EXTRACT(YEAR from time) > '1987' GROUP BY dt, EXTRACT(YEAR from time))
AS count group by dt;
The result from the second query above is (and skips zero count dt, which I don't want):
dt | count
-------+-------
0100 | 10
0104 | 5
0108 | 4
...
Conclusion:
I'm trying to combine the above working individual queries into a single query that provides a three a three column result--woyhh, count, and count divided by 30. And I want to include woyhh that have zero in the county, so that I have a complete set of woyhh.
Thanks for any help!!
I found the answer. I'll be posting it tomorrow, but I wanted to put this on today so no one unnecessarily works on this question. I apologize for the formatting.
WITH CTE_Dates AS (SELECT DISTINCT f_woyhh(d::timestamp) as dt
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 hour') d),
CTE_WeeklyHourlyCounts AS (SELECT dt, count(*) as ct
FROM (SELECT f_woyhh((time)::timestamp at time zone 'utc' at time zone 'america/chicago') as dt,
EXTRACT(YEAR FROM time) as ctYear, count(*) as ct
FROM counties c
INNER JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE countyname = 'Milwaukee' AND state = 'WI' AND EXTRACT(YEAR from time) > '1987'
GROUP BY dt,
EXTRACT(YEAR from time)) as count group by dt),
CTE_FullSTats AS (SELECT CTE_Dates.dt as dt, CAST(CTE_WeeklyHourlyCounts.ct as decimal) as ct
FROM CTE_Dates LEFT JOIN CTE_WeeklyHourlyCounts ON CTE_WeeklyHourlyCounts.dt = CTE_Dates.dt
GROUP BY CTE_Dates.dt, CTE_WeeklyHourlyCounts.ct, CTE_WeeklyHourlyCounts.dt) SELECT dt, COALESCE(ct, 0)
AS count, round(((COALESCE(ct,0) * 100) / 30),0) as percent FROM CTE_FullStats
GROUP BY dt, ct ORDER BY dt;

Select query for fetching data on the interval of 4 hours

I am using postgres 8.1 database and I want to write a query which select data on the interval of 4 hours.
So as image show subscriber_id with date, this how currently data available in the database and
I want data like
No. of Subscriber | Interval
0 0-4
0 4-8
7 8-12
1 12-16
0 16-20
0 20-24
basically in each day we have 24 hours, if I divide 24/4=6 means I have total 6 intervals for each day
0-4
4-8
8-12
12-16
16-20
20-24
So I need count of subscribers within these intervals. Is there any data function in postgres which solve my problem or how can I write a query for this problem ?
NOTE : Please write your solution according to postgres 8.1 version
Use generate_series() to generate periods and left join date_time with appropriate periods, e.g.:
with my_table(date_time) as (
values
('2016-10-24 11:10:00'::timestamp),
('2016-10-24 11:20:00'),
('2016-10-24 15:10:00'),
('2016-10-24 21:10:00')
)
select
format('%s-%s', p, p+4) as "interval",
sum((date_time notnull)::int) as "no of subscriber"
from generate_series(0, 20, 4) p
left join my_table
on extract(hour from date_time) between p and p+ 4
group by p
order by p;
interval | no of subscriber
----------+------------------
0-4 | 0
4-8 | 0
8-12 | 2
12-16 | 1
16-20 | 0
20-24 | 1
(6 rows)
I wouldn't suppose that there is a live guy who remembers version 8.1. You can try:
create table periods(p integer);
insert into periods values (0),(4),(8),(12),(16),(20);
select
p as "from", p+4 as "to",
sum((date_time notnull)::int) as "no of subscriber"
from periods
left join my_table
on extract(hour from date_time) between p and p+ 4
group by p
order by p;
from | to | no of subscriber
------+----+------------------
0 | 4 | 0
4 | 8 | 0
8 | 12 | 2
12 | 16 | 1
16 | 20 | 0
20 | 24 | 1
(6 rows)
In Postgres, you can do this by generating all the intervals for your time periods. This is a little tricky, because you have to pick out the dates in your data. However, generate_series() is really helpful.
The rest is just a left join and aggregation:
select dt.dt, count(t.t)
from (select generate_series(min(d.dte), max(d.dte) + interval '23 hour', interval '4 hour') as dt
from (select distinct date_trunc('day', t.t)::date as dte from t) d
) dt left join
t
on t.t >= dt.dt and t.t < dt.dt + interval '4 hour'
group by dt.dt
order by dt.dt;
Note that this keeps the period as the date/time of the beginning of the period. You can readily convert this to a date and an interval number, if that is more helpful.
The Previous Solution is also working...
Adding one more option,
Instead of creating a table for periods we can also use array and unnest function of arrays in this query
My code is
select
p as "from", p+4 as "to",
sum((date_time not null)::int) as "no of subscriber"
from unnest(ARRAY[0,4,8,12,16,20]) as p
left join my_table
on extract(hour from date_time) between p and p+ 4
group by p
order by p;
I think if you run six different queries since you know the time intervals (lower and upper limit) will be better.

Break into multiple rows based on date range of a single row

I have a table which captures appointments, some are single day appointments and some are multi day appointments, so the data looks like
AppointmentId StartDate EndDate
9 2017-04-12 2017-04-12
10 2017-05-01 2017-05-03
11 2017-06-01 2017-06-01
I want to split the multi day appointment as single days, so the result I am trying to achieve is like
AppointmentId StartDate EndDate
9 2017-04-12 2017-04-12
10 2017-05-01 2017-05-01
10 2017-05-02 2017-05-02
10 2017-05-03 2017-05-03
11 2017-06-01 2017-06-01
So I have split the appointment id 10 into multiple rows. I checked a few other questions like
here but those are to split just based on a single start date and end date and not based on table data
You can use a Calendar or dates table for this sort of thing.
For only 152kb in memory, you can have 30 years of dates in a table with this:
/* dates table */
declare #fromdate date = '20000101';
declare #years int = 30;
/* 30 years, 19 used data pages ~152kb in memory, ~264kb on disk */
;with n as (select n from (values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) t(n))
select top (datediff(day, #fromdate,dateadd(year,#years,#fromdate)))
[Date]=convert(date,dateadd(day,row_number() over(order by (select 1))-1,#fromdate))
into dbo.Dates
from n as deka cross join n as hecto cross join n as kilo
cross join n as tenK cross join n as hundredK
order by [Date];
create unique clustered index ix_dbo_Dates_date
on dbo.Dates([Date]);
Without taking the actual step of creating a table, you can use it inside a common table expression with just this:
declare #fromdate date = '20161229';
declare #thrudate date = '20170103';
;with n as (select n from (values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) t(n))
, dates as (
select top (datediff(day, #fromdate, #thrudate)+1)
[Date]=convert(date,dateadd(day,row_number() over(order by (select 1))-1,#fromdate))
from n as deka cross join n as hecto cross join n as kilo
cross join n as tenK cross join n as hundredK
order by [Date]
)
select [Date]
from dates;
Use either like so:
select
t.AppointmentId
, StartDate = d.date
, EndDate = d.date
from dates d
inner join appointments t
on d.date >= t.StartDate
and d.date <= t.EndDate
rextester demo: http://rextester.com/TNWQ64342
returns:
+---------------+------------+------------+
| AppointmentId | StartDate | EndDate |
+---------------+------------+------------+
| 9 | 2017-04-12 | 2017-04-12 |
| 10 | 2017-05-01 | 2017-05-01 |
| 10 | 2017-05-02 | 2017-05-02 |
| 10 | 2017-05-03 | 2017-05-03 |
| 11 | 2017-06-01 | 2017-06-01 |
+---------------+------------+------------+
Number and Calendar table reference:
Generate a set or sequence without loops - 1 - Aaron Bertrand
Generate a set or sequence without loops - 2 - Aaron Bertrand
Generate a set or sequence without loops - 3 - Aaron Bertrand
The "Numbers" or "Tally" Table: What it is and how it replaces a loop - Jeff Moden
Creating a Date Table/Dimension in sql Server 2008 - David Stein
Calendar Tables - Why You Need One - David Stein
Creating a date dimension or calendar table in sql Server - Aaron Bertrand
tsql Function to Determine Holidays in sql Server - Aaron Bertrand
F_table_date - Michael Valentine Jones
Clearly a Calendar/Tally table would be the way to go as SqlZim illustrated (+1), however you can use an ad-hoc tally table with a CROSS APPLY.
Example
Select A.AppointmentId
,StartDate = B.D
,EndDate = B.D
From YourTable A
Cross Apply (
Select Top (DateDiff(DD,A.StartDate,A.EndDate)+1) D=DateAdd(DD,-1+Row_Number() Over (Order By Number),A.StartDate)
From master..spt_values
) B
Returns
AppointmentId StartDate EndDate
9 2017-04-12 2017-04-12
10 2017-05-01 2017-05-01
10 2017-05-02 2017-05-02
10 2017-05-03 2017-05-03
11 2017-06-01 2017-06-01

Creating sequence of dates and inserting each date into query

I need to find certain data within first day of current month to the last day of current month.
select count(*) from q_aggr_data as a
where a.filial_='fil1'
and a.operator_ like 'unit%'
and date_trunc('day',a.s_end_)='"+ date_to_search+ "'
group by a.s_name_,date_trunc('day',a.s_end_)
date_to_searh here is 01.09.2014,02.09.2014, 03.09.2014,...,30.09.2014
I've tried to loop through i=0...30 and make 30 queries, but that takes too long and extremely naive. Also to the days where there is no entry it should return 0. I've seen how to generate date sequences, but can't get my head around on how to inject those days one by one into the query
By creating not only a series, but a set of 1 day ranges, any timestamp data can be joined to the range using >= with <
Note in particular that this approach avoids functions on the data (such as truncating to date) and because of this it permits the use indexes to assist query performance.
If some data looked like this:
CREATE TABLE my_data
("data_dt" timestamp)
;
INSERT INTO my_data
("data_dt")
VALUES
('2014-09-01 08:24:00'),
('2014-09-01 22:48:00'),
('2014-09-02 13:12:00'),
('2014-09-03 03:36:00'),
('2014-09-03 18:00:00'),
Then that can be joined, using an outer join so unmatched ranges are still reported to a generated set of ranges (dt_start & dt_end pairs)
SELECT
r.dt_start
, count(d.data_dt)
FROM (
SELECT
dt_start
, dt_start + INTERVAL '1 Day' dt_end
FROM
generate_series('2014-09-01 00:00'::timestamp,
'2014-09-30 00:00', '1 Day') AS dt_start
) AS r
LEFT OUTER JOIN my_data d ON d.data_dt >= r.dt_start
AND d.data_dt < r.dt_end
GROUP BY
r.dt_start
ORDER BY
r.dt_start
;
and a result such as this is produced:
| DT_START | COUNT |
|----------------------------------|-------|
| September, 01 2014 00:00:00+0000 | 2 |
| September, 02 2014 00:00:00+0000 | 1 |
| September, 03 2014 00:00:00+0000 | 2 |
| September, 04 2014 00:00:00+0000 | 2 |
...
| September, 29 2014 00:00:00+0000 | 0 |
| September, 30 2014 00:00:00+0000 | 0 |
See this SQLFiddle demo
One way to solve this problem is to group by truncated date.
select count(*)
from q_aggr_data as a
where a.filial_='fil1'
and a.operator_ like 'unit%'
group by date_trunc('day',a.s_end_), a.s_name_;
The other way is to use a window function, for getting the count over truncated date for example.
Please check if this query satisfies your requirements:
select sum(matched) -- include s_name_, s_end_ if you want to verify the results
from
(select a.filial_
, a.operator_
, a.s_name_
, generate_series s_end_
, (case when a.filial_ = 'fil1' then 1 else 0 end) as matched
from q_aggr_data as a
right join generate_series('2014-09-01', '2014-09-30', interval '1 day')
on a.s_end_ = generate_series
and a.filial_ = 'fil1'
and a.operator_ like 'unit%') aa
group by s_name_, s_end_
order by s_end_, s_name_
http://sqlfiddle.com/#!15/e8edf/3