Grouping by date, with 0 when count() yields no lines - postgresql

I'm using Postgresql 9 and I'm fighting with counting and grouping when no lines are counted.
Let's assume the following schema :
create table views {
date_event timestamp with time zone ;
event_id integer;
}
Let's imagine the following content :
2012-01-01 00:00:05 2
2012-01-01 01:00:05 5
2012-01-01 03:00:05 8
2012-01-01 03:00:15 20
I want to group by hour, and count the number of lines. I wish I could retrieve the following :
2012-01-01 00:00:00 1
2012-01-01 01:00:00 1
2012-01-01 02:00:00 0
2012-01-01 03:00:00 2
2012-01-01 04:00:00 0
2012-01-01 05:00:00 0
.
.
2012-01-07 23:00:00 0
I mean that for each time range slot, I count the number of lines in my table whose date correspond, otherwise, I return a line with a count at zero.
The following will definitely not work (will yeld only lines with counted lines > 0).
SELECT extract ( hour from date_event ),count(*)
FROM views
where date_event > '2012-01-01' and date_event <'2012-01-07'
GROUP BY extract ( hour from date_event );
Please note I might also need to group by minute, or by hour, or by day, or by month, or by year (multiple queries is possible of course).
I can only use plain old sql, and since my views table can be very big (>100M records), I try to keep performance in mind.
How can this be achieved ?
Thank you !

Given that you don't have the dates in the table, you need a way to generate them. You can use the generate_series function:
SELECT * FROM generate_series('2012-01-01'::timestamp, '2012-01-07 23:00', '1 hour') AS ts;
This will produce results like this:
ts
---------------------
2012-01-01 00:00:00
2012-01-01 01:00:00
2012-01-01 02:00:00
2012-01-01 03:00:00
...
2012-01-07 21:00:00
2012-01-07 22:00:00
2012-01-07 23:00:00
(168 rows)
The remaining task is to join the two selects using an outer join like this :
select extract ( day from ts ) as day, extract ( hour from ts ) as hour,coalesce(count,0) as count from
(
SELECT extract ( day from date ) as day , extract ( hour from date ) as hr ,count(*)
FROM sr
where date>'2012-01-01' and date <'2012-01-07'
GROUP BY extract ( day from date ) , extract ( hour from date )
) AS cnt
right outer join ( SELECT * FROM generate_series ( '2012-01-01'::timestamp, '2012-01-07 23:00', '1 hour') AS ts ) as dtetable on extract ( hour from ts ) = cnt.hr and extract ( day from ts ) = cnt.day
order by day,hour asc;

This query will give you the output what your are looking for,
select to_char(date_event, 'YYYY-MM-DD HH24:00') as time, count (to_char(date_event, 'HH24:00')) as count from views where date(date_event) > '2012-01-01' and date(date_event) > '2012-01-07' group by time order by time;

Related

generate series using break time

I have a table that store opening hour and closing hour
CREATE TABLE public.open_hours
(
id bigint NOT NULL,
open_hour character varying(255),
end_hour character varying(255),
day character varying(255),
CONSTRAINT pk_open_hour_id PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.open_hours
OWNER TO postgres;
I have another table that sotre
CREATE TABLE public.break_hours
(
id bigint ,
start_time character varying(255),
end_time character varying(255),
open_hour_id bigint ,
CONSTRAINT break_hours_pkey PRIMARY KEY (id),
CONSTRAINT fkinhl5x01pnn54nv15ol5ntxr5 FOREIGN KEY (open_hour_id )
REFERENCES public.open_hours(id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.break_hours
OWNER TO postgres;
I need to generate time series of 30 minutest interval based on break times.
For eg: if my open hours is 08:00 AM and end hour is 06:00 PM and my break time is 11:00 AM to 11:30 and another break time is 03:00 PM to 03:15 PM then i need to generate series from 08:00 AM to 11:00 AM and 11:30 AM to 03:00 PM and 03:15 to 06:00 PM.
sample data
open_hours
-----------
id open_hours end_hour day
1 08:00 AM 06:00 PM Monday
break_hours
id start_time end_time open_hour_id
1 11:00 AM 11:30 AM 1
2 03:00 PM 03:15 PM 1
Sample out put
--------------
08:00 AM
08:30 AM
09:00 AM
09:30 AM
10:00 AM
10:30 AM
11:30 AM
12:00 PM
12:30 PM
01:00 PM
01:30 PM
02:PM PM
02:30 PM
03:15 PM
03:45 PM
04:15 PM
04:45 PM
05:15 PM
Query used for generating series between open hours is
SELECT DISTINCT gs AS start_time,gs + interval '30min' as end_time
FROM generate_series( timestamp '2018-11-09 08:00 AM', timestamp '2018-11-09 06:00 PM', interval '30min' )gs
ORDER BY start_time
It seems that your table modelling should be cleaned. E.g. you should not store times as text types but as time without time zone.
demo: db<>fiddle
WITH hours AS (
SELECT
oh.open_hour + '1970-01-01'::date as open_hour,
oh.end_hour + '1970-01-01'::date as end_hour,
bh.start_time + '1970-01-01'::date as break_start,
bh.end_time + '1970-01-01'::date as break_end,
lead(start_time + '1970-01-01'::date) OVER (ORDER BY start_time) as next_start_time
FROM open_hours oh
LEFT JOIN break_hours bh
ON oh.id = bh.start_date
)
SELECT generate_series(open_hour, break_start, interval '30 minutes')::time as time_slot
FROM (
SELECT
open_hour, break_start
FROM hours
ORDER BY break_start
LIMIT 1
)s
UNION
SELECT
generate_series(break_end, next_start_time, interval '30 minutes')::time
FROM (
SELECT
break_end, next_start_time
FROM
hours
WHERE next_start_time IS NOT NULL
) s
UNION
SELECT generate_series(break_end, end_hour, interval '30 minutes')::time
FROM (
SELECT
break_end, end_hour
FROM hours
ORDER BY break_start DESC
LIMIT 1
) s
Explanation:
WITH clause (CTE):
Merging both tables. I am adding a nonsense date because this results in a timestamp. The later used function generate_series only works for timestamps not for type time. The part is cut away later after the generation with the ::time cast.
The result of the CTE is:
open_hour end_hour break_start break_end next_start_time
1970-01-01 08:00:00 1970-01-01 18:00:00 1970-01-01 09:30:00 1970-01-01 09:45:00 1970-01-01 11:00:00
1970-01-01 08:00:00 1970-01-01 18:00:00 1970-01-01 11:00:00 1970-01-01 11:30:00 1970-01-01 15:00:00
1970-01-01 08:00:00 1970-01-01 18:00:00 1970-01-01 15:00:00 1970-01-01 15:15:00 (NULL)
UNION part:
This part contains three subparts. Because I have to merge the time series from both tables:
1. Taking the opening hour. Generate a time series to the first break beginning.
For this I only need the first row from the CTE above. That's why LIMIT 1 is used.
2. For all breaks: Generate a time series from current break ending to the next break beginning.
The CTE contains a window function lead() which shifts the start_time of the next row into the current one (have a look at the last column of the CTE result). So now I am able to get all break times, no matter how many there are. In my example I added a third break from 9:30 to 9:45 to demonstrate it. So the next time series can be generated from all these columns (current break_end to next_start_time). Only the last row does not contain a next_start_time because there is none.
3. Last step: Generate a time series from the last break ending to the closing hour.
This is quiet similar to (1). After iterating all break times I have to add the last time series from the last break time to the closing time. This could be achieved either by filtering the row without next_start_time or sorting DESC and using LIMIT 1 as I did.
More complex case with more day types:
demo: db<>fiddle
WITH hours AS (
SELECT
oh.id as day_id,
oh.open_hour + '1970-01-01'::date as open_hour,
oh.end_hour + '1970-01-01'::date as end_hour,
bh.start_time + '1970-01-01'::date as break_start,
bh.end_time + '1970-01-01'::date as break_end,
lead(start_time + '1970-01-01'::date) OVER (PARTITION BY oh.id ORDER BY start_time) as next_start_time
FROM open_hours oh
LEFT JOIN break_hours bh
ON oh.id = bh.start_date
)
SELECT day_id, generate_series(open_hour, break_start, interval '30 minutes')::time as time_slot
FROM (
SELECT DISTINCT ON (day_id)
day_id, open_hour, break_start
FROM hours
ORDER BY day_id, break_start
)s
UNION
SELECT
day_id, generate_series(break_end, next_start_time, interval '30 minutes')::time
FROM (
SELECT
day_id, break_end, next_start_time
FROM
hours
WHERE next_start_time IS NOT NULL
) s
UNION
SELECT day_id, generate_series(break_end, end_hour, interval '30 minutes')::time
FROM (
SELECT DISTINCT ON (day_id)
day_id, break_end, end_hour
FROM hours
ORDER BY day_id, break_start DESC
) s
ORDER BY day_id, time_slot
The main idea stays the same as in the example for only one day. The difference is that we have to consider the different day types. I expanded the example above and added a second day with different opening hours and break times.
Changes:
The window function in the CTE got a PARTITION BY part. This ensures that only the start_times are shifted that contains to the same day.
LIMIT 1 will not work anymore because it limits the whole table to one row. This has been changed to DISTINCT ON (day_id) which limits the table to the first row of each day.

Getting attendance of an employee with a date series in a particular range in Postgres

I have a attendance table with employee_id, date and punch-in time.
Emp_Id PunchTime
101 10/10/2016 07:15
101 10/10/2016 12:20
101 10/10/2016 12:50
101 10/10/2016 16:31
102 10/10/2016 07:15
Here I have the date only for the working days. I want to get the attendance list of a employee with series of given date period. I need the day also. Result should look like as follows
date | day |employee_id | Intime | outtime |
2016-10-09 | sunday | 101 | | |
2016-10-10 | monday | 101 | 2016-10-10 7:15AM |2016-10-10 4:31 PM |
You can generate a list of dates and then do an outer join on them:
The following displays all days in October:
select d.date, a.emp_id,
min(punchtime) as intime,
max(punchtime) as outtime
from generate_series(date '2016-10-01', date '2016-11-01' - 1, interval '1' day) as d (date)
left join attendance a on d.date = a.punchtime::date
group by d.date, a.emp_id;
order by d.date, a.emp_id;
As you want the first and last timestamp from each day this can be done using a simple group by query.
This will however not repeat the emp_id for the non_existing days.
Something like the following will generate a list of the range of dates (starting and ending with whatever range is found in your punchtime table), with employees and intime, outtime for each. Check the SQL fiddle here:
http://sqlfiddle.com/#!15/d93bd/1
WITH RECURSIVE minmax AS
(
SELECT MIN(CAST(time AS DATE)) AS min, MAX(CAST(time as DATE)) AS max
FROM emp_time
),
dates AS
(
SELECT m.min as datepart
FROM minmax m
RIGHT JOIN emp_time e ON m.min = CAST(e.time as DATE)
UNION ALL
SELECT d.datepart + 1 FROM dates d, minmax mm
WHERE d.datepart + 1 <= mm.max
)
SELECT d.datepart as date, e.emp, MIN(e.time) as intime, MAX(e.time) as outtime FROM dates d
LEFT JOIN emp_time e ON d.datepart = CAST(e.time as DATE)
GROUP BY d.datepart, e.emp
ORDER BY d.datepart;

How many seconds passed by grouped by hour between two dates

Let's suppose I have a start date 2016-06-19 09:30:00 and an end date 2016-06-19 10:20:00
I would like to get the time that elapsed every hour before starting the next hour or before getting to the final time in seconds grouped by hour and date, the result I'm trying to achieve (without having any success) would be something like this:
hour | date | time_elapsed_in_seconds
9 | 2016-06-19 | 1800 (there are 1800 seconds between 09:30:00 and 10:00:00)
10 | 2016-06-19 | 1200 (there are 1200 seconds between 10:00:00 and 10:20:00)
Try this :
with table1 as (
select '2016-06-19 09:30:00'::timestamp without time zone start_date,'2016-06-19 10:20:00'::timestamp without time zone end_date
)
select extract(hour from the_hour) "hour",the_hour::date "date",extract (epoch from (new_end-new_start)) "time_elapsed" from (
select the_hour,CASE WHEN date_trunc('hour',start_date)=the_hour then start_date else the_hour end new_start,
CASE WHEN date_trunc('hour',end_date)=the_hour then end_date else the_hour+'1 hour'::interval end new_end
from (
select generate_series(date_trunc('hour',start_date),end_date,'1 hour'::interval) the_hour,start_date,end_date from table1
) a
) b

How can I group by 2 fields and having by an interval type?

I've got this table:
TABLE T (
id int,
month int,
interval hours
);
and I want to group by id and month, and add the hours.
For example:
id month hours
-------------------
1 1 08:00:00
1 1 09:00:00
1 2 10:00:00
1 2 11:00:00
I want:
1 1 17:00:00
1 2 21:00:00
I tried this:
SELECT * FROM T
GROUP BY T.id , T.month
HAVING SUM( SELECT EXTRACT ( epoch FROM T.hours ) / 3600 );
but it doens't work and I can't fix it.
SELECT
id,
month,
sum(extract ('epoch' from hours)/3600)
FROM
hours
GROUP BY
id,
month
SQL Fiddle

SQL query to convert date ranges to per day records

Requirements
I have data table that saves data in date ranges.
Each record is allowed to overlap previous record(s) (record has a CreatedOn datetime column).
New record can define it's own date range if it needs to hence can overlap several older records.
Each new overlapping record overrides settings of older records that it overlaps.
Result set
What I need to get is get per day data for any date range that uses record overlapping. It should return a record per day with corresponding data for that particular day.
To convert ranges to days I was thinking of numbers/dates table and user defined function (UDF) to get data for each day in the range but I wonder whether there's any other (as in better* or even faster) way of doing this since I'm using the latest SQL Server 2008 R2.
Stored data
Imagine my stored data looks like this
ID | RangeFrom | RangeTo | Starts | Ends | CreatedOn (not providing data)
---|-----------|----------|--------|-------|-----------
1 | 20110101 | 20110331 | 07:00 | 15:00
2 | 20110401 | 20110531 | 08:00 | 16:00
3 | 20110301 | 20110430 | 06:00 | 14:00 <- overrides both partially
Results
If I wanted to get data from 1st January 2011 to 31st May 2001 resulting table should look like the following (omitted obvious rows):
DayDate | Starts | Ends
--------|--------|------
20110101| 07:00 | 15:00 <- defined by record ID = 1
20110102| 07:00 | 15:00 <- defined by record ID = 1
... many rows omitted for obvious reasons
20110301| 06:00 | 14:00 <- defined by record ID = 3
20110302| 06:00 | 14:00 <- defined by record ID = 3
... many rows omitted for obvious reasons
20110501| 08:00 | 16:00 <- defined by record ID = 2
20110502| 08:00 | 16:00 <- defined by record ID = 2
... many rows omitted for obvious reasons
20110531| 08:00 | 16:00 <- defined by record ID = 2
Actually, since you are working with dates, a Calendar table would be more helpful.
Declare #StartDate date
Declare #EndDate date
;With Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
Select ...
From Calendar
Left Join MyTable
On Calendar.[Date] Between MyTable.Start And MyTable.End
Option ( Maxrecursion 0 );
Addition
Missed the part about the trumping rule in your original post:
Set DateFormat MDY;
Declare #StartDate date = '20110101';
Declare #EndDate date = '20110501';
-- This first CTE is obviously to represent
-- the source table
With SampleData As
(
Select 1 As Id
, Cast('20110101' As date) As RangeFrom
, Cast('20110331' As date) As RangeTo
, Cast('07:00' As time) As Starts
, Cast('15:00' As time) As Ends
, CURRENT_TIMESTAMP As CreatedOn
Union All Select 2, '20110401', '20110531', '08:00', '16:00', DateAdd(s,1,CURRENT_TIMESTAMP )
Union All Select 3, '20110301', '20110430', '06:00', '14:00', DateAdd(s,2,CURRENT_TIMESTAMP )
)
, Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
, RankedData As
(
Select C.[Date]
, S.Id
, S.RangeFrom, S.RangeTo, S.Starts, S.Ends
, Row_Number() Over( Partition By C.[Date] Order By S.CreatedOn Desc ) As Num
From Calendar As C
Join SampleData As S
On C.[Date] Between S.RangeFrom And S.RangeTo
)
Select [Date], Id, RangeFrom, RangeTo, Starts, Ends
From RankedData
Where Num = 1
Option ( Maxrecursion 0 );
In short, I rank all the sample data preferring the newer rows that overlap the same date.
Why do it all in DB when you can do it better in memory
This is the solution (I eventually used) that seemed most reasonable in terms of data transferred, speed and resources.
get actual range definitions from DB to mid tier (smaller amount of data)
generate in memory calendar of a certain date range (faster than in DB)
put those DB definitions in (much easier and faster than DB)
And that's it. I realised that complicating certain things in DB is not not worth it when you have executable in memory code that can do the same manipulation faster and more efficient.