How to PostgreSQL full join: counting number of events by day - postgresql

I'm trying to count the number of accidents happened and resolved in a given day. The data, stored as 'Accidents' looks something like below. Each accidents that happened are assigned unique accident_issue ID, and employer ID of who solved it. Note that some accidents are not resolved in the same day that it happened, and that there are some accidents that happened at the same time.
INSERT INTO Accidents (empid, accident_issue, accident_date, resolved_date) VALUES
('abcd', 'a49b0a4k', '3/12/19 13:25', '3/12/19 13:37'),
('abcd', 'ao3jbvna', '2/1/19 21:16', '2/1/19 21:19'),
('abcd', 'g4b04kcd', '12/12/18 20:37', '12/12/18 21:34'),
('abcd', 'hk9502jb', '12/10/18 21:09', '12/10/18 21:13'),
('abcd', 'cj9rj4vb', '11/30/18 19:44', '11/30/18 19:49'),
('abcd', 'd948mafg', '11/24/18 19:53', '11/26/18 19:55'),
('abcd', 'mkgiud84', '11/24/18 12:48', '11/25/18 14:37'),
('abcd', 'it93hvmv', '11/24/18 12:48', '11/25/18 15:29'),
('efgh', '94jbniv4', '5/17/18 19:56', '5/17/18 20:11'),
('efgh', '5k0bnck5', '4/13/18 15:07', '4/13/18 15:13'),
('efgh', 'mborj3hf', '2/28/18 21:32', '2/28/18 21:51'),
('efgh', 'vkrok4mn', '2/21/18 16:19', '2/21/18 16:35'),
('efgh', '2ivj39cn', '2/20/18 22:01', '2/20/18 22:06'),
('efgh', '0virj3mv', '2/20/18 16:21', '2/20/18 16:23'),
('efgh', 'x20xzn93', '2/9/18 21:16', '2/10/18 21:30'),
('efgh', '49jcn3k5', '2/6/18 19:35', '2/8/18 22:36');
I want the query result to have the number of accidents assigned and resolved by each employers by day.
My initial idea was to first count number of accidents and number of those resolved per day separately, and full join the two table.
This is the code that I have been working so far.
SELECT
a.empid,
a.date,
a.number_of_accidents,
b.number_resolved
FROM
(SELECT A1.empid, A1.accident_issue, to_char(accident_date::date, 'yyyy-mm-dd') as date,
count(accident_date) as number_of_accidents
FROM Accidents as A1
GROUP BY A1.empid, A1.accident_issue
) AS a
FULL OUTER JOIN
(SELECT B1.empid, B1.accident_issue, to_char(resolved_date::date, 'yyyy-mm-dd') as date,
count(resolved_date) as number_resolved
FROM Accidents as B1
GROUP BY B1.empid, B1.accident_issue
) AS b
ON a.date = b.date
GROUP BY a.empid, a.date
When run separately, the two table a and b seems to return what I want, but when put together, for some reason the output gets corrupted and produce multiple duplicate rows.
I want the result to look something like below
| empid | date | number_of_accidents | number_solved |
|-------|-----------|---------------------|---------------|
| abcd | 11/24/18 | 3 | 0 |
| abcd | 11/25/18 | 0 | 2 |
| abcd | 2/1/19 | 1 | 1 |
| abcd | 3/12/19 | 1 | 1 |
| efgh | 2/20/18 | 2 | 2 |
| efgh | 2/21/18 | 1 | 1 |
What seems to be the problem, and am I heading the right direction?
Any help will be greatly appreciated. Thank you!

Aggregate by employee and day in the subqueries and full join them on common day and employee.
SELECT coalesce(o.empid, r.empid) empid,
coalesce(o.day, r.day) date,
o.count number_of_accidents,
r.count number_resolved
FROM (SELECT a.empid,
date_trunc('day', a.accident_date) day,
count(*) count
FROM accidents a
GROUP BY a.empid,
date_trunc('day', a.accident_date)) o
FULL JOIN (SELECT a.empid,
date_trunc('day', a.resolved_date) day,
count(*) count
FROM accidents a
GROUP BY a.empid,
date_trunc('day', a.resolved_date)) r
ON r.empid = o.empid
AND r.day = o.day;

Related

MySQL group by timestamp difference

I need to write mysql query which will group results by difference between timestamps.
Is it possible?
I have table with locations and every row has created_at (timestamp) and I want to group results by difference > 1min.
Example:
id | lat | lng | created_at
1. | ... | ... | 2020-05-03 06:11:35
2. | ... | ... | 2020-05-03 06:11:37
3. | ... | ... | 2020-05-03 06:11:46
4. | ... | ... | 2020-05-03 06:12:48
5. | ... | ... | 2020-05-03 06:12:52
Result of this data should be 2 groups (1,2,3) and (4,5)
It depends on what you actually want. If youw want to group together records that belong to the same minute, regardless of the difference with the previous record, then simple aggregation is enough:
select
date_format(created_at, '%Y-%m-%d %H:%i:00') date_minute,
min(id) min_id,
max(id) max_id,
min(created_at) min_created_at,
max(created_at) max_created_at,
count(*) no_records
from mytable
group by date_minute
On the other hand, if you want to build groups of consecutive records that have less than 1 minute gap in between, this is a gaps and islands problem. Here is on way to solve it using window functions (available in MySQL 8.0):
select
min(id) min_id,
max(id) max_id,
min(created_at) min_created_at,
max(created_at) max_created_at,
count(*) no_records
from (
select
t.*,
sum(case when created_at < lag_created_at + interval 1 minute then 0 else 1 end)
over(order by created_at) grp
from (
select
t.*,
lag(created_at) over(order by created_at) lag_created_at
from mytable t
) t
) t
group by grp

PostgreSQL Marketing Report

I'm writing out a query that takes ad marketing data from Google Ads, Microsoft, and Taboola and merges it into one table.
The table should have 3 rows, one for each ad company with 4 columns: traffic source (ad company), money spent, sales, and cost per conversion. Right now I'm just dealing with the first 2 till I get those right. The whole table's data should be grouped within that a given month's data.
Right now the results I'm getting are multiple rows from each traffic source, some of them merging months of data into the cost column instead of summing up the costs within a given month.
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
Here's an example of the results I'm getting. The month column is simply for testing purposes.
| traffic_source1 | cost | month1 |
|:----------------|:-----------|:---------------|
| Google | 210.00 | 01/09/18 00:00 |
| Google | 1,213.00 | 01/10/18 00:00 |
| Google | 2,481.00 | 01/11/18 00:00 |
| Google | 3,503.00 | 01/12/18 00:00 |
| Google | 7,492.00 | 01/01/19 00:00 |
| Microsoft | 22,059.00 | 01/02/19 00:00 |
| Microsoft | 16,958.00 | 01/03/19 00:00 |
| Microsoft | 7,582.00 | 01/04/19 00:00 |
| Microsoft | 76,125.00 | 01/05/19 00:00 |
| Taboola | 37,205.00 | 01/06/19 00:00 |
| Google | 45,910.00 | 01/07/19 00:00 |
| Google | 137,421.00 | 01/08/19 00:00 |
| Google | 29,501.00 | 01/09/19 00:00 |
Instead, it should look like this (Let's say for the month of July this year, for instance):
| traffic_source | cost |
|----------------|-----------|
| Google | 53,901.00 |
| Microsoft | 22,059.00 |
| Taboola | 37,205.00 |
Any help would be greatly appreciated, thanks!
You can try this way:
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
HAVING EXTRACT(month from month1) = ... desired month (July is 7)
The concept of a different table for each ad source is really a very bad idea. It vastly compounds the complexity of of queries requiring consolidation. It would be much better to have a single table having the source along with the other columns. Consider what happens when marketing decides to use 30-40 or more ad suppliers. If you cannot create a single table then at least standardize column names and types. Also build a view, a materialized view, or a table function (below) which combines all the traffic sources into a single source.
create or replace function consolidated_ad_traffic()
returns table( traffic_source text
, ad_month timestamp with time zone
, ad_cost numeric(11,2)
, ad_sales numeric(11,2)
, conversion_cost numeric(11,6)
)
language sql
AS $$
with ad_sources as
( select 'Google' as traffic_source
, "date" as ad_date
, round(cast (cost AS numeric ) / 1000000.0,2) as cost
, sales
, cost_per_conversion
from googleads_campaign
union all
select 'Taboola'
, "date"
, spent
, sales
, cost_per_conversion
from taboola_campaign
union all
select 'Microsoft'
, "TimePeriod"
, "Spend"
, sales
, cost_per_conversion
from microsoft_campaign
)
select * from ad_sources;
$$;
With a consolidated view of the data you can now write normal selects as though you had a single table. Such as:
select * from consolidated_ad_traffic();
select distinct on( traffic_source, to_char(ad_month, 'mm'))
traffic_source
, to_char(ad_month, 'Mon') "For Month"
, to_char(sum(ad_cost) over(partition by traffic_source, to_char(ad_month, 'Mon')), 'FM99,999,999,990.00') monthly_traffic_cost
, to_char(sum(ad_cost) over(partition by traffic_source), 'FM99,999,999,990.00') total_traffic_cost
from consolidated_ad_traffic();
select traffic_source, sum(ad_cost) ad_cost
from consolidated_ad_traffic()
group by traffic_source
order by traffic_source;
select traffic_source
, to_char(ad_month, 'dd-Mon') "For Month"
, sum(ad_cost) "Monthly Cost"
from consolidated_ad_traffic()
where date_trunc('month',ad_month) = date_trunc('month', date '2019-07-01')
and traffic_source = 'Google'
group by traffic_source, to_char(ad_month, 'dd-Mon') ;
Now this won't do much for updating but will drastically ease selection.

Postgresql: Create a date sequence, use it in date range query

I'm not great with SQL but I have been making good progress on a project up to this point. Now I am completely stuck.
I'm trying to get a count for the number of apartments with each status. I want this information for each day so that I can trend it over time. I have data that looks like this:
table: y_unit_status
unit | date_occurred | start_date | end_date | status
1 | 2017-01-01 | 2017-01-01 | 2017-01-05 | Occupied No Notice
1 | 2017-01-06 | 2017-01-06 | 2017-01-31 | Occupied Notice
1 | 2017-02-01 | 2017-02-01 | | Vacant
2 | 2017-01-01 | 2017-01-01 | | Occupied No Notice
And I want to get output that looks like this:
date | occupied_no_notice | occupied_notice | vacant
2017-01-01 | 2 | 0 | 0
...
2017-01-10 | 1 | 1 | 0
...
2017-02-01 | 1 | 0 | 1
Or, this approach would work:
date | status | count
2017-01-01 | occupied no notice | 2
2017-01-01 | occupied notice | 0
date_occurred: Date when the status of the unit changed
start_date: Same as date_occurred
end_date: Date when status stopped being x and changed to y.
I am pulling in the number of bedrooms and a property id so the second approach of selecting counts for one status at a time would produce a relatively large number of rows vs. option 1 (if that matters).
I've found a lot of references that have gotten me close to what I'm looking for but I always end up with a sort of rolling, cumulative count.
Here's my query, which produces a column of dates and counts, which accumulate over time rather than reflecting a snapshot of counts for a particular day. You can see my references to another table where I'm pulling in a property id. The table schema is Property -> Unit -> Unit Status.
WITH t AS(
SELECT i::date from generate_series('2016-06-29', '2017-08-03', '1 day'::interval) i
)
SELECT t.i as date,
u.hproperty,
count(us.hmy) as count --us.hmy is the id
FROM t
LEFT OUTER JOIN y_unit_status us ON t.i BETWEEN us.dtstart AND
us.dtend
INNER JOIN y_unit u ON u.hmy = us.hunit -- to get property id
WHERE us.sstatus = 'Occupied No Notice'
AND t.i >= us.dtstart
AND t.i <= us.dtend
AND u.hproperty = '1'
GROUP BY t.i, u.hproperty
ORDER BY t.i
limit 1500
I also tried a FOR loop, iterating over the dates to determine cases where the date was between start and end but my logic wasn't working. Thanks for any insight!
You are on the right track, but you'll need to handle NULL values in end_date. If those means that status is assumed to be changed somewhere in the future (but not sure when it will change), the containment operators (#> and <#) for the daterange type are perfect for you (because ranges can be "unbounded"):
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d, status, count(unit)
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1, 2
To achieve the first variant, you can use conditional aggregation:
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d,
count(unit) filter (where status = 'Occupied No Notice') occupied_no_notice,
count(unit) filter (where status = 'Occupied Notice') occupied_notice,
count(unit) filter (where status = 'Vacant') vacant
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1
Notes:
The syntax filter (where <predicate>) is new to 9.4+. Before that, you can use CASE (and the fact that most aggregate functions does not include NULL values) to emulate it.
You can even index the expression daterange(start_date, end_date, '[]') (using gist) for better performance.
http://rextester.com/HWKDE34743

LAG function and GROUP BY

I have a table like this,
event_id | date
----------+------------------------
1703702 | 2013-06-25 07:50:57-04
3197588 | 2013-06-25 07:51:57-04
60894420 | 2013-06-25 07:52:57-04
60894420 | 2013-06-25 07:53:57-04
183503 | 2013-06-25 07:54:57-04
63116743 | 2013-06-25 07:55:57-04
63110451 | 2013-06-25 07:56:57-04
63116743 | 2013-06-25 07:57:57-04
63116743 | 2013-06-25 07:58:57-04
I'd like to apply the lag function but also a group by so I can find the time intervals between any particular event_id.
I'd like something like this:
SELECT event_id, difference
FROM (
SELECT event_id, date - lag(date) over (order by date) as
difference FROM table GROUP BY event_id
) t;
I cannot however use GROUP BY with the LAG function. I'd like a result similar to the following:
63116743, {120, 60}
60894420, {60}
...
...
So there was a 120s and 60s window between the events for the first id, and a 60s window for the second id.
Is there a way to do this? The output format is not too important as long as I can get it into an array in the end. I'm using Postgres 9.1
WITH diffs as (
SELECT
event_id,
date - lag(date) over (partition BY event_id ORDER BY date) as difference
FROM
TABLE
)
SELECT
event_id,
array_agg( difference ) as all_diffs
FROM
diffs
GROUP BY event_id;
Should work.

Iterate through rows, compare them against each other and store results in another table

I have a table that contains the following rows:
product_id | order_date
A | 12/04/12
A | 01/11/13
A | 01/21/13
A | 03/05/13
B | 02/14/13
B | 03/09/13
What I now need is an overview for each month, how many products have been bought for the first time (=have not been bought the month before), how many are existing products (=have been bought the month before) and how many have not been purchased within a given month. Taken the sample above as an input, the script should deliver the following result, regardless of what period of time is in the data:
month | new | existing | nopurchase
12/2012 | 1 | 0 | 0
01/2013 | 0 | 1 | 0
02/2013 | 1 | 0 | 1
03/2013 | 1 | 1 | 0
Would be great to get a first hint how this could be solved so I'm able to continue.
Thanks!
SQL Fiddle
with t as (
select product_id pid, date_trunc('month', order_date)::date od
from t
group by 1, 2
)
select od,
sum(is_new::integer) "new",
sum(is_existing::integer) existing,
sum(not_purchased::integer) nopurchase
from (
select od,
lag(t_pid) over(partition by s_pid order by od) is null and t_pid is not null is_new,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is not null is_existing,
lag(t_pid) over(partition by s_pid order by od) is not null and t_pid is null not_purchased
from (
select t.pid t_pid, s.pid s_pid, s.od
from
t
right join
(
select pid, s.od
from
t
cross join
(
select date_trunc('month', d)::date od
from
generate_series(
(select min(od) from t),
(select max(od) from t),
'1 month'
) s(d)
) s
group by pid, s.od
) s on t.od = s.od and t.pid = s.pid
) s
) s
group by 1
order by 1