LAG function and GROUP BY - postgresql

I have a table like this,
event_id | date
----------+------------------------
1703702 | 2013-06-25 07:50:57-04
3197588 | 2013-06-25 07:51:57-04
60894420 | 2013-06-25 07:52:57-04
60894420 | 2013-06-25 07:53:57-04
183503 | 2013-06-25 07:54:57-04
63116743 | 2013-06-25 07:55:57-04
63110451 | 2013-06-25 07:56:57-04
63116743 | 2013-06-25 07:57:57-04
63116743 | 2013-06-25 07:58:57-04
I'd like to apply the lag function but also a group by so I can find the time intervals between any particular event_id.
I'd like something like this:
SELECT event_id, difference
FROM (
SELECT event_id, date - lag(date) over (order by date) as
difference FROM table GROUP BY event_id
) t;
I cannot however use GROUP BY with the LAG function. I'd like a result similar to the following:
63116743, {120, 60}
60894420, {60}
...
...
So there was a 120s and 60s window between the events for the first id, and a 60s window for the second id.
Is there a way to do this? The output format is not too important as long as I can get it into an array in the end. I'm using Postgres 9.1

WITH diffs as (
SELECT
event_id,
date - lag(date) over (partition BY event_id ORDER BY date) as difference
FROM
TABLE
)
SELECT
event_id,
array_agg( difference ) as all_diffs
FROM
diffs
GROUP BY event_id;
Should work.

Related

MySQL group by timestamp difference

I need to write mysql query which will group results by difference between timestamps.
Is it possible?
I have table with locations and every row has created_at (timestamp) and I want to group results by difference > 1min.
Example:
id | lat | lng | created_at
1. | ... | ... | 2020-05-03 06:11:35
2. | ... | ... | 2020-05-03 06:11:37
3. | ... | ... | 2020-05-03 06:11:46
4. | ... | ... | 2020-05-03 06:12:48
5. | ... | ... | 2020-05-03 06:12:52
Result of this data should be 2 groups (1,2,3) and (4,5)
It depends on what you actually want. If youw want to group together records that belong to the same minute, regardless of the difference with the previous record, then simple aggregation is enough:
select
date_format(created_at, '%Y-%m-%d %H:%i:00') date_minute,
min(id) min_id,
max(id) max_id,
min(created_at) min_created_at,
max(created_at) max_created_at,
count(*) no_records
from mytable
group by date_minute
On the other hand, if you want to build groups of consecutive records that have less than 1 minute gap in between, this is a gaps and islands problem. Here is on way to solve it using window functions (available in MySQL 8.0):
select
min(id) min_id,
max(id) max_id,
min(created_at) min_created_at,
max(created_at) max_created_at,
count(*) no_records
from (
select
t.*,
sum(case when created_at < lag_created_at + interval 1 minute then 0 else 1 end)
over(order by created_at) grp
from (
select
t.*,
lag(created_at) over(order by created_at) lag_created_at
from mytable t
) t
) t
group by grp

PostgreSQL Marketing Report

I'm writing out a query that takes ad marketing data from Google Ads, Microsoft, and Taboola and merges it into one table.
The table should have 3 rows, one for each ad company with 4 columns: traffic source (ad company), money spent, sales, and cost per conversion. Right now I'm just dealing with the first 2 till I get those right. The whole table's data should be grouped within that a given month's data.
Right now the results I'm getting are multiple rows from each traffic source, some of them merging months of data into the cost column instead of summing up the costs within a given month.
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
Here's an example of the results I'm getting. The month column is simply for testing purposes.
| traffic_source1 | cost | month1 |
|:----------------|:-----------|:---------------|
| Google | 210.00 | 01/09/18 00:00 |
| Google | 1,213.00 | 01/10/18 00:00 |
| Google | 2,481.00 | 01/11/18 00:00 |
| Google | 3,503.00 | 01/12/18 00:00 |
| Google | 7,492.00 | 01/01/19 00:00 |
| Microsoft | 22,059.00 | 01/02/19 00:00 |
| Microsoft | 16,958.00 | 01/03/19 00:00 |
| Microsoft | 7,582.00 | 01/04/19 00:00 |
| Microsoft | 76,125.00 | 01/05/19 00:00 |
| Taboola | 37,205.00 | 01/06/19 00:00 |
| Google | 45,910.00 | 01/07/19 00:00 |
| Google | 137,421.00 | 01/08/19 00:00 |
| Google | 29,501.00 | 01/09/19 00:00 |
Instead, it should look like this (Let's say for the month of July this year, for instance):
| traffic_source | cost |
|----------------|-----------|
| Google | 53,901.00 |
| Microsoft | 22,059.00 |
| Taboola | 37,205.00 |
Any help would be greatly appreciated, thanks!
You can try this way:
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
HAVING EXTRACT(month from month1) = ... desired month (July is 7)
The concept of a different table for each ad source is really a very bad idea. It vastly compounds the complexity of of queries requiring consolidation. It would be much better to have a single table having the source along with the other columns. Consider what happens when marketing decides to use 30-40 or more ad suppliers. If you cannot create a single table then at least standardize column names and types. Also build a view, a materialized view, or a table function (below) which combines all the traffic sources into a single source.
create or replace function consolidated_ad_traffic()
returns table( traffic_source text
, ad_month timestamp with time zone
, ad_cost numeric(11,2)
, ad_sales numeric(11,2)
, conversion_cost numeric(11,6)
)
language sql
AS $$
with ad_sources as
( select 'Google' as traffic_source
, "date" as ad_date
, round(cast (cost AS numeric ) / 1000000.0,2) as cost
, sales
, cost_per_conversion
from googleads_campaign
union all
select 'Taboola'
, "date"
, spent
, sales
, cost_per_conversion
from taboola_campaign
union all
select 'Microsoft'
, "TimePeriod"
, "Spend"
, sales
, cost_per_conversion
from microsoft_campaign
)
select * from ad_sources;
$$;
With a consolidated view of the data you can now write normal selects as though you had a single table. Such as:
select * from consolidated_ad_traffic();
select distinct on( traffic_source, to_char(ad_month, 'mm'))
traffic_source
, to_char(ad_month, 'Mon') "For Month"
, to_char(sum(ad_cost) over(partition by traffic_source, to_char(ad_month, 'Mon')), 'FM99,999,999,990.00') monthly_traffic_cost
, to_char(sum(ad_cost) over(partition by traffic_source), 'FM99,999,999,990.00') total_traffic_cost
from consolidated_ad_traffic();
select traffic_source, sum(ad_cost) ad_cost
from consolidated_ad_traffic()
group by traffic_source
order by traffic_source;
select traffic_source
, to_char(ad_month, 'dd-Mon') "For Month"
, sum(ad_cost) "Monthly Cost"
from consolidated_ad_traffic()
where date_trunc('month',ad_month) = date_trunc('month', date '2019-07-01')
and traffic_source = 'Google'
group by traffic_source, to_char(ad_month, 'dd-Mon') ;
Now this won't do much for updating but will drastically ease selection.

How to PostgreSQL full join: counting number of events by day

I'm trying to count the number of accidents happened and resolved in a given day. The data, stored as 'Accidents' looks something like below. Each accidents that happened are assigned unique accident_issue ID, and employer ID of who solved it. Note that some accidents are not resolved in the same day that it happened, and that there are some accidents that happened at the same time.
INSERT INTO Accidents (empid, accident_issue, accident_date, resolved_date) VALUES
('abcd', 'a49b0a4k', '3/12/19 13:25', '3/12/19 13:37'),
('abcd', 'ao3jbvna', '2/1/19 21:16', '2/1/19 21:19'),
('abcd', 'g4b04kcd', '12/12/18 20:37', '12/12/18 21:34'),
('abcd', 'hk9502jb', '12/10/18 21:09', '12/10/18 21:13'),
('abcd', 'cj9rj4vb', '11/30/18 19:44', '11/30/18 19:49'),
('abcd', 'd948mafg', '11/24/18 19:53', '11/26/18 19:55'),
('abcd', 'mkgiud84', '11/24/18 12:48', '11/25/18 14:37'),
('abcd', 'it93hvmv', '11/24/18 12:48', '11/25/18 15:29'),
('efgh', '94jbniv4', '5/17/18 19:56', '5/17/18 20:11'),
('efgh', '5k0bnck5', '4/13/18 15:07', '4/13/18 15:13'),
('efgh', 'mborj3hf', '2/28/18 21:32', '2/28/18 21:51'),
('efgh', 'vkrok4mn', '2/21/18 16:19', '2/21/18 16:35'),
('efgh', '2ivj39cn', '2/20/18 22:01', '2/20/18 22:06'),
('efgh', '0virj3mv', '2/20/18 16:21', '2/20/18 16:23'),
('efgh', 'x20xzn93', '2/9/18 21:16', '2/10/18 21:30'),
('efgh', '49jcn3k5', '2/6/18 19:35', '2/8/18 22:36');
I want the query result to have the number of accidents assigned and resolved by each employers by day.
My initial idea was to first count number of accidents and number of those resolved per day separately, and full join the two table.
This is the code that I have been working so far.
SELECT
a.empid,
a.date,
a.number_of_accidents,
b.number_resolved
FROM
(SELECT A1.empid, A1.accident_issue, to_char(accident_date::date, 'yyyy-mm-dd') as date,
count(accident_date) as number_of_accidents
FROM Accidents as A1
GROUP BY A1.empid, A1.accident_issue
) AS a
FULL OUTER JOIN
(SELECT B1.empid, B1.accident_issue, to_char(resolved_date::date, 'yyyy-mm-dd') as date,
count(resolved_date) as number_resolved
FROM Accidents as B1
GROUP BY B1.empid, B1.accident_issue
) AS b
ON a.date = b.date
GROUP BY a.empid, a.date
When run separately, the two table a and b seems to return what I want, but when put together, for some reason the output gets corrupted and produce multiple duplicate rows.
I want the result to look something like below
| empid | date | number_of_accidents | number_solved |
|-------|-----------|---------------------|---------------|
| abcd | 11/24/18 | 3 | 0 |
| abcd | 11/25/18 | 0 | 2 |
| abcd | 2/1/19 | 1 | 1 |
| abcd | 3/12/19 | 1 | 1 |
| efgh | 2/20/18 | 2 | 2 |
| efgh | 2/21/18 | 1 | 1 |
What seems to be the problem, and am I heading the right direction?
Any help will be greatly appreciated. Thank you!
Aggregate by employee and day in the subqueries and full join them on common day and employee.
SELECT coalesce(o.empid, r.empid) empid,
coalesce(o.day, r.day) date,
o.count number_of_accidents,
r.count number_resolved
FROM (SELECT a.empid,
date_trunc('day', a.accident_date) day,
count(*) count
FROM accidents a
GROUP BY a.empid,
date_trunc('day', a.accident_date)) o
FULL JOIN (SELECT a.empid,
date_trunc('day', a.resolved_date) day,
count(*) count
FROM accidents a
GROUP BY a.empid,
date_trunc('day', a.resolved_date)) r
ON r.empid = o.empid
AND r.day = o.day;

Postgresql: get first item of an ordered group not working [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 6 years ago.
I have this data:
| id | person_id | date |
|--------|-----------|---------------------|
| 313962 | 1111111 | 2016-04-14 16:00:00 | --> this row
| 313946 | 2222222 | 2015-03-13 15:00:00 | --> this row
| 313937 | 1111111 | 2014-02-12 14:00:00 |
| 313944 | 1111111 | 2013-01-11 13:00:00 |
| ...... | ....... | ................... |
-What I would like to select are the indicated rows, i.e. the rows with the most recent date for each person_id.
-Also the output format for the date must be dd-mm-YYYY
So far I was trying with this:
SELECT
l.person_id,
to_char(DATE(l.date), 'dd-mm-YYYY') AS user_date
FROM login l
group by l.person_id
order by l.date desc
I was trying different approaches, but I have all kind of Aggregation error messages such as:
for select distinct order by expressions must appear
And
must appear in the GROUP BY clause or be used in an aggregate function
Any idea?
There are several ways, but the simplest way (and perhaps more efficient - but not SQL standard) is to rely on Postgresql's DISTINCT ON:
SELECT DISTINCT ON (person_id )
id, person_id , date
FROM login
ORDER BY person_id , date desc
The date formatting (do you really want that?) can be done in a outer select:
SELECT id,person_id, to_char(DATE(date), 'dd-mm-YYYY') as date
FROM (
SELECT DISTINCT ON (person_id )
id, person_id , date
FROM login
ORDER BY person_id, date desc )
AS XXX;
You can do it with a subquery, something like this:
SELECT
l.person_id,
to_char(DATE(l.date), 'dd-mm-YYYY') AS user_date
FROM login l
where l.date = (select max(date) from login where person_id = l.person_id)
order by l.person_id
You need something like the following to know which date to grab for each person.
select l.person_id, to_char(DATE(d.maxdate), 'dd-mm-YYYY')
from login l
inner join
(select person_id, max(date) as maxdate
from login group by person_id) d on l.person_id = d.person_id
order by d.maxdate desc

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)