postgresql: merge rows keeping some information, without loops - postgresql

I have a list of calls per every user sometimes separated by minutes. Users can buy something in these calls or not.
When a user makes a call within 45 minutes after the last call, I need to consider that it was the same call as the first one.
I need to get the final number of calls ( aggregating the calls separated by less than 45 minutes)
and the number of calls in which they bought something, per user.
So for example, I have a list like this:
buyer timestamp bougth_flag
tom 20150201 9:15 1
anna 20150201 9:25 0
tom 20150201 10:15 0
tom 20150201 10:45 1
tom 20150201 10:48 1
anna 20150201 11:50 0
tom 20150201 11:52 0
anna 20150201 11:54 0
The final table would be:
buyer time_started calls articles_bought
tom 20150201 9:15 1 1
anna 20150201 9:25 1 0
tom 20150201 10:15 3 2
anna 20150201 10:50 2 0
tom 20150201 11:52 1 0
So, I need to merge rows separated by less than 45 minutes, and separate still per user.
This is very easy to do with a loop but I don't have loops or functions/procedures in the postgresql I am using.
Any ideas about how to do it?
Thank you

Since you do not know beforehand how long a "call" is going to be (you could have a call from some buyer every 30 minutes for the full day - see comment to question), you can only solve this with a recursive CTE. (Note that I changed your column 'timestamp' to 'ts'. Never use a keyword as a table or column name.)
WITH conversations AS (
WITH RECURSIVE calls AS (
SELECT buyer, ts, bought_flag, row_number() OVER (ORDER BY ts) AS conversation, 1::int AS calls
FROM (
SELECT buyer, ts, lag(ts) OVER (PARTITION BY buyer ORDER BY ts) AS lag, bought_flag
FROM list) sub
WHERE lag IS NULL OR ts - lag > interval '45 minutes'
UNION ALL
SELECT l.buyer, l.ts, l.bought_flag, c.conversation, c.calls + 1
FROM list l
JOIN calls c ON c.buyer = l.buyer AND l.ts > c.ts
WHERE l.ts - c.ts < interval '45 minutes'
)
SELECT buyer, ts, bought_flag, conversation, max(calls) AS calls
FROM calls
GROUP BY buyer, ts, bought_flag, conversation
order by conversation, ts
)
SELECT buyer, min(ts) AS time_started, max(calls) AS calls, sum(bought_flag) AS articles_bought
FROM conversations
GROUP BY buyer, conversation
ORDER BY time_started
A few words of explanation:
The starting term of the inner recursive CTE has a sub-query that gets the basic data from the table for every call, together with the time of the previous call. The main query in the starting term of the inner CTE keeps only those rows where there is no previous call (lag IS NULL) or where the previous call is more than 45 minutes away. These are therefore the initial calls in what I term here a "conversation". The conversation gets a column and an id which is just the row number from the query, and another column to track the number of calls in the conversation "calls".
In the recursive term successive calls in the same conversation are added, with the "calls" counter incremented.
When calls are very close together (such as 10:45 and 10:48 after 10:15) then the later calls may be included multiple times, those duplicates (10:48) are dropped in the outer CTE by selecting the earliest call in the sequence for each conversation.
In the main query, finally, the 'bought_flag' column is summed for every conversation of every buyer.

The big problem is that you need to group your results per 45 minutes which makes it tricky. This query is a nice starting point but it's not completely correct. It should help you get going though:
SELECT a.buyer,
MIN(a.timestamp),
COUNT(a),
COUNT(b),
SUM(a.bougth_flag),
SUM(b.bougth_flag)
FROM calls a
LEFT JOIN calls b ON (a.buyer = b.buyer
AND a.timestamp != b.timestamp
AND a.timestamp < b.timestamp
AND a.timestamp + '45 minutes'::INTERVAL > b.timestamp)
GROUP BY a.buyer,
DATE_TRUNC('hour', a.timestamp) ;
Results:
┌───────┬─────────────────────┬───────┬───────┬─────┬─────┐
│ buyer │ min │ count │ count │ sum │ sum │
├───────┼─────────────────────┼───────┼───────┼─────┼─────┤
│ tom │ 2015-02-01 11:52:00 │ 1 │ 0 │ 0 │ Ø │
│ anna │ 2015-02-01 11:50:00 │ 2 │ 1 │ 0 │ 0 │
│ anna │ 2015-02-01 09:25:00 │ 1 │ 0 │ 0 │ Ø │
│ tom │ 2015-02-01 09:15:00 │ 1 │ 0 │ 1 │ Ø │
│ tom │ 2015-02-01 10:15:00 │ 4 │ 3 │ 2 │ 3 │
└───────┴─────────────────────┴───────┴───────┴─────┴─────┘

Thanks Patrick for notice about original version.
You are defently need WINDOW functions here, but CTE is optional here.
with start_points as(
select tmp.*,
--calculate distance between start points
(lead(ts) OVER w)-ts AS start_point_lead from( select t.*, ts - (lag(ts) OVER w) AS lag from test t window w as (PARTITION BY buyer ORDER BY ts)
) tmp where lag is null or lag>interval '45 minutes'
window w as (PARTITION BY buyer ORDER BY ts) order by ts
)
select s.buyer, s.ts, count(*), sum(t.bougth_flag) from start_points s join test t
on t.buyer=s.buyer and (t.ts-s.ts<s.start_point_lead or s.start_point_lead is null)and t.ts>=s.ts
group by s.buyer, s.ts order by s.ts

Related

How to complete missing rows from a table with rows from another table in postgres?

I have 2 action tables, one specific, one general, based on a status and related actions.
In the first table, some rows (based on status) are missing.
I am trying to return a global table that would pick the missing rows from the second table (default_action) whenever there would be a row missing in the first one.
Job table: job_actions
Default table: default_actions
I am using the following set for testing:
CREATE TYPE STATUS AS ENUM('In Progress', 'Failed', 'Completed');
CREATE TYPE EXPIRATION_ACTION AS ENUM('Expire', 'Delete');
CREATE TYPE BASIC_ACTION AS (status STATUS,
operation EXPIRATION_ACTION, expiration_time TIMESTAMP);
CREATE TYPE ACTION AS (partition VARCHAR(40), job VARCHAR(48), b_action BASIC_ACTION);
CREATE TABLE IF NOT EXISTS job_actions (
partition VARCHAR(40),
job VARCHAR(48),
status STATUS,
operation EXPIRATION_ACTION,
expiration_time TIMESTAMP
);
CREATE TABLE IF NOT EXISTS default_actions OF BASIC_ACTION;
INSERT INTO default_actions (
status,
operation,
expiration_time
)
VALUES ('In Progress', 'Expire', 'infinity'::timestamp),
('Failed', 'Expire', 'infinity'::timestamp),
('Completed', 'Expire', 'infinity'::timestamp);
INSERT INTO job_actions (
partition ,
job ,
status,
operation,
expiration_time
)
VALUES
('part1', 'job1','Failed', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job2','In Progress', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job2','Failed', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job3','In Progress', 'Expire', NOW() + INTERVAL '1 hour'),
('part1', 'job3','Failed', 'Expire', NOW() + INTERVAL '1 hour');
I am trying to use something like
SELECT ja.partition, ja.job, ja.status, ja.operation, ja.expiration_time
FROM job_actions ja
WHERE NOT EXISTS (
SELECT da.status, da.operation, da.expiration_time
FROM default_actions da );
But at the moment, it returns an empty table.
Here is the expected result:
Would anyone know what I am doing wrong?
First, get all partitions and jobs from job_actions. Then cross join with default_actions to get all possible combinations. Left join that with job_actions and take the expiration_time from there unless it is a NULL value (no matching row was found).
Translated into SQL:
SELECT partition, job, status, operation,
coalesce(ja.expiration_time, da.expiration_time) AS expiration_time
FROM (SELECT DISTINCT partition, job
FROM job_actions) AS jobs
CROSS JOIN default_actions AS da
LEFT JOIN job_actions AS ja USING (partition, job, status, operation)
ORDER BY partition, job, status;
partition │ job │ status │ operation │ expiration_time
═══════════╪══════╪═════════════╪═══════════╪════════════════════════════
part1 │ job1 │ In Progress │ Expire │ infinity
part1 │ job1 │ Failed │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job1 │ Completed │ Expire │ infinity
part1 │ job2 │ In Progress │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job2 │ Failed │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job2 │ Completed │ Expire │ infinity
part1 │ job3 │ In Progress │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job3 │ Failed │ Expire │ 2021-06-18 14:57:23.912874
part1 │ job3 │ Completed │ Expire │ infinity
(9 rows)

Average amount between numbers in PostgresSQL

I have a database like this:
ACTIONS
id day name
1 4 Bill
2 8 Susan
3 10 Bill
4 12 Bill
5 18 Susan
6 22 John
I want to get the average amount of days or latency between 2 records grouped by name.
Example with Bill
Calculation Steps
Days Between 10 - 4 = 6
Days Between 12 - 10 = 2
Average = 4
Example with Susan
Calculation Steps
Days Between 18 - 8 = 10
Average = 10
Since John has only one record there is no time between so it should return 0 or null either way is fine.
So i'm looking to write a query which does those calculation steps and returns the following dataset:
RESULTS
name average_days_between
Bill 4
Susan 10
John null
I was able to write a script that looped through each record and averaged out the calculation one at a time but with a large set of records it takes to long to generate.
Is it possible to write a PostgreSQL query which generates a result set like that?
The lag() window function will do this for you. If it's too slow, then rewrite the intervals CTE as a subquery.
with actions (id, day, name) as (
values (1, 4, 'Bill'),
(2, 8, 'Susan'),
(3, 10, 'Bill'),
(4, 12, 'Bill'),
(5, 18, 'Susan'),
(6, 22, 'John')
), intervals as (
select name,
day -
lag(day)
over (partition by name
order by day) as latency
from actions
)
select name,
avg(latency) as avg_latency,
count(*) as observations
from intervals
where latency is not null
group by name
order by name;
┌───────┬─────────────────────┬──────────────┐
│ name │ avg_latency │ observations │
├───────┼─────────────────────┼──────────────┤
│ Bill │ 4.0000000000000000 │ 2 │
│ Susan │ 10.0000000000000000 │ 1 │
└───────┴─────────────────────┴──────────────┘
(2 rows)

How to add dates with dataset that have a start and end date?

Hello I am a beginner at SQL, especially postgresql.
I have a table that looks something like this:
ID | Entity | Startdate | enddate
------| ------ | ------ | ------
1 | Hospital |2013-01-01 |2013-01-31
1 | Clinic |2013-02-01 |2013-04-30
1 | Hospital |2013-05-01 |2013-05-31
What I would like to do in this case is that where the start and end date span more than a month to break it out so the above table would like this:
ID | Entity | Startdate | enddate
------| ------ | ------ | ------
1 | Hospital |2013-01-01 |2013-01-31
1 | Clinic |2013-02-01 |2013-02-29
1 | Clinic |2013-03-01 |2013-03-31
1 | Clinic |2013-04-01 |2013-04-30
1 | Hospital |2013-05-01 |2013-05-31
If you notice that row 2, 3 and 4 have been broken down by the month and the ID and entity have also been duplicated.
Any suggestions on how to run this in postgresql would be appreciated.
P.S Apologies I am trying to figure out how to create the table above properly. Having difficulty, the pipes between the numbers and words are lines in a table.
Hope its not too confusing.
One way to do this is to create yourself an end_of_month function like this:
CREATE FUNCTION end_of_month(date)
RETURNS date AS
$BODY$
select (date_trunc('month', $1) + interval '1 month' - interval '1 day')::date;
$BODY$
LANGUAGE sql IMMUTABLE STRICT
COST 100;
Then you can have a string of UNIONS like this:
SELECT
id,
entity,
startdate,
least(end_of_month(startdate),enddate) enddate
from hospital
union
SELECT
id,
entity,
startdate,
least(end_of_month((startdate + interval '1 month')::date),enddate) enddate
from hospital
union
id,
entity,
startdate,
least(end_of_month((startdate + interval '1 month')::date),enddate) enddate
from hospital
ORDER BY startdate,enddate
The problem with this approach, is that you need to have as many unions as necessary!
The alternative is to use a cursor.
EDIT
Just thought of another (better) non-cursor solution. Create a table of month-end dates. Then you can simply do:
select h.id,
h.entity,
h.startdate,
least(h.enddate, m.enddate) enddate
from hospital h
INNER JOIN monthends m
ON m.enddate > h.startdate and m.enddate <= end_of_month(h.enddate)
ORDER BY startdate, enddate
Here is example how to clone row based on its data:
-- Demo data begin
with t(i,x,y) as (values
(1, '2013-02-03'::date, '2013-04-27'::date),
(2, current_date, current_date))
-- Demo data end
select
*,
greatest(x, z)::date as x1, least(y, z + '1 month - 1 day'::interval)::date as y1
from
t,
generate_series(date_trunc('month', x)::date, date_trunc('month', y)::date, '1 month') as z;
┌───┬────────────┬────────────┬────────────────────────┬────────────┬────────────┐
│ i │ x │ y │ z │ x1 │ y1 │
╞═══╪════════════╪════════════╪════════════════════════╪════════════╪════════════╡
│ 1 │ 2013-02-03 │ 2013-04-27 │ 2013-02-01 00:00:00+02 │ 2013-02-03 │ 2013-02-28 │
│ 1 │ 2013-02-03 │ 2013-04-27 │ 2013-03-01 00:00:00+02 │ 2013-03-01 │ 2013-03-31 │
│ 1 │ 2013-02-03 │ 2013-04-27 │ 2013-04-01 00:00:00+03 │ 2013-04-01 │ 2013-04-27 │
│ 2 │ 2017-08-27 │ 2017-08-27 │ 2017-08-01 00:00:00+03 │ 2017-08-27 │ 2017-08-27 │
└───┴────────────┴────────────┴────────────────────────┴────────────┴────────────┘
Just remove Demo data block and replace t, x and y by your table/columns names.
Explanation:
least() and greatest() function returns minimum and maximum element accordingly. Link
generate_series(v1,v2,d) function returns series of values started with v1, not greatest then v2 with step d. Link
'1 month - 1 day'::interval - interval data type notation, <value>::<datatype> means explicit type casting, the SQL standard equivalent is cast(<value> as <datatype>). Link and link
date_trunc() function truncates the date/timestamp value to the specified precision. Link

Group by school year in postgres

I have a query where I count the number of rows for each year:
SELECT
count(*) as activation_count
, extract(year from activated_at) as year
FROM "activations"
WHERE ...
GROUP BY year
But I'd like instead to have years ranging from September to September instead of January to January. Thus grouping by school years instead of calendar years.
Can I modify my query to do that?
And more generally, is it possible to group by a time range and specify an offset to it, eg: extract(year from activated_at offset interval 2 month) as year (this is not working, just the idea of what I want)
What you essentially want is to treat all dates after September as "next year", so the following should work:
select count(*) as activation_count,
case
when extract(month from activated_at) >= 9 then extract(year from activated_at) + 1
else extract(year from activated_at)
end as school_year
from activations
group by school_year;
Assuming that someone whose activated_at is '2016-09-01' should be counted as year = 2017, you could add 4 months to activated_at when in the extract (translating (in the mathematical sense of the word) September to January).
SELECT * FROM activations ;
┌────────────────────────┐
│ activated_at │
├────────────────────────┤
│ 2016-08-01 00:00:00+02 │
│ 2016-09-01 00:00:00+02 │
│ 2016-10-01 00:00:00+02 │
│ 2017-02-02 00:00:00+01 │
└────────────────────────┘
(4 rows)
SELECT COUNT(*),
EXTRACT(year FROM (activated_at + '4 months'::interval)) AS year
FROM activations
GROUP BY year;
┌───────┬──────┐
│ count │ year │
├───────┼──────┤
│ 3 │ 2017 │
│ 1 │ 2016 │
└───────┴──────┘
(2 rows)
If it should be counted as year = 2016 you could remove 8 months instead.

query for median with case statement

How to get the median of the valcolumn from table test whose value is greater than 20.
id val
1 5.43
2 106.26
3 14.00
4 39.58
5 27.00
In this case output would be median(27.00, 39.58, 106.26) = 39.58.
I am using PostgreSQL database.
Any help would be much appreciated.
From PostgreSQL 9.4 you use ordered aggregates:
postgres=# SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY val)
FROM foo WHERE val > 20;
┌─────────────────┐
│ percentile_cont │
╞═════════════════╡
│ 39.58 │
└─────────────────┘
(1 row)
or some really modern syntax:
postgres=# SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY val)
FILTER (WHERE val > 20)
FROM foo;
┌─────────────────┐
│ percentile_cont │
╞═════════════════╡
│ 39.58 │
└─────────────────┘
(1 row)