Select multiple row values into single row with multi-table clauses - postgresql

I've searched the forums and while I see similar posts, they only address pieces of the full query I need to formulate (array_aggr, where exists, joins, etc.). If the question I'm posting has been answered, I will gladly accept references to those threads.
I did find this thread ...which is very similar to what I need, except it is for MySQL, and I kept running into errors trying to get it into psql syntax. Hoping someone can help me get everything together. Here's the scenario:
Attribute
attrib_id | attrib_name
UserAttribute
user_id | attrib_id | value
Here's a small example of what the data looks like:
Attribute
attrib_id | attrib_name
-----------------------
1 | attrib1
2 | attrib2
3 | attrib3
4 | attrib4
5 | attrib5
UserAttribute -- there can be up to 15 attrib_id's/value's per user_id
user_id | attrib_id | value
----------------------------
101 | 1 | valueA
101 | 2 | valueB
102 | 1 | valueC
102 | 2 | valueD
103 | 1 | valueA
103 | 2 | valueB
104 | 1 | valueC
104 | 2 | valueD
105 | 1 | valueA
105 | 2 | valueB
Here's what I'm looking for
Result
user_id | attrib1_value | attrib2_value
--------------------------------------------------------
101 | valueA | valueB
102 | valueC | valueD
103 | valueA | valueB
104 | valueC | valueD
105 | valueA | valueB
As shown, I'm looking for single rows that contain:
- user_id from the UserAttribute table
- attribute values from the UserAttribute table
Note: I only need attribute values from the UserAttribute table for two specific attribute names in the Attribute table
Again, any help or reference to an existing solution would be greatly appreciated.
UPDATE:
#ronin provided a query that gets the results desired:
SELECT ua.user_id
,MAX(CASE WHEN a.attrib_name = 'attrib1' THEN ua.value ELSE NULL END) AS attrib_1_val
,MAX(CASE WHEN a.attrib_name = 'attrib2' THEN ua.value ELSE NULL END) AS attrib_2_val
FROM UserAttribute ua
JOIN Attribute a ON (a.attrib_id = ua.attrib_id)
WHERE a.attrib_name IN ('attrib1', 'attrib2')
GROUP BY ua.user_id;
To build on that, I tried to add some 'LIKE' pattern matching within the 'WHEN' condition (against the ua.value), but everything ends up as the 'FALSE' value. Will start a new question to see if that can be incorporated if I cannot figure it out. Thanks all for the help!!

If each attribute only has a single value for a user, you can start by making a sparse matrix:
SELECT user_id
,CASE WHEN attrib_id = 1 THEN value ELSE NULL END AS attrib_1_val
,CASE WHEN attrib_id = 2 THEN value ELSE NULL END AS attrib_2_val
FROM UserAttribute;
Then compress the matrix using an aggregate function:
SELECT user_id
,MAX(CASE WHEN attrib_id = 1 THEN value ELSE NULL END) AS attrib_1_val
,MAX(CASE WHEN attrib_id = 2 THEN value ELSE NULL END) AS attrib_2_val
FROM UserAttribute
GROUP BY user_id;
In response to the comment, searching by attribute name rather than id:
SELECT ua.user_id
,MAX(CASE WHEN a.attrib_name = 'attrib1' THEN ua.value ELSE NULL END) AS attrib_1_val
,MAX(CASE WHEN a.attrib_name = 'attrib2' THEN ua.value ELSE NULL END) AS attrib_2_val
FROM UserAttribute ua
JOIN Attribute a ON (a.attrib_id = ua.attrib_id)
WHERE a.attrib_name IN ('attrib1', 'attrib2')
GROUP BY ua.user_id;

Starting with Postgres 9.4 you can use the simpler aggregate FILTER clause:
SELECT user_id
,MAX(value) FILTER (WHERE attrib_id = 1) AS attrib_1_val
,MAX(value) FILTER (WHERE attrib_id = 2) AS attrib_2_val
FROM UserAttribute
WHERE attrib_id IN (1,2)
GROUP BY 1;
For more than a few attributes or for top performance, look to crosstab() from the additional module tablefunc (Postgres 8.3+). Details here:
PostgreSQL Crosstab Query

What about something like this:
select ua.user_id, a.attrib_name attrib_value1, a2.attrib_name attrib_value2
from user_attribute ua
left join attribute a on a.atribute_id=ua.attribute_id and a.attribute_id in (1,2)
left join user_attribute ua2 on ua2.user_id=ua.user_id and ua2.attribute_id > ua.attribute_id
left join attribute a2 on a2.attribute_id=ua2.attribute_id and a2.attribute_id in (1,2)

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

I have an unusual problem I'm trying to solve with SQL where I need to generate sequential numbers for partitioned rows but override specific numbers with values from the data, while not breaking the sequence (unless the override causes a number to be used greater than the number of rows present).
I feel I might be able to achieve this by selecting the rows where I need to override the generated sequence value and the rows I don't need to override the value, then unioning them together and somehow using coalesce to get the desired dynamically generated sequence value, or maybe there's some way I can utilise recursive.
I've not been able to solve this problem yet, but I've put together a SQL Fiddle which provides a simplified version:
http://sqlfiddle.com/#!17/236b5/5
The desired_dynamic_number is what I'm trying to generate and the generated_dynamic_number is my current work-in-progress attempt.
Any pointers around the best way to achieve the desired_dynamic_number values dynamically?
Update:
I'm almost there using lag:
http://sqlfiddle.com/#!17/236b5/24
step-by-step demo:db<>fiddle
SELECT
*,
COALESCE( -- 3
first_value(override_as_number) OVER w -- 2
, 1
)
+ row_number() OVER w - 1 -- 4, 5
FROM (
SELECT
*,
SUM( -- 1
CASE WHEN override_as_number IS NOT NULL THEN 1 ELSE 0 END
) OVER (PARTITION BY grouped_by ORDER BY secondary_order_by)
as grouped
FROM sample
) s
WINDOW w AS (PARTITION BY grouped_by, grouped ORDER BY secondary_order_by)
Create a new subpartition within your partitions: This cumulative sum creates a unique group id for every group of records which starts with a override_as_number <> NULL followed by NULL records. So, for instance, your (AAA, d) to (AAA, f) belongs to the same subpartition/group.
first_value() gives the first value of such subpartition.
The COALESCE ensures a non-NULL result from the first_value() function if your partition starts with a NULL record.
row_number() - 1 creates a row count within a subpartition, starting with 0.
Adding the first_value() of a subpartition with the row count creates your result: Beginning with the one non-NULL record of a subpartition (adding the 0 row count), the first following NULL records results in the value +1 and so forth.
Below query gives exact result, but you need to verify with all combinations
select c.*,COALESCE(c.override_as_number,c.act) as final FROM
(
select b.*, dense_rank() over(partition by grouped_by order by grouped_by, actual) as act from
(
select a.*,COALESCE(override_as_number,row_num) as actual FROM
(
select grouped_by , secondary_order_by ,
dense_rank() over ( partition by grouped_by order by grouped_by, secondary_order_by ) as row_num
,override_as_number,desired_dynamic_number from fiddle
) a
) b
) c ;
column "final" is the result
grouped_by | secondary_order_by | row_num | override_as_number | desired_dynamic_number | actual | act | final
------------+--------------------+---------+--------------------+------------------------+--------+-----+-------
AAA | a | 1 | 1 | 1 | 1 | 1 | 1
AAA | b | 2 | | 2 | 2 | 2 | 2
AAA | c | 3 | 3 | 3 | 3 | 3 | 3
AAA | d | 4 | 3 | 3 | 3 | 3 | 3
AAA | e | 5 | | 4 | 5 | 4 | 4
AAA | f | 6 | | 5 | 6 | 5 | 5
AAA | g | 7 | 999 | 999 | 999 | 6 | 999
XYZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | b | 2 | | 2 | 2 | 2 | 2
(10 rows)
Hope this helps!
The real world problem I was trying to solve did not have a nicely ordered secondary_order_by column, instead it would be something a bit more randomised (a created timestamp).
For the benefit of people who stumble across this question with a similar problem to solve, a colleague solved this problem using a cartesian join, who's solution I'm posting below. The solution is Snowflake SQL which should be possible to adapt to Postgres. It does fall down on higher override_as_number values though unless the from table(generator(rowcount => 1000)) 1000 value is not increased to something suitably high.
The SQL:
with tally_table as (
select row_number() over (order by seq4()) as gen_list
from table(generator(rowcount => 1000))
),
base as (
select *,
IFF(override_as_number IS NULL, row_number() OVER(PARTITION BY grouped_by, override_as_number order by random),override_as_number) as rownum
from "SANDPIT"."TEST"."SAMPLEDATA" order by grouped_by,override_as_number,random
) --select * from base order by grouped_by,random;
,
cart_product as (
select *
from tally_table cross join (Select distinct grouped_by from base ) as distinct_grouped_by
) --select * from cart_product;
,
filter_product as (
select *,
row_number() OVER(partition by cart_product.grouped_by order by cart_product.grouped_by,gen_list) as seq_order
from cart_product
where CONCAT(grouped_by,'~',gen_list) NOT IN (select concat(grouped_by,'~',override_as_number) from base where override_as_number is not null)
) --select * from try2 order by 2,3 ;
select base.grouped_by,
base.random,
base.override_as_number,
base.answer, -- This is hard coded as test data
IFF(override_as_number is null, gen_list, seq_order) as computed_answer
from base inner join filter_product on base.rownum = filter_product.seq_order and base.grouped_by = filter_product.grouped_by
order by base.grouped_by,
random;
In the end I went for a simpler solution using a temporary table and cursor to inject override_as_number values and shuffle other numbers.

How to find which posts have the highest comments and which posts have the fewest comments?

I am very new to postgreSQl and SQL and databases, I hope you guys can help me with this, i want to know which posts have the most amount of comments and which have the least amount of comments and the users need to be specified too.
CREATE SCHEMA perf_demo;
SET search_path TO perf_demo;
-- Tables
CREATE TABLE users(
id SERIAL -- PRIMARY KEY
, email VARCHAR(40) NOT NULL UNIQUE
);
CREATE TABLE posts(
id SERIAL -- PRIMARY KEY
, user_id INTEGER NOT NULL -- REFERENCES users(id)
, title VARCHAR(100) NOT NULL UNIQUE
);
CREATE TABLE comments(
id SERIAL -- PRIMARY KEY
, user_id INTEGER NOT NULL -- REFERENCES users(id)
, post_id INTEGER NOT NULL -- REFERENCES posts(id)
, body VARCHAR(500) NOT NULL
);
-- Generate approx. N users
-- Note: NULL values might lead to lesser rows than N value.
INSERT INTO users(email)
WITH query AS (
SELECT 'user_' || seq || '#'
|| ( CASE (random() * 5)::INT
WHEN 0 THEN 'my'
WHEN 1 THEN 'your'
WHEN 2 THEN 'his'
WHEN 3 THEN 'her'
WHEN 4 THEN 'our'
END )
|| '.mail' AS email
FROM generate_series(1, 5) seq -- Important: Replace N with a useful value
)
SELECT email
FROM query
WHERE email IS NOT NULL;
-- Generate N posts
INSERT INTO posts(user_id, title)
WITH expanded AS (
SELECT random(), seq, u.id AS user_id
FROM generate_series(1, 8) seq, users u -- Important: Replace N with a useful value
),
shuffled AS (
SELECT e.*
FROM expanded e
INNER JOIN (
SELECT ei.seq, min(ei.random) FROM expanded ei GROUP BY ei.seq
) em ON (e.seq = em.seq AND e.random = em.min)
ORDER BY e.seq
)
-- Top 20 programming languages: https://www.tiobe.com/tiobe-index/
SELECT s.user_id,
'Let''s talk about (' || s.seq || ') '
|| ( CASE (random() * 19 + 1)::INT
WHEN 1 THEN 'C'
WHEN 2 THEN 'Python'
WHEN 3 THEN 'Java'
WHEN 4 THEN 'C++'
WHEN 5 THEN 'C#'
WHEN 6 THEN 'Visual Basic'
WHEN 7 THEN 'JavaScript'
WHEN 8 THEN 'Assembly language'
WHEN 9 THEN 'PHP'
WHEN 10 THEN 'SQL'
WHEN 11 THEN 'Ruby'
WHEN 12 THEN 'Classic Visual Basic'
WHEN 13 THEN 'R'
WHEN 14 THEN 'Groovy'
WHEN 15 THEN 'MATLAB'
WHEN 16 THEN 'Go'
WHEN 17 THEN 'Delphi/Object Pascal'
WHEN 18 THEN 'Swift'
WHEN 19 THEN 'Perl'
WHEN 20 THEN 'Fortran'
END ) AS title
FROM shuffled s;
-- Generate N comments
-- Note: The cross-join is a performance killer.
-- Try the SELECT without INSERT with small N values to get an estimation of the execution time.
-- With these values you can extrapolate the execution time for a bigger N value.
INSERT INTO comments(user_id, post_id, body)
WITH expanded AS (
SELECT random(), seq, u.id AS user_id, p.id AS post_id
FROM generate_series(1, 10) seq, users u, posts p -- Important: Replace N with a useful value
),
shuffled AS (
SELECT e.*
FROM expanded e
INNER JOIN ( SELECT ei.seq, min(ei.random) FROM expanded ei GROUP BY ei.seq ) em ON (e.seq = em.seq AND e.random = em.min)
ORDER BY e.seq
)
SELECT s.user_id, s.post_id, 'Here some comment: ' || md5(random()::text) AS body
FROM shuffled s;
Could someone show me how this could be done please, I am new to SQL/postgres any help would be much appreciated. an Example would be very helpful too.
Good effort in pasting the whole dataset creation procedure, is what it needs to be included in order to make the example reproducible.
Let's start first with, how to join several tables: you have your posts table which contains the user_id and we can use it to join with users with the following.
SELECT email,
users.id user_id,
posts.id post_id,
title
from posts join users
on posts.user_id=users.id;
This will list the posts together with the authors. Check the joining condition (after the ON) stating the fields we're using. The result should be similar to the below
email | user_id | post_id | title
------------------+---------+---------+----------------------------------------
user_1#her.mail | 1 | 5 | Let's talk about (5) Visual Basic
user_1#her.mail | 1 | 2 | Let's talk about (2) Assembly language
user_3#her.mail | 3 | 8 | Let's talk about (8) R
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl
user_4#her.mail | 4 | 6 | Let's talk about (6) Visual Basic
user_5#your.mail | 5 | 4 | Let's talk about (4) R
user_5#your.mail | 5 | 3 | Let's talk about (3) C
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby
(8 rows)
Now it's time to join this result, with the comments table. Since a post can have comments or not and you want to show all posts even if you don't have any comments you should use the LEFT OUTER JOIN (more info about join types here
So let's rewrite the above to include comments
SELECT email,
users.id user_id,
posts.id post_id,
title,
comments.body
from posts
join users
on posts.user_id=users.id
left outer join comments
on posts.id = comments.post_id
;
Check out the join between posts and comments based on post_id.
The result of the query is the list of posts, related author and comments, similar to the below
email | user_id | post_id | title | body
------------------+---------+---------+----------------------------------------+-----------------------------------------------------
user_1#her.mail | 1 | 5 | Let's talk about (5) Visual Basic |
user_1#her.mail | 1 | 2 | Let's talk about (2) Assembly language |
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: 200bb07acfbac893aed60e018b47b92b
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: 66159adaed11404b1c88ca23b6a689ef
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: e5cc1f7c10bb6103053bf281d3cadb60
user_3#her.mail | 3 | 8 | Let's talk about (8) R | Here some comment: 5ae8674c2ef819af0b1a93398efd9418
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl | Here some comment: 5b818da691c1570dcf732ed8f6b718b3
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl | Here some comment: 88a990e9495841f8ed628cdce576a766
user_4#her.mail | 4 | 6 | Let's talk about (6) Visual Basic |
user_5#your.mail | 5 | 4 | Let's talk about (4) R | Here some comment: ed19bb476eb220d6618e224a0ac2910d
user_5#your.mail | 5 | 3 | Let's talk about (3) C | Here some comment: 23cd43836a44aeba47ad212985f210a7
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby | Here some comment: b83999120bd2bb09d71aa0c6c83a05dd
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby | Here some comment: b4895f4e0aa0e0106b5d3834af80275e
(13 rows)
Now you can start aggregating and counting comments for a certain post. You can use PG's aggregation functions, we'll use the COUNT here.
SELECT email,
users.id user_id,
posts.id post_id,
title,
count(comments.id) nr_comments
from posts
join users
on posts.user_id=users.id
left outer join comments
on posts.id = comments.post_id
group by email,
users.id,
posts.id,
title
;
Check out that we're counting the comments.id field but we could also perform a count(*) which just counts the rows. Also check that we are grouping our results by email, users.id, post.id and title, the columns we are showing alongside the count.
The result should be similar to
email | user_id | post_id | title | nr_comments
------------------+---------+---------+----------------------------------------+-------------
user_3#her.mail | 3 | 7 | Let's talk about (7) Perl | 2
user_5#your.mail | 5 | 3 | Let's talk about (3) C | 1
user_5#your.mail | 5 | 1 | Let's talk about (1) Ruby | 2
user_3#her.mail | 3 | 8 | Let's talk about (8) R | 4
user_1#her.mail | 1 | 5 | Let's talk about (5) Visual Basic | 0
user_5#your.mail | 5 | 4 | Let's talk about (4) R | 1
user_4#her.mail | 4 | 6 | Let's talk about (6) Visual Basic | 0
user_1#her.mail | 1 | 2 | Let's talk about (2) Assembly language | 0
(8 rows)
This should be the result you're looking for. Just bear in mind, that you're showing the user from users who wrote the post, not the one who commented. To view who commented you'll need to change the joining conditions.

Calculate relative errors in Postgres

Is there an easy way to calculate errors in Postgres given a table that looks something like this:
id | bool | score
1 | False | 9
1 | True | 9.6
2 | False | 5
2 | True | 4.7
The output that I want id | (False_row - True_row)/True_row:
id | err
1 | -0.0625
2 | 0.063829
SELECT
id,
(false_row - true_row) / true_row
FROM (
SELECT
id,
SUM(CASE WHEN bool THEN score ELSE 0 END) AS true_row,
SUM(CASE WHEN NOT bool THEN score ELSE 0 END) AS false_row
FROM
table_name
GROUP BY
id
) AS sub;
In the subquery (sub) take the true_row and the false_row. This can be done using a variety of aggregate functions, SUM for example.
When you have your true_row and false_row just do the calculations in the outer query.

I am computing a percentage in postgresql and I get the following unexpected behavior when dividing a number by the same number

I am new at postgresql and am having trouble wrapping my mind around why I am getting the results that I see.
I perform the following query
SELECT
name AS region_name,
COUNT(tripsq1.id) AS trips,
COUNT(DISTINCT user_id) AS unique_users,
COUNT(case when consumed_at = start_at then tripsq1.id end) AS first_day,
(SUM(case when consumed_at = start_at then tripsq1.id end)::NUMERIC(6,4))/COUNT(tripsq1.id)::NUMERIC(6,4) AS percent_on_first_day
FROM promotionsq1
INNER JOIN couponsq1
ON promotion_id = promotionsq1.id
INNER JOIN tripsq1
ON couponsq1.id = coupon_id
INNER JOIN regionsq1
ON regionsq1.id = region_id
WHERE promotion_name = 'TestPromo'
GROUP BY region_name;
and get the following result
region_name | trips | unique_users | first_day | percent_on_first_day
-------------------+-------+--------------+-----------+-----------------------
A | 3 | 2 | 1 | 33.3333333333333333
B | 1 | 1 | 0 |
C | 1 | 1 | 1 | 2000.0000000000000000
The first rows percentage gets calculated correctly while the third rows percentage is 20 times what it should be. The percent_on_first_day should be 100.00 since it is 100.0 * 1/1.
Any help would be greatly appreciated
I'm suspecting that the issue is because of this code:
SUM(case when consumed_at = start_at then tripsq1.id end)
This tells me you are summing the ids, which is meaningless. You probably want:
SUM(case when consumed_at = start_at then 1 end)

Can window function LAG reference the column which value is being calculated?

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form
SELECT <some fields>,
<some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>
This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.
Here is an example. I have a table with events. Each event has type and time_stamp.
create table event (id serial, type integer, time_stamp integer);
I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then
the first event is not a duplicate
all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
and so on
For this data
insert into event (type, time_stamp)
values
(1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10),
(1,15), (1, 21), (2,13),
(1, 40);
and TIMEFRAME=10 result should be
time_stamp | type | duplicate
-----------------------------
1 | 1 | false
2 | 1 | true
3 | 1 | true
10 | 1 | true
15 | 1 | false
21 | 1 | true
40 | 1 | false
2 | 2 | false
10 | 2 | true
13 | 2 | false
I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:
WITH evt AS (
SELECT
time_stamp,
CASE WHEN
time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
THEN
time_stamp
ELSE
LAG(current_non_dupl_time_stamp) OVER w
END AS current_non_dupl_time_stamp
FROM event
WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
But this does not work because the field which is calculated cannot be referenced in LAG:
ERROR: column "current_non_dupl_time_stamp" does not exist.
So the question: can I rewrite this query to achieve the effect I need?
Naive recursive chain knitter:
-- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
SELECT e.type,e.time_stamp
, ROW_NUMBER() OVER www as rn -- number the records
, FIRST_VALUE(e.time_stamp) OVER www as fst -- the "group leader"
, EXISTS (SELECT * FROM event x
WHERE x.type = e.type
AND x.time_stamp < e.time_stamp) AS is_dup
FROM event e
WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
;
WITH RECURSIVE ttt AS (
SELECT d0.*
FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
UNION ALL
SELECT d1.type, d1.time_stamp, d1.rn
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
ELSE ttt.fst END AS fst -- new "group leader"
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
ELSE True END AS is_dup
FROM drag d1
JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
)
SELECT * FROM ttt
ORDER BY type, time_stamp
;
Results:
CREATE TABLE
INSERT 0 10
CREATE VIEW
type | time_stamp | rn | fst | is_dup
------+------------+----+-----+--------
1 | 1 | 1 | 1 | f
1 | 2 | 2 | 1 | t
1 | 3 | 3 | 1 | t
1 | 10 | 4 | 1 | t
1 | 15 | 5 | 1 | t
1 | 21 | 6 | 1 | t
1 | 40 | 7 | 40 | f
2 | 2 | 1 | 2 | f
2 | 10 | 2 | 2 | t
2 | 13 | 3 | 2 | t
(10 rows)
An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.
State transition function:
create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
if st is null or st[1] + timeframe <= time_stamp
then
st[1] := time_stamp;
end if;
st[2] := time_stamp;
return st;
end $$;
Final function:
create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
select st[1] <> st[2];
$$;
Aggregate:
create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
sfunc = is_duplicate,
stype = int[],
finalfunc = is_duplicate_final
);
Query:
select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;
id | type | time_stamp | is_duplicate_agg
----+------+------------+------------------
1 | 1 | 1 | f
2 | 1 | 2 | t
4 | 1 | 3 | t
5 | 1 | 10 | t
7 | 1 | 15 | f
8 | 1 | 21 | t
10 | 1 | 40 | f
3 | 2 | 2 | f
6 | 2 | 10 | t
9 | 2 | 13 | f
(10 rows)
Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.
This feels more like a recursive problem than windowing function. The following query obtained the desired results:
WITH RECURSIVE base(type, time_stamp) AS (
-- 3. base of recursive query
SELECT x.type, x.time_stamp, y.next_time_stamp
FROM
-- 1. start with the initial records of each type
( SELECT type, min(time_stamp) AS time_stamp
FROM event
GROUP BY type
) x
LEFT JOIN LATERAL
-- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = x.type
AND time_stamp > (x.time_stamp + 10)
) y ON true
UNION ALL
-- 4. recursive join, same logic as base
SELECT e.type, e.time_stamp, z.next_time_stamp
FROM event e
JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
LEFT JOIN LATERAL
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = e.type
AND time_stamp > (e.time_stamp + 10)
) z ON true
)
-- The actual query:
-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
FROM base
UNION
-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
FROM event
WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base)
ORDER BY type, time_stamp
There are a lot of caveats with this. It assumes no duplicate time_stamp for a given type. Really the joins should be based on a unique id rather than type and time_stamp. I didn't test this much, but it may at least suggest an approach.
This is my first time to try a LATERAL join. So there may be a way to simplify that moe. Really what I wanted to do was a recursive CTE with the recursive part using MIN(time_stamp) based on time_stamp > (x.time_stamp + 10), but aggregate functions are not allowed in CTEs in that manner. But it seems the lateral join can be used in the CTE.