T-SQL: Combining rows based on another table - tsql

I am seeking to alter a table content based on information of another table using a stored procedure. To make my point (and dodge my rusty English skills) I created the following simplification.
I have a table with fragment amounts of the form
SELECT * FROM [dbo].[obtained_fragments] ->
fragment amount
22 42
76 7
101 31
128 4
177 22
212 6
and a table that lists all possible combinations to combine these fragments to other fragments.
SELECT * FROM [dbo].[possible_combinations] ->
fragment consists_of_f1 f1_amount_needed consists_of_f2 f2_amount_needed
1001 128 1 22 3
1004 151 1 101 12
1012 128 1 177 6
1047 212 1 76 4
My aim is to alter the first table so that all possible fragment combinations are performed, leading to
SELECT * FROM [dbo].[obtained_fragments] ->
fragment amount
22 30
76 3
101 31
177 22
212 5
1001 4
1047 1
In words, combined fragments are added to the table based on [dbo].[possible_combinations], and the amount of needed fragments is reduced. Depleted fragments are removed from the table.
How do I achieve this fragment transformation in an easy way? I started writing a while loop, checking if sufficient fragments are available, inside of a for loop, interating through the fragment numbers. However, I am unable to come up with a functional amount check and begin to wonder if this is even possible in T-SQL this way.
The code doesn't have to be super efficient since both tables will always be smaller than 200 rows.
It is important to note that it doesn't matter which combinations are created.
It might come in handy that [f1_amount_needed] always has a value of 1.
UPDATE
Using the solution of iamdave, which works perfectly fine as long I don't touch it, I receive the following error message:
Column name or number of supplied values does not match table definition.
I barely changed anything really. Is there a chance that using existing tables with more than the necessary columns instead of declaring the tables (as iamdave did) makes this difference?
DECLARE #t TABLE(Binding_ID int, Exists_of_Binding_ID_2 int, Exists_of_Pieces_2 int, Binding1 int, Binding2 int);
WHILE 1=1
BEGIN
DELETE #t
INSERT INTO #t
SELECT TOP 1
k.Binding_ID
,k.Exists_of_Binding_ID_2
,k.Exists_of_Pieces_2
,g1.mat_Binding_ID AS Binding1
,g2.mat_Binding_ID AS Binding2
FROM [dbo].[vwCombiBinding] AS k
JOIN [leer].[sandbox5] AS g1
ON k.Exists_of_Binding_ID_1 = g1.mat_Binding_ID AND g1.Amount >= 1
JOIN [leer].[sandbox5] AS g2
ON k.Exists_of_Binding_ID_2 = g2.mat_Binding_ID AND g2.Amount >= k.Exists_of_Pieces_2
ORDER BY k.Binding_ID
IF (SELECT COUNT(1) FROM #t) = 1
BEGIN
UPDATE g
SET Amount = g.Amount +1
FROM [leer].[sandbox5] AS g
JOIN #t AS t
ON g.mat_Binding_ID = t.Binding_ID
INSERT INTO [leer].[sandbox5]
SELECT
t.Binding_ID
,1
FROM #t AS t
WHERE NOT EXISTS (SELECT NULL FROM [leer].[sandbox5] AS g WHERE g.mat_Binding_ID = t.Binding_ID);
UPDATE g
SET Amount = g.Amount - 1
FROM [leer].[sandbox5] AS g
JOIN #t AS t
ON g.mat_Binding_ID = t.Binding1
UPDATE g
SET Amount = g.Amount - t.Exists_of_Pieces_2
FROM [leer].[sandbox5] AS g
JOIN #t AS t
ON g.mat_Binding_ID = t.Binding2
END
ELSE
BREAK
END
SELECT * FROM [leer].[sandbox5]

You can do this with a while loop that contains several statements to handle your iterative data updates. As you need to make changes based on a re-assessment of your data each iteration this has to be done in a loop of some kind:
declare #f table(fragment int,amount int);
insert into #f values (22 ,42),(76 ,7 ),(101,31),(128,4 ),(177,22),(212,6 );
declare #c table(fragment int,consists_of_f1 int,f1_amount_needed int,consists_of_f2 int,f2_amount_needed int);
insert into #c values (1001,128,1,22,3),(1004,151,1,101,12),(1012,128,1,177,6),(1047,212,1,76,4);
declare #t table(fragment int,consists_of_f2 int,f2_amount_needed int,fragment1 int,fragment2 int);
while 1 = 1
begin
-- Clear out staging area
delete #t;
-- Populate with the latest possible combination
insert into #t
select top 1 c.fragment
,c.consists_of_f2
,c.f2_amount_needed
,f1.fragment as fragment1
,f2.fragment as fragment2
from #c as c
join #f as f1
on c.consists_of_f1 = f1.fragment
and f1.amount >= 1
join #f as f2
on c.consists_of_f2 = f2.fragment
and f2.amount >= c.f2_amount_needed
order by c.fragment;
-- Update fragments table if a new combination can be made
if (select count(1) from #t) = 1
begin
-- Update if additional fragment
update f
set amount = f.amount + 1
from #f as f
join #t as t
on f.fragment = t.fragment;
-- Insert if a new fragment
insert into #f
select t.fragment
,1
from #t as t
where not exists(select null
from #f as f
where f.fragment = t.fragment
);
-- Update fragment1 amounts
update f
set amount = f.amount - 1
from #f as f
join #t as t
on f.fragment = t.fragment1;
-- Update fragment2 amounts
update f
set amount = f.amount - t.f2_amount_needed
from #f as f
join #t as t
on f.fragment = t.fragment2;
end
else -- If no new combinations possible, break the loop
break
end;
select *
from #f;
Output:
+----------+--------+
| fragment | amount |
+----------+--------+
| 22 | 30 |
| 76 | 3 |
| 101 | 31 |
| 128 | 0 |
| 177 | 22 |
| 212 | 5 |
| 1001 | 4 |
| 1047 | 1 |
+----------+--------+

Related

Taking N-samples from each group in PostgreSQL

I have a table containing data that has a column named id that looks like below:
id
value 1
value 2
value 3
1
244
550
1000
1
251
551
700
1
540
60
1200
...
...
...
...
2
19
744
2000
2
10
903
100
2
44
231
600
2
120
910
1100
...
...
...
...
I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.
For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...
I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:
SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;
But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.
Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?
If you want to select a random sample for each id, then you need to randomize the rows somehow. Here is a way to do it:
select * from (
select *, row_number() over (partition by id order by random()) as u
from schema.table
) as a
where u <= 50;
Example (limiting to 3, and some row number for each id so you can see the selection randomness):
setup
DROP TABLE IF EXISTS foo;
CREATE TABLE foo
(
id int,
value1 int,
idrow int
);
INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;
Selection
select * from (
select *, row_number() over (partition by id order by random()) as u
from foo
) as a
where u <= 3;
Output:
id
value1
idrow
u
1
542
6
1
1
24
86
2
1
155
74
3
2
505
95
1
2
100
46
2
2
422
33
3
3
966
88
1
3
747
89
2
3
664
19
3
In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -
From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."
Query -
with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;
Fiddle.
Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):
with uniq as (select distinct id from test)
select foo.* from uniq cross join lateral
(select * from test where test.id=uniq.id order by random() limit 3) foo
This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.

PostgreSQL - How to get the previous(lag) calculated value

I would like to get the previous(lag) calculated value?
id | value
-------|-------
1 | 1
2 | 3
3 | 5
4 | 7
5 | 9
What I am trying to achieve is this:
id | value | new value
-------|-------|-----------
1 | 1 | 10 <-- 1 * lag(new_value)
2 | 3 | 30 <-- 3 * lag(new_value)
3 | 5 | 150 <-- 5 * lag(new_value)
4 | 7 | 1050 <-- 7 * lag(new_value)
5 | 9 | 9450 <-- 9 * lag(new_value)
What I have tried:
SELECT value,
COALESCE(lag(new_value) OVER () * value, 10) AS new_value
FROM table
Error:
ERROR: column "new_value" does not exist
Similar to Juan's answer but I thought I'd post it anyway. It at least avoids the need for the ID column and doesn't have the empty row at the end:
with recursive all_data as (
select value, value * 10 as new_value
from data
where value = 1
union all
select c.value,
c.value * p.new_value
from data c
join all_data p on p.value < c.value
where c.value = (select min(d.value)
from data d
where d.value > p.value)
)
select *
from all_data
order by value;
The idea is to join exactly one row in the recursive part to exactly one "parent" row. While the "exactly one parent" can be done with a derived table and a lateral join (which surprisingly does allow the limit). The "exactly one row" from the "child" in the recursive part can unfortunately only be done using the sub-select with a min().
The where c.value= (...) wouldn't be necessary if it was possible to use an order by and limit in the recursive part as well, but unfortunately that is not supported in the current Postgres version.
Online example: http://rextester.com/WFBVM53545
My bad, this isnt that easy as I thought. Got a very close result but still need some tunning.
DEMO
WITH RECURSIVE t(n, v) AS (
SELECT MIN(value), 10
FROM Table1
UNION ALL
SELECT (SELECT min(value) from Table1 WHERE value > n),
(SELECT min(value) from Table1 WHERE value > n) * v
FROM t
JOIN Table1 on t.n = Table1.value
)
SELECT n, v
FROM t;

PostgreSQL - dynamic INSERT on column names

I'm looking to dynamically insert a set of columns from one table to another in PostgreSQL. What I think I'd like to do is read in a 'checklist' of column headings (those columns which exist in table 1 - the storage table), and if they exist in the export table (table 2) then insert them in all at once from table 1. Table 2 will be variable in its columns though - once imported ill drop it and import new data to be imported with potentially different column structure. So I need to import it based on the column names.
e.g.
Table 1. - The storage table
ID NAME YEAR LITH_AGE PROV_AGE SIO2 TIO2 CAO MGO COMMENTS
1 John 1998 2000 3000 65 10 5 5 comment1
2 Mark 2005 2444 3444 63 8 2 3 comment2
3 Luke 2001 1000 1500 77 10 2 2 comment3
Table 2. - The export table
ID NAME MG# METHOD SIO2 TIO2 CAO MGO
1 Amy 4 Method1 65 10 5 5
2 Poe 3 Method2 63 8 2 3
3 Ben 2 Method3 77 10 2 2
As you can see the export table may include columns which do not exist in the storage table, so these would be ignored.
I want to insert all of these columns at once, as I've found if I do it individually by column it extends the number of rows each time on the insert (maybe someone can solve this issue instead? Currently I've written a function to check if a column name exists in table 2, if it does, insert it, but as said this extends the rows of the table every time and NULL the rest of the columns).
The INSERT line from my function:
EXECUTE format('INSERT INTO %s (%s) (SELECT %s::%s FROM %s);',_tbl_import, _col,_col,_type,_tbl_export);
As a type of 'code example' for my question:
EXECUTE FORMAT('INSERT INTO table1 (%s) (SELECT (%s) FROM table2)',columns)
where 'columns' would be some variable denoting the columns that exist in the export table that need to go into the storage table. This will be variable as table 2 will be different every time.
This would ideally update Table 1 as:
ID NAME YEAR LITH_AGE PROV_AGE SIO2 TIO2 CAO MGO COMMENTS
1 John 1998 2000 3000 65 10 5 5 comment1
2 Mark 2005 2444 3444 63 8 2 3 comment2
3 Luke 2001 1000 1500 77 10 2 2 comment3
4 Amy NULL NULL NULL 65 10 5 5 NULL
5 Poe NULL NULL NULL 63 8 2 3 NULL
6 Ben NULL NULL NULL 77 10 2 2 NULL
UPDATED answer
As my original answer did not meet requirement came out later but was asked to post an alternative example for information_schema solution so here it is.
I made two versions for solutions:
V1 - is equivalent to already given example using information_schema. But that solution relies on table1 column DEFAULTs. Meaning, if table1 column that does not exist at table2 does not have DEFAULT NULL then it will be filled with whatever the default is.
V2 - is modified to force 'NULL' in case of two table columns mismatch and does not inherit table1 own DEFAULTs
Version1:
CREATE OR REPLACE FUNCTION insert_into_table1_v1()
RETURNS void AS $main$
DECLARE
columns text;
BEGIN
SELECT string_agg(c1.attname, ',')
INTO columns
FROM pg_attribute c1
JOIN pg_attribute c2
ON c1.attrelid = 'public.table1'::regclass
AND c2.attrelid = 'public.table2'::regclass
AND c1.attnum > 0
AND c2.attnum > 0
AND NOT c1.attisdropped
AND NOT c2.attisdropped
AND c1.attname = c2.attname
AND c1.attname <> 'id';
-- Following is the actual result of query above, based on given data examples:
-- -[ RECORD 1 ]----------------------
-- string_agg | name,si02,ti02,cao,mgo
EXECUTE format(
' INSERT INTO table1 ( %1$s )
SELECT %1$s
FROM table2
',
columns
);
END;
$main$ LANGUAGE plpgsql;
Version2:
CREATE OR REPLACE FUNCTION insert_into_table1_v2()
RETURNS void AS $main$
DECLARE
t1_cols text;
t2_cols text;
BEGIN
SELECT string_agg( c1.attname, ',' ),
string_agg( COALESCE( c2.attname, 'NULL' ), ',' )
INTO t1_cols,
t2_cols
FROM pg_attribute c1
LEFT JOIN pg_attribute c2
ON c2.attrelid = 'public.table2'::regclass
AND c2.attnum > 0
AND NOT c2.attisdropped
AND c1.attname = c2.attname
WHERE c1.attrelid = 'public.table1'::regclass
AND c1.attnum > 0
AND NOT c1.attisdropped
AND c1.attname <> 'id';
-- Following is the actual result of query above, based on given data examples:
-- t1_cols | t2_cols
-- --------------------------------------------------------+--------------------------------------------
-- name,year,lith_age,prov_age,si02,ti02,cao,mgo,comments | name,NULL,NULL,NULL,si02,ti02,cao,mgo,NULL
-- (1 row)
EXECUTE format(
' INSERT INTO table1 ( %s )
SELECT %s
FROM table2
',
t1_cols,
t2_cols
);
END;
$main$ LANGUAGE plpgsql;
Also link to documentation about pg_attribute table columns if something is unclear: https://www.postgresql.org/docs/current/static/catalog-pg-attribute.html
Hopefully this helps :)

Coalesce overlapping time ranges in PostgreSQL

I have a PostgreSQL (9.4) table that contains time stamp ranges and user IDs, and I need to collapse any overlapping ranges (with the same user ID) into a single record.
I've tried a complicated set of CTEs to accomplish this, but there are some edge cases in our (40,000+ rows) real table that complicate matters. I've come to the conclusion that I probably need a recursive CTE, but I haven't had any luck writing it.
Here's some code to create a test table and populate it with data. This isn't the exact layout of our table, but it's close enough for an example.
CREATE TABLE public.test
(
id serial,
sessionrange tstzrange,
fk_user_id integer
);
insert into test (sessionrange, fk_user_id)
values
('[2016-01-14 11:57:01-05,2016-01-14 12:06:59-05]', 1)
,('[2016-01-14 12:06:53-05,2016-01-14 12:17:28-05]', 1)
,('[2016-01-14 12:17:24-05,2016-01-14 12:21:56-05]', 1)
,('[2016-01-14 18:18:00-05,2016-01-14 18:42:09-05]', 2)
,('[2016-01-14 18:18:08-05,2016-01-14 18:18:15-05]', 1)
,('[2016-01-14 18:38:12-05,2016-01-14 18:48:20-05]', 1)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 1)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 1)
,('[2016-01-14 18:18:12-05,2016-01-14 18:18:20-05]', 3)
,('[2016-01-14 19:32:12-05,2016-01-14 23:18:20-05]', 3)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 4)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 2);
I have found that I can do this to get the sessions sorted by the time they started:
select * from test order by fk_user_id, sessionrange
I could use this to determine whether an individual record overlaps with the previous, using window functions:
SELECT *, sessionrange && lag(sessionrange) OVER (PARTITION BY fk_user_id ORDER BY sessionrange)
FROM test
ORDER BY fk_user_id, sessionrange
But this only detects whether the single previous record overlaps the current one (see the record where id = 6). I need to detect all the way back to the beginning of the partition.
After that, I'd need to group any records that overlap together, to find the beginning of the earliest session and the end of the last session to terminate.
I'm sure there's a way to do this that I'm overlooking. How can I collapse these overlapping records?
It is relatively easy to merge overlapping ranges as elements of an array. For simplicity the following function returns set of tstzrange:
create or replace function merge_ranges(tstzrange[])
returns setof tstzrange language plpgsql as $$
declare
t tstzrange;
r tstzrange;
begin
foreach t in array $1 loop
if r && t then r:= r + t;
else
if r notnull then return next r;
end if;
r:= t;
end if;
end loop;
if r notnull then return next r;
end if;
end $$;
Just aggregate the ranges for a user and use the function:
select fk_user_id, merge_ranges(array_agg(sessionrange))
from test
group by 1
order by 1, 2
fk_user_id | merge_ranges
------------+-----------------------------------------------------
1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"]
1 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"]
1 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"]
1 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"]
2 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"]
3 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"]
3 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"]
4 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"]
(8 rows)
Alternatively, the algorithm can be applied to the entire table in one function loop. I'm not sure but for a large dataset this method should be faster.
create or replace function merge_ranges_in_test()
returns setof test language plpgsql as $$
declare
curr test;
prev test;
begin
for curr in
select *
from test
order by fk_user_id, sessionrange
loop
if prev notnull and prev.fk_user_id <> curr.fk_user_id then
return next prev;
prev:= null;
end if;
if prev.sessionrange && curr.sessionrange then
prev.sessionrange:= prev.sessionrange + curr.sessionrange;
else
if prev notnull then
return next prev;
end if;
prev:= curr;
end if;
end loop;
return next prev;
end $$;
Results:
select *
from merge_ranges_in_test();
id | sessionrange | fk_user_id
----+-----------------------------------------------------+------------
1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] | 1
5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] | 1
7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] | 1
6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] | 1
4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] | 2
9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] | 3
10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] | 3
11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] | 4
(8 rows)
The problem is very interesting. I've tried to find a recursive solution but it seems the procedural attempt is most natural and efficient.
I have finally found a recursive solution. The query deletes overlapping rows and inserts their compacted equivalent:
with recursive cte (user_id, ids, range) as (
select t1.fk_user_id, array[t1.id, t2.id], t1.sessionrange + t2.sessionrange
from test t1
join test t2
on t1.fk_user_id = t2.fk_user_id
and t1.id < t2.id
and t1.sessionrange && t2.sessionrange
union all
select user_id, ids || t.id, range + sessionrange
from cte
join test t
on user_id = t.fk_user_id
and ids[cardinality(ids)] < t.id
and range && t.sessionrange
),
list as (
select distinct on(id) id, range, user_id
from cte, unnest(ids) id
order by id, upper(range)- lower(range) desc
),
deleted as (
delete from test
where id in (select id from list)
)
insert into test
select distinct on (range) id, range, user_id
from list
order by range, id;
Results:
select *
from test
order by 3, 2;
id | sessionrange | fk_user_id
----+-----------------------------------------------------+------------
1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] | 1
5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] | 1
7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] | 1
6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] | 1
4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] | 2
9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] | 3
10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] | 3
11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] | 4
(8 rows)

Can window function LAG reference the column which value is being calculated?

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form
SELECT <some fields>,
<some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>
This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.
Here is an example. I have a table with events. Each event has type and time_stamp.
create table event (id serial, type integer, time_stamp integer);
I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then
the first event is not a duplicate
all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
and so on
For this data
insert into event (type, time_stamp)
values
(1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10),
(1,15), (1, 21), (2,13),
(1, 40);
and TIMEFRAME=10 result should be
time_stamp | type | duplicate
-----------------------------
1 | 1 | false
2 | 1 | true
3 | 1 | true
10 | 1 | true
15 | 1 | false
21 | 1 | true
40 | 1 | false
2 | 2 | false
10 | 2 | true
13 | 2 | false
I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:
WITH evt AS (
SELECT
time_stamp,
CASE WHEN
time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
THEN
time_stamp
ELSE
LAG(current_non_dupl_time_stamp) OVER w
END AS current_non_dupl_time_stamp
FROM event
WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
But this does not work because the field which is calculated cannot be referenced in LAG:
ERROR: column "current_non_dupl_time_stamp" does not exist.
So the question: can I rewrite this query to achieve the effect I need?
Naive recursive chain knitter:
-- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
SELECT e.type,e.time_stamp
, ROW_NUMBER() OVER www as rn -- number the records
, FIRST_VALUE(e.time_stamp) OVER www as fst -- the "group leader"
, EXISTS (SELECT * FROM event x
WHERE x.type = e.type
AND x.time_stamp < e.time_stamp) AS is_dup
FROM event e
WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
;
WITH RECURSIVE ttt AS (
SELECT d0.*
FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
UNION ALL
SELECT d1.type, d1.time_stamp, d1.rn
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
ELSE ttt.fst END AS fst -- new "group leader"
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
ELSE True END AS is_dup
FROM drag d1
JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
)
SELECT * FROM ttt
ORDER BY type, time_stamp
;
Results:
CREATE TABLE
INSERT 0 10
CREATE VIEW
type | time_stamp | rn | fst | is_dup
------+------------+----+-----+--------
1 | 1 | 1 | 1 | f
1 | 2 | 2 | 1 | t
1 | 3 | 3 | 1 | t
1 | 10 | 4 | 1 | t
1 | 15 | 5 | 1 | t
1 | 21 | 6 | 1 | t
1 | 40 | 7 | 40 | f
2 | 2 | 1 | 2 | f
2 | 10 | 2 | 2 | t
2 | 13 | 3 | 2 | t
(10 rows)
An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.
State transition function:
create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
if st is null or st[1] + timeframe <= time_stamp
then
st[1] := time_stamp;
end if;
st[2] := time_stamp;
return st;
end $$;
Final function:
create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
select st[1] <> st[2];
$$;
Aggregate:
create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
sfunc = is_duplicate,
stype = int[],
finalfunc = is_duplicate_final
);
Query:
select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;
id | type | time_stamp | is_duplicate_agg
----+------+------------+------------------
1 | 1 | 1 | f
2 | 1 | 2 | t
4 | 1 | 3 | t
5 | 1 | 10 | t
7 | 1 | 15 | f
8 | 1 | 21 | t
10 | 1 | 40 | f
3 | 2 | 2 | f
6 | 2 | 10 | t
9 | 2 | 13 | f
(10 rows)
Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.
This feels more like a recursive problem than windowing function. The following query obtained the desired results:
WITH RECURSIVE base(type, time_stamp) AS (
-- 3. base of recursive query
SELECT x.type, x.time_stamp, y.next_time_stamp
FROM
-- 1. start with the initial records of each type
( SELECT type, min(time_stamp) AS time_stamp
FROM event
GROUP BY type
) x
LEFT JOIN LATERAL
-- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = x.type
AND time_stamp > (x.time_stamp + 10)
) y ON true
UNION ALL
-- 4. recursive join, same logic as base
SELECT e.type, e.time_stamp, z.next_time_stamp
FROM event e
JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
LEFT JOIN LATERAL
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = e.type
AND time_stamp > (e.time_stamp + 10)
) z ON true
)
-- The actual query:
-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
FROM base
UNION
-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
FROM event
WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base)
ORDER BY type, time_stamp
There are a lot of caveats with this. It assumes no duplicate time_stamp for a given type. Really the joins should be based on a unique id rather than type and time_stamp. I didn't test this much, but it may at least suggest an approach.
This is my first time to try a LATERAL join. So there may be a way to simplify that moe. Really what I wanted to do was a recursive CTE with the recursive part using MIN(time_stamp) based on time_stamp > (x.time_stamp + 10), but aggregate functions are not allowed in CTEs in that manner. But it seems the lateral join can be used in the CTE.