The actual query is more involved, but the problem I'm facing can be distilled to this:
A query to filter a rowset of monotonically increasing integers so that - in the final result set, row(n+1).value >= row(n).value + 5.
For the actual problem I need to solve, the rowset count is in the 1000s.
A few examples to clarify:
if rows are: 1,2,3,4,5 : then query should return: 1
if rows are: 1,5,7,10,11,12,13 : then query should return: 1,7,12
if rows are: 6,8,11,16,20,23: then query should return: 6,11,16,23
if rows are: 6,8,12,16,20,23: then query should return: 6,12,20
I've managed to get the required results with the following query, but it seems overly complicated. Uncomment the different "..with t(k).." to try them out.
I'm looking for any simplifications or alternative approaches to get the same results.
with recursive r(n, pri) as (
with t(k) as (values (1),(2),(3),(4),(5)) -- the data we want to filter
-- with t(k) as (values (1),(5),(7),(10),(11),(12),(13))
-- with t(k) as (values (6),(8),(11),(16),(20),(23))
-- with t(k) as (values (6),(8),(12),(16),(20),(23))
select min(k), 1::bigint from t -- bootstrap for recursive processing. 1 here represents rank().
UNION
select k, (rank() over(order by k)) rr -- rank() is required just to filter out the rows we dont want from the final result set, and no other reason
from r, t
where t.k >= r.n+5 and r.pri = 1 -- capture the rows we want, AND unfortunately a bunch of rows we dont want
)
select n from r where pri = 1; -- filter out the rows we dont want
-- The data:
CREATE TABLE rowseq(val INTEGER NOT NULL) ;
INSERT INTO rowseq(val ) values
-- (1),(2),(3),(4),(5)
(1), (5), (7), (10), (11), (12), (13)
--(6),(8),(11),(16),(20),(23)
--(6),(8),(12),(16),(20),(23)
;
-- need this view, because a recursive CTE cannot be based on a CTE
-- [we could also duplicate the row_number() in both legs of the recursive CTE]
CREATE TEMP VIEW qqq AS
SELECT val, row_number() OVER (ORDER BY val) AS rn
FROM rowseq
;
WITH RECURSIVE rrr AS (
SELECT qqq.val, qqq.rn
FROM qqq
WHERE qqq.rn = 1
UNION
SELECT q1.val, q1.rn
FROM rrr
JOIN qqq q1 ON q1.rn > rrr.rn AND q1.val >= rrr.val+5 -- The "Gap" condition
AND NOT EXISTS ( SELECT * FROM qqq nx -- But it must be the FISRT match
WHERE nx.rn > rrr.rn AND nx.val >= rrr.val+5 -- same condition
AND nx.rn < q1.rn -- but NO earlier match
)
)
-- prove it to the world!
SELECT *
FROM rrr
;
What about using plpgsql?
drop table if exists t;
create table t(k) as (
values
(1),(2),(3),(4),(5)
--(1),(5),(7),(10),(11),(12),(13)
--(6),(8),(11),(16),(20),(23)
--(6),(8),(12),(16),(20),(23)
);
create or replace function foo(in n int, out k int) returns setof int as $$
declare
r t;
rp t;
begin
rp := null;
for r in (select * from t) loop
if (rp is null) or (r.k >= rp.k + n) then
rp := r;
k := r.k;
return next;
end if;
end loop;
return;
end; $$ immutable language plpgsql;
select * from foo(5);
Related
I have a table1: id int, id_2 int, date timestamp, vec float[]
And table2 : id int, vec float[]
My target is to create trigger on update of table1 which will take last 10 (by date) rows for id_2, take average of vectors by first axis(10 x N -> N) and write it to table2 under id = id_2.
My code:
CREATE OR REPLACE FUNCTION public.foo()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
BEGIN
WITH rows AS (
SELECT DISTINCT t1.id_2, t2.id, t2.vec, t2.date, DENSE_RANK() OVER (PARTITION BY t1.id_2 ORDER BY t2.date desc) AS counter
FROM new_table AS t1
LEFT JOIN table1 t2 ON t1.id_2 = t2.id_2
),
elements_average AS (
SELECT id_2, AVG(unnest::float) AS av
FROM rows,
unnest(vec) with ORDINALITY
WHERE counter < 11
GROUP BY id_2, ORDINALITY
ORDER BY ORDINALITY
),
avr AS (
SELECT id_2, array_agg(av::float) AS averages
FROM elements_average
GROUP BY id_2
)
UPDATE table2 SET vec = averages FROM avr WHERE table2.id = user_av.id_2;
RETURN NULL;
END;
$function$
;
CREATE TRIGGER foo_trigger AFTER
UPDATE
ON
public.table1 REFERENCING NEW TABLE AS new_table FOR EACH STATEMENT EXECUTE FUNCTION foo()
The problem: when I update few rows in table1 with different id_2 in one transactions a value in table2 becomes wrong. Not the average.
What's even more strange is that this code gives correct values in same situation:
...
avr AS (
SELECT id_2, array_agg(av::float) AS averages
FROM elements_average
GROUP BY id_2
),
strange_thing AS (
SELECT * from elements_average
)
UPDATE table2 SET vec = averages FROM avr WHERE table2.id = user_av.id_2;
RETURN NULL;
END;
$function$
;
So, small meaningless and unimportant SELECT changes the behavior of the function. Is it a bug of postgres or my fault?
In PgSQL I make huge select, and then I want count it's size and apply some extra filters.
execute it twice sound dumm,
so I wrapped it in function
and then "cache" it and return union of filtered table and extra row at the end where in "id" column store size
with q as (select * from myFunc())
select * from q
where q.distance < 400
union all
select count(*) as id, null,null,null
from q
but it also doesn't look like proper solution...
and so the question: is in pg something like "generator function" or any other stuff that can properly solve this ?
postgreSQL 13
myFunc aka "selectItemsByRootTag"
CREATE OR REPLACE FUNCTION selectItemsByRootTag(
in tag_name VARCHAR(50)
)
RETURNS table(
id BIGINT,
name VARCHAR(50),
description TEXT,
/*info JSON,*/
distance INTEGER
)
AS $$
BEGIN
RETURN QUERY(
WITH RECURSIVE prod AS (
SELECT
tags.name, tags.id, tags.parent_tags
FROM
tags
WHERE tags.name = (tags_name)
UNION
SELECT c.name, c.id , c.parent_tags
FROM
tags as c
INNER JOIN prod as p
ON c.parent_tags = p.id
)
SELECT
points.id,
points.name,
points.description,
/*points.info,*/
points.distance
from points
left join tags on points.tag_id = tags.id
where tags.name in (select prod.name from prod)
);
END;
$$ LANGUAGE plpgsql;
as a result i want see maybe set of 2 table or generator function that yield some intermediate result not shure how exacltly it should look
demo
CREATE OR REPLACE FUNCTION pg_temp.selectitemsbyroottag(tag_name text, _distance numeric)
RETURNS TABLE(id bigint, name text, description text, distance numeric, count bigint)
LANGUAGE plpgsql
AS $function$
DECLARE _sql text;
BEGIN
_sql := $p1$WITH RECURSIVE prod AS (
SELECT
tags.name, tags.id, tags.parent_tags
FROM
tags
WHERE tags.name ilike '%$p1$ || tag_name || $p2$%'
UNION
SELECT c.name, c.id , c.parent_tags
FROM
tags as c
INNER JOIN prod as p
ON c.parent_tags = p.id
)
SELECT
points.id,
points.name,
points.description,
points.distance,
count(*) over ()
from points
left join tags on points.tag_id = tags.id
where tags.name in (select prod.name from prod)
and points.distance > $p2$ || _distance
;
raise notice '_sql: %', _sql;
return query execute _sql;
END;
$function$
You can call it throug following way
select * from pg_temp.selectItemsByRootTag('test',20);
select * from pg_temp.selectItemsByRootTag('test_8',20) with ORDINALITY;
The 1 way to call the function, will have a row of total count total number of rows. Second way call have number of rows plus a serial incremental number.
I also make where q.distance < 400 into function input argument.
selectItemsByRootTag('test',20); means that q.distance > 20 and tags.name ilike '%test%'.
Given a table with arrays of integers, the arrays should be merged so that all arrays that have overlapping entries end up as a single one.
Given the table arrays
a
------------
{1,2,3}
{1,4,7}
{4,7,9}
{15,17,18}
{18,16,15}
{20}
The result should look like this
{1,2,3,4,7,9}
{15,17,18,16}
{20}
As you can see duplicate values from a merged array may be removed and the order of the resulting entries in the array is unimportant. The arrays are integer arrays so functions from the intarray module can be used.
This will be done on a quite large table so performance is critical.
My first naive approach was to self-join the table on the && operator. Like this:
SELECT DISTINCT uniq(sort(t1.a || t2.a))
FROM arrays t1
JOIN arrays t2 ON t1.a && t2.a
This leaves two problems:
It is not recursive (it merges at most 2 arrays).
This could probably be solved with a recursive CTE.
Merged arrays re-occur in the output.
Any input is very welcome.
do $$
declare
arr int[];
arr_id int := 0;
tmp_id int;
begin
create temporary table tmp (v int primary key, id int not null);
for arr in select a from t loop
select id into tmp_id from tmp where v = any(arr) limit 1;
if tmp_id is NULL then
tmp_id = arr_id;
arr_id = arr_id+1;
end if;
insert into tmp
select unnest(arr), tmp_id
on conflict do nothing;
end loop;
end
$$;
select array_agg(v) from tmp group by id;
Pure SQL version:
WITH RECURSIVE x (a) AS (VALUES ('{1,2,3}'::int2[]),
('{1,4,7}'),
('{4,7,9}'),
('{15,17,18}'),
('{18,16,15}'),
('{20}')
), y AS (
SELECT 1::int AS lvl,
ARRAY [ a::text ] AS a,
a AS res
FROM x
UNION ALL
SELECT lvl + 1,
t1.a || ARRAY [ t2.a::text ],
(SELECT array_agg(DISTINCT unnest ORDER BY unnest)
FROM (SELECT unnest(t1.res) UNION SELECT unnest(t2.a)) AS a)
FROM y AS t1
JOIN x AS t2 ON (t2.a && t1.res) AND NOT t2.a::text = ANY(t1.a)
WHERE lvl < 10
)
SELECT DISTINCT res
FROM x
JOIN LATERAL (SELECT res FROM y WHERE x.a && y.res ORDER BY lvl DESC LIMIT 1) AS z ON true
I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;
I'm trying to run a graph search to find all nodes accessible from a starting point, like so:
with recursive
nodes_traversed as (
select START_NODE ID
from START_POSITION
union all
select ed.DST_NODE
from EDGES ed
join nodes_traversed NT
on (NT.ID = ed.START_NODE)
and (ed.DST_NODE not in (select ID from nodes_traversed))
)
select distinct * from nodes_traversed
Unfortunately, when I try to run that, I get an error:
Recursive CTE member (nodes_traversed) can refer itself only in FROM clause.
That "not in select" clause is important to the recursive expression, though, as it provides the ending point. (Without it, you get infinite recursion.) Using generation counting, like in the accepted answer to this question, would not help, since this is a highly cyclic graph.
Is there any way to work around this without having to create a stored proc that does it iteratively?
Here is my solution that use global temporary table, I have limited recursion by level and nodes from temporary table.
I am not sure how it will work on large set of data.
create procedure get_nodes (
START_NODE integer)
returns (
NODE_ID integer)
as
declare variable C1 integer;
declare variable C2 integer;
begin
/**
create global temporary table id_list(
id integer
);
create index id_list_idx1 ON id_list (id);
*/
delete from id_list;
while ( 1 = 1 ) do
begin
select count(distinct id) from id_list into :c1;
insert into id_list
select id from
(
with recursive nodes_traversed as (
select :START_NODE AS ID , 0 as Lv
from RDB$DATABASE
union all
select ed.DST_NODE , Lv+1
from edges ed
join nodes_traversed NT
on
(NT.ID = ed.START_NODE)
and nt.Lv < 5 -- Max recursion level
and nt.id not in (select id from id_list)
)
select distinct id from nodes_traversed);
select count(distinct id) from id_list into :c2;
if (c1 = c2) then break;
end
for select distinct id from id_list into :node_id do
begin
suspend ;
end
end