I want to walk a directed graph. I need all involved nodes.
I want to support both cases (result should be always 1,2,3):
INSERT INTO foo VALUES (1,2),(2,3);
INSERT INTO foo VALUES (1,2),(3,2);
WITH RECURSIVE traverse(id, path, cycle) AS (
SELECT a, ARRAY[a], false
FROM foo WHERE a = 1
UNION ALL
SELECT GREATEST(a,b), traverse.path || GREATEST(a,b), GREATEST(a,b) = ANY(traverse.path)
FROM traverse
INNER JOIN foo
ON LEAST(a,b) = traverse.id
WHERE NOT cycle
)
SELECT * FROM traverse
Table foo can have up to 50 Mio records. Index is on both column (not multicolumn index). It doesnt work very "fast" - without GREATEST and LEAST its very fast. Any other solutions?
Update: An iterative solution is not that bad after analyzing requirements again:
There are 54 Mio edges and 21 Mio nodes in the db - there are distinct graphs in the db each connected graph has 3 to 100 nodes
it worked well with the question "give me all related nodes" --> 20msec (graph depth = 13)
DROP TABLE IF EXISTS edges, nodes;
CREATE TABLE nodes
(
id INTEGER PRIMARY KEY
);
CREATE TABLE edges
(
"from" INTEGER NOT NULL REFERENCES nodes(id),
"to" INTEGER NOT NULL REFERENCES nodes(id)
);
INSERT INTO nodes SELECT generate_series(1,5);
INSERT INTO edges VALUES
-- circle
(1,2),(2,3),(3,1),
-- other direction
(4,3);
CREATE OR REPLACE FUNCTION walk_graph(param_node_id INTEGER)
RETURNS TABLE (id INTEGER)
LANGUAGE plpgsql
AS $$
DECLARE
var_node_ids INTEGER[] := ARRAY[param_node_id];
var_iteration_node_ids INTEGER[] := ARRAY[param_node_id];
BEGIN
WHILE array_length(var_iteration_node_ids, 1) > 0 LOOP
var_iteration_node_ids := ARRAY(SELECT DISTINCT "to" FROM edges
WHERE "from" = ANY(var_iteration_node_ids)
AND NOT("to" = ANY(var_node_ids))
UNION
SELECT DISTINCT "from" FROM edges
WHERE "to" = ANY(var_iteration_node_ids)
AND NOT("from" = ANY(var_node_ids)));
var_node_ids := var_node_ids || var_iteration_node_ids;
END LOOP;
RETURN QUERY SELECT unnest(var_node_ids);
END $$;
SELECT * FROM walk_graph(2);
Related
Say I have a table:
CREATE TABLE nodes (
id SERIAL PRIMARY KEY,
parent_id INTEGER REFERENCES nodes(id),
trashed_at timestamptz
)
I have this query nodes_trash_node(node_id INTEGER):
UPDATE nodes SET
trashed_at = now()
WHERE nodes.id = node_id
OR nodes.id IN (SELECT id FROM nodes_descendants(node_id))
RETURNING *
The nodes_descendants function operates on an adjacency list structure and looks like this:
CREATE OR REPLACE FUNCTION nodes_descendants(node_id INTEGER, depth INTEGER)
RETURNS TABLE (id INTEGER) AS $$
WITH RECURSIVE tree AS (
SELECT id, array[node_id]::integer[] as ancestors
FROM nodes
WHERE parent_id = node_id
UNION ALL
SELECT nodes.id, tree.ancestors || nodes.parent_id
FROM nodes, tree
WHERE nodes.parent_id = tree.id
AND (depth = 0 OR cardinality(tree.ancestors) < depth)
)
SELECT id FROM tree;
$$ LANGUAGE sql;
(taken from here).
However I'd now like to convert my query to take a list of node_ids, but I'm struggling to find the correct syntax. Something like:
UPDATE nodes SET
trashed_at = now()
WHERE nodes.id = ANY(node_ids)
OR nodes.id IN (???)
RETURNING *
EDIT
Just to clarify, I'd like to now select many 'root' node_ids and all their descendants. For the example use case: select many files and folders and move to the trash at the same time.
Thanks.
It is straight-forward if you do not use a function actually.
BTW, I have changed it to a proper INNER JOIN.
Please do not use tables products (i.e. cross joins) followed by WHERE as you will mistakenly skip it someday.
WITH RECURSIVE tree AS (
SELECT id
FROM nodes
WHERE <Type your condition here>
UNION ALL
SELECT nodes.id
FROM nodes
JOIN tree ON nodes.parent_id = tree.id
)
UPDATE nodes SET
trashed_at = now()
WHERE nodes.id IN (SELECT id from Tree)
RETURNING *
It's a little hard to know if this is right or not without the larger context of where these updates are running. Presumably it's within a procedure/function or through an application?
Either way, I think your final syntax was fine -- it's just you need to ensure the datatype you pass is an array:
update nodes
set trashed_at = now()
where id in (1, 2, 3);
Is essentially the same functionally as:
update nodes
set trashed_at = now()
where id = any(array[1, 2, 3]);
So, back to your original statement:
UPDATE nodes SET
trashed_at = now()
WHERE nodes.id = ANY(node_ids)
OR nodes.id IN (???)
RETURNING *
I think you can simplify this to:
UPDATE nodes SET
trashed_at = now()
WHERE nodes.id = ANY(node_ids)
RETURNING *
Just be sure node_ids is an array of 64-bit integers.
So, assuming this was within a procedure, these are some examples:
DECLARE
node_ids bigint[];
BEGIN
node_ids := array[1, 2, 3, 4];
-- or perhaps
select array_agg (bar)
into node_ids
from foo
where baz = x;
UPDATE nodes
SET trashed_at = now()
WHERE nodes.id = ANY(node_ids)
RETURNING *;
END;
IMO it's always been a struggle to pass an in-list as a parameters, but with PostgreSQL arrays it's not only possible but quite straight-forward.
I have two tables that contain categorized tsrange values. The ranges in each table are non-overlapping per category, but the ranges in b might overlap those in a.
create table a ( id serial primary key, category int, period tsrange );
create table b ( id serial primary key, category int, period tsrange );
What I would like to do is combine these two tables into a CTE for another query. The combined values needs to be the tsranges from table a subtracted by any overlapping tsranges in table b with the same category.
The complication is that in the case where an overlapping b.period is contained inside an a.period, the result of the subtraction is two rows. The Postgres Range - operator does not support this, so I create a function that will return 1 or 2 rows:
create function subtract_tsrange( a tsrange , b tsrange )
returns table (period tsrange)
language 'plpgsql' as $$
begin
if a #> b and not isempty(b) and lower(a) <> lower(b) and upper(b) <> upper(a)
then
period := tsrange(lower(a), lower(b), '[)');
return next;
period := tsrange(upper(b), upper(a), '[)');
return next;
else
period := a - b;
return next;
end if;
return;
end
$$;
There can also be several b.periods overlapping an a.period, so one row from a might be potentially be split into a lot of rows with shorter periods.
Now I want to create a select that takes each row in a and returns:
The original a.period if there is no overlapping b.period with the same category
or
1 or several rows representing the original a.period minus all overlapping b.periods with the same category.
After reading lots of other posts I figure I should use SELECT LATERAL in combination with my function somehow, but I'm still scratching my head as to how?? (We're talking Postgres 9.6 btw!)
Notes: your problem can easily be generalized to every range types, therefore I will use the anyrange pseudo type in my answer, but you don't have to. In fact because of this I had to create a generic constructor for range types, because PostgreSQL have not defined it (yet):
create or replace function to_range(t anyrange, l anyelement, u anyelement, s text default '[)', out to_range anyrange)
language plpgsql as $func$
begin
execute format('select %I($1, $2, $3)', pg_typeof(t)) into to_range using l, u, s;
end
$func$;
Of course, you can use the appropriate range constructor instead of to_range() calls.
Also, I will use the numrange type for testing purposes, as it can be created and checked more easily than the tsrange type, but my answer should work with that as well.
Answer:
I rewrote your function to handle any type of bounds (inclusive, exclusive and even unbounded ranges). Also, it will return an empty result set when a <# b.
create or replace function range_div(a anyrange, b anyrange)
returns setof anyrange
language sql as $func$
select * from unnest(case
when b is null or a <# b then '{}'
when a #> b then array[
to_range(a, case when lower_inf(a) then null else lower(a) end,
case when lower_inf(b) then null else lower(b) end,
case when lower_inc(a) then '[' else '(' end ||
case when lower_inc(b) then ')' else ']' end),
to_range(a, case when upper_inf(b) then null else upper(b) end,
case when upper_inf(a) then null else upper(a) end,
case when upper_inc(b) then '(' else '[' end ||
case when upper_inc(a) then ']' else ')' end)
]
else array[a - b]
end)
$func$;
With this in mind, what you need is somewhat an inverse of aggregation. F.ex. with sum() one can start with an empty value (0) and constantly add some value to that. But you have your initial value and you need to constantly remove some parts of it.
One solution to that is to use recursive CTEs:
with recursive r as (
select *
from a
union
select r.id, r.category, d
from r
left join b using (category)
cross join range_div(r.period, b.period) d -- this is in fact an implicit lateral join
where r.period && b.period
)
select r.*
from r
left join b on r.category = b.category and r.period && b.period
where not isempty(r.period) and b.period is null
My sample data:
create table a (id serial primary key, category int, period numrange);
create table b (id serial primary key, category int, period numrange);
insert into a (category, period) values (1, '[1,4]'), (1, '[2,5]'), (1, '[3,6]'), (2, '(1,6)');
insert into b (category, period) values (1, '[2,3)'), (1, '[1,2]'), (2, '[3,3]');
The query above produces:
id | category | period
3 | 1 | [3,6]
1 | 1 | [3,4]
2 | 1 | [3,5]
4 | 2 | (1,3)
4 | 2 | (3,6)
This is a simple example of what I need, for any given table, I need to get all the instances of the primary keys, this is a little example, but I need a generic way to do it.
create table foo
(
a numeric
,b text
,c numeric
constraint pk_foo primary key (a,b)
)
insert into foo(a,b,c) values (1,'a',1),(2,'b',2),(3,'c',3);
select <the magical thing>
result
a|b
1 |1|a|
2 |2|b|
3 |3|c|
.. ...
I need to control if the instances of the primary keys are changed by the user, but I don't want to repeat code in too many tables! I need a generic way to do it, I will put <the magical thing>
in a function to put it on a trigger before update and blah blah blah...
In PostgreSQL you must always provide a resulting type for a query. However, you can obtain the code of the query you need, and then execute the query from the client:
create or replace function get_key_only_sql(regclass) returns string as $$
select 'select '|| (
select string_agg(quote_ident(att.attname), ', ' order by col)
from pg_index i
join lateral unnest(indkey) col on (true)
join pg_attribute att on (att.attrelid = i.indrelid and att.attnum = col)
where i.indrelid = $1 and i.indisprimary
group by i.indexrelid
limit 1) || ' from '||$1::text
end;
$$ language sql;
Here's some client pseudocode using the function above:
sql = pgexecscalar("select get_key_only_sql('mytable'::regclass)");
rs = pgopen(sql);
I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;
I'm trying to run a graph search to find all nodes accessible from a starting point, like so:
with recursive
nodes_traversed as (
select START_NODE ID
from START_POSITION
union all
select ed.DST_NODE
from EDGES ed
join nodes_traversed NT
on (NT.ID = ed.START_NODE)
and (ed.DST_NODE not in (select ID from nodes_traversed))
)
select distinct * from nodes_traversed
Unfortunately, when I try to run that, I get an error:
Recursive CTE member (nodes_traversed) can refer itself only in FROM clause.
That "not in select" clause is important to the recursive expression, though, as it provides the ending point. (Without it, you get infinite recursion.) Using generation counting, like in the accepted answer to this question, would not help, since this is a highly cyclic graph.
Is there any way to work around this without having to create a stored proc that does it iteratively?
Here is my solution that use global temporary table, I have limited recursion by level and nodes from temporary table.
I am not sure how it will work on large set of data.
create procedure get_nodes (
START_NODE integer)
returns (
NODE_ID integer)
as
declare variable C1 integer;
declare variable C2 integer;
begin
/**
create global temporary table id_list(
id integer
);
create index id_list_idx1 ON id_list (id);
*/
delete from id_list;
while ( 1 = 1 ) do
begin
select count(distinct id) from id_list into :c1;
insert into id_list
select id from
(
with recursive nodes_traversed as (
select :START_NODE AS ID , 0 as Lv
from RDB$DATABASE
union all
select ed.DST_NODE , Lv+1
from edges ed
join nodes_traversed NT
on
(NT.ID = ed.START_NODE)
and nt.Lv < 5 -- Max recursion level
and nt.id not in (select id from id_list)
)
select distinct id from nodes_traversed);
select count(distinct id) from id_list into :c2;
if (c1 = c2) then break;
end
for select distinct id from id_list into :node_id do
begin
suspend ;
end
end