efficient merge of arrays that contain overlapping values

efficient merge of arrays that contain overlapping values - postgresql

Given a table with arrays of integers, the arrays should be merged so that all arrays that have overlapping entries end up as a single one.
Given the table arrays
a
------------
{1,2,3}
{1,4,7}
{4,7,9}
{15,17,18}
{18,16,15}
{20}
The result should look like this
{1,2,3,4,7,9}
{15,17,18,16}
{20}
As you can see duplicate values from a merged array may be removed and the order of the resulting entries in the array is unimportant. The arrays are integer arrays so functions from the intarray module can be used.
This will be done on a quite large table so performance is critical.
My first naive approach was to self-join the table on the && operator. Like this:
SELECT DISTINCT uniq(sort(t1.a || t2.a))
FROM arrays t1
JOIN arrays t2 ON t1.a && t2.a
This leaves two problems:
It is not recursive (it merges at most 2 arrays).
This could probably be solved with a recursive CTE.
Merged arrays re-occur in the output.
Any input is very welcome.

do $$
declare
arr int[];
arr_id int := 0;
tmp_id int;
begin
create temporary table tmp (v int primary key, id int not null);
for arr in select a from t loop
select id into tmp_id from tmp where v = any(arr) limit 1;
if tmp_id is NULL then
tmp_id = arr_id;
arr_id = arr_id+1;
end if;
insert into tmp
select unnest(arr), tmp_id
on conflict do nothing;
end loop;
end
$$;
select array_agg(v) from tmp group by id;

Pure SQL version:
WITH RECURSIVE x (a) AS (VALUES ('{1,2,3}'::int2[]),
('{1,4,7}'),
('{4,7,9}'),
('{15,17,18}'),
('{18,16,15}'),
('{20}')
), y AS (
SELECT 1::int AS lvl,
ARRAY [ a::text ] AS a,
a AS res
FROM x
UNION ALL
SELECT lvl + 1,
t1.a || ARRAY [ t2.a::text ],
(SELECT array_agg(DISTINCT unnest ORDER BY unnest)
FROM (SELECT unnest(t1.res) UNION SELECT unnest(t2.a)) AS a)
FROM y AS t1
JOIN x AS t2 ON (t2.a && t1.res) AND NOT t2.a::text = ANY(t1.a)
WHERE lvl < 10
)
SELECT DISTINCT res
FROM x
JOIN LATERAL (SELECT res FROM y WHERE x.a && y.res ORDER BY lvl DESC LIMIT 1) AS z ON true

Related

Looking for a simpler alternative to a recursive query

The actual query is more involved, but the problem I'm facing can be distilled to this:
A query to filter a rowset of monotonically increasing integers so that - in the final result set, row(n+1).value >= row(n).value + 5.
For the actual problem I need to solve, the rowset count is in the 1000s.
A few examples to clarify:
if rows are: 1,2,3,4,5 : then query should return: 1
if rows are: 1,5,7,10,11,12,13 : then query should return: 1,7,12
if rows are: 6,8,11,16,20,23: then query should return: 6,11,16,23
if rows are: 6,8,12,16,20,23: then query should return: 6,12,20
I've managed to get the required results with the following query, but it seems overly complicated. Uncomment the different "..with t(k).." to try them out.
I'm looking for any simplifications or alternative approaches to get the same results.
with recursive r(n, pri) as (
with t(k) as (values (1),(2),(3),(4),(5)) -- the data we want to filter
-- with t(k) as (values (1),(5),(7),(10),(11),(12),(13))
-- with t(k) as (values (6),(8),(11),(16),(20),(23))
-- with t(k) as (values (6),(8),(12),(16),(20),(23))
select min(k), 1::bigint from t -- bootstrap for recursive processing. 1 here represents rank().
UNION
select k, (rank() over(order by k)) rr -- rank() is required just to filter out the rows we dont want from the final result set, and no other reason
from r, t
where t.k >= r.n+5 and r.pri = 1 -- capture the rows we want, AND unfortunately a bunch of rows we dont want
)
select n from r where pri = 1; -- filter out the rows we dont want

-- The data:
CREATE TABLE rowseq(val INTEGER NOT NULL) ;
INSERT INTO rowseq(val ) values
-- (1),(2),(3),(4),(5)
(1), (5), (7), (10), (11), (12), (13)
--(6),(8),(11),(16),(20),(23)
--(6),(8),(12),(16),(20),(23)
;
-- need this view, because a recursive CTE cannot be based on a CTE
-- [we could also duplicate the row_number() in both legs of the recursive CTE]
CREATE TEMP VIEW qqq AS
SELECT val, row_number() OVER (ORDER BY val) AS rn
FROM rowseq
;
WITH RECURSIVE rrr AS (
SELECT qqq.val, qqq.rn
FROM qqq
WHERE qqq.rn = 1
UNION
SELECT q1.val, q1.rn
FROM rrr
JOIN qqq q1 ON q1.rn > rrr.rn AND q1.val >= rrr.val+5 -- The "Gap" condition
AND NOT EXISTS ( SELECT * FROM qqq nx -- But it must be the FISRT match
WHERE nx.rn > rrr.rn AND nx.val >= rrr.val+5 -- same condition
AND nx.rn < q1.rn -- but NO earlier match
)
)
-- prove it to the world!
SELECT *
FROM rrr
;

What about using plpgsql?
drop table if exists t;
create table t(k) as (
values
(1),(2),(3),(4),(5)
--(1),(5),(7),(10),(11),(12),(13)
--(6),(8),(11),(16),(20),(23)
--(6),(8),(12),(16),(20),(23)
);
create or replace function foo(in n int, out k int) returns setof int as $$
declare
r t;
rp t;
begin
rp := null;
for r in (select * from t) loop
if (rp is null) or (r.k >= rp.k + n) then
rp := r;
k := r.k;
return next;
end if;
end loop;
return;
end; $$ immutable language plpgsql;
select * from foo(5);

Postgres Select from a Table Based On Query Result

I have two tables with identical columns, in an identical order. I have a desire to join across one of the two tables, depending on a subquery condition. For example, assume I have the following schema:
CREATE TABLE b (
bid SERIAL PRIMARY KEY,
cid INT NOT NULL
);
CREATE TABLE a1 (
aid SERIAL PRIMARY KEY,
bid INT NOT NULL REFERENCES b
);
CREATE TABLE a2 (
aid SERIAL PRIMARY KEY,
bid INT NOT NULL REFERENCES b
);
I would like a query, that performs a join across either a1 or a2 based on some condition. Something like:
WITH z AS (
SELECT cid, someCondition FROM someTable
)
SELECT *
FROM CASE z.someCondition THEN a1 ELSE a2 END
JOIN b USING (bid)
WHERE cid = (SELECT cid FROM z);
However, the above doesn't work. Is there some way to conditionally join across a1 or a2, depending on some boolean condition stored in table z?

If the conditions are exclusive (I expect they are): just do both queries and UNION ALL them, with the smart union construct:
WITH z AS (
SELECT cid
, (cid %3) AS some_condition -- Fake ...
FROM b
)
SELECT *
FROM a1
JOIN b USING (bid)
WHERE EXISTS( SELECT * FROM z
WHERE some_condition = 1 AND cid = b.cid )
UNION ALL
SELECT *
FROM a2
JOIN b USING (bid)
WHERE EXISTS( SELECT * FROM z
WHERE some_condition = 2 AND cid = b.cid )
;
A somewhat different syntax to do the same:
WITH z AS (
SELECT cid
, (cid %3) AS some_condition
FROM b
)
SELECT *
FROM a1
JOIN b ON a1.bid = b.bid
AND EXISTS( SELECT * FROM z
WHERE some_condition = 1 AND cid = b.cid )
UNION ALL
SELECT *
FROM a2
JOIN b ON a2.bid = b.bid
AND EXISTS( SELECT * FROM z
WHERE some_condition = 2 AND cid = b.cid )
;

SQL syntax does not allow conditional joins.
Probably the simplest way to achieve a similar effect is to use a dynamic query in a plpgsql function, which may look like this:
create function conditional_select(acid int, some_condition boolean)
returns table (aid int, bid int, cid int)
language plpgsql as $$
declare
tname text;
begin
if some_condition then tname = 'a1';
else tname = 'a2';
end if;
return query execute format ($fmt$
select a.aid, b.bid, b.cid
from %s a
join b using(bid)
where cid = %s;
$fmt$, tname, acid);
end $$;
select * from conditional_select(1, true)

If, like in your example, you have only a few columns that you want to output, you can use the CASE statement for every column:
SELECT CASE z.someCondition THEN a1.aid ELSE a2.aid END AS aid,
CASE z.someCondition THEN a1.bid ELSE a2.bid END AS bid
FROM b
JOIN a1 ON a1.bid = b.bid
JOIN a2 ON a2.bid = b.bid
JOIN someTable z USING (cid);
Depending on the size of tables a1 and a2 and how many columns you have to output, this may or my not be faster than Klin's solution with a function, which is inherently slower than plain SQL and even more so because of the dynamic query. Given that z.someCondition is a boolean value already, the CASE evaluation will be very fast. Small tables + few columns = this solution; large tables + many columns = Klin's solution.

Postgres ANY operator with array selected in a subquery

Can someone explain to me why the 4th select works, but the first 3 do not? (I'm on PostgreSQL 9.3.4 if it matters.)
drop table if exists temp_a;
create temp table temp_a as
(
select array[10,20] as arr
);
select 10 = any(select arr from temp_a); -- ERROR: operator does not exist: integer = integer[]
select 10 = any(select arr::integer[] from temp_a); -- ERROR: operator does not exist: integer = integer[]
select 10 = any((select arr from temp_a)); -- ERROR: operator does not exist: integer = integer[]
select 10 = any((select arr from temp_a)::integer[]); -- works
Here's a sqlfiddle: http://sqlfiddle.com/#!15/56a09/2

You might be expecting an aggregate. Per the documentation:
Note: Boolean aggregates bool_and and bool_or correspond to standard SQL aggregates every and any or some. As for any and some, it seems that there is an ambiguity built into the standard syntax:
SELECT b1 = ANY((SELECT b2 FROM t2 ...)) FROM t1 ...;
Here ANY can be considered either as introducing a subquery, or as being an aggregate function, if the subquery returns one row with a Boolean value. Thus the standard name cannot be given to these aggregates.
In Postgres, the any operator exists for subqueries and for arrays.
The first three queries return a set of values of type int[] and you're comparing them to an int. Can't work.
The last query is returning an int[] array but it's only working because you're returning a single element.
Exhibit A; this works:
select (select i from (values (array[1])) rows(i))::int[];
But this doesn't:
select (select i from (values (array[1]), (array[2])) rows(i))::int[];
This works as a result (equivalent to your fourth query):
select 1 = any((select i from (values (array[1])) rows(i))::int[]);
But this doesn't (equivalent to your fourth query returning multiple rows):
select 1 = any((select i from (values (array[1]), (array[2])) rows(i))::int[]);
These should also work, btw:
select 1 = any(
select unnest(arr) from temp_a
);
select 1 = any(
select unnest(i)
from (values (array[1]), (array[2])) rows(i)
);
Also note the array(select ...)) construct as an aside, since it's occasionally handy:
select 1 = any(array(
select i
from (values (1), (2)) rows(i)
));
select 1 = any(
select i
from (values (1), (2)) rows(i)
);

Update a value in DB with ascending values with no duplicates?

I am using this query to update a column with ascending values:
DECLARE #counter NUMERIC(10, 0)
SET #counter = 1400000
UPDATE SomeTable
SET #counter = SomeColumn = #counter + 1
Question is, how can I not put duplicates there? For example the column already has 1400002 as value. Normally it has NULLs, but sometimes it doesnt. I could add
where SomeColumn is null
but this would not avoid duplicates. Any ideas?
Thanks

I am not sure that this will help or not but you can put your existing data into temp table and then use that temp table to remove duplicates like:
WHERE (#counter + 1) not in ( select SomeColumn from #temp)
If above is not correct then please explain your question a little more.

This worked for me in SQL Server 2008:
DECLARE #StartNumber int, #EndNumber int;
SET #StartNumber = 100;
SELECT #EndNumber = #StartNumber + COUNT(*) - 1 FROM SomeTable;
WITH numbers AS (
SELECT #StartNumber AS Value
UNION ALL
SELECT
Value + 1
FROM numbers
WHERE Value < #EndNumber
),
validnumbers AS (
SELECT
n.Value,
rownum = ROW_NUMBER() OVER (ORDER BY n.Value)
FROM numbers n
LEFT JOIN SomeTable t ON n.Value = t.Value
WHERE t.Value IS NULL
),
RowsToUpdate AS (
SELECT
Value,
rownum = ROW_NUMBER() OVER (ORDER BY Value)
FROM SomeTable
WHERE Value IS NULL
OR Value NOT IN (SELECT Value FROM numbers)
)
UPDATE r
SET Value = v.Value
FROM RowsToUpdate r
INNER JOIN validnumbers v ON v.rownum = r.rownum;
Basically, it implements the following steps:
Create a number table.
Exclude the numbers present in SomeTable.
Rank the rest of the rows.
Exclude the values from SomeTable that are present in the number table.
Rank the rest of the rows.
Update the ranked rows of SomeTable from the ranked number list.
Not sure how good this solution would be for big tables, though...

T-SQL A question about inner join table variable

in my stored procedure I have a table variable contains rows ID. The are 2 scenarios - that table variable is empty and not.
declare #IDTable as table
(
number NUMERIC(18,0)
)
In the main query, I join that table:
inner join #IDTable tab on (tab.number = csr.id)
BUT:
as we know how inner join works, I need that my query returns some rows:
when #IDTable is empty
OR
return ONLY rows that exist in
#IDTable
I tried also with LEFT join but it doesn't work. Any ideas how to solve it ?

If `#IDTable' is empty then what rows do you return? Do you just ignore the Join on to the table?
I'm not sure I get what you're trying to do but this might be easier.
if (Select Count(*) From #IDTable) == 0
begin
-- do a SELECT that doesn't join on to the #IDTable
end
else
begin
-- do a SELECT that joins on to #IDTable
end

It is not optimal, but it works:
declare #z table
(
id int
)
--insert #z values(2)
select * from somTable n
left join #z z on (z.id = n.id)
where NOT exists(select 1 from #z) or (z.id is not null)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

efficient merge of arrays that contain overlapping values - postgresql

Related

Looking for a simpler alternative to a recursive query

Postgres Select from a Table Based On Query Result

Postgres ANY operator with array selected in a subquery

Update a value in DB with ascending values with no duplicates?

T-SQL A question about inner join table variable

Categories

Resources