How to group by multiple columns with OR statement transitively? - postgresql

I want to select rows from a table grouped by multiple columns connected by OR. (Not the usual AND when you use multiple group by clauses.)
In contrast to other questions already asked it should also be transitive, meaning when by common value in one of the columns row A connects to B, and row B connects to row C, then also row A should be connected to row C (making A, B, C one group).
The setup:
CREATE TABLE test.tab1 (
rowid STRING,
col1 STRING,
col2 STRING,
col3 STRING
);
INSERT INTO test.tab1
(rowid, col1, col2, col3)
VALUES
( '1', 'A', 'Q', 'green'),
( '2', 'B', 'R', 'blue'),
( '3', 'B', 'S', 'red'),
( '4', 'C', 'T', 'purple'),
( '5', 'D', 'U', 'orange'),
( '6', 'E', 'R', 'black'),
( '7', 'F', 'U', 'brown'),
( '8', 'F', 'V', 'pink'),
( '9', 'G', 'W', 'white'),
('10', 'H', 'R', 'cyan'),
('11', 'A', 'Y', 'grey'),
('12', 'I', 'Z', 'azul'),
('13', 'H', 'X', 'magenta');
rowid
col1
col2
col3
1
A
Q
green
2
B
R
blue
3
B
S
red
4
C
T
purple
5
D
U
orange
6
E
R
black
7
F
U
brown
8
F
V
pink
9
G
W
white
10
H
R
cyan
11
A
Y
grey
12
I
Z
azul
13
H
X
magenta
The query should select all rows grouped by col1 OR col2, meaning when one of them is equal to another value of that column in a different row, rows get connected together. The expected result would be (for simplicity grouped columns are concatenated strings):
INSERT INTO test.result
(rowid, col1, col2, col3)
VALUES
( '1,11', 'A', 'Q,Y', 'green,grey'),
( '2,3,6,10,13', 'B,E,H', 'R,S,X', 'blue,red,black,cyan,magenta'),
( '4', 'C', 'T', 'purple'),
( '5,7,8', 'D,F', 'U,V', 'orange,brown,pink'),
( '9', 'G', 'W', 'white'),
('12', 'I', 'Z', 'azul');
rowid
col1
col2
col3
1,11
A
Q,Y
green,grey
2,3,6,10,13
B,E,H
R,S,X
blue,red,black,cyan,magenta
4
C
T
purple
5,7,8
D,F
U,V
orange,brown,pink
9
G
W
white
12
I
Z
azul
The try adapted from a similar question:
-- This is not transitive and does not work as desired.
-- For example it does not connect row 2 blue (col2=R) with row 13 magenta (col1=H),
-- even though they are both connected to row 10 cyan (col2=R, col1=H).
SELECT t1.col1,
STRING_AGG(t2.col3, ',') AS col3
FROM test.tab1 AS t1 INNER JOIN test.tab1 AS t2
ON t2.rowid = t1.rowid OR t2.col2 = t1.col2
GROUP BY t1.col1
HAVING COUNT(DISTINCT t1.rowid) > 1 OR COUNT(t2.rowid) = 1;
I am looking for solutions that work in BigQuery, but an answer for PostgreSQL would probably help a lot already. I have prepared a DB fiddle as the syntax is really similar and I believe an answer could easily be adapted.

Related

Postgres Group by intersection array

I have a table like this
SELECT id, items
FROM ( VALUES
( '1', ARRAY['A', 'B'] ),
( '2', ARRAY['A', 'B', 'C'] ),
( '3', ARRAY['E', 'F'] ),
( '4', ARRAY['G'] )
) AS t(id, items)
Two items belongs to the same group if the have at least one item in common.
For example #1 and #2 belongs to the same group because they both have A and B. #3 and #4 are other different group.
So my desidered output would be
ID
items
group_alias
1
{A,B}
{A,B}
2
{A,B,C}
{A,B}
3
{E,F}
{E,F}
4
{G}
{G}
The group_alias field is a new field that say to me that the record #1 and #2 belongs to the same group.
Having
CREATE TABLE temp1
(
id int PRIMARY KEY,
items char[] NOT NULL
);
INSERT INTO temp1 VALUES
( '1', ARRAY['A', 'B'] ),
( '2', ARRAY['A', 'B', 'C'] ),
( '3', ARRAY['E', 'F'] ),
( '4', ARRAY['G'] );
--Indexing array field to speedup queries
CREATE INDEX idx_items on temp1 USING GIN ("items");
Then
select t1.*,
coalesce( (select t2.items from temp1 t2
where t2.items && t1.items
and t1.id != t2.id
and array_length(t2.items,1)<array_length(t1.items,1)
order by array_length(t2.items,1) limit 1 )/*minimum common*/
, t1.items /*trivial solution*/ ) group_alias
from temp1 t1;
https://www.db-fiddle.com/f/46ydeE5ZXCJDk4Rw3cu4jt/10
This query returns all group alias of an item. For example item no. 5 has group alias {E} and {A,B}. The performance is maybe better if you create a temporary table for the items instead of creating them dynamically like you mentioned in one comment. Temporary tables are automatically dropped at the end of a session. You can create
indexes on temporary tables, too, which can speed up the query.
CREATE TEMP TABLE temp
(
id int PRIMARY KEY,
items char[] NOT NULL
);
INSERT INTO temp VALUES
( '1', ARRAY['A', 'B'] ),
( '2', ARRAY['A', 'B', 'C'] ),
( '3', ARRAY['E', 'F'] ),
( '4', ARRAY['G'] ),
( '5', ARRAY['A', 'B', 'E'] );
The query:
SELECT DISTINCT
t1.id, t1.items, coalesce(match, t1.items) AS group_alias
FROM temp t1 LEFT JOIN (
SELECT
t2.id, match
FROM
temp t2,
LATERAL(
SELECT
match
FROM
temp t3,
LATERAL(
SELECT
array_agg(aa) AS match
FROM
unnest(t2.items) aa
JOIN
unnest(t3.items) ab
ON aa = ab
) AS m1
WHERE
t2.id != t3.id AND t2.items && t3.items
) AS m2
) AS groups
ON groups.id = t1.id
ORDER BY t1.id;
And the result:
id | items | group_alias
----+---------+-------------
1 | {A,B} | {A,B}
2 | {A,B,C} | {A,B}
3 | {E,F} | {E}
4 | {G} | {G}
5 | {A,B,E} | {A,B}
5 | {A,B,E} | {E}

How to get array of descendants from adjacency lists?

I have the following table and data
CREATE TABLE relationships (a TEXT, b TEXT);
CREATE TABLE nodes(n TEXT);
INSERT INTO relationships(a, b) VALUES
('1', '2'),
('1', '3'),
('1', '4'),
('1', '5'),
('2', '6'),
('2', '7'),
('2', '8'),
('3', '9');
INSERT INTO nodes(n) VALUES ('1'), ('2'), ('3'), ('4'), ('5'), ('6'), ('7'), ('8'), ('9'), ('10');
I want to output
n | children
1 | ['2', '3', '4', '5', '6', '7', '8', '9']
2 | ['6', '7', '8', '9']
3 | ['9']
4 | []
5 | []
6 | []
7 | []
8 | []
9 | []
10 | []
I am trying to use WITH RECURSIVE but is stuck on how to pass parameter into CTE
WITH RECURSIVE traverse(n) AS (
SELECT *
FROM relationships
WHERE a = n --- not sure how to pass data to here
UNION ALL
...
)
WITH basic_cte AS (
SELECT a1.n as n,
(SELECT COALESCE(json_agg(temp), '[]')
FROM (
(SELECT * FROM traverse(a1.a))
) as temp
) as children
FROM nodes as a1
)
SELECT *
FROM basic_cte;
To get a list of the children of all nodes, you need a left join to the nodes table
with recursive rels as (
select a,b, a as root
from relationships
union all
select c.*, r.root
from relationships c
join rels r on r.b = c.a
)
select n.n, array_agg(r.b) filter (where r.b is not null)
from nodes n
left join rels r on r.root = n.n
group by n.n
order by n.n;
Note: This ignores any empty children. You can add a left join like in #a_horse_with_no_name's answer to get that functionality.
You can't really pass a parameter into the CTE unless you veer off into stored procedures and whatnot. The CTE is a single table that needs to contain all the rows you might want to use from it.
Assuming a fairly nice graph (no duplicate edges, no cycles), code like the following ought to do what you're looking for.
The base case for the recursive query gets all level-1 descendants (the children) for all nodes which could possibly be parents.
The recursive step walks through the 2nd level, 3rd level, etc down the tree.
Once we have all parent-descendant tuples we can aggregate the data as desired.
WITH RECURSIVE descendants(parent, child) AS (
SELECT * FROM relationships
UNION
SELECT d.parent, r.b
FROM descendants d JOIN relationships r ON d.child=r.a
)
SELECT parent AS n, array_agg(child) AS children
FROM descendants
GROUP BY parent

Retrieve rows with a column value that do not exist in a given list

Let's say I have a list of values
'A', 'B', 'C','D','E',...,'X'.
And I have a database column CHARS that is storing the exact values of my list (one value / row) except for 'C' and 'E'. So, CHARS contains 'A', 'B', 'D',..., 'X'.
Is there a way in PostgreSQL to return only the rows for 'C' and 'E'; the values from my list which are missing from column CHARS?
If your list of values come from outside of the database (e.g. a program), the simplest solution should be the NOT INoperator
SELECT * FROM your_table WHERE chars NOT IN ('A', 'B', 'C', 'D', 'E', ..., 'X')
(Note: The missing characters in the tuple can't be abbreviated with .... So you have write them all.)
If your list of values come from a table or query, you can use a subquery:
SELECT * FROM your_table WHERE chars NOT IN (SELECT a_char FROM another_table WHERE ...);
You can do an outer join against a values clause:
select i.ch as missing_character
from (
values
('A'), ('B'), ('C'), ('D'), ('E'), ('F'), ..., ('X')
) as i(ch)
left join the_table t on i.ch = t.chars
where t.chars is null
order by i.ch;
Note that each char needs to be enclosed in parentheses in the values clause to make sure that each one is a single row.
Alternatively you can use the EXCEPT operator:
select ch
from (
values ('A'), ('B'), ('C'), ('D'), ('E'), ('F'), ... , ('X')
) as i(ch)
except
select chars
from the_table
order by ch
Online example: https://rextester.com/XOUB52627

Filtering data based on dynamic date

Create table #TEMP
(
ID INT
)
Create table #TEMP1
(
ID INT,
Letter_Type VARCHAR(100),
Letter_Sent_Date DATE
)
INSERT INTO #TEMP VALUES (1),(2),(3),(4)
GO
INSERT INTO #TEMP1 VALUES
(1,'A','01/01/2017'),
(1,'B','01/02/2017'),
(1,'C','01/03/2018'),
(1,'D','01/04/2018'),
(2,'A','01/01/2017'),
(2,'B','01/02/2017'),
(2,'C','01/10/2018'),
(2,'D','01/12/2018')
I'm trying to achieve below results - data should be based on date.
Suppose I want to know any letter sent after '01/05/2018' for letter type C.
For ID 1 there is no letter C - in that case, we need to print null value.
I'm trying to do it in single statement as query I currently have is super big due to couple of joins used.
Any help is much appreciated. Thanks!
OUTPUT
1,NULL,NULL
2,C,'01/10/2018'
This is rather clumsy, but I think that it returns what you want (written in Oracle).
CTE (WITH factoring clause) is used to prepare test data; you don't need that, as your data already is in those two tables). Useful code begins at line 15; have a look at the comments.
SQL> with
2 temp (id) as
3 (select level from dual connect by level <= 4),
4 temp1 (id, letter_type, letter_sent_date) as
5 (select 1, 'A', date '2017-01-01' from dual union all
6 select 1, 'B', date '2017-01-02' from dual union all
7 select 1, 'C', date '2018-01-03' from dual union all
8 select 1, 'D', date '2018-01-04' from dual union all
9 --
10 select 2, 'A', date '2017-01-01' from dual union all
11 select 2, 'B', date '2017-01-02' from dual union all
12 select 2, 'C', date '2018-01-10' from dual union all
13 select 2, 'D', date '2018-01-12' from dual
14 ),
15 -- letter type = 'C' and letters were sent after 2018-01-05
16 satisfy_condition as
17 (select t.id, t1.letter_type, t1.letter_sent_date
18 from temp t join temp1 t1 on t1.id = t.id
19 where t1.letter_type = 'C'
20 and t1.letter_sent_date > date '2018-01-05'
21 )
22 -- Finally: rows that satisfy condition ...
23 select i.id, i.letter_type, i.letter_sent_date
24 from satisfy_condition i
25 union
26 -- ... and rows that contain letter type = 'C', but were sent before 2018-01-05
27 select t.id, null, null
28 from temp1 t
29 where t.letter_type = 'C'
30 and t.id not in (select i1.id from satisfy_condition i1)
31 order by id, letter_type, letter_sent_date;
ID LETTER_TYPE LETTER_SENT_DATE
---------- ------------ --------------------
1
2 C 2018-01-10
SQL>
In an effort to avoid not in /not exist solution maybe this can help you?
WITH TEMP (ID, LETTER_TYPE, LETTER_SENT_DATE) AS
(select 1, 'A', date '2017-01-01' from dual union all
SELECT 1, 'B', DATE '2017-02-01' FROM DUAL UNION ALL
--SELECT 1, 'C', DATE '2018-03-01' FROM DUAL UNION ALL
select 1, 'D', date '2018-04-01' from dual union all
select 2, 'A', date '2017-01-01' from dual union all
SELECT 2, 'B', DATE '2017-02-01' FROM DUAL UNION ALL
SELECT 2, 'C', DATE '2018-10-01' FROM DUAL UNION ALL
select 2, 'D', date '2018-12-01' from dual
)
SELECT ID, LETTER_TYPE, LETTER_SENT_DATE FROM TEMP
WHERE LETTER_SENT_DATE > DATE '2018-05-01'
AND LETTER_TYPE = 'C'
UNION
SELECT ID, NULL LETTER_TYPE, NULL LETTER_SENT_DATE
from (
select tt.id , sum(tt.HASLETTER) hasletter
FROM
(SELECT T1.ID, CASE WHEN (T1.LETTER_TYPE='C') THEN 1 ELSE 0 END HASLETTER --, NULL T1.LETTER_SENT_DATE
FROM TEMP T1) TT
GROUP BY TT.ID
) hl
where hl.hasletter = 0

Grouping sets of records in sql

The grouping is done on from and toloc and one group is been indicated by usrid
Table :
from toloc usrid
a b 1
c d 1 --- group 1
e f 1
-------------------
a b 2
c d 2 --- group 2
e f 2
----------------------
a b 3
c d 3 --- group 3
h k 3
after group set query required resulset ???
from toloc usrid
a b 1
c d 1 --- group 1 & 3 combined to form 1 group
e f 1
-------------------
a b 2![alt text][1]
c d 2 --- group 2
h k 2
How can I achieve the resultset.
I have to group similar set of records in sql. Is it possible to do with rollup or the new grouping sets. I'm not been able to figure it.
I dug up this old question. Assuming there there are no duplicated rows it should work.
I solved the one you linked to first and rewrote it to match this one, so the fields will differ in names compared to your example.
DECLARE #t TABLE (fromloc VARCHAR(30), toloc VARCHAR(30), usr_history INT)
INSERT #t VALUES ('a', 'b', 1)
INSERT #t VALUES ('c', 'b', 1)
INSERT #t VALUES ('e', 'f', 1)
INSERT #t VALUES ('a', 'b', 2)
INSERT #t VALUES ('c', 'b', 2)
INSERT #t VALUES ('e', 'f', 2)
INSERT #t VALUES ('a', 'b', 3)
INSERT #t VALUES ('c', 'd', 3)
INSERT #t VALUES ('h', 'k', 3)
;WITH c as
(
SELECT t1.usr_history h1, t2.usr_history h2, COUNT(*) COUNT
FROM #t t1
JOIN #t t2 ON t1.fromloc = t2.fromloc and t1.toloc = t2.toloc and t1.usr_history < t2.usr_history
GROUP BY t1.usr_history, t2.usr_history
),
d as (
SELECT usr_history h, COUNT(*) COUNT FROM #t GROUP BY usr_history
),
e as (
SELECT d.h FROM d JOIN c ON c.COUNT = d.COUNT and c.h2 = d.h
JOIN d d2 ON d2.COUNT=c.COUNT and d2.h= c.h1
)
SELECT fromloc, toloc, DENSE_RANK() OVER (ORDER BY usr_history) AS 'usrid'
FROM #t t
WHERE NOT EXISTS (SELECT 1 FROM e WHERE e.h = t.usr_history)
Answer to this question is here: https://stackoverflow.com/a/6727662/195446
Another way is to use FOR XML PATH as signature of set of records.