Postgres, get two row values that are both linked to the same ID - postgresql

I have a rather tricky database problem that has really stumped me, would appreciate any help.
I have a table which includes data from multiple different sources. This data from different sources can be ‘duplicated’ and we have ways of identifying if that is the case.
Each row in the table has an ‘id’, and if it is identified as a duplicate of another row then we merge it, and it is given a ‘merged_into_id’ which refers to another row in the same table.
I am trying to run a report which will return information about where we have identified duplicates from two of those different sources.
Lets say I have three sources: A, B and C. I want to identify all of the duplicate rows between source A and source B.
I have got the query working fine to do this if a row from source A is directly merged into source B. However, we also have instances in the DB where source A row AND source B row are merged into source C. I am struggling with these and was hoping someone could help with that.
An example:
Original DB:
id
source
merged_into_id
1
A
3
2
B
3
3
C
NULL
What I would like to do is to be able to return id 1 and id 2 from that table, as they are both merged into the same ID e.g. like so:
source_a_id
source_b_id
1
2
But I'm really struggling to get to that - all I've managed to do is create a parent and child link like the following:
parent_id
child_id
child_source
3
1
A
3
2
B
I can also return just the IDs that I want, but they don't 'join' so to speak:
e.g.
SELECT
CASE WHEN child_source = 'A' then child_id as source_a_id,
CASE WHEN child_source = 'B' then child_id as source_b_id
But that just gives me a response with an empty row for the 'missing' data
---EDIT---
Using array_agg and array_to_string I've gotten a little closer to what I need:
SELECT
parent.id as parent_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'A' THEN child.id END)
, ','
) a_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'B' THEN child.id END)
, ','
) b_id
but its not quite the right format as I can occasionally have multiple versions from each source, so I get a table that looks like :
parent_id
a_id
b_id
3
1
2,4,5
In this case, I want to return a table that looks like:
parent_id
a_id
b_id
3
1
2
3
1
4
3
1
5
Does anyone have any advice on getting to my desired output? Many thanks

Suppose that we have this table
select * from t;
id | source | merged_into_id
----+--------+----------------
1 | A | 3
2 | B | 3
3 | C |
5 | B | 3
4 | B | 3
(5 rows)
This should do the work
WITH B_source as (select * from t where source = 'B'),
A_source as (select * from t where source = 'A')
SELECT merged_into_id,A_source.id as a_id,B_source.id as b_id
FROM A_source
INNER JOIN B_source using (merged_into_id);
Result
merged_into_id | a_id | b_id
----------------+------+------
3 | 1 | 2
3 | 1 | 5
3 | 1 | 4
(3 rows)

Related

PSQL filter each group of rows

Recently I've faced with pretty rare filtering case in PSQL.
My question is: How to filter redundant elements in each group of the grouped table?
For example: we have a nexp table:
id | group_idx | filter_idx
1 1 x
2 3 z
3 3 x
4 2 x
5 1 x
6 3 x
7 2 x
8 1 z
9 2 z
Firstly, to group rows:
SELECT group_idx FROM table
GROUP BY group_idx;
But how I can filter redundant fields (filter_idx = z) from each group after grouping?
P.S. I can't just write like that because I need to find groups firstly.
SELECT group_idx FROM table
where filter_idx <> z;
Thanks.
Assuming that you want to see all groups at all times, even when you filter out all records of some group:
drop table if exists test cascade;
create table test (id integer, group_idx integer, filter_idx character);
insert into test
(id,group_idx,filter_idx)
values
(1,1,'x'),
(2,3,'z'),
(3,3,'x'),
(4,2,'x'),
(5,1,'x'),
(6,3,'x'),
(7,2,'x'),
(8,1,'z'),
(9,2,'z'),
(0,4,'y');--added an example of a group that would be discarded using WHERE.
Get groups in one query, filter your rows in another, then left join the two.
select groups.group_idx,
string_agg(filtered_rows.filter_idx,',')
from
(select distinct group_idx from test) groups
left join
(select group_idx,filter_idx from test where filter_idx<>'y') filtered_rows
using (group_idx)
group by 1;
-- group_idx | string_agg
-------------+------------
-- 3 | z,x,x
-- 4 |
-- 2 | x,x,z
-- 1 | x,x,z
--(4 rows)

How to sum children occurrences from a joining table in Postgres?

I need to count how many consultants are using a skill through a joining table (consultant_skills), and the challenge is to sum the children occurrences to the parents recursively.
Here's the reproduction of what I'm trying to accomplish. The current results are:
skill_id | count
2 | 2
3 | 1
5 | 1
6 | 1
But I need to compute the count to the parents recursively, where the expected result would be:
skill_id | count
1 | 2
2 | 2
3 | 1
4 | 2
5 | 2
6 | 1
Does anyone know how can I do that?
Sqlfiddle Solution
You need to use WITH RECURSIVE, as the Mike suggests. His answer is useful, especially in reference to using distinct to eliminate redundant counts for consultants, but it doesn't drive to the exact results you're looking for.
See the working solution in the sqlfiddle above. I believe this is what you are looking for:
WITH RECURSIVE results(skill_id, parent_id, consultant_id)
AS (
SELECT skills.id as skill_id, parent_id, consultant_id
FROM consultant_skills
JOIN skills on skill_id = skills.id
UNION ALL
SELECT skills.id as skill_id, skills.parent_id as parent_id, consultant_id
FROM results
JOIN skills on results.parent_id = skills.id
)
SELECT skill_id, count(distinct consultant_id) from results
GROUP BY skill_id
ORDER BY skill_id
What is happening in the query below the UNION ALL is that we're recursively joining the skills table to itself, but rotating in the previous parent id as the new skill id, and using the new parent id on each iteration. The recursion stops because eventually the parent id is NULL and there is no JOIN because it's an INNER join. Hope that makes sense.

SQL Query for equal and opposite values

Suppose I have a table with two columns id and val. I wan't to find all the distinct ids where there exist a pair of equal and opposite vals. For example suppose you have the following table
id | val
------+------
1 | 3
2 | 5
2 | -5
1 | 4
2 | 6
3 | 9
2 | -6
3 | -9
I want the result to be
result
2
3
2 in the result set because there are values 5, -5 and 6, -6. 3 is in the result set because of 9, -9.
I can do this by using where exists. Something like
select distinct tab1.id from tab tab1
where exists (
select * from tab tab2
where tab1.id = tab2.id
and tab1.val = -tab2.val
);
However I worry that a query like this has time complexity O(n^2) because it is computed like nested loops (?). However it is possible to compute this in O(n) time by scanning the table (and keeping track of previously seen results in a data structure with O(1) lookup time). What is the optimal way to write such a query?
We should have an explain of your request and how you sets indexes.
May be it could be done like this too :
WITH pos AS (
SELECT id, val FROM tab WHERE val > 0),
neg AS (
SELECT id, val FROM tab WHERE val < 0)
SELECT DISTINCT id
FROM pos JOIN neg USING (id)
WHERE pos.val = neg.val;
With right indexation, this could be quick. Depend also of the volume of data.

Does the returning clause always execute first?

I have a many-to-many relation representing containers holding items.
I have a primary key row_id in the table.
I insert four rows: (container_id, item_id) values (1778712425160346751, 4). These rows will be identical except the aforementioned unique row_id.
I subsequently execute the following query:
delete from contains
where item_id = 4 and
container_id = '1778712425160346751' and
row_id =
(
select max(row_id) from contains
where container_id = '1778712425160346751' and
item_id = 4
)
returning
(
select count(*) from contains
where container_id = '1778712425160346751' and
item_id = 4
);
Now I expected to get 3 returned from this query, but I got a 4. Getting a 4 is the desired behavior, but it is not what was expected.
My question is: can I always expect that the returning clause executes before the delete, or is this an idiosyncrasy of certain versions or specific software?
The use of a query in returning section is allowed but not documented. For the documentation:
output_expression
An expression to be computed and returned by the DELETE command after each row is deleted. The expression can use any column names of the table named by table_name or table(s) listed in USING. Write * to return all columns.
It seems logical that the query sees the table in a state before deleting, as the statement is not completed yet.
create temp table test as
select id from generate_series(1, 4) id;
delete from test
returning id, (select count(*) from test);
id | count
----+-------
1 | 4
2 | 4
3 | 4
4 | 4
(4 rows)
The same concerns update:
create temp table test as
select id from generate_series(1, 4) id;
update test
set id = id+ 1
returning id, (select sum(id) from test);
id | sum
----+-----
2 | 10
3 | 10
4 | 10
5 | 10
(4 rows)

How to find the last descendant (that matches other criteria) in a linear “ancestor-descendant” relationship

This question is based on the following question, but with an additional requirement: PostgreSQL: How to find the last descendant in a linear "ancestor-descendant" relationship
Basically, what I need is a Postgre-SQL statement that finds the last descendant in a linear “ancestor-descendant” relationship that matches additional criteria.
Example:
Here the content of table "RELATIONSHIP_TABLE":
id | id_ancestor | id_entry | bool_flag
---------------------------------------
1 | null | a | false
2 | 1 | a | false
3 | 2 | a | true
4 | 3 | a | false
5 | null | b | true
6 | null | c | false
7 | 6 | c | false
Every record within a particular hierarchy has the same "id_entry"
There are 3 different “ancestor-descendant” relationships in this example:
1. 1 <- 2 <- 3 <- 4
2. 5
3. 6 <- 7
Question PostgreSQL: How to find the last descendant in a linear "ancestor-descendant" relationship shows how to find the last record of each relationship. In the example above:
1. 4
2. 5
3. 7
So, what I need this time is the last descendant by "id_entry" whose "bool_flag" is set to true. In the example above:
1. 3
2. 5
3. <empty result>
Does anyone know a solution?
Thanks in advance :)
QStormDS
Graphs, trees, chains, etc represented as edge lists are usually good uses for recursive common table expressions - i.e. WITH RECURSIVE queries.
Something like:
WITH RECURSIVE walk(id, id_ancestor, id_entry, bool_flag, id_root, generation) AS (
SELECT id, id_ancestor, id_entry, bool_flag, id, 0
FROM RELATIONSHIP_TABLE
WHERE id_ancestor IS NULL
UNION ALL
SELECT x.id, x.id_ancestor, x.id_entry, x.bool_flag, walk.id_root, walk.generation + 1
FROM RELATIONSHIP_TABLE x INNER JOIN walk ON x.id_ancestor = walk.id
)
SELECT
id_entry, id_root, id
FROM (
SELECT
id, id_entry, bool_flag, id_root, generation,
max(CASE WHEN bool_flag THEN generation END ) OVER w as max_enabled_generation
FROM walk
WINDOW w AS (PARTITION BY id_root ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
) x
WHERE generation = max_enabled_generation;
... though it feels like there really should be a better way to do this than tracking how many generations we've walked down each path.
If id_entry is common for all members of a tree, you can avoid needing to track id_root. You should create a UNIQUE constraint on (id_entry, id) and a foreign key constraint on FOREIGN KEY (id_entry, id_ancestor) REFERENCES (id_entry, id) to make sure that the ordering is consistent, then use:
WITH RECURSIVE walk(id, id_ancestor, id_entry, bool_flag, generation) AS (
SELECT id, id_ancestor, id_entry, bool_flag, 0
FROM RELATIONSHIP_TABLE
WHERE id_ancestor IS NULL
UNION ALL
SELECT x.id, x.id_ancestor, x.id_entry, x.bool_flag, walk.generation + 1
FROM RELATIONSHIP_TABLE x INNER JOIN walk ON x.id_ancestor = walk.id
)
SELECT
id_entry, id
FROM (
SELECT
id, id_entry, bool_flag, generation,
max(CASE WHEN bool_flag THEN generation END ) OVER w as max_enabled_generation
FROM walk
WINDOW w AS (PARTITION BY id_entry ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
) x
WHERE generation = max_enabled_generation;
Since this gives you a table of final descendents matched up with root parents, you can just filter with a regular WHERE clause now, just append AND bool_flag. If you instead want to exclude chains that have bool_flag set to false at any point along the way, you can add WHERE bool_value in the RECURSIVE query's join.
SQLFiddle example: http://sqlfiddle.com/#!12/92a64/3
WITH RECURSIVE tail AS (
SELECT id AS opa
, id, bool_flag FROM boolshit
WHERE bool_flag = True
UNION ALL
SELECT t.opa AS opa
, b.id, b.bool_flag FROM boolshit b
JOIN tail t ON b.id_ancestor = t.id
)
SELECT *
FROM boolshit bs
WHERE bs.bool_flag = True
AND NOT EXISTS (
SELECT * FROM tail t
WHERE t.opa = bs.id
AND t.id <> bs.id
AND t.bool_flag = True
);
Explanation: select all records that have the bool_flag set,
EXCEPT those that have offspring (direct or indirect) that have the bool_flag set, too. This effectively picks the last record of the chain that has the flag set.