Suppose I have a table with two columns id and val. I wan't to find all the distinct ids where there exist a pair of equal and opposite vals. For example suppose you have the following table
id | val
------+------
1 | 3
2 | 5
2 | -5
1 | 4
2 | 6
3 | 9
2 | -6
3 | -9
I want the result to be
result
2
3
2 in the result set because there are values 5, -5 and 6, -6. 3 is in the result set because of 9, -9.
I can do this by using where exists. Something like
select distinct tab1.id from tab tab1
where exists (
select * from tab tab2
where tab1.id = tab2.id
and tab1.val = -tab2.val
);
However I worry that a query like this has time complexity O(n^2) because it is computed like nested loops (?). However it is possible to compute this in O(n) time by scanning the table (and keeping track of previously seen results in a data structure with O(1) lookup time). What is the optimal way to write such a query?
We should have an explain of your request and how you sets indexes.
May be it could be done like this too :
WITH pos AS (
SELECT id, val FROM tab WHERE val > 0),
neg AS (
SELECT id, val FROM tab WHERE val < 0)
SELECT DISTINCT id
FROM pos JOIN neg USING (id)
WHERE pos.val = neg.val;
With right indexation, this could be quick. Depend also of the volume of data.
Related
While trying to map some data to a table, I wanted to obtain the ID of a table and its modulo respect the total rows in the same table. For example, given this table:
id
--
1
3
10
12
I would like this result:
id | mod
---+----
1 | 1 <- 1 mod 4
3 | 3 <- 3 mod 4
10 | 2 <- 10 mod 4
12 | 0 <- 12 mod 4
Is there an easy way to achieve this dynamically (as in, not counting the rows on before hand or doing it in an atomic way)?
So far I've tried something like this:
SELECT t1.id, t1.id % COUNT(t1.id) mod FROM tbl t1, tbl t2 GROUP BY t1.id;
This works but you must have the GROUP BY and tbl t2 as otherwise it returns 0 for the mod column which makes sense because I think it works by multiplying the table by itself so each ID gets a full set of the table. I guess for small enough tables this is ok but I can see how this becomes problematic for larger tables.
Edit: Found another hack-ish way:
WITH total AS (
SELECT COUNT(*) cnt FROM tbl
)
SELECT t1.id, t1.id % t2.cnt mod FROM tbl t1, total t2
It similar to the previous query but it "collapses" the multiplication to a single row with the previous count.
You can use COUNT() window function:
SELECT id,
id % COUNT(*) OVER () mod
FROM tbl;
I'm sure that the optimizer is smart enough to calculate the result of the window function only once.
See the demo.
I have a rather tricky database problem that has really stumped me, would appreciate any help.
I have a table which includes data from multiple different sources. This data from different sources can be ‘duplicated’ and we have ways of identifying if that is the case.
Each row in the table has an ‘id’, and if it is identified as a duplicate of another row then we merge it, and it is given a ‘merged_into_id’ which refers to another row in the same table.
I am trying to run a report which will return information about where we have identified duplicates from two of those different sources.
Lets say I have three sources: A, B and C. I want to identify all of the duplicate rows between source A and source B.
I have got the query working fine to do this if a row from source A is directly merged into source B. However, we also have instances in the DB where source A row AND source B row are merged into source C. I am struggling with these and was hoping someone could help with that.
An example:
Original DB:
id
source
merged_into_id
1
A
3
2
B
3
3
C
NULL
What I would like to do is to be able to return id 1 and id 2 from that table, as they are both merged into the same ID e.g. like so:
source_a_id
source_b_id
1
2
But I'm really struggling to get to that - all I've managed to do is create a parent and child link like the following:
parent_id
child_id
child_source
3
1
A
3
2
B
I can also return just the IDs that I want, but they don't 'join' so to speak:
e.g.
SELECT
CASE WHEN child_source = 'A' then child_id as source_a_id,
CASE WHEN child_source = 'B' then child_id as source_b_id
But that just gives me a response with an empty row for the 'missing' data
---EDIT---
Using array_agg and array_to_string I've gotten a little closer to what I need:
SELECT
parent.id as parent_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'A' THEN child.id END)
, ','
) a_id,
ARRAY_TO_STRING(
ARRAY_AGG(CASE WHEN child_source = 'B' THEN child.id END)
, ','
) b_id
but its not quite the right format as I can occasionally have multiple versions from each source, so I get a table that looks like :
parent_id
a_id
b_id
3
1
2,4,5
In this case, I want to return a table that looks like:
parent_id
a_id
b_id
3
1
2
3
1
4
3
1
5
Does anyone have any advice on getting to my desired output? Many thanks
Suppose that we have this table
select * from t;
id | source | merged_into_id
----+--------+----------------
1 | A | 3
2 | B | 3
3 | C |
5 | B | 3
4 | B | 3
(5 rows)
This should do the work
WITH B_source as (select * from t where source = 'B'),
A_source as (select * from t where source = 'A')
SELECT merged_into_id,A_source.id as a_id,B_source.id as b_id
FROM A_source
INNER JOIN B_source using (merged_into_id);
Result
merged_into_id | a_id | b_id
----------------+------+------
3 | 1 | 2
3 | 1 | 5
3 | 1 | 4
(3 rows)
I need to count how many consultants are using a skill through a joining table (consultant_skills), and the challenge is to sum the children occurrences to the parents recursively.
Here's the reproduction of what I'm trying to accomplish. The current results are:
skill_id | count
2 | 2
3 | 1
5 | 1
6 | 1
But I need to compute the count to the parents recursively, where the expected result would be:
skill_id | count
1 | 2
2 | 2
3 | 1
4 | 2
5 | 2
6 | 1
Does anyone know how can I do that?
Sqlfiddle Solution
You need to use WITH RECURSIVE, as the Mike suggests. His answer is useful, especially in reference to using distinct to eliminate redundant counts for consultants, but it doesn't drive to the exact results you're looking for.
See the working solution in the sqlfiddle above. I believe this is what you are looking for:
WITH RECURSIVE results(skill_id, parent_id, consultant_id)
AS (
SELECT skills.id as skill_id, parent_id, consultant_id
FROM consultant_skills
JOIN skills on skill_id = skills.id
UNION ALL
SELECT skills.id as skill_id, skills.parent_id as parent_id, consultant_id
FROM results
JOIN skills on results.parent_id = skills.id
)
SELECT skill_id, count(distinct consultant_id) from results
GROUP BY skill_id
ORDER BY skill_id
What is happening in the query below the UNION ALL is that we're recursively joining the skills table to itself, but rotating in the previous parent id as the new skill id, and using the new parent id on each iteration. The recursion stops because eventually the parent id is NULL and there is no JOIN because it's an INNER join. Hope that makes sense.
This question is based on the following question, but with an additional requirement: PostgreSQL: How to find the last descendant in a linear "ancestor-descendant" relationship
Basically, what I need is a Postgre-SQL statement that finds the last descendant in a linear “ancestor-descendant” relationship that matches additional criteria.
Example:
Here the content of table "RELATIONSHIP_TABLE":
id | id_ancestor | id_entry | bool_flag
---------------------------------------
1 | null | a | false
2 | 1 | a | false
3 | 2 | a | true
4 | 3 | a | false
5 | null | b | true
6 | null | c | false
7 | 6 | c | false
Every record within a particular hierarchy has the same "id_entry"
There are 3 different “ancestor-descendant” relationships in this example:
1. 1 <- 2 <- 3 <- 4
2. 5
3. 6 <- 7
Question PostgreSQL: How to find the last descendant in a linear "ancestor-descendant" relationship shows how to find the last record of each relationship. In the example above:
1. 4
2. 5
3. 7
So, what I need this time is the last descendant by "id_entry" whose "bool_flag" is set to true. In the example above:
1. 3
2. 5
3. <empty result>
Does anyone know a solution?
Thanks in advance :)
QStormDS
Graphs, trees, chains, etc represented as edge lists are usually good uses for recursive common table expressions - i.e. WITH RECURSIVE queries.
Something like:
WITH RECURSIVE walk(id, id_ancestor, id_entry, bool_flag, id_root, generation) AS (
SELECT id, id_ancestor, id_entry, bool_flag, id, 0
FROM RELATIONSHIP_TABLE
WHERE id_ancestor IS NULL
UNION ALL
SELECT x.id, x.id_ancestor, x.id_entry, x.bool_flag, walk.id_root, walk.generation + 1
FROM RELATIONSHIP_TABLE x INNER JOIN walk ON x.id_ancestor = walk.id
)
SELECT
id_entry, id_root, id
FROM (
SELECT
id, id_entry, bool_flag, id_root, generation,
max(CASE WHEN bool_flag THEN generation END ) OVER w as max_enabled_generation
FROM walk
WINDOW w AS (PARTITION BY id_root ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
) x
WHERE generation = max_enabled_generation;
... though it feels like there really should be a better way to do this than tracking how many generations we've walked down each path.
If id_entry is common for all members of a tree, you can avoid needing to track id_root. You should create a UNIQUE constraint on (id_entry, id) and a foreign key constraint on FOREIGN KEY (id_entry, id_ancestor) REFERENCES (id_entry, id) to make sure that the ordering is consistent, then use:
WITH RECURSIVE walk(id, id_ancestor, id_entry, bool_flag, generation) AS (
SELECT id, id_ancestor, id_entry, bool_flag, 0
FROM RELATIONSHIP_TABLE
WHERE id_ancestor IS NULL
UNION ALL
SELECT x.id, x.id_ancestor, x.id_entry, x.bool_flag, walk.generation + 1
FROM RELATIONSHIP_TABLE x INNER JOIN walk ON x.id_ancestor = walk.id
)
SELECT
id_entry, id
FROM (
SELECT
id, id_entry, bool_flag, generation,
max(CASE WHEN bool_flag THEN generation END ) OVER w as max_enabled_generation
FROM walk
WINDOW w AS (PARTITION BY id_entry ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
) x
WHERE generation = max_enabled_generation;
Since this gives you a table of final descendents matched up with root parents, you can just filter with a regular WHERE clause now, just append AND bool_flag. If you instead want to exclude chains that have bool_flag set to false at any point along the way, you can add WHERE bool_value in the RECURSIVE query's join.
SQLFiddle example: http://sqlfiddle.com/#!12/92a64/3
WITH RECURSIVE tail AS (
SELECT id AS opa
, id, bool_flag FROM boolshit
WHERE bool_flag = True
UNION ALL
SELECT t.opa AS opa
, b.id, b.bool_flag FROM boolshit b
JOIN tail t ON b.id_ancestor = t.id
)
SELECT *
FROM boolshit bs
WHERE bs.bool_flag = True
AND NOT EXISTS (
SELECT * FROM tail t
WHERE t.opa = bs.id
AND t.id <> bs.id
AND t.bool_flag = True
);
Explanation: select all records that have the bool_flag set,
EXCEPT those that have offspring (direct or indirect) that have the bool_flag set, too. This effectively picks the last record of the chain that has the flag set.
I have a table where the data may look like so:
ID | PARENT_ID | several_columns_to_follow
---+-----------+--------------------------
1 | 1 | some data
2 | 2 | ...
3 | 1 | ...
4 | 4 | ...
5 | 3 | ...
6 | 1 | ...
7 | 2 | ...
8 | 3 | ...
Per requirements, I need to allow sorting in two ways (per user request):
1) Random order of ID's within sequential parents - this is achieved easily with
SELECT * FROM my_table ORDER BY parent_id, random()
2) Random order of ID's within randomly sorted parents - and this is where I'm stuck. Obviously, just sorting the whole thing by random() won't be very useful.
Any suggestions? Ideally, in one SQL statement, but I'm willing to go for more than one if needed. The amount of data is not large, so I'm not worried about performance (there will never be more than about 100 rows in the final result).
Maybe this will do:
ORDER BY ((parent_id * date_part('microseconds', NOW()) :: Integer) % 123456),
((id * date_part('microseconds', NOW()) :: Integer) % 123456);
Maybe a prime number instead of 123456 will yield "more random" results
random() really is a nasty function in a final sort. If almost random is good enough, maybe you could sort on some hashed-up function of the row values, like
SELECT * FROM my_table
ORDER BY parent_id, (id *12345) % 54321
;
If it's never more than 100 records, it may make sense just to select all the records, place it in a local table (or in memory data structure on your client end) and just do a proper Fisher-Yates shuffle.
Something like this should do it:
WITH r AS (SELECT id, random() as rand FROM my_table ORDER BY rand)
SELECT m.* FROM r JOIN my_table m USING (id)
ORDER BY r.rand, random();