postgresql ignores index for recursive query

postgresql ignores index for recursive query - postgresql

I have a table representing graph of hierarchy links (parent_id, child_id)
The table has indexes on parent, child and both.
The graph may contain loops and i need to check them (or, maybe i need find all loops to eliminate them).
And i need to query all parents of a node recustively.
For this i use this query (it is supposed to be saved in view):
WITH RECURSIVE recursion(parent_id, child_id, node_id, path) AS (
SELECT h.parent_id,
h.child_id,
h.child_id AS node_id,
ARRAY[h.parent_id, h.child_id] AS path
FROM hierarchy h
UNION ALL
SELECT h.parent_id,
h.child_id,
r.node_id,
ARRAY[h.parent_id] || r.path
FROM hierarchy h JOIN recursion r ON h.child_id = r.parent_id
WHERE NOT r.path #> ARRAY[h.parent_id]
)
SELECT parent_id,
child_id,
node_id,
path
FROM recursion
where node_id = 883
For this query postgres is going to use very terrific plan:
"CTE Scan on recursion (cost=2703799682.88..4162807558.70 rows=324223972 width=56)"
" Filter: (node_id = 883)"
" CTE recursion"
" -> Recursive Union (cost=0.00..2703799682.88 rows=64844794481 width=56)"
" -> Seq Scan on hierarchy h (cost=0.00..74728.61 rows=4210061 width=56)"
" -> Merge Join (cost=10058756.99..140682906.47 rows=6484058442 width=56)"
" Merge Cond: (h_1.child_id = r.parent_id)"
" Join Filter: (NOT (r.path #> ARRAY[h_1.parent_id]))"
" -> Index Scan using hierarchy_idx_child on hierarchy h_1 (cost=0.43..256998.25 rows=4210061 width=16)"
" -> Materialize (cost=10058756.56..10269259.61 rows=42100610 width=48)"
" -> Sort (cost=10058756.56..10164008.08 rows=42100610 width=48)"
" Sort Key: r.parent_id"
" -> WorkTable Scan on recursion r (cost=0.00..842012.20 rows=42100610 width=48)"
It seems like postgres does not understand that external filter on node_id is applied to child_id in first recursive subquery.
I suppose i'm doing very wrong thing. But where exactly?

Looks like you just need to move WHERE node_id = 883 to first part of union:
WITH RECURSIVE recursion(parent_id, child_id, node_id, path) AS (
SELECT h.parent_id,
h.child_id,
h.child_id AS node_id,
ARRAY[h.parent_id, h.child_id] AS path
FROM hierarchy h
WHERE node_id = 883
UNION ALL
SELECT h.parent_id,
h.child_id,
r.node_id,
ARRAY[h.parent_id] || r.path
FROM hierarchy h JOIN recursion r ON h.child_id = r.parent_id
WHERE NOT r.path #> ARRAY[h.parent_id]
)
SELECT parent_id,
child_id,
node_id,
path
FROM recursion

Here is much more effective way to solve the graph traversing tasks.
CREATE OR REPLACE FUNCTION public.terr_ancestors(IN bigint)
RETURNS TABLE(node_id bigint, depth integer, path bigint[]) AS
$BODY$
WITH RECURSIVE recursion(node_id, depth, path) AS (
SELECT $1 as node_id, 0, ARRAY[$1] AS path
UNION ALL
SELECT h.parent_id as node_id, r.depth + 1, h.parent_id || r.path
FROM hierarchy h JOIN recursion r ON h.child_id = r.node_id
WHERE h.parent_id != ANY(path)
)
SELECT * FROM recursion
$BODY$
And similar way for descendants.

Related

Postgresql node traversal using Recursive CTE

I am just trying to learn graph traversal using Recursive CTE in postgresql.
Below is my data set:
i am using the below code to get the path along with existing columns(node & edges).
It is giving me output but path column is not in ARRAY format.
;WITH RECURSIVE CTE AS
(
SELECT NODE,EDGES,ARRAY[G.NODE]::TEXT AS PATH,1 AS LEVEL
FROM property_graph G
UNION ALL
SELECT G.NODE,G.EDGES,C.PATH || G.NODE,LEVEL + 1
FROM property_graph G
INNER JOIN CTE C ON G.NODE = ANY(C.EDGES)
WHERE G.NODE <> ALL(STRING_TO_ARRAY(C.PATH,'')) --Cond added to avoid cyclic graph
)
SELECT NODE,EDGES,PATH,LEVEL
FROM CTE
ORDER BY NODE,LEVEL;
Output:
Could you guys help me?
Thanks in advance.

The problem is that your PATH column is of type TEXT, and so is NODE, therefore the || operator performs string concatenation rather than array concatenation.
You should change the type of your PATH column from TEXT to TEXT[] (and then you can remove the STRING_TO_ARRAY in the WHERE clause.
For example:
WITH RECURSIVE CTE AS
(
SELECT NODE,EDGES,ARRAY[G.NODE]::TEXT[] AS PATH,1 AS LEVEL
FROM property_graph G
UNION ALL
SELECT G.NODE,G.EDGES,C.PATH || ARRAY[G.NODE]::TEXT[],LEVEL + 1
FROM property_graph G
INNER JOIN CTE C ON G.NODE = ANY(C.EDGES)
WHERE G.NODE <> ALL(C.PATH) --Cond added to avoid cyclic graph
)
SELECT NODE,EDGES,PATH,LEVEL
FROM CTE
ORDER BY NODE,LEVEL;

Postgres: Optimisation for query "WHERE id IN (...)"

I have a table (2M+ records) which keeps track of a ledger.
Some entries add points, while others subtract points (there are only two kinds of entries). The entries which subtract points, always reference the (adding) entries they were subtracted from with referenceentryid. The adding entries would always have NULL in referenceentryid.
This table has a dead column which would be set to true by a worker when some additions was depleted or expired, or when the subtractions is pointing at a "dead" additions. Since the table has a partial index on dead=false, SELECT on live rows works pretty fast.
My problem is with the performance of the worker that sets dead to NULL.
The flow would be:
1. Get an entry for each addition which indicates the amount added, subtracted and whether or not it's expired.
2. Filter away entries which are both not expired and have more addition than subtraction.
3. Update dead=true on every row where either the id or the referenceentryid is in the filtered set of entries.
WITH entries AS
(
SELECT
additions.id AS id,
SUM(subtractions.amount) AS subtraction,
additions.amount AS addition,
additions.expirydate <= now() AS expired
FROM
loyalty_ledger AS subtractions
INNER JOIN
loyalty_ledger AS additions
ON
additions.id = subtractions.referenceentryid
WHERE
subtractions.dead = FALSE
AND subtractions.referenceentryid IS NOT NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_entries AS (
SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE
)
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
In the query above the inner part runs pretty fast (a few seconds) while the last part would run for ever.
I have the following indexes on the table:
CREATE TABLE IF NOT EXISTS loyalty_ledger (
id SERIAL PRIMARY KEY,
programid bigint NOT NULL,
FOREIGN KEY (programid) REFERENCES loyalty_programs(id) ON DELETE CASCADE,
referenceentryid bigint,
FOREIGN KEY (referenceentryid) REFERENCES loyalty_ledger(id) ON DELETE CASCADE,
customerprofileid bigint NOT NULL,
FOREIGN KEY (customerprofileid) REFERENCES customer_profiles(id) ON DELETE CASCADE,
amount int NOT NULL,
expirydate TIMESTAMPTZ,
dead boolean DEFAULT false,
expired boolean DEFAULT false
);
CREATE index loyalty_ledger_referenceentryid_idx ON loyalty_ledger (referenceprofileid) WHERE dead = false;
CREATE index loyalty_ledger_customer_program_idx ON loyalty_ledger (customerprofileid, programid) WHERE dead = false;
I'm trying to optimise the last part of the query.
EXPLAIN gives me the following:
"Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger ledger (cost=103412.24..4976040812.22 rows=986583 width=67)"
" Filter: ((SubPlan 3) OR (SubPlan 4))"
" CTE entries"
" -> GroupAggregate (cost=1.47..97737.83 rows=252177 width=25)"
" Group Key: subtractions.referenceentryid, additions.id"
" -> Merge Join (cost=1.47..91390.72 rows=341928 width=28)"
" Merge Cond: (subtractions.referenceentryid = additions.id)"
" -> Index Scan using loyalty_ledger_referenceentryid_idx on loyalty_ledger subtractions (cost=0.43..22392.56 rows=341928 width=12)"
" Index Cond: (referenceentryid IS NOT NULL)"
" -> Index Scan using loyalty_ledger_pkey on loyalty_ledger additions (cost=0.43..80251.72 rows=1683086 width=16)"
" CTE dead_entries"
" -> CTE Scan on entries (cost=0.00..5673.98 rows=168118 width=4)"
" Filter: ((subtraction >= addition) OR expired)"
" SubPlan 3"
" -> CTE Scan on dead_entries (cost=0.00..3362.36 rows=168118 width=4)"
" SubPlan 4"
" -> CTE Scan on dead_entries dead_entries_1 (cost=0.00..3362.36 rows=168118 width=4)"
Seems like the last part of my query is very inefficient. Any ideas on how to speed it up?

For large datasets, I have found semi-joins to have much better performance than query in-lists:
from
loyalty_ledger as ledger
WHERE
ledger.dead = FALSE AND (
exists (
select null
from dead_entries d
where d.id = ledger.id
) or
exists (
select null
from dead_entries d
where d.id = ledger.referenceentryid
)
)
I honestly don't know, but I think each of these would also be worth a try. It's less code and more intuitive, but there is no guarantee they will work better:
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id = ledger.id or d.id = ledger.referenceentryid
)
or
ledger.dead = FALSE AND
exists (
select null
from dead_entries d
where d.id in (ledger.id, ledger.referenceentryid)
)

What helped me in the end was to do the id IN filtering part in the second WITH step, replacing IN with ANY syntax:
WITH entries AS
(
SELECT
additions.id AS id,
additions.amount - coalesce(SUM(subtractions.amount),0) AS balance,
additions.expirydate <= now() AS passed_expiration
FROM
loyalty_ledger AS additions
LEFT JOIN
loyalty_ledger AS subtractions
ON
subtractions.dead = FALSE AND
additions.id = subtractions.referenceentryid
WHERE
additions.dead = FALSE AND additions.referenceentryid IS NULL
GROUP BY
subtractions.referenceentryid, additions.id
), dead_rows AS (
SELECT
l.id AS id,
-- only additions that still have usable points can expire
l.referenceentryid IS NULL AND e.balance > 0 AND e.passed_expiration AS expired
FROM
loyalty_ledger AS l
INNER JOIN
entries AS e
ON
(l.id = e.id OR l.referenceentryid = e.id)
WHERE
l.dead = FALSE AND
(e.balance <= 0 OR e.passed_expiration)
ORDER BY e.balance DESC
)
UPDATE
loyalty_ledger AS l
SET
(dead, expired) = (TRUE, d.expired)
FROM
dead_rows AS d
WHERE
l.id = d.id AND
l.dead = FALSE;

I also believe
-- THE SLOW BIT:
SELECT
*
FROM
loyalty_ledger AS ledger
WHERE
ledger.dead = FALSE AND
(ledger.id IN (SELECT id FROM dead_entries) OR ledger.referenceentryid IN (SELECT id FROM dead_entries));
Can be rewritten into a JOIN and UNION ALL which most likely also will generate a other execution plan and might be faster.
But hard to verify for sure without the other table structures.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT id FROM dead_entries) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE
And because CTE's in PostgreSQL are materialized and not indexed. Your are most likely better off removing the dead_entries alias from the CTE and repeat outside the CTE.
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.id = dead_entries.id AND ledger.dead = FALSE
UNION ALL
SELECT
*
FROM
loyalty_ledger AS ledger
INNER JOIN (SELECT
id
FROM
entries
WHERE
subtraction >= addition OR expired = TRUE) AS dead_entries
ON ledger.referenceentryid = dead_entries.id AND ledger.dead = FALSE

Why can't PostgreSQL do this simple FULL JOIN?

Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.

PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.

I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.

PostgreSQL: Order by name after a multiple-level hierarchy sort

I'm currently exporting queries from Oracle to PostgreSQL, and I am stuck on this one which is used to sort directories:
WITH RECURSIVE R AS (
SELECT ARRAY[ID] AS H
,ID
,PARENTID
,NAME
,1 AS level
FROM REPERTORIES
WHERE ID= (SELECT Min(ID) FROM REPERTORIES)
UNION ALL
SELECT R.H || A.ID
,A.ID
,A.PARENTID
,A.NAME
,R.level + 1
FROM REPERTORIES A
JOIN R ON A.PARENTID = R.ID
)
SELECT NAME
,ID
,PARENTID
, level
FROM R
ORDER BY H
It's partially working, each subdirectory is placed after his parent directory or a directory sharing the same parent directory (A directory can have subdirectories which also have subdirectories and so on)
But I need to also sort the directories that are at the same level by their NAME (while, of course, still having their subdirectories right next to them)
How can I achieve this?
Thanks in advance (and sorry if my English is hard to understand)
EDIT: Here is the orignial Oracle query:
SELECT NAME, ID, PARENTID, level
FROM REPERTORIES
CONNECT BY PRIOR ID = PARENTID
START WITH ID = (SELECT Min(ID) FROM REPERTORIES)
ORDER SIBLINGS BY NAME

Similar to the way you construct h, construct an array that contains the path names and order by that.

The "order by H" is throwing me for a loop, however..
In general you need to do your sorting in the recursive part of the query to keep them "grouped" properly.
WITH RECURSIVE R AS (
SELECT ARRAY[ID] AS H
,ID
,PARENTID
,NAME
,1 AS level
FROM REPERTORIES
WHERE ID= (SELECT Min(ID) FROM REPERTORIES)
UNION ALL
SELECT R.H || A.ID
,A.ID
,A.PARENTID
,A.NAME
,R.level + 1
FROM REPERTORIES A
JOIN R ON A.PARENTID = R.ID
order by name
)
SELECT NAME
,ID
,PARENTID
, level
FROM R

How to design a SQL recursive query?

How would I redesign the below query so that it will recursively loop through entire tree to return all descendants from root to leaves? (I'm using SSMS 2008). We have a President at the root. under him are the VPs, then upper management, etc., on down the line. I need to return the names and titles of each. But this query shouldn't be hard-coded; I need to be able to run this for any selected employee, not just the president. This query below is the hard-coded approach.
select P.staff_name [Level1],
P.job_title [Level1 Title],
Q.license_number [License 1],
E.staff_name [Level2],
E.job_title [Level2 Title],
G.staff_name [Level3],
G.job_title [Level3 Title]
from staff_view A
left join staff_site_link_expanded_view P on P.people_id = A.people_id
left join staff_site_link_expanded_view E on E.people_id = C.people_id
left join staff_site_link_expanded_view G on G.people_id = F.people_id
left join facility_view Q on Q.group_profile_id = P.group_profile_id
Thank you, this was most closely matching what I needed. Here is my CTE query below:
with Employee_Hierarchy (staff_name, job_title, id_number, billing_staff_credentials_code, site_name, group_profile_id, license_number, region_description, people_id)
as
(
select C.staff_name, C.job_title, C.id_number, C.billing_staff_credentials_code, C.site_name, C.group_profile_id, Q.license_number, R.region_description, A.people_id
from staff_view A
left join staff_site_link_expanded_view C on C.people_id = A.people_id
left join facility_view Q on Q.group_profile_id = C.group_profile_id
left join regions R on R.regions_id = Q.regions_id
where A.last_name = 'kromer'
)
select C.staff_name, C.job_title, C.id_number, C.billing_staff_credentials_code, C.site_name, C.group_profile_id, Q.license_number, R.region_description, A.people_id
from staff_view A
left join staff_site_link_expanded_view C on C.people_id = A.people_id
left join facility_view Q on Q.group_profile_id = C.group_profile_id
left join regions R on R.regions_id = Q.regions_id
WHERE C.STAFF_NAME IS NOT NULL
GROUP BY C.STAFF_NAME, C.job_title, C.id_number, C.billing_staff_credentials_code, C.site_name, C.group_profile_id, Q.license_number, R.region_description, A.people_id
ORDER BY C.STAFF_NAME
But I am wondering what is the purpose of the "Employee_Hierarchy"? When I replaced "staff_view" in the outer query with "Employee_Hierarchy", it only returned one record = "Kromer". So when/where can we use "Employee_Hierarchy"?

See:
SQL Server - Simple example of a recursive CTE
MSDN: Recursive Queries using Common Table Expression
SQL Server recursive CTE (this seems pretty much like exactly what you are working on!)
Update:
A proper recursive CTE consist of basically three things:
an anchor SELECT to begin with; that can select e.g. the root level employees (where the Reports_To is NULL), or it can select any arbitrary employee that you define, e.g. by a parameter
a UNION ALL
a recursive SELECT statement that selects from the same, typically self-referencing table and joins with the recursive CTE being currently built up
This gives you the ability to recursively build up a result set that you can then select from.
If you look at the Northwind sample database, it has a table called Employees which is self-referencing: Employees.ReportsTo --> Employees.EmployeeID defines who reports to whom.
Your CTE would look something like this:
;WITH RecursiveCTE AS
(
-- anchor query; get the CEO
SELECT EmployeeID, FirstName, LastName, Title, 1 AS 'Level', ReportsTo
FROM dbo.Employees
WHERE ReportsTo IS NULL
UNION ALL
-- recursive part; select next Employees that have ReportsTo -> cte.EmployeeID
SELECT
e.EmployeeID, e.FirstName, e.LastName, e.Title,
cte.Level + 1 AS 'Level', e.ReportsTo
FROM
dbo.Employees e
INNER JOIN
RecursiveCTE cte ON e.ReportsTo = cte.EmployeeID
)
SELECT *
FROM RecursiveCTE
ORDER BY Level, LastName
I don't know if you can translate your sample to a proper recursive CTE - but that's basically the gist of it: anchor query, UNION ALL, recursive query