PostgreSQL - Slow Count - postgresql

I need to write one time query. It will be run one time, and the data will be moved to other system (AWS Personalize). It does not need to be optimized for sure, but at least sped up a bit, so the migration of data is even possible.
Coming from MySQL I thought it would not be a problem. But reading a lot, it seems the COUNT function is handled differently in PostgreSQL. Having mentioned all of that this is the query, reduced in size. There are several other joins (removed from this example), but they do not present an issue, at least looking at the QUERY PLAN.
explain
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE,
'-1' AS EVENT_VALUE,
extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
INNER JOIN schedules sch ON p.id = sch.plan_id
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND (select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE') = 1
The issue is here:
select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE'
The id field in the schedules table is uuid.
I have tried lots of things, but they all end up the same. Same or worse.
I have read somewhere it is possible to use row estimate in these cases, but I have honestly no idea how to do that in this case.
This is the query plan:
Unique (cost=0.99..25152516038.36 rows=100054 width=88)
-> Nested Loop (cost=0.99..25152515788.22 rows=100054 width=88)
-> Index Only Scan using idx_schedules_plan_id_done_date on schedules sch (cost=0.56..25152152785.84 rows=107641 width=16)
Filter: ((SubPlan 1) = 1)
SubPlan 1
-> Aggregate (cost=1168.28..1168.29 rows=1 width=8)
-> Bitmap Heap Scan on schedules s (cost=14.78..1168.13 rows=58 width=16)
Recheck Cond: (plan_id = sch.plan_id)
Filter: ((status)::text = 'DONE'::text)
-> Bitmap Index Scan on idx_schedules_plan_id_done_date (cost=0.00..14.77 rows=294 width=0)
Index Cond: (plan_id = sch.plan_id)
-> Index Scan using plans_pkey on plans p (cost=0.42..3.37 rows=1 width=24)
Index Cond: (id = sch.plan_id)
Filter: ((continuous IS NOT TRUE) AND ((status)::text = 'ENDED'::text))

you are not selecting any column from the schedules table, so it can be omitted from the main query, and put into an EXISTS() term
distinct is probaly not needed, assuming id is a PK
Maybe you dont need the COUNT() to be exactly one, but just > 0
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS (
SELECT *
FROM schedules sch
WHERE p.id = sch.plan_id
)
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE' -- <<-- Must there be EXACTLY ONE schedules record?
) ;
Now you can see that the first EXISTS() is actually not needed: if the second one yields True, the first EXISTS() must yield True, too
SELECT -- DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE'
) ;

Related

Postgresql planner won't select index for 'NOT' queries

[Title updated to reflect updates in description]
I am running Postgresql 9.6
I have a complex query that isn't using the indexes that I expect, when I break it down to this small example I am lost as to why the index isn't being used.
These examples run on a table with 1 million records, and currently all records have the value 'COMPLETED' for column state. State is a text column and I have a btree index on it.
The following query uses my index as I'd expect:
explain analyze
SELECT * FROM(
SELECT
q.state = 'COMPLETED'::text AS completed_successfully
FROM request.request q
) a where NOT completed_successfully;
V
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using request_state_index on request q (cost=0.43..88162.19 rows=11200 width=1) (actual time=200.554..200.554 rows=0 loops=1)
Filter: (state <> 'COMPLETED'::text)
Rows Removed by Filter: 1050005
Heap Fetches: 198150
Planning time: 0.272 ms
Execution time: 200.579 ms
(6 rows)
But if I add anything else to the select that references my table, then the planner chooses to do a sequential scan instead.
explain analyze
SELECT * FROM(
SELECT
q.state = 'COMPLETED'::text AS completed_successfully,
q.type
FROM request.request q
) a where NOT completed_successfully;
V
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on request q (cost=0.00..234196.06 rows=11200 width=8) (actual time=407.713..407.713 rows=0 loops=1)
Filter: (state <> 'COMPLETED'::text)
Rows Removed by Filter: 1050005
Planning time: 0.113 ms
Execution time: 407.733 ms
(5 rows)
Even this simpler example has the same issue.
Uses Index:
SELECT
q.state
FROM request.request q
WHERE q.state = 'COMPLETED';
Doesn't use Index:
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state = 'COMPLETED';
[UPDATE]
I now understand (for this case) that the index it's using there is INDEX ONLY, and it would stop using that in this case because type isn't also in the index. So the question perhaps is why won't it use it in the 'Not' case below:
When I use a different value that isn't in the table, i knows to use the index (which makes sense):
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state = 'CREATED';
But if I not it, it doesn't:
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state != 'COMPLETED';
Why is my index not being used?
What can I do to ensure it gets used?
Most of the time, I expect nearly all the records in this table to be in one of many end states (using IN operator);. So when running my more complex query, I expect these records should be excluded from my more expensive part of the query early and quickly.
[UPDATES]
It looks like 'NOT' is not a supported B-Tree operation. I'll need some kind of unique approach: https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE
I tried adding the following partial indexes but they didn't seem to work:
CREATE INDEX request_incomplete_state_index ON request.request (state) WHERE state NOT IN('COMPLETED', 'FAILED', 'CANCELLED');
CREATE INDEX request_complete_state_index ON request.request (state) WHERE state IN('COMPLETED', 'FAILED', 'CANCELLED');
This partial index does work, but is not an ideal solution.
CREATE INDEX request_incomplete_state_exact_index ON request.request (state) WHERE state != 'COMPLETED';
explain analyze SELECT q.state, q.type FROM request.request q WHERE q.state != 'COMPLETED';
I also tried this expression index, while also not ideal also didn't work:
CREATE OR REPLACE FUNCTION request.request_is_done(in_state text)
RETURNS BOOLEAN
LANGUAGE sql
STABLE
AS $function$
SELECT in_state IN ('COMPLETED', 'FAILED', 'CANCELLED');
$function$
;
CREATE INDEX request_is_done_index ON request.request (request.request_is_done(state));
explain analyze select * from request.request q where NOT request.request_is_done(state);
Using a list (In Clause) of states with equals works. So I may have to figure out my larger query to just not use the NOT.

Why is "order by" on the primary key changing the query plan so that it ignores an useful index?

After investigating why a multi-column index doesn't help speed up a query when I was expecting it to, I realized that it's because of a simple ORDER BY clause.
I reduced the query to this simple form (first without the ORDER BY, then with it):
somedb=# explain select * from user_resource where resource_id = 943 and status = 2 limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..39.29 rows=10 width=44)
-> Index Scan using user_resource_resource_id_status on user_resource (cost=0.56..5422.22 rows=1400 width=44)
Index Cond: ((resource_id = 943) AND (status = 2))
(3 rows)
Time: 0.409 ms
somedb=# explain select * from user_resource where resource_id = 943 and status = 2 order by id desc limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1000.46..4984.60 rows=10 width=44)
-> Gather Merge (cost=1000.46..558780.31 rows=1400 width=44)
Workers Planned: 2
-> Parallel Index Scan Backward using idx_121518_primary on user_resource (cost=0.44..557618.69 rows=583 width=44)
Filter: ((resource_id = 943) AND (status = 2))
Once I add the ORDER BY, you can see the user_resource_resource_id_status key is not used anymore and the query becomes ~10 times slower.
Why is this? And is there a way to fix it? I would think sorting by a simple integer field shouldn't make an index useless. Thank you.
It is related to the limit.
You can run the query without the limit clause and with an offset of 0 to prevent inlining the subquery, then apply the limit.
select * from (
select * from user_resource
where resource_id = 943
and status = 2
offset 0
) sub
order by id desc
limit 10;
It depends on, how you created the index. Example NULLS FIRST, ASC, DESC, NULLS FIRST, and/or NULLS LAST
Refer https://www.postgresql.org/docs/current/indexes-ordering.html, explains how to work with Indexes and ORDER BY

Postgresql should left join use WHERE or ON is enough?

When you do a select subquery, should you use WHERE inside it or 's on s.id = t.id' is enough? I want to understand if subquery without where selecting all the rows and then filter them, or it's select only that match condition 'on add.id = table.id'
SELECT * FROM table
left join (
select *
from add
/* where add.id = 1 */ - do i need this?
group by add.id
) add on add.id = table.id
WHERE table.id = 1
As i understand from EXPLAIN:
Nested Loop Left Join (cost=2.95..13.00 rows=10 width=1026)
Join Filter: (add.id = table.id)
It loads all rows and then do a filter. Is it bad?
I'm not sure if your example it too simple, but you shouldn't need a subquery at all for this one - and definitely not the group by.
Suppose you do need a subquery, then for this specific example, it leads to exactly the same query plan whether you add the where clause or not. The idea of the query planner is that it tries to find a way to make your query as fast as possible. Oftentimes this means ordering the execution of joins and where clauses in such a way, that the result set is increased sooner rather than later. I generated exactly the same query, only with reservations and customers, I hope that's okay.
EXPLAIN
SELECT *
FROM reservations
LEFT OUTER JOIN (
SELECT *
FROM customers
) AS customers ON customers.id = reservations.customer_id
WHERE customer_id = 1;
Nested Loop Left Join (cost=0.17..183.46 rows=92 width=483)
Join Filter: (customers.id = reservations.customer_id)
-> Index Scan using index_reservations_on_customer_id on reservations (cost=0.09..179.01 rows=92 width=255)
Index Cond: (customer_id = 1)
-> Materialize (cost=0.08..4.09 rows=1 width=228)
-> Index Scan using customers_pkey on customers (cost=0.08..4.09 rows=1 width=228)
Index Cond: (id = 1)
The deepest arrows are executed first. This means that even though I didn't have the equivalent of where add.id = 1 in my subquery, it still knew that the equality customers.id = customer_id = 1 should be true, so it decided to filter on customers.id = 1 before even attempting to join anything

Lock one table at update and another in subquery, which one will be locked first?

I have a query like this:
UPDATE table1 SET
col = 'some value'
WHERE id = X
RETURNING col1, (SELECT col2 FROM table2 WHERE id = table1.table2_id FOR UPDATE);
So, this query will lock both tables, table1 and table2, right? But which one will be locked first?
The execution plan for the query will probably look like this:
QUERY PLAN
-------------------------------------------------------------------------------------------
Update on laurenz.table1
Output: table1.col1, (SubPlan 1)
-> Index Scan using table1_pkey on laurenz.table1
Output: table1.id, table1.table2_id, 'some value'::text, table1.col1, table1.ctid
Index Cond: (table1.id = 42)
SubPlan 1
-> LockRows
Output: table2.col2, table2.ctid
-> Index Scan using table2_pkey on laurenz.table2
Output: table2.col2, table2.ctid
Index Cond: (table2.id = table1.table2_id)
That suggests that the row in table1 is locked first.
Looking into the code, I see that ExecUpdate first calls EvalPlanQual, where the updated tuple is locked, and only after that calls ExecProcessReturning where the RETURNING clause is processed.
So yes, the row in table1 is locked first.
So far, I have treated row locks, but there are also the ROW EXCLUSIVE locks on the tables themselves:
The tables are all locked in InitPlan in execMain.c, and it seems to me that again table1 will be locked before table2 here.

Why can't PostgreSQL do this simple FULL JOIN?

Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.
PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.
I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.