Postgresql planner won't select index for 'NOT' queries - postgresql

[Title updated to reflect updates in description]
I am running Postgresql 9.6
I have a complex query that isn't using the indexes that I expect, when I break it down to this small example I am lost as to why the index isn't being used.
These examples run on a table with 1 million records, and currently all records have the value 'COMPLETED' for column state. State is a text column and I have a btree index on it.
The following query uses my index as I'd expect:
explain analyze
SELECT * FROM(
SELECT
q.state = 'COMPLETED'::text AS completed_successfully
FROM request.request q
) a where NOT completed_successfully;
V
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using request_state_index on request q (cost=0.43..88162.19 rows=11200 width=1) (actual time=200.554..200.554 rows=0 loops=1)
Filter: (state <> 'COMPLETED'::text)
Rows Removed by Filter: 1050005
Heap Fetches: 198150
Planning time: 0.272 ms
Execution time: 200.579 ms
(6 rows)
But if I add anything else to the select that references my table, then the planner chooses to do a sequential scan instead.
explain analyze
SELECT * FROM(
SELECT
q.state = 'COMPLETED'::text AS completed_successfully,
q.type
FROM request.request q
) a where NOT completed_successfully;
V
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on request q (cost=0.00..234196.06 rows=11200 width=8) (actual time=407.713..407.713 rows=0 loops=1)
Filter: (state <> 'COMPLETED'::text)
Rows Removed by Filter: 1050005
Planning time: 0.113 ms
Execution time: 407.733 ms
(5 rows)
Even this simpler example has the same issue.
Uses Index:
SELECT
q.state
FROM request.request q
WHERE q.state = 'COMPLETED';
Doesn't use Index:
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state = 'COMPLETED';
[UPDATE]
I now understand (for this case) that the index it's using there is INDEX ONLY, and it would stop using that in this case because type isn't also in the index. So the question perhaps is why won't it use it in the 'Not' case below:
When I use a different value that isn't in the table, i knows to use the index (which makes sense):
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state = 'CREATED';
But if I not it, it doesn't:
SELECT
q.state,
q.type
FROM request.request q
WHERE q.state != 'COMPLETED';
Why is my index not being used?
What can I do to ensure it gets used?
Most of the time, I expect nearly all the records in this table to be in one of many end states (using IN operator);. So when running my more complex query, I expect these records should be excluded from my more expensive part of the query early and quickly.
[UPDATES]
It looks like 'NOT' is not a supported B-Tree operation. I'll need some kind of unique approach: https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-BTREE
I tried adding the following partial indexes but they didn't seem to work:
CREATE INDEX request_incomplete_state_index ON request.request (state) WHERE state NOT IN('COMPLETED', 'FAILED', 'CANCELLED');
CREATE INDEX request_complete_state_index ON request.request (state) WHERE state IN('COMPLETED', 'FAILED', 'CANCELLED');
This partial index does work, but is not an ideal solution.
CREATE INDEX request_incomplete_state_exact_index ON request.request (state) WHERE state != 'COMPLETED';
explain analyze SELECT q.state, q.type FROM request.request q WHERE q.state != 'COMPLETED';
I also tried this expression index, while also not ideal also didn't work:
CREATE OR REPLACE FUNCTION request.request_is_done(in_state text)
RETURNS BOOLEAN
LANGUAGE sql
STABLE
AS $function$
SELECT in_state IN ('COMPLETED', 'FAILED', 'CANCELLED');
$function$
;
CREATE INDEX request_is_done_index ON request.request (request.request_is_done(state));
explain analyze select * from request.request q where NOT request.request_is_done(state);
Using a list (In Clause) of states with equals works. So I may have to figure out my larger query to just not use the NOT.

Related

SQL query lost perfomance after adding filter with subselect

I have a function that totally lost it's performance after i added another filter to it in postgresql
Here is a simple example of how it looked like at first with good performance.
CREATE OR REPLACE FUNCTION my_function(param_a boolean, param_b boolean )
RETURNS TABLE(blablabla)
LANGUAGE sql
IMMUTABLE
AS $function$
with data as (
select id,amount,account_nr from transfer
)
select * from
data d
where param_a or 0.00 <> (select sum(d2.amount)
from data d2
where d2.id = d.id)
$function$
;
(cost=0.25..10.25 rows=1000 width=560)
(actual time=1162.528..1162.561 rows=306 loops=1)
Buffers: shared hit=1099180
Planning time: 2.928 ms
Execution time: 1162.630 ms
After i added another filter with a subselect and count i lost my perfomance. Is this count so bad for performance and can i solve it another way?
CREATE OR REPLACE FUNCTION my_function(param_a boolean, param_b boolean )
RETURNS TABLE(blablabla)
LANGUAGE sql
IMMUTABLE
AS $function$
with data as (
select id,amount,account_nr from transfer
)
select * from
data d
where (param_b or 1 < (select count(d2.account_nr)
from data d2
where d2.id = d.id
group by d2.account_nr))
and (param_a or 0.00 <> (select sum(d2.amount)
from data d2
where d2.id = d.id))
$function$
;
(cost=0.25..10.25 rows=1000 width=560)
(actualtime=271191.341..271191.383 rows=306 loops=1)
Buffers: shared hit=1099180
Planning time: 2.955 ms
Execution time: 271191.463 ms
Your slow query, embeddded in your stored function, is this:
with data as ( -- original query from the question.
select id,amount,account_nr from transfer
)
select *
from data d
where (param_b or 1 < (select count(d2.account_nr)
from data d2
where d2.id = d.id
group by d2.account_nr)
)
and (param_a or 0.00 <> (select sum(d2.amount)
from data d2
where d2.id = d.id)
)
This has a pointless common table expression. We can get rid of it for simplicity's sake. You can always put it back if you need it for some other purpose.
And has a couple of correlated subqueries. Let's refactor them into a single independent subquery. Starting with that independent subquery:
select id,
count(account_nr) account_count,
sum(amount) total_amount
from transfer
group by id
This aggregate subquery generates the number of accounts and the total amount for each id in your transfer table. Eyeball the results to convince yourself it does what you need it to do.
Then we can join this to the main query and apply your WHERE conditions.
select d.id, d.amount, d.account_nr
from transfer d
join (
select id,
count(account_nr) account_count,
sum(amount) total_amount
from transfer
group by id
) d2 ON d.id = d2.id
where (param_b or 1 < d2.account_count)
and (param_a or 0.00 <> d2.total_amount)
Using the independent subquery can speed things up a lot; sometimes the query planner decides it needs to re-evaluate the dependent subquery many times.
The following index will help the subquery run faster.
CREATE INDEX id_details ON transfer (id) INCLUDE (account_nr, amount);
Convince yourself this works and is fast enough. (I did not debug it, because I don't have your data.) You'll need to test it substituting true and false for param_a and param_b.
Then, and only then, put it into your stored function.

PostgreSQL - Slow Count

I need to write one time query. It will be run one time, and the data will be moved to other system (AWS Personalize). It does not need to be optimized for sure, but at least sped up a bit, so the migration of data is even possible.
Coming from MySQL I thought it would not be a problem. But reading a lot, it seems the COUNT function is handled differently in PostgreSQL. Having mentioned all of that this is the query, reduced in size. There are several other joins (removed from this example), but they do not present an issue, at least looking at the QUERY PLAN.
explain
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE,
'-1' AS EVENT_VALUE,
extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
INNER JOIN schedules sch ON p.id = sch.plan_id
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND (select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE') = 1
The issue is here:
select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE'
The id field in the schedules table is uuid.
I have tried lots of things, but they all end up the same. Same or worse.
I have read somewhere it is possible to use row estimate in these cases, but I have honestly no idea how to do that in this case.
This is the query plan:
Unique (cost=0.99..25152516038.36 rows=100054 width=88)
-> Nested Loop (cost=0.99..25152515788.22 rows=100054 width=88)
-> Index Only Scan using idx_schedules_plan_id_done_date on schedules sch (cost=0.56..25152152785.84 rows=107641 width=16)
Filter: ((SubPlan 1) = 1)
SubPlan 1
-> Aggregate (cost=1168.28..1168.29 rows=1 width=8)
-> Bitmap Heap Scan on schedules s (cost=14.78..1168.13 rows=58 width=16)
Recheck Cond: (plan_id = sch.plan_id)
Filter: ((status)::text = 'DONE'::text)
-> Bitmap Index Scan on idx_schedules_plan_id_done_date (cost=0.00..14.77 rows=294 width=0)
Index Cond: (plan_id = sch.plan_id)
-> Index Scan using plans_pkey on plans p (cost=0.42..3.37 rows=1 width=24)
Index Cond: (id = sch.plan_id)
Filter: ((continuous IS NOT TRUE) AND ((status)::text = 'ENDED'::text))
you are not selecting any column from the schedules table, so it can be omitted from the main query, and put into an EXISTS() term
distinct is probaly not needed, assuming id is a PK
Maybe you dont need the COUNT() to be exactly one, but just > 0
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS (
SELECT *
FROM schedules sch
WHERE p.id = sch.plan_id
)
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE' -- <<-- Must there be EXACTLY ONE schedules record?
) ;
Now you can see that the first EXISTS() is actually not needed: if the second one yields True, the first EXISTS() must yield True, too
SELECT -- DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE'
) ;

Lock one table at update and another in subquery, which one will be locked first?

I have a query like this:
UPDATE table1 SET
col = 'some value'
WHERE id = X
RETURNING col1, (SELECT col2 FROM table2 WHERE id = table1.table2_id FOR UPDATE);
So, this query will lock both tables, table1 and table2, right? But which one will be locked first?
The execution plan for the query will probably look like this:
QUERY PLAN
-------------------------------------------------------------------------------------------
Update on laurenz.table1
Output: table1.col1, (SubPlan 1)
-> Index Scan using table1_pkey on laurenz.table1
Output: table1.id, table1.table2_id, 'some value'::text, table1.col1, table1.ctid
Index Cond: (table1.id = 42)
SubPlan 1
-> LockRows
Output: table2.col2, table2.ctid
-> Index Scan using table2_pkey on laurenz.table2
Output: table2.col2, table2.ctid
Index Cond: (table2.id = table1.table2_id)
That suggests that the row in table1 is locked first.
Looking into the code, I see that ExecUpdate first calls EvalPlanQual, where the updated tuple is locked, and only after that calls ExecProcessReturning where the RETURNING clause is processed.
So yes, the row in table1 is locked first.
So far, I have treated row locks, but there are also the ROW EXCLUSIVE locks on the tables themselves:
The tables are all locked in InitPlan in execMain.c, and it seems to me that again table1 will be locked before table2 here.

Simple WHERE EXISTS ... ORDER BY... query very slow in PostrgeSQL

I have this very simple query, generated by my ORM (Entity Framework Core):
SELECT *
FROM "table1" AS "t1"
WHERE EXISTS (
SELECT 1
FROM "table2" AS "t2"
WHERE ("t2"."is_active" = TRUE) AND ("t1"."table2_id" = "t2"."id"))
ORDER BY "t1"."table2_id"
There are 2 "is_active" records. The other involved columns ("id") are the primary keys. Query returns exactly 4 rows.
Table 1 is 96 million records.
Table 2 is 30 million records.
The 3 columns involved in this query are indexed (is_active, id, table2_id).
The C#/LINQ code that generates this simple query is: Table2.Where(t => t.IsActive).Include(t => t.Table1).ToList();`
SET STATISTICS 10000 was set to all of the 3 columns.
VACUUM FULL ANALYZE was run on both tables.
WITHOUT the ORDER BY clause, the query returns within a few milliseconds, and I’d expect nothing else for 4 records to return. EXPLAIN output:
Nested Loop (cost=1.13..13.42 rows=103961024 width=121)
-> Index Scan using table2_is_active_idx on table2 (cost=0.56..4.58 rows=1 width=8)
Index Cond: (is_active = true)
Filter: is_active
-> Index Scan using table1_table2_id_fkey on table1 t1 (cost=0.57..8.74 rows=10 width=121)
Index Cond: (table2_id = table1.id)
WITH the ORDER BY clause, the query takes 5 minutes to complete! EXPLAIN output:
Merge Semi Join (cost=10.95..4822984.67 rows=103961040 width=121)
Merge Cond: (t1.table2_id = t2.id)
-> Index Scan using table1_table2_id_fkey on table1 t1 (cost=0.57..4563070.61 rows=103961040 width=121)
-> Sort (cost=4.59..4.59 rows=2 width=8)
Sort Key: t2.id
-> Index Scan using table2_is_active_idx on table2 a (cost=0.56..4.58 rows=2 width=8)
Index Cond: (is_active = true)
Filter: is_active
The inner, first index scan should return no more than 2 rows. Then the outer, second index scan doesn't make any sense with its cost of 4563070 and 103961040 rows. It only has to match 2 rows in table2 with 4 rows in table1!
This is a very simple query with very few records to return. Why is Postgres failing to execute it properly?
Ok I solved my problem in the most unexpected way. I upgraded Postgresql from 9.6.1 to 9.6.3. And that was it. After restarting the service, the explain plan now looked good and the query ran just fine this time. I did not change anything, no new index, nothing. The only explanation I can think of is that there is was a query planner bug in 9.6.1 and solved in 9.6.3. Thank you all for your answers!
Add an index:
CREATE INDEX _index
ON table2
USING btree (id)
WHERE is_active IS TRUE;
And rewrite query like this
SELECT table1.*
FROM table2
INNER JOIN table1 ON (table1.table2_id = table2.id)
WHERE table2.is_active IS TRUE
ORDER BY table2.id
It is necessary to take into account that "is_active IS TRUE" and "is_active = TRUE" process by PostgreSQL in different ways. So the expression in the index predicate and the query must match.
If u can't rewrite query try add an index:
CREATE INDEX _index
ON table2
USING btree (id)
WHERE is_active = TRUE;
Your guess is right, there is a bug in Postgres 9.6.1 that fits your use case exactly. And upgrading was the right thing to do. Upgrading to the latest point-release is always the right thing to do.
Quoting the release notes for Postgres 9.6.2:
Fix foreign-key-based join selectivity estimation for semi-joins and
anti-joins, as well as inheritance cases (Tom Lane)
The new code for taking the existence of a foreign key relationship
into account did the wrong thing in these cases, making the estimates
worse not better than the pre-9.6 code.
You should still create that partial index like Dima advised. But keep it simple:
is_active = TRUE and is_active IS TRUE subtly differ in that the second returns FALSE instead of NULL for NULL input. But none of this matters in a WHERE clause where only TRUE qualifies. And both expressions are just noise. In Postgres you can use boolean values directly:
CREATE INDEX t2_id_idx ON table2 (id) WHERE is_active; -- that's all
And do not rewrite your query with a LEFT JOIN. This would add rows consisting of NULL values to the result for "active" rows in table2 without any siblings in table1. To match your current logic it would have to be an [INNER] JOIN:
SELECT t1.*
FROM table2 t2
JOIN table1 t1 ON t1.table2_id = t2.id -- and no parentheses needed
WHERE t2.is_active -- that's all
ORDER BY t1.table2_id;
But there is no need to rewrite your query that way at all. The EXISTS semi-join you have is just as good. Results in the same query plan once you have the partial index.
SELECT *
FROM table1 t1
WHERE EXISTS (
SELECT 1 FROM table2
WHERE is_active -- that's all
WHERE id = t1.table2_id
)
ORDER BY table2_id;
BTW, since you fixed the bug by upgrading and once you have created that partial index (and run ANALYZE or VACUUM ANALYZE on the table at least once - or autovacuum did that for you), you will never again get a bad query plan for this, since Postgres maintains separate estimates for the partial index, which are unambiguous for your numbers. Details:
Get count estimates from pg_class.reltuples for given conditions
Index that is not used, yet influences query

Postgres doesn't use index

I am using postgres 9.5 on linux7. Here is the environment:
create table t1(c1 int primary key, c2 varchar(100));
insert some rows in just created table
do $$
begin
for i in 1..12000000 loop
insert into t1 values(i,to_char(i,'9999999'));
end loop;
end $$;
Now I want to update c2 column where c1=random value (EXPLAIN show that index is not used).
explain update t1 set c2=to_char(4,'9999999') where c1=cast(floor(random()*100000) as int);
QUERY PLAN
----------------------------------------------------------------------------------
Update on t1 (cost=10000000000.00..10000000017.20 rows=1 width=10)
-> Seq Scan on t1 (cost=10000000000.00..10000000017.20 rows=1 width=10)
Filter: (c1 = (floor((random() * '100000'::double precision)))::integer)
(3 rows)
Now, if I replace "cast(floor(random()*100000) as int)" with a number (any number) index is used:
explain update t1 set c2=to_char(4,'9999999') where c1=12345;
QUERY PLAN
-------------------------------------------------------------------------
Update on t1 (cost=0.15..8.17 rows=1 width=10)
-> Index Scan using t1_pkey on t1 (cost=0.15..8.17 rows=1 width=10)
Index Cond: (c1 = 12345)
(3 rows)
Questions are:
Why in first case (when random() is used) postgres doesn't use index?
How can I force Postgres to use index?
This is because random() is a volatile function (see PostgreSQL CREATE FUNCTION) which means it should be (re)evaluated per each row.
So you actually aren't updating one random row each time (as I understand you wanted) but a random number of rows (the number of rows where its own random generated number happens to match its id), which attending probabilities, it will tend to 0.
See it using a lower range for the random generated number:
test=# select * from t1 where c1=cast(floor(random()*10) as int);
c1 | c2
----+----
(0 rows)
test=# select * from t1 where c1=cast(floor(random()*10) as int);
c1 | c2
----+----------
3 | 3
(1 row)
test=# select * from t1 where c1=cast(floor(random()*10) as int);
c1 | c2
----+----------
4 | 4
9 | 9
(2 rows)
test=# select * from t1 where c1=cast(floor(random()*10) as int);
c1 | c2
----+----------
5 | 5
8 | 8
(2 rows)
If you want to retrieve only one random row, you need, at first, generate a single random id to compare against row id.
HINT: You can think that database planner is dumb and always perform sequential scan over all rows and calculates condition expressions one time per each row.
Then, under the hood, database planner is much more smart and, if he know that every time he calculate it (in the same transaction) the result will be the same, then he calculate it once and perform an index scan.
A tricky (but dirty) solution could be creating your own random_stable() function, declaring it as stable even it returns a random generated number.
...This will keep your query as simple as now is. But I think it is a dirty solution because it is faking the fact that the function is, in fact, volatile.
Then, a better solution (the right one for me) is to write the query in a form that it really generates the number single time.
For example:
test=# with foo as (select floor(random()*1000000)::int as bar) select * from t1 join foo on (t1.c1 = foo.bar);
c1 | c2 | bar
-----+----------+-----
929 | 929 | 929
(1 row)
...or a subquery solution like that provides #a_horse_with_no_name
NOTE: I used select queries instead of update ones for simplicity and readability, but the case is the same: Simply use the same where clause (with the subquery approach: Off course, using which would be a little more tricky...). Then, to check that index is used, you only need to prepend "explain" as you know.
Not sure why the index isn't used, maybe because of the definition of the random() function. If you use a sub-select for calling the function, then (at least for me with 9.5.3) Postgres uses the index:
explain
update t1
set c2=to_char(4,'9999999')
where c1= (select cast(floor(random()*100000) as int));
returns:
Update on t1 (cost=0.44..3.45 rows=1 width=10)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.01 rows=1 width=0)
-> Index Scan using t1_pkey on t1 (cost=0.43..3.44 rows=1 width=10)
Index Cond: (c1 = $0)