Why PostgreSQL doesn't use indexes on "WHERE NOT IN" conditions. - postgresql

I have two tables db100 and db60 with the same fields: x, y, z.
Indexes are created for both the tables on field z like this:
CREATE INDEX db100_z_idx
ON db100
USING btree
(z COLLATE pg_catalog."default");
CREATE INDEX db60_z_idx
ON db60
USING btree
(z COLLATE pg_catalog."default");
Trying to find z values from db60 that don't exist in db100:
select db60.z from db60 where db60.z not in (select db100.z from db100)
As far as I understand, all the information required to execute the query is presented in the indexes. So, I would expect only indexes used.
However it uses sequential scan on tables instead:
"Seq Scan on db60 (cost=0.00..25951290012.84 rows=291282 width=4)"
" Filter: (NOT (SubPlan 1))"
" SubPlan 1"
" -> Materialize (cost=0.00..80786.26 rows=3322884 width=4)"
" -> Seq Scan on db100 (cost=0.00..51190.84 rows=3322884 width=4)"
Can someone pls explain why PostgreSQL doesn't use indexes in this example?
Both the tables contain a few millions records and execution takes a while.
I know that using a left join with "is null" condition gives better results. However, the question is about this particular syntax.
I'm on PG v 9.5

SubPlan 1 is for select db100.z from db100. You select all rows and hence an index is useless. You really want to select DISTINCT z from db100 here and then the index should be used.
In the main query you have select db60.z from db60 where db60.z not in .... Again, you select all rows except where a condition is not true, so again the index does not apply because it applies to the inverse condition.
In general, an index is only used if the planner thinks that such a use will speed up the query processing. It always depends on how many distinct values there are and how the rows are distributed over the physical pages on disk. An index to search for all rows having a column with a certain value is not the same as finding the rows that do not have that same value; the index indicates on which pages and at which locations to find the rows, but that set can not simply be inversed.
Given - in your case - that z is some text type, a meaningful "negative" index can not be constructed (this is actually almost a true-ism, although in some cases a "negative" index could be conceivable). You should look into trigram indexes, as these tend to work much faster than btree on text indexing.
You really want to extract all 291,282 rows with the same z value, or perhaps use a DISTINCT clause here too? That should speed things up quite a bit.

Related

Postgres partial vs regular / full index on nullable column

I have a table with 1m records, with 100k records having null on colA. Remaining records have pretty distinct values, is there a difference in creating a regular index on this column vs a partial index with where colA is not null?
Since regular Postgres indexes do not store NULL values, wouldn't it be the same as creating a partial index with where colA is not null?
Any pros or cons with either indexes?
If you create a partial index without nulls, it will not use it to find nulls.
Here's a test with a full index on 13.5.
# create index idx_test_num on test(num);
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
-------------------------------------------------------------------------------------
Aggregate (cost=5135.00..5135.01 rows=1 width=8)
-> Bitmap Heap Scan on test (cost=63.05..5121.25 rows=5500 width=0)
Recheck Cond: (num IS NULL)
-> Bitmap Index Scan on idx_test_num (cost=0.00..61.68 rows=5500 width=0)
Index Cond: (num IS NULL)
(5 rows)
And with a partial index.
# create index idx_test_num on test(num) where num is not null;
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate (cost=10458.12..10458.13 rows=1 width=8)
-> Gather (cost=10457.90..10458.11 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9457.90..9457.91 rows=1 width=8)
-> Parallel Seq Scan on test (cost=0.00..9352.33 rows=42228 width=0)
Filter: (num IS NULL)
(6 rows)
Since regular postgres indexes do not store NULL values...
This has not been true since version 8.2 [checks notes] 16 years ago. The 8.2 docs say...
Indexes are not used for IS NULL clauses by default. The best way to use indexes in such cases is to create a partial index using an IS NULL predicate.
8.3 introduced nulls first and nulls last and many other improvements around nulls including...
Allow col IS NULL to use an index (Teodor)
It all depends.
NULL values are included in (default) B-tree indexes since version Postgres 8.3, like Schwern provided. However, predicates like the one you mention (where colA is not null) are only properly supported since Postgres 9.0. The release notes:
Allow IS NOT NULL restrictions to use indexes (Tom Lane)
This is particularly useful for finding MAX()/MIN() values in
indexes that contain many null values.
GIN indexes followed later:
As of PostgreSQL 9.1, null key values can be included in the index.
Typically, a partial index makes sense if it excludes a major part of the table from the index, making it substantially smaller and saving writes to the index. Since B-tree indexes are so shallow, bare seek performance scales fantastically (once the index is cached). 10 % fewer index entries hardly matter in that area.
Your case would exclude only around 10% of all rows, and that rarely pays. A partial index adds some overhead for the query planner and excludes queries that don't match the index condition. (The Postgres query planner doesn't try hard if the match is not immediately obvious.)
OTOH, Postgres will rarely use an index for predicates retrieving 10 % of the table - a sequential scan will typically be faster. Again, it depends.
If (almost) all queries exclude NULL anyway (in a way the Postgres planner understands), then a partial index excluding only 10 % of all rows is still a sensible option. But it may backfire if query patterns change. The added complexity may not be worth it.
Also worth noting that there are still corner cases with NULL values in Postgres indexes. I bumped into this case recently where Postgres proved unwilling to read sorted rows from a multicolumn index when the first index expression was filtered with IS NULL (making a partial index preferable for the case):
db<>fiddle here
So, it depends on the complete picture.

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?
What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function
Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).
The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.

Postgresql Sorting a Joined Table with an index

I'm currently working on a complex sorting problem in Postgres 9.2
You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC).
But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)
You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.

Postgres query optimization (forcing an index scan)

Below is my query. I am trying to get it to use an index scan, but it will only seq scan.
By the way the metric_data table has 130 million rows. The metrics table has about 2000 rows.
metric_data table columns:
metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
How can I get this query to use my PRIMARY KEY index?
SELECT
S.metric,
D.t,
D.d
FROM metric_data D
INNER JOIN metrics S
ON S.id = D.metric_id
WHERE S.NAME = ANY (ARRAY ['cpu', 'mem'])
AND D.t BETWEEN '2012-02-05 00:00:00'::TIMESTAMP
AND '2012-05-05 00:00:00'::TIMESTAMP;
EXPLAIN:
Hash Join (cost=271.30..3866384.25 rows=294973 width=25)
Hash Cond: (d.metric_id = s.id)
-> Seq Scan on metric_data d (cost=0.00..3753150.28 rows=29336784 width=20)
Filter: ((t >= '2012-02-05 00:00:00'::timestamp without time zone)
AND (t <= '2012-05-05 00:00:00'::timestamp without time zone))
-> Hash (cost=270.44..270.44 rows=68 width=13)
-> Seq Scan on metrics s (cost=0.00..270.44 rows=68 width=13)
Filter: ((sym)::text = ANY ('{cpu,mem}'::text[]))
For testing purposes you can force the use of the index by "disabling" sequential scans - best in your current session only:
SET enable_seqscan = OFF;
Do not use this on a productive server. Details in the manual here.
I quoted "disabling", because you cannot actually disable sequential table scans. But any other available option is now preferable for Postgres. This will prove that the multicolumn index on (metric_id, t) can be used - just not as effective as an index on the leading column.
You probably get better results by switching the order of columns in your PRIMARY KEY (and the index used to implement it behind the curtains with it) to (t, metric_id). Or create an additional index with reversed columns like that.
Is a composite index also good for queries on the first field?
You do not normally have to force better query plans by manual intervention. If setting enable_seqscan = OFF leads to a much better plan, something is probably not right in your database. Consider this related answer:
Keep PostgreSQL from sometimes choosing a bad query plan
You cannot force index scan in this case because it will not make it faster.
You currently have index on metric_data (metric_id, t), but server cannot take advantage of this index for your query, because it needs to be able to discriminate by metric_data.t only (without metric_id), but there is no such index. Server can use sub-fields in compound indexes, but only starting from the beginning. For example, searching by metric_id will be able to employ this index.
If you create another index on metric_data (t), your query will make use of that index and will work much faster.
Also, you should make sure that you have an index on metrics (id).
Have you tried to use:
WHERE S.NAME = ANY (VALUES ('cpu'), ('mem'))
instead of
ARRAY
like here
It appears you are lacking suitable FK constraints:
CREATE TABLE metric_data
( metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
, FOREIGN KEY metrics_xxx_fk (metric_id) REFERENCES metrics (id)
)
and in table metrics:
CREATE TABLE metrics
( id INTEGER PRIMARY KEY
...
);
Also check if your statistics are sufficient (and fine-grained enough, since you intend to select 0.2 % of the metrics_data table)

Why index seek and not a scan for the following setup in SQL Server 2005

I have created a table
create table #temp(a int, b int, c int)
I have 2 indexes on this table:
Non clustered non unique index on c
Clustered Index on a
When I try to execute the following query:
select b from #temp where c = 3
I see that the system goes for index scan. This is fine, because the non clustered index doesn't have b as the key value. Hence it does an index scan from column a.
But when I try to execute the below query:-
select b from #temp where c= 3 and a = 3
I see that the execute plan has got only index seek. No scan. Why is that?
Neither the clustered index nor the nonclustered index as b as one of the columns?
I was hoping an index scan.
Please clarify
If you have a as your clustering key, then that column is included in all non-clustered indices on that table.
So your index on c also includes a, so the condition
where c= 3 and a = 3
can be found in that index using an index seek. Most likely, the query optimizer decided that doing a index seek to find a and c and a key lookup to get the rest of the data is faster/more efficient here than using an index scan.
BTW: why did you expect / prefer an index scan over an index seek? The index seek typically is faster and uses a lot less resources - I would always strive to get index seeks over scans.
This is fine, because the non clustered index doesn't have b as the key value. Hence it does an index scan from column a.
This assumption is not right. index seek and scan has to deal with WHERE clause and not the select clause.
Now your question -
Where clause is optimised by sql optimizer and as there is a=3 condition, clustered index can be applied.