Postgres partial vs regular / full index on nullable column - postgresql

I have a table with 1m records, with 100k records having null on colA. Remaining records have pretty distinct values, is there a difference in creating a regular index on this column vs a partial index with where colA is not null?
Since regular Postgres indexes do not store NULL values, wouldn't it be the same as creating a partial index with where colA is not null?
Any pros or cons with either indexes?

If you create a partial index without nulls, it will not use it to find nulls.
Here's a test with a full index on 13.5.
# create index idx_test_num on test(num);
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
-------------------------------------------------------------------------------------
Aggregate (cost=5135.00..5135.01 rows=1 width=8)
-> Bitmap Heap Scan on test (cost=63.05..5121.25 rows=5500 width=0)
Recheck Cond: (num IS NULL)
-> Bitmap Index Scan on idx_test_num (cost=0.00..61.68 rows=5500 width=0)
Index Cond: (num IS NULL)
(5 rows)
And with a partial index.
# create index idx_test_num on test(num) where num is not null;
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate (cost=10458.12..10458.13 rows=1 width=8)
-> Gather (cost=10457.90..10458.11 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9457.90..9457.91 rows=1 width=8)
-> Parallel Seq Scan on test (cost=0.00..9352.33 rows=42228 width=0)
Filter: (num IS NULL)
(6 rows)
Since regular postgres indexes do not store NULL values...
This has not been true since version 8.2 [checks notes] 16 years ago. The 8.2 docs say...
Indexes are not used for IS NULL clauses by default. The best way to use indexes in such cases is to create a partial index using an IS NULL predicate.
8.3 introduced nulls first and nulls last and many other improvements around nulls including...
Allow col IS NULL to use an index (Teodor)

It all depends.
NULL values are included in (default) B-tree indexes since version Postgres 8.3, like Schwern provided. However, predicates like the one you mention (where colA is not null) are only properly supported since Postgres 9.0. The release notes:
Allow IS NOT NULL restrictions to use indexes (Tom Lane)
This is particularly useful for finding MAX()/MIN() values in
indexes that contain many null values.
GIN indexes followed later:
As of PostgreSQL 9.1, null key values can be included in the index.
Typically, a partial index makes sense if it excludes a major part of the table from the index, making it substantially smaller and saving writes to the index. Since B-tree indexes are so shallow, bare seek performance scales fantastically (once the index is cached). 10 % fewer index entries hardly matter in that area.
Your case would exclude only around 10% of all rows, and that rarely pays. A partial index adds some overhead for the query planner and excludes queries that don't match the index condition. (The Postgres query planner doesn't try hard if the match is not immediately obvious.)
OTOH, Postgres will rarely use an index for predicates retrieving 10 % of the table - a sequential scan will typically be faster. Again, it depends.
If (almost) all queries exclude NULL anyway (in a way the Postgres planner understands), then a partial index excluding only 10 % of all rows is still a sensible option. But it may backfire if query patterns change. The added complexity may not be worth it.
Also worth noting that there are still corner cases with NULL values in Postgres indexes. I bumped into this case recently where Postgres proved unwilling to read sorted rows from a multicolumn index when the first index expression was filtered with IS NULL (making a partial index preferable for the case):
db<>fiddle here
So, it depends on the complete picture.

Related

Why doesn't postresql use all columns in a multi-column index?

I am using the extension
CREATE EXTENSION btree_gin;
I have an index that looks like this...
create index boundaries2 on rets USING GIN(source, isonlastsync, status, (geoinfo::jsonb->'boundaries'), ctcvalidto, searchablePrice, ctcSortOrder);
before I started messing with it, the index looked like this, with the same results that I'm about to share, so it seems minor variations in the index definition don't make a difference:
create index boundaries on rets USING GIN((geoinfo::jsonb->'boundaries'), source, status, isonlastsync, ctcvalidto, searchablePrice, ctcSortOrder);
I give pgsql 11 this query:
explain analyze select id from rets where ((geoinfo::jsonb->'boundaries' ?| array['High School: Torrey Pines']) AND source='SDMLS'
AND searchablePrice>=800000 AND searchablePrice<=1200000 AND YrBlt>=2000 AND EstSF>=2300
AND Beds>=3 AND FB>=2 AND ctcSortOrder>'2019-07-05 16:02:54 UTC' AND Status IN ('ACTIVE','BACK ON MARKET')
AND ctcvalidto='9999-12-31 23:59:59 UTC' AND isonlastsync='true') order by LstDate desc, ctcSortOrder desc LIMIT 3000;
with result...
Limit (cost=120.06..120.06 rows=1 width=23) (actual time=472.849..472.850 rows=1 loops=1)
-> Sort (cost=120.06..120.06 rows=1 width=23) (actual time=472.847..472.848 rows=1 loops=1)
Sort Key: lstdate DESC, ctcsortorder DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on rets (cost=116.00..120.05 rows=1 width=23) (actual time=472.748..472.841 rows=1 loops=1)
Recheck Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Rows Removed by Index Recheck: 93
Filter: (isonlastsync AND (yrblt >= 2000) AND (estsf >= 2300) AND (beds >= 3) AND (fb >= 2) AND (status = ANY ('{ACTIVE,"BACK ON MARKET"}'::text[])))
Rows Removed by Filter: 10
Heap Blocks: exact=102
-> Bitmap Index Scan on boundaries2 (cost=0.00..116.00 rows=1 width=0) (actual time=471.762..471.762 rows=104 loops=1)
Index Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Planning Time: 0.333 ms
Execution Time: 474.311 ms
(14 rows)
The Question
Why are the columns status and isonlastsync not used by the Bitmap Index Scan on boundaries2?
It can do so if it predicts that filtering out those columns will be faster. This is usually the case if cardinality on columns is very low and you will fetch large enough portion of all rows; this is true for boolean like isonlastsync and usually true for status columns with just a few distinct values.
Rows Removed by Filter: 10 this is very little to filter out, either because your table does not hold large number of rows or most of them fit into condition you specified for those two columns. You might try generating more data in that table or selecting rows with rare status.
I suggest doing partial indexes (with WHERE condition), at least for boolean value and remove those two columns to make this index a bit more lightweight.
I cannot tell you why, but I can help you optimize the query.
You should not use a multi-column GIN index, but a GIN index on only the jsonb expression and a b-tree index on the other columns.
The order of columns matters: put the oned used in an equality condition first, with the most selective in the beginning. Next put the column with the must selective inequality or IN conditions. For the following columns, the order doesn't matter, as they will only act as filters in the index scan.
Make sure that the indexes are cached in RAM.
I'd expect you to be faster that way.
I think you're asking yourself the wrong question. As Lukasz answered already, PostgreSQL may find inefficient to check all columns in the index. The problem here is that your index is too big on disk.
Probably by trying to make this SQL faster you added as many columns as possible to the index, but it is backfiring.
The trick is to realize how much data PostgreSQL has to read to find your records. If your index contains too much data, it will have to read a lot. Also, be aware that low cardinality columns don't play well with BTree and common index types; generally you want to avoid indexing them.
To have an index as small as possible and it's quick to do lookups you have to find the column with more cardinality, or better, the column that will return less rows for your query. My guess is "ctcSortOrder". This will be the first column of your index.
Now look, after filtering by the 1st column, which column has now the most cardinality or will filter out most rows. I have no idea on your data, but "source" looks like a good candidate.
Try to avoid jsonb searches unless they are the primary source of the cardinality, and keep the index as a Btree. BTree is several times faster.
And like Lukasz suggested, look on partial indexes. For example add "WHERE Status IN ('ACTIVE','BACK ON MARKET') AND isonlastsync='true'" as these two may be common for all your searches.
Bottom line is, having a simpler, smaller index is faster than having all columns indexed. And the order of the columns does matter a lot. Stick with BTree unless there is a good reason (lots of cardinality in non-btree compatible types).
If your table is huge (>10M rows) consider table partitioning, for example by ctcSortOrder. But I don't think this is your case.

Postgresql not allowing index scan although indexing is there

I have table users where I make
CREATE INDEX user_status_index
ON public.users
USING btree
(status COLLATE pg_catalog."default", "keep_id" COLLATE pg_catalog."default");
When I did
EXPLAIN ANALYZE select * from users where keep_id = 'pop90'
It gives me
"Seq Scan on users (cost=0.00..47284.38 rows=2 width=16) (actual time=960.463..3451.731 rows=2 loops=1)"
" Filter: (("keep_id")::text = 'pop90'::text)"
" Rows Removed by Filter: 1271935"
"Planning time: 0.075 ms"
"Execution time: 3451.773 ms"
Why it is not taking index scan ??
How can I make it index scan??
Any help will really be appreciable
You are using a multi column index.
There are some limitations when using multi column indexes.
You might be better off creating two single column indexes.
You could read this article:
Multi-column Indexes
Multi-column Indexes
While Postgres has the ability to create multi-column indexes, it’s important to understand when it makes sense to do so. The Postgres query planner has the ability to combine and use multiple single-column indexes in a multi-column query by performing a bitmap index scan. In general, you can create an index on every column that covers query conditions and in most cases Postgres will use them, so make sure to benchmark and justify the creation of a multi-column index before you create them. As always, indexes come with a cost, and multi-column indexes can only optimize the queries that reference the columns in the index in the same order, while multiple single column indexes provide performance improvements to a larger number of queries.
However there are cases where a multi-column index clearly makes sense. An index on columns (a, b) can be used by queries containing WHERE a = x AND b = y, or queries using WHERE a = x only, but will not be used by a query using WHERE b = y. So if this matches the query patterns of your application, the multi-column index approach is worth considering. Also note that in this case creating an index on a alone would be redundant.

Can the Postgres array overlap (&&) operator use an index?

We have a table with an indexed array column:
CREATE TABLE mention (
id SERIAL,
phraseIds integer[],
PRIMARY KEY (id)
);
CREATE INDEX indx_mentions_phraseIds on mention USING GIN (phraseids public.gin__int_ops);
Queries using the "overlaps" operator on this column don't seem to use the index:
explain analyze select m.id FROM mention m WHERE m.phraseIds && ARRAY[11638,11639];
Seq Scan on mention m (cost=0.00..933723.44 rows=1404 width=4) (actual time=103.018..3751.525 rows=1101 loops=1)
Filter: (phraseids && '{11638,11639}'::integer[])
Rows Removed by Filter: 7019974
Total runtime: 3751.618 ms
Is it possible to get Postgresql to use the index? Or should we be doing something else?
Update: I repeated the test with 'SET enable_seqscan TO off' and the index is still not used.
Update: I should have mentioned that I am using 9.2 with the intarray extension.
Update: It seems that the intarray extension is part of this problem. I re-created the table without using the intarray extension and the index is used as expected. Anyone know how to get the index to be used with the intarray extension? The docs (http://www.postgresql.org/docs/9.2/static/intarray.html) say that indexes are supported for &&.
I built a similar table in PostgreSQL 9.2; the difference was USING GIN (phraseids); I don't seem to have int_ops available in this context for some reason. I loaded a few thousand rows of random (ish) data.
Setting enable_seqscan off let PostgreSQL use the index.
PostgreSQL calculated the cost of a sequential scan to be less than the cost of a bitmap heap scan. The actual time of a sequential scan was 10% the actual time of a bitmap heap scan, but the total run time for a sequential scan was a little more than the total run time of a bitmap heap scan.

PostgreSQL index not used for query on IP ranges

I'm using PostgreSQL 9.2 and have a table of IP ranges. Here's the SQL:
CREATE TABLE ips (
id serial NOT NULL,
begin_ip_num bigint,
end_ip_num bigint,
country_name character varying(255),
CONSTRAINT ips_pkey PRIMARY KEY (id )
)
I've added plain B-tree indices on both begin_ip_num and end_ip_num:
CREATE INDEX index_ips_on_begin_ip_num ON ips (begin_ip_num);
CREATE INDEX index_ips_on_end_ip_num ON ips (end_ip_num );
The query being used is:
SELECT ips.* FROM ips
WHERE 3065106743 BETWEEN begin_ip_num AND end_ip_num;
The problem is that my BETWEEN query is only using the index on begin_ip_num. After using the index, it filters the result using end_ip_num. Here's the EXPLAIN ANALYZE result:
Index Scan using index_ips_on_begin_ip_num on ips (cost=0.00..2173.83 rows=27136 width=76) (actual time=16.349..16.350 rows=1 loops=1)
Index Cond: (3065106743::bigint >= begin_ip_num)
Filter: (3065106743::bigint <= end_ip_num)
Rows Removed by Filter: 47596
Total runtime: 16.425 ms
I've already tried various combinations of indices including adding a composite index on both begin_ip_num and end_ip_num.
Try a multicolumn index, but with reversed order on the second column:
CREATE INDEX index_ips_begin_end_ip_num ON ips (begin_ip_num, end_ip_num DESC);
Ordering is mostly irrelevant for a single-column index, since it can be scanned backwards almost as fast. But it is important for multicolumn indexes.
With the index I propose, Postgres can scan the first column and find the address, where the rest of the index fulfills the first condition. Then it can, for each value of the first column, return all rows that fulfill the second condition, until the first one fails. Then jump to the next value of the first column, etc.
This is still not very efficient and Postgres may be faster just scanning the first index column and filtering for the second. Very much depends on your data distribution.
Either way, CLUSTER using the multicolumn index from above can help performance:
CLUSTER ips USING index_ips_begin_end_ip_num
This way, candidates fulfilling your first condition are packed onto the same or adjacent data pages. Can help performance a lot with if you have lots of rows per value of the first column. Else it is hardly effective.
(There are also non-blocking external tools for the purpose: pg_repack or pg_squeeze.)
Also, is autovacuum running and configured properly or have you run ANALYZE on the table? You need current statistics for Postgres to pick appropriate query plans.
What would really help here is a GiST index for a int8range column, available since PostgreSQL 9.2. See:
Optimizing queries on a range of timestamps (two columns)
If your IP ranges can be covered with one of the built-in network types inet or cidr, consider to replace your two bigint columns. Or, better yet, look to the additional module ip4r by Andrew Gierth (not in the standard distribution. The indexing strategy changes accordingly.
Barring that, you can check out this related answer on dba.SE with using a sophisticated regime with partial indexes. Advanced stuff, but it delivers great performance:
Can spatial index help a “range - order by - limit” query
I had exactly this same problem on a nearly identical dataset from maxmind.com's free geiop table. I solved it using Erwin's tip about range types and GiST indexes. The GiST index was key. Without it I was querying at best about 3 rows per second. With it I queried nearly 500000 rows in under 10 seconds! Since Erwin didn't post detailed instructions on how to do this, I thought I'd add them, here...
First of all, you must add a new column having the range type, note that int8range is required for bigint types. Next set its values appropriately, note that the '[]' parameter indicates to make the range inclusive at lower and upper bounds (rtfm). Finally add the index, note that the GiST index is where all the performance advantage comes from.
alter table ips add column iprange int8range;
update ips set iprange=int8range(begin_ip_num, end_ip_num, '[]');
create index index_ips_on_iprange on ips using gist (iprange);
Having laid the groundwork, you can now use the '<#' contained-by operator to search specific addresses against the table. See http://www.postgresql.org/docs/9.2/static/functions-range.html
SELECT "ips".* FROM "ips" WHERE (3065106743::bigint <# iprange);
I'm a bit late to this party, but this is what works really well for me.
Consider installing ip4r extension. It basically allows you to define a column that can hold IP ranges. The name of the extension implies it is just for IPv4, but currently it is also support IPv6.
After you populate table with ranges within that column all you need, is to create GIST index:
CREATE INDEX ip_zip_ip4_range ON ip_zip USING gist (ip4_range);
I have almost 10 million ranges in my database, but queries take fraction of a milisecond:
region=> select count(*) from ip_zip ;
count
---------
9566133
region=> explain analyze select * from ip_zip where '8.8.8.8'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.085..0.086 rows=1 loops=1)
Recheck Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Planning time: 0.106 ms
Execution time: 0.118 ms
(7 rows)
region=> explain analyze select * from ip_zip where '254.50.22.54'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.059..0.059 rows=1 loops=1)
Recheck Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.048..0.048 rows=1 loops=1)
Index Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Planning time: 0.102 ms
Execution time: 0.145 ms
(7 rows)
I believe your query looks like WHERE [constant] BETWEEN begin_ip_num AND end_ipnum or
As far as I know Postgres doesn't have "AND-EQUAL " access plan, so you need to add a composite index on 2 columns as suggested by Erwin Brandstetter.

Can PostgreSQL index array columns?

I can't find a definite answer to this question in the documentation. If a column is an array type, will all the entered values be individually indexed?
I created a simple table with one int[] column, and put a unique index on it. I noticed that I couldn't add the same array of ints, which leads me to believe the index is a composite of the array items, not an index of each item.
INSERT INTO "Test"."Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test"."Test" VALUES ('{10, 20, 30}');
SELECT * FROM "Test"."Test" WHERE 20 = ANY ("Column1");
Is the index helping this query?
Yes you can index an array, but you have to use the array operators and the GIN-index type.
Example:
CREATE TABLE "Test"("Column1" int[]);
INSERT INTO "Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test" VALUES ('{10, 20, 30}');
CREATE INDEX idx_test on "Test" USING GIN ("Column1");
-- To enforce index usage because we have only 2 records for this test...
SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT * FROM "Test" WHERE "Column1" #> ARRAY[20];
Result:
Bitmap Heap Scan on "Test" (cost=4.26..8.27 rows=1 width=32) (actual time=0.014..0.015 rows=2 loops=1)
Recheck Cond: ("Column1" #> '{20}'::integer[])
-> Bitmap Index Scan on idx_test (cost=0.00..4.26 rows=1 width=0) (actual time=0.009..0.009 rows=2 loops=1)
Index Cond: ("Column1" #> '{20}'::integer[])
Total runtime: 0.062 ms
Note
it appears that in many cases the gin__int_ops option is required
create index <index_name> on <table_name> using GIN (<column> gin__int_ops)
I have not yet seen a case where it would work with the && and #> operator without the gin__int_ops options
#Tregoreg raised a question in the comment to his offered bounty:
I didn't find the current answers working. Using GIN index on
array-typed column does not increase the performance of ANY()
operator. Is there really no solution?
#Frank's accepted answer tells you to use array operators, which is still correct for Postgres 11. The manual:
... the standard distribution of PostgreSQL includes a GIN operator
class for arrays, which supports indexed queries using these
operators:
<#
#>
=
&&
The complete list of built-in operator classes for GIN indexes in the standard distribution is here.
In Postgres indexes are bound to operators (which are implemented for certain types), not data types alone or functions or anything else. That's a heritage from the original Berkeley design of Postgres and very hard to change now. And it's generally working just fine. Here is a thread on pgsql-bugs with Tom Lane commenting on this.
Some PostGis functions (like ST_DWithin()) seem to violate this principal, but that is not so. Those functions are rewritten internally to use respective operators.
The indexed expression must be to the left of the operator. For most operators (including all of the above) the query planner can achieve this by flipping operands if you place the indexed expression to the right - given that a COMMUTATOR has been defined. The ANY construct can be used in combination with various operators and is not an operator itself. When used as constant = ANY (array_expression) only indexes supporting the = operator on array elements would qualify and we would need a commutator for = ANY(). GIN indexes are out.
Postgres is not currently smart enough to derive a GIN-indexable expression from it. For starters, constant = ANY (array_expression) is not completely equivalent to array_expression #> ARRAY[constant]. Array operators return an error if any NULL elements are involved, while the ANY construct can deal with NULL on either side. And there are different results for data type mismatches.
Related answers:
Check if value exists in Postgres array
Index for finding an element in a JSON array
SQLAlchemy: how to filter on PgArray column types?
Can IS DISTINCT FROM be combined with ANY or ALL somehow?
Asides
While working with integer arrays (int4, not int2 or int8) without NULL values (like your example implies) consider the additional module intarray, that provides specialized, faster operators and index support. See:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
As for the UNIQUE constraint in your question that went unanswered: That's implemented with a btree index on the whole array value (like you suspected) and does not help with the search for elements at all. Details:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
It's now possible to index the individual array elements. For example:
CREATE TABLE test (foo int[]);
INSERT INTO test VALUES ('{1,2,3}');
INSERT INTO test VALUES ('{4,5,6}');
CREATE INDEX test_index on test ((foo[1]));
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * from test WHERE foo[1]=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using test_index on test (cost=0.00..8.27 rows=1 width=32) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (foo[1] = 1)
Total runtime: 0.112 ms
(3 rows)
This works on at least Postgres 9.2.1. Note that you need to build a separate index for each array index, in my example I only indexed the first element.