Suggest appropriate index on my database - postgresql

I have a table product
Product(id BIGINT,
... Some more columns here
expired DATE);
I want to create index on expired field for faster retrieval.
Majority of time my where clause is
...
WHERE (expired IS NULL OR expired > now());
Please can you suggest which index is more suitable for me.
When I execute explain analyze for the above query
EXPLAIN ANALYZE
SELECT 1
FROM product
WHERE (expired IS NULL) OR (expired > now());
it gave me following result. In which it is not using index which I have created.
Seq Scan on product (cost=0.00..190711.22 rows=5711449 width=0) (actual time=0.009..8653.380 rows=7163105 loops=1)
Filter: ((expired IS NULL) OR (expired > now()))
Rows Removed by Filter: 43043
Planning time: 0.117 ms
Execution time: 15679.478 ms
(5 rows)
I guess that is because of "OR" condition. i tried to create function base index but it gave me following error
ERROR: functions in index expression must be marked IMMUTABLE
Is there any alternate way we can do?

You should change the WHERE clause to look like this:
... WHERE COALESCE(expired, DATE 'infinity') > current_date;
This is equivalent to your query, but now you can use the following index:
CREATE INDEX ON product (COALESCE(expired, DATE 'infinity'));

Probably the default B-tree index is the most appropriate one for you; hash indexes only handle "equals" comparisons, and the GiST and GIN indexes are for more complex data types that what you are using:
https://www.postgresql.org/docs/current/static/indexes-types.html
In fact the B-tree is the default, so all you need to do is something like:
CREATE INDEX Products_expired_idx ON TABLE Products (expired)

Related

Incorrect index usage Postgresql Version 12

Query Plan:
db=> explain
db-> SELECT MIN("id"), MAX("id") FROM "public"."tablename" WHERE ( "updated_at" >= '2022-07-24 09:08:05.926533' AND "updated_at" < '2022-07-28 09:16:54.95459' );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=128.94..128.95 rows=1 width=16)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan using tablename_pkey on tablename (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
InitPlan 2 (returns $1)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan Backward using tablename_pkey on tablename tablename_1 (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
(11 rows)
Indexes:
"tablename_pkey" PRIMARY KEY, btree (id)
"tablename_updated_at_incl_id_partial_idx" btree (updated_at) INCLUDE (id) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone
Idea is when there is already a filtered index which only has small subset of records, why is query doing index scan on primary key, instead of tablename_updated_at_incl_id_partial_idx. Also this is a heap table not clustered table.
Because you're using MIN and MAX, try redefining your second index so id is part of the BTREE index, not just INCLUDEd in it. That may make searching for the MIN and MAX items faster.
Since a small fraction of your table really is over 6e6 rows, then your data must be huge. And I am guessing that id and updated_at are nearly perfectly correlated with each other, so selecting specifically for recent updated_at means you are also selecting for higher id. But the planner doesn't now about that. It thinks that by walking up the id index it can stop after walking about 1/6513960 of it, once it finds the first row qualifying on the time column. But instead it has to walk most of the index before finding that row.
The simplest solution probably to introduce some dummy arithmetic into the aggregates: SELECT MIN("id"+0), MAX("id"+0) ... This will force it not to use the index on id. This will probably be the most robust and simplest solution as long as you have the flexibility to change the query text in your app. But even if you can't change the app, this should at least allow you to verify my assumptions and capture an EXPLAIN (ANALYZE) of it while it is not using the pk index.
None of PostgreSQL's advanced statistics will (as of yet) fix this problem. so you are stuck with fixing it by changing the query or the indexes. Changing the query in the silly way I described is the best currently available solution, but if you need to do just with indexes there are some other less-good options but which will likely still be better than what you currently have.
One is to make the horrible index scan at least into a horrible index-only scan. You could replace your existing primary key index with one like create unique index on tablename (id) include (updated_at). Here the INCLUDE is necessary because otherwise the UNIQUE would not do what you want. It will still have to walk a large part of the index, but at least it won't need to keep jumping between index and table to fetch the time column. (Make sure the table is well-vacuumed)
Or, you could provide a partial index that the planner would find attractive, by switching the order of the columns in it: create index on tablename (id, updated_at) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone The only thing that makes this better than your existing partial index is that this one would actually get used.

Why does postgres use index scan over sequential scan even with a mismatching data type on the indexed column and query condition

I have the following PostgreSQL table:
CREATE TABLE staff (
id integer primary key,
full_name VARCHAR(100) NOT NULL,
department VARCHAR(100) NULL,
tier bigint
);
Filled random data into this table using following block:
do $$
declare
begin
FOR counter IN 1 .. 100000 LOOP
INSERT INTO staff (id, full_name, department, tier)
VALUES (nextval('staff_sequence'),
random_string(10),
random_string(20),
get_department(),
floor(random() * 5 + 1)::bigint);
end LOOP;
end; $$
After the data is populated, I created an index on this table on the tier column:
create index staff_tier_idx on staff(tier);
Although I created this index, when I execute a query using this column, I want this index NOT to be used. To accomplish this, I tried to execute this query:
select count(*) from staff where tier=1::numeric;
Due to mismatching data types on the indexed column and the query condition, I thought the index will not be used & instead a sequential scan will be executed. However, when I run EXPLAIN ANALYZE on the above query I get the following output:
Aggregate (cost=2349.54..2349.55 rows=1 width=8) (actual time=17.078..17.079 rows=1 loops=1)
-> Index Only Scan using staff_tier_idx on staff (cost=0.29..2348.29 rows=500 width=0) (actual time=0.022..15.925 rows=19942 loops=1)
Filter: ((tier)::numeric = '1'::numeric)
Rows Removed by Filter: 80058
Heap Fetches: 0
Planning Time: 0.305 ms
Execution Time: 17.130 ms
Showing that the index has indeed been used.
How do I change this so that the query uses a sequential scan instead of the index? This is purely for a testing/learning purposes.
If its of any importance, I am running this on an Amazon RDS database instance
From the "Filter" rows of the plan like
Rows Removed by Filter: 80058
you can see that the index is not being used as a real index, but just as a skinny table, testing the casted condition for each row. This appears favorable because the index is less than 1/4 the size of the table, while the default ratio of random_page_cost/seq_page_cost = 4.
In addition to just outright disabling index scans as Adrian already suggested, you could also discourage this "skinny table" usage by just increasing random_page_cost, since pages of indexes are assumed to be read in random order.
Another method would be to change the query so it can't use the index-only scan. For example, just using count(full_name) would do that, as PostgreSQL then needs to visit the table to make sure full_name is not NULL (even though it has a constraint asserting that already--sometimes it is not very clever)
Which method is better depends on what it is you are wanting to test/learn.

Does Postgres use indexes if casting timestamp to date?

Let's say I have a table with some columns and a column dt which is of type TIMESTAMP.
I create a (non functional) index on this column.
Then I execute a query
SELECT *
FROM tbl
WHERE
dt::DATE = NOW()::DATE
The question is will Postgres use the index I've created earlier and under which circumstances it will/will not?
I understand that a functional index would cover this case, but does a simple index cover both cases or not when it's a TIMESTAMP -> DATE type conversion?
EDIT:
performing an EXPLAIN ANALYZE on the query tells us it does not use index and performs a Seq scan (table with 3+ mil records:
Seq Scan on tbl (cost=0.00..192289.92 rows=17043 width=12) (actual time=7.237..2493.496 rows=4928 loops=1)
Filter: ((dt)::date = (now())::date)
Rows Removed by Filter: 3397155
Total runtime: 2494.546 ms
Let me ask a question differently then, is it possible to make Postgres utilize this index or should I create another one?
A simple index will not work in this case; try it with EXPLAIN.
What you could do to use the simple index is
WHERE dt >= current_date::timestamptz
AND dt < (current_date + 1)::timestamptz
I think that this is pretty readable and the best solution, but if you want to go with your current query, you'll have to add a second index on (dt::date).
Don't forget that every additional index costs space and slows down the performance of data modifying statements.

PostgreSQL index not used for query on IP ranges

I'm using PostgreSQL 9.2 and have a table of IP ranges. Here's the SQL:
CREATE TABLE ips (
id serial NOT NULL,
begin_ip_num bigint,
end_ip_num bigint,
country_name character varying(255),
CONSTRAINT ips_pkey PRIMARY KEY (id )
)
I've added plain B-tree indices on both begin_ip_num and end_ip_num:
CREATE INDEX index_ips_on_begin_ip_num ON ips (begin_ip_num);
CREATE INDEX index_ips_on_end_ip_num ON ips (end_ip_num );
The query being used is:
SELECT ips.* FROM ips
WHERE 3065106743 BETWEEN begin_ip_num AND end_ip_num;
The problem is that my BETWEEN query is only using the index on begin_ip_num. After using the index, it filters the result using end_ip_num. Here's the EXPLAIN ANALYZE result:
Index Scan using index_ips_on_begin_ip_num on ips (cost=0.00..2173.83 rows=27136 width=76) (actual time=16.349..16.350 rows=1 loops=1)
Index Cond: (3065106743::bigint >= begin_ip_num)
Filter: (3065106743::bigint <= end_ip_num)
Rows Removed by Filter: 47596
Total runtime: 16.425 ms
I've already tried various combinations of indices including adding a composite index on both begin_ip_num and end_ip_num.
Try a multicolumn index, but with reversed order on the second column:
CREATE INDEX index_ips_begin_end_ip_num ON ips (begin_ip_num, end_ip_num DESC);
Ordering is mostly irrelevant for a single-column index, since it can be scanned backwards almost as fast. But it is important for multicolumn indexes.
With the index I propose, Postgres can scan the first column and find the address, where the rest of the index fulfills the first condition. Then it can, for each value of the first column, return all rows that fulfill the second condition, until the first one fails. Then jump to the next value of the first column, etc.
This is still not very efficient and Postgres may be faster just scanning the first index column and filtering for the second. Very much depends on your data distribution.
Either way, CLUSTER using the multicolumn index from above can help performance:
CLUSTER ips USING index_ips_begin_end_ip_num
This way, candidates fulfilling your first condition are packed onto the same or adjacent data pages. Can help performance a lot with if you have lots of rows per value of the first column. Else it is hardly effective.
(There are also non-blocking external tools for the purpose: pg_repack or pg_squeeze.)
Also, is autovacuum running and configured properly or have you run ANALYZE on the table? You need current statistics for Postgres to pick appropriate query plans.
What would really help here is a GiST index for a int8range column, available since PostgreSQL 9.2. See:
Optimizing queries on a range of timestamps (two columns)
If your IP ranges can be covered with one of the built-in network types inet or cidr, consider to replace your two bigint columns. Or, better yet, look to the additional module ip4r by Andrew Gierth (not in the standard distribution. The indexing strategy changes accordingly.
Barring that, you can check out this related answer on dba.SE with using a sophisticated regime with partial indexes. Advanced stuff, but it delivers great performance:
Can spatial index help a “range - order by - limit” query
I had exactly this same problem on a nearly identical dataset from maxmind.com's free geiop table. I solved it using Erwin's tip about range types and GiST indexes. The GiST index was key. Without it I was querying at best about 3 rows per second. With it I queried nearly 500000 rows in under 10 seconds! Since Erwin didn't post detailed instructions on how to do this, I thought I'd add them, here...
First of all, you must add a new column having the range type, note that int8range is required for bigint types. Next set its values appropriately, note that the '[]' parameter indicates to make the range inclusive at lower and upper bounds (rtfm). Finally add the index, note that the GiST index is where all the performance advantage comes from.
alter table ips add column iprange int8range;
update ips set iprange=int8range(begin_ip_num, end_ip_num, '[]');
create index index_ips_on_iprange on ips using gist (iprange);
Having laid the groundwork, you can now use the '<#' contained-by operator to search specific addresses against the table. See http://www.postgresql.org/docs/9.2/static/functions-range.html
SELECT "ips".* FROM "ips" WHERE (3065106743::bigint <# iprange);
I'm a bit late to this party, but this is what works really well for me.
Consider installing ip4r extension. It basically allows you to define a column that can hold IP ranges. The name of the extension implies it is just for IPv4, but currently it is also support IPv6.
After you populate table with ranges within that column all you need, is to create GIST index:
CREATE INDEX ip_zip_ip4_range ON ip_zip USING gist (ip4_range);
I have almost 10 million ranges in my database, but queries take fraction of a milisecond:
region=> select count(*) from ip_zip ;
count
---------
9566133
region=> explain analyze select * from ip_zip where '8.8.8.8'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.085..0.086 rows=1 loops=1)
Recheck Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Planning time: 0.106 ms
Execution time: 0.118 ms
(7 rows)
region=> explain analyze select * from ip_zip where '254.50.22.54'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.059..0.059 rows=1 loops=1)
Recheck Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.048..0.048 rows=1 loops=1)
Index Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Planning time: 0.102 ms
Execution time: 0.145 ms
(7 rows)
I believe your query looks like WHERE [constant] BETWEEN begin_ip_num AND end_ipnum or
As far as I know Postgres doesn't have "AND-EQUAL " access plan, so you need to add a composite index on 2 columns as suggested by Erwin Brandstetter.

Postgres combining multiple Indexes

I have the following table/indexes -
CREATE TABLE test
(
coords geography(Point,4326),
user_id varchar(50),
created_at timestamp
);
CREATE INDEX ix_coords ON test USING GIST (coords);
CREATE INDEX ix_user_id ON test (user_id);
CREATE INDEX ix_created_at ON test (created_at DESC);
This is the query I want to execute:
select *
from updates
where ST_DWithin(coords, ST_MakePoint(-126.4, 45.32)::geography, 30000)
and user_id='3212312'
order by created_at desc
limit 60
When I run the query it only uses ix_coords index. How can I ensure that Postgres uses ix_user_id and ix_created_at index as well for the query?
This is a new table in which I did bulk insert of production data. Total rows in the test table: 15,069,489
I am running PostgreSQL 9.2.1 (with Postgis) with (effective_cache_size = 2GB). This is my local OSX with 16GB RAM, Core i7/2.5 GHz, non-SSD disk.
Adding the EXPLAIN ANALYZE output -
Limit (cost=71.64..71.65 rows=1 width=280) (actual time=1278.652..1278.665 rows=60 loops=1)
-> Sort (cost=71.64..71.65 rows=1 width=280) (actual time=1278.651..1278.662 rows=60 loops=1)
Sort Key: created_at
Sort Method: top-N heapsort Memory: 33kB
-> Index Scan using ix_coords on test (cost=0.00..71.63 rows=1 width=280) (actual time=0.198..1278.227 rows=178 loops=1)
Index Cond: (coords && '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography)
Filter: (((user_id)::text = '4f1092000b921a000100015c'::text) AND ('0101000020E61000006666666666E63C40C3F5285C8F824440'::geography && _st_expand(coords, 30000::double precision)) AND _st_dwithin(coords, '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography, 30000::double precision, true))
Rows Removed by Filter: 3122459
Total runtime: 1278.701 ms
UPDATE:
Based on the suggestions below I tried index on cords + user_id:
CREATE INDEX ix_coords_and_user_id ON updates USING GIST (coords, user_id);
..but get the following error:
ERROR: data type character varying has no default operator class for access method "gist"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
UPDATE:
So the CREATE EXTENSION btree_gist; solved the btree/gist compound index issue. And now my index looks like
CREATE INDEX ix_coords_user_id_created_at ON test USING GIST (coords, user_id, created_at);
NOTE: btree_gist does not accept DESC/ASC.
New query plan:
Limit (cost=134.99..135.00 rows=1 width=280) (actual time=273.282..273.292 rows=60 loops=1)
-> Sort (cost=134.99..135.00 rows=1 width=280) (actual time=273.281..273.285 rows=60 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 41kB
-> Index Scan using ix_updates_coords_user_id_created_at on updates (cost=0.00..134.98 rows=1 width=280) (actual time=0.406..273.110 rows=115 loops=1)
Index Cond: ((coords && '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography) AND ((user_id)::text = '4e952bb5b9a77200010019ad'::text))
Filter: (('0101000020E61000006666666666E63C40C3F5285C8F824440'::geography && _st_expand(coords, 30000::double precision)) AND _st_dwithin(coords, '0101000020E61000006666666666E63C40C3F5285C8F824440'::geography, 30000::double precision, true))
Rows Removed by Filter: 1
Total runtime: 273.331 ms
The query is performing better than before, almost a second better but still not great. I guess this is the best that I can get?? I was hoping somewhere around 60-80ms. Also taking order by created_at desc from the query, shaves off another 100ms, meaning it is unable to use the index. Anyway to fix this?
I don't know if Pg can combine a GiST index and regular b-tree indexes with a bitmap index scan, but I suspect not. You may be getting the best result you can without adding a user_id column to your GiST index (and consequently making it bigger and slower for other queries that don't use user_id).
As an experiment you could:
CREATE EXTENSION btree_gist;
CREATE INDEX ix_coords_and_user_id ON test USING GIST (coords, user_id);
which is likely to result in a big index, but might boost that query - if it works. Be aware that maintaining such an index will significantly slow INSERT and UPDATEs. If you drop the old ix_coords your queries will use ix_coords_and_user_id even if they don't filter on user_id, but it'll be slower than ix_coords. Keeping both will make the INSERT and UPDATE slowdown even worse.
See btree-gist
(Obsoleted by edit to question that changes the question completely; when written the user had a multicolumn index they've now split into two separate ones):
You don't seem to be filtering or sorting on user_id, only create_date. Pg won't (can't?) use only the second term of a multi-column index like (user_id, create_date), it needs use of the first item too.
If you want to index create_date, create a separate index for it. If you use and need the (user_id, create_date) index and don't generally use just user_id alone, see if you can reverse the column order. Alternately create two independent indexes, (user_id) and (create_date). When both columns are needed Pg can combine the two indepependent indexes using a bitmap index scan.
I think Craig is correct with his answer, but I just wanted to add a few things (and it wouldn't fit in a comment)
You have to work pretty hard to force PostgreSQL to use an index. The Query optimizer is smart and there are times where it will believe that a sequential table scan will be faster. It is usually right! :) But, you can play with some settings (such as seq_page_cost, random_page_cost, etc) you can play with to try and get it to favor an index. Here is a link to some of the configurations that you might want to examine if you feel like it is not making the correct decision. But, again... my experience is that most of the time, Postgres is smarter than I am! :)
Hope this helps you (or someone in the future).