Slow Like Query in Postgres - postgresql

I have 20 million Record in table its Schema is like below
FieldName Datatype
id bigint(Auto Inc,Primarykey)
name varchar(255)
phone varchar(255)
deleted_at timestamp
created_at timestamp
updated_at timestamp
It has index on name and phone column
Column Index type
name GIN trgm index
phone btree index, GIN trgm index
Created index using the following commands
CREATE INDEX btree_idx ON contacts USING btree (phone);
CREATE INDEX trgm_idx ON contacts USING GIN (phone gin_trgm_ops);
CREATE INDEX trgm_idx_name ON contacts USING GIN (name gin_trgm_ops);
I am running the below query
select * from contacts where phone like '%6666666%' limit 15;
I am doing contains query on phone. The above query takes more than 5 min to get a result. Let me provide the explain statement of this.
explain analyse select * from contacts where phone like '%6666666%' limit 15;
Limit (cost=1774.88..1830.57 rows=15 width=65) (actual time=7970.553..203001.985 rows=15 loops=1)
-> Bitmap Heap Scan on contacts (cost=1774.88..10819.13 rows=2436 width=65) (actual time=7970.552..203001.967 rows=15 loops=1)
Recheck Cond: ((phone)::text ~~ '%6666666%'::text)
Rows Removed by Index Recheck: 254869
Heap Blocks: lossy=2819
-> Bitmap Index Scan on trgm_idx (cost=0.00..1774.27 rows=2436 width=0) (actual time=6720.978..6720.978 rows=306226 loops=1)
Index Cond: ((phone)::text ~~ '%6666666%'::text)
Planning Time: 0.139 ms
Execution Time: 203002.791 ms
Here what can I do to optimize my query? and bring the result under 5 sec would be optimal

One cause of the bad performance is probably
Heap Blocks: lossy=2819
Your work_mem setting is to small to contain a bitmap with one bit per table row, so PostgreSQL degrades it to one bit per 8kB block. This leads to many more rechecks than necessary.
Also, your test is bad. The search string contains only the trigram 666, which will match many rows that don't satisfy the query and have to be removed during recheck. A trigram index is not effective in this pathological case. Test with a number that contains more digits.

Related

Postgres Query uses Bitmap index scan on GIN indexed item. Does not GIN use B-Tree internally?

From postgres docs on GIN -
Internally, a GIN index contains a B-tree index constructed over keys.
But in my use case, I see Bitmap indexes instead -
My Schema and Indexes are created as below.
CREATE TABLE customer (
jdata jsonb NOT NULL,
comments text NULL
);
CREATE INDEX idx_jdata ON customer USING gin (jdata jsonb_path_ops);
Say, inserted a 10K records. sample prepared data/ alternate link
explain analyze select jdata from customer where jdata #> '{"supplier":{"id":"7f5644ca-f0d3-4f50-947b-9e3e38f7b796"}}'
Outpt -
Bitmap Heap Scan on file_set (cost=2139.27..36722.32 rows=10744 width=18) (actual time=1.438..267.122 rows=4048 loops=1)
Recheck Cond: (jdata #> '{"supplier": {"id": "7f5644ca-f0d3-4f50-947b-9e3e38f7b796"}}'::jsonb)
Heap Blocks: exact=1197
-> Bitmap Index Scan on idx_jdata (cost=0.00..2136.58 rows=10744 width=0) (actual time=1.214..1.214 rows=4048 loops=1)
Index Cond: (jdata #> '{"supplier": {"id": "7f5644ca-f0d3-4f50-947b-9e3e38f7b796"}}'::jsonb)
Planning Time: 0.065 ms
Execution Time: 267.456 ms
(the query plan above is based on real output form postgres - i had to change table/column names)
Why is Bitmap index created when a GIN index is created
My Postgres DB version is 13.5.
After looking at comment from #a-horse-with-no-name -> I tried below
SET enable_seqscan = OFF;
explain analyze
select * from pg_opclass where "oid"=10003;
SET enable_seqscan = ON;
and the output -
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------+
Index Scan using pg_opclass_oid_index on pg_opclass (cost=0.14..8.16 rows=1 width=93) (actual time=0.015..0.016 rows=1 loops=1)|
Index Cond: (oid = '10003'::oid) |
Planning Time: 0.060 ms |
Execution Time: 0.027 ms |
I see difference -
"Bitmap Index Scan on idx_jdata" vs "Index Scan using pg_opclass_oid_index on pg_opclass"
Does this mean anything important. Can some one add more details about the using vs on and the on pg_opsclass.
A GIN index can contain the same tuple pointer multiple times, listed under different tokens which the field is broken down into. Those need to be de-duplicated to make sure it doesn't return duplicate rows where it should not, and the way GIN chooses to do this is by forcing it go through a bitmap index scan; which inherently deduplicates the pointers. It is not the index which is the bitmap, it is the scan which uses bitmaps. Any index can be used in a bitmap index scan, but GIN indexes can only be used in bitmap scans, due to the need to deduplicate.

When to use multi column indexes?

I have the following table with about 10 million rows
CREATE TABLE "autocomplete_books"
(
id uuid PRIMARY KEY DEFAULT uuid_generate_v4 (),
author_id uuid NOT NULL REFERENCES "author"(id) ON DELETE CASCADE,
language VARCHAR(30) NOT NULL,
name VARCHAR(100) NOT NULL,
importance_rank SMALLINT NOT NULL DEFAULT 1
);
I have the following query
SELECT DISTINCT ON (author_id)
author_id,
similarity(name, $1) as score,
language, name, importance_rank
FROM
"autocomplete_books"
WHERE
$1 % name AND language IN ($2, $3, $4)
ORDER BY
author_id, score DESC, importance_rank DESC
LIMIT
10
I am querying primarily on similarity as this is an autocomplete endpoint, so I have a trigram index on name. I am also sorting on some other fields. I am not sure how the score field will mix with my other indexes and whether it is better to have a compound index like so
Option 1
CREATE INDEX ON "autocomplete_books" USING GIN (name gin_trgm_ops);
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language, importance_rank DESC);
Or if I should break them out like so
Option 2
CREATE INDEX ON "autocomplete_books" USING GIN (name gin_trgm_ops);
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language, importance_rank DESC);
CREATE INDEX ON "autocomplete_books" USING BTREE (language);
CREATE INDEX ON "autocomplete_books" USING BTREE (importance_rank DESC);
Here is the output of explain analyze ran on 220k rows with the following index
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language);
CREATE INDEX ON "autocomplete_books" USING BTREE (importance_rank DESC);
-
Limit (cost=762.13..762.38 rows=50 width=82) (actual time=12.230..13.024 rows=50 loops=1)
-> Unique (cost=762.13..763.23 rows=217 width=82) (actual time=12.223..12.686 rows=50 loops=1)
-> Sort (cost=762.13..762.68 rows=220 width=82) (actual time=12.216..12.373 rows=50 loops=1)
Sort Key: author_id, ((similarity((name)::text, \'sale\'::text)) DESC, importance_rank DESC
Sort Method: quicksort Memory: 45kB
-> Bitmap Heap Scan on "books_autocomplete" mat (cost=45.71..753.57 rows=220 width=82) (actual time=1.905..11.610 rows=149 loops=1)
Recheck Cond: (\'sale\'::text % (name)::text)
Rows Removed by Index Recheck: 2837
Filter: ((language)::text = ANY (\'{language1,language2,language3}\'::text[]))
Heap Blocks: exact=2078
-> Bitmap Index Scan on "books_autocomplete_name_idx" (cost=0.00..45.65 rows=220 width=0) (actual time=1.551..1.557 rows=2986 loops=1)
Index Cond: (\'sale\'::text % (name)::text)
Planning time: 13.976 ms
Execution time: 13.545 ms'
An index will only help you with sorting if all expressions in the ORDER BY clause are in the index, and you can't do that because of the second expression.
Also, only b-tree indexes are useful for supporting ORDER BY. Now you cannot combine multiple indexes when you want to use ORDER BY, and you say that $1 % name is your most selective criterion, so you probably want to use an index on that.
There are two ways this query can take:
Go for the $1 % name condition with a trigram GIN index on name.
This is what the execution plan in your question does.
Then you'll have to live with that Sort, because you can't use an index for it. The danger here is that the bitmap index scan will find so many rows that the bitmap heap scan is quite expensive.
If there is an index that is an exact match for the ORDER BY clause:
CREATE INDEX ON autocomplete_books
(author_id, score DESC, importance_rank DESC);
you can scan the index and fetch rows in ORDER BY order until you have 10 that match the filter condition $1 % name. The danger here is that it may take longer than expected to find the 10 rows.
Try first with only the one index, then with only the other index and run the query with different parameters on a data set of realistic size to see what works out best.
You should drop all other indexes than these two, because they won't do any good for this query.
If one of the two strategies is a clear winner, drop the other index so the optimizer is not tempted to use it. Otherwise keep both and hope that the optimizer picks the right one depending on the parameters.

What's wrong with GIN index, can't avoid SEQ scan?

I've created a table like this,
create table mytable(hash char(40), title varchar(500));
create index name_fts on mytable using gin(to_tsvector('english', 'title'));
CREATE UNIQUE INDEX md5_uniq_idx ON mytable(hash);
When I query the title,
test=# explain analyze select * from mytable where to_tsvector('english', title) ## 'abc | def'::tsquery limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..277.35 rows=10 width=83) (actual time=0.111..75.549 rows=10 loops=1)
-> Seq Scan on mytable (cost=0.00..381187.45 rows=13744 width=83) (actual time=0.110..75.546 rows=10 loops=1)
Filter: (to_tsvector('english'::regconfig, (title)::text) ## '''abc'' | ''def'''::tsquery)
Rows Removed by Filter: 10221
Planning time: 0.176 ms
Execution time: 75.564 ms
(6 rows)
The index is not used. Any ideas? I have 10m rows.
There is a typo in your index definition, it should be
ON mytable USING gin (to_tsvector('english', title))
instead of
ON mytable USING gin (to_tsvector('english', 'title'))
The way you wrote it, it is a constant and not a field that is indexed, and such an index would indeed be useless for a search like the one you perform.
To see if an index can be used, you can execute
SET enable_seqscan=off;
and then run the query again.
If the index is still not used, the index probably cannot be used.
In addition to the above, there is something that strikes me as strange with your execution plan. PostgreSQL estimates that a sequential scan of mytable will return 13744 rows and not 10 million as you say there are. Did you disable autovacuum or is there something else that could cause your table statistics to be that inaccurate?

Faster search in INTARRAY column

I have table with approximately 300 000 rows with INT[] column type
Each array contains approximately 2000 elements
I created index for this array column
create index index_name ON table_name USING GIN (column_name)
Then run query:
SELECT COUNT(*)
FROM table_name
WHERE
column_name#> ARRAY[1777]
This query runs very slow Execution time: 66886.132 ms and as EXPLAIN ANALYZE shows, not uses GIN index, only Seq Scan index is used.
Why not uses Postgres GIN index and main destination: how to run above query as fast, as it is possible?
EDIT
This is result from explain (analyze, verbose) for above query
Aggregate (cost=10000024724.75..10000024724.76 rows=1 width=0) (actual time=61087.513..61087.513 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.users (cost=10000000000.00..10000024724.00 rows=300 width=0) (actual time=12104.651..61087.500 rows=5 loops=1)
Output: id, email, pass, nick, reg_dt, reg_ip, gender, curr_location, about, followed_tag_ids, avatar_img_ext, rep_tag_ids, rep_tag_id_scores, stats, status
Filter: (users.rep_tag_ids #> '{1777}'::integer[])
Rows Removed by Filter: 299995
Planning time: 0.110 ms
Execution time: 61087.564 ms
This is table and index definitions
CREATE TABLE users
(
id serial PRIMARY KEY,
rep_tag_ids integer[] DEFAULT '{}'
-- other columns here
);
create index users_rep_tag_ids_idx ON users USING GIN (rep_tag_ids);
You should help query optimizer to use index. Install intarray extension for PostgreSQL if you don't have it yet and then recreate your index using gin__int_ops operator class.
DROP INDEX users_rep_tag_ids_idx;
CREATE INDEX users_rep_tag_ids_idx ON users USING gin (rep_tag_ids gin__int_ops);

GIN index not used for small table when 0 rows returned

In a Postgres 9.4 database, I created a GIN trigram index on a table called 'persons' that contains 1514 rows like the following:
CREATE INDEX persons_index_name_1 ON persons
USING gin (lower(name) gin_trgm_ops);
and a query that looks for similar names as follows:
select name, last_name from persons where lower(name) % 'thename'
So, I first issued a query with a name I knew beforehand that would have similar matches, so the explain analyze showed that the index I created was used in this query:
select name, last_name from persons where lower(name) % 'george'
And the results were the expected:
-> Bitmap Heap Scan on persons (cost=52.01..58.72 rows=2 width=26) (actual time=0.054..0.065 rows=1 loops=1)
Recheck Cond: (lower((name)::text) % 'george'::text)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=1
-> Bitmap Index Scan on persons_index_name_1 (cost=0.00..52.01 rows=2 width=0) (actual time=0.032..0.032 rows=3 loops=1)
Index Cond: (lower((name)::text) % 'george'::text)
...
Execution time: 1.382 ms"
So, out of curiosity, I wanted to see if the index was used when the thename parameter contained a name that didn't exist at all in the table:
select name, last_name from persons where lower(name) % 'noname'
But I saw that in this case that the index was not used at all and the execution time was way higher:
-> Seq Scan on persons (cost=0.00..63.72 rows=2 width=26) (actual time=6.494..6.494 rows=0 loops=1)
Filter: (lower((name)::text) % 'noname'::text)
Rows Removed by Filter: 1514
...
Execution time: 7.387 ms
As a test, I tried the same with a GIST index and in both cases, the index was used and the execution time was like the first case above.
I went ahead and recreated the table but this time inserting 10014 rows; and I saw that in both cases above, the GIN index was used and the execution time was the best for those cases.
Why is a GIN index is not used when the query above returns no results in a table with not so much rows (1514 in my case)?
Trigram indexes are case insensitive, test with:
select 'case' <-> 'CASE' AS ci1
, 'case' % 'CASE' AS ci2
, 'CASE' <-> 'CASE' AS c1
, 'CASE' % 'CASE' AS c2;
So you might as well just:
CREATE INDEX persons_index_name_1 ON persons USING gin (name gin_trgm_ops);
And:
select name, last_name from persons where name % 'thename';
As to your actual question, for small tables an index look-up might not pay. That's exactly what your added tests demonstrate. And establishing that nothing matches can be more expensive than finding some matches.
Aside from that, your cost setting and / or table statistics may not be at their respective optimum to let Postgres pick the most adequate query plans.
The expected cost numbers translate to much higher actual cost for the sequential scan than for the bitmap index scan. You may be overestimating the cost of index scans as compared to sequential scans. random_page_cost (and cpu_index_tuple_cost) may be set too high and effective_cache_size too low.
Keep PostgreSQL from sometimes choosing a bad query plan