When to use multi column indexes? - postgresql

I have the following table with about 10 million rows
CREATE TABLE "autocomplete_books"
(
id uuid PRIMARY KEY DEFAULT uuid_generate_v4 (),
author_id uuid NOT NULL REFERENCES "author"(id) ON DELETE CASCADE,
language VARCHAR(30) NOT NULL,
name VARCHAR(100) NOT NULL,
importance_rank SMALLINT NOT NULL DEFAULT 1
);
I have the following query
SELECT DISTINCT ON (author_id)
author_id,
similarity(name, $1) as score,
language, name, importance_rank
FROM
"autocomplete_books"
WHERE
$1 % name AND language IN ($2, $3, $4)
ORDER BY
author_id, score DESC, importance_rank DESC
LIMIT
10
I am querying primarily on similarity as this is an autocomplete endpoint, so I have a trigram index on name. I am also sorting on some other fields. I am not sure how the score field will mix with my other indexes and whether it is better to have a compound index like so
Option 1
CREATE INDEX ON "autocomplete_books" USING GIN (name gin_trgm_ops);
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language, importance_rank DESC);
Or if I should break them out like so
Option 2
CREATE INDEX ON "autocomplete_books" USING GIN (name gin_trgm_ops);
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language, importance_rank DESC);
CREATE INDEX ON "autocomplete_books" USING BTREE (language);
CREATE INDEX ON "autocomplete_books" USING BTREE (importance_rank DESC);
Here is the output of explain analyze ran on 220k rows with the following index
CREATE INDEX ON "autocomplete_books" USING BTREE (author_id, language);
CREATE INDEX ON "autocomplete_books" USING BTREE (importance_rank DESC);
-
Limit (cost=762.13..762.38 rows=50 width=82) (actual time=12.230..13.024 rows=50 loops=1)
-> Unique (cost=762.13..763.23 rows=217 width=82) (actual time=12.223..12.686 rows=50 loops=1)
-> Sort (cost=762.13..762.68 rows=220 width=82) (actual time=12.216..12.373 rows=50 loops=1)
Sort Key: author_id, ((similarity((name)::text, \'sale\'::text)) DESC, importance_rank DESC
Sort Method: quicksort Memory: 45kB
-> Bitmap Heap Scan on "books_autocomplete" mat (cost=45.71..753.57 rows=220 width=82) (actual time=1.905..11.610 rows=149 loops=1)
Recheck Cond: (\'sale\'::text % (name)::text)
Rows Removed by Index Recheck: 2837
Filter: ((language)::text = ANY (\'{language1,language2,language3}\'::text[]))
Heap Blocks: exact=2078
-> Bitmap Index Scan on "books_autocomplete_name_idx" (cost=0.00..45.65 rows=220 width=0) (actual time=1.551..1.557 rows=2986 loops=1)
Index Cond: (\'sale\'::text % (name)::text)
Planning time: 13.976 ms
Execution time: 13.545 ms'

An index will only help you with sorting if all expressions in the ORDER BY clause are in the index, and you can't do that because of the second expression.
Also, only b-tree indexes are useful for supporting ORDER BY. Now you cannot combine multiple indexes when you want to use ORDER BY, and you say that $1 % name is your most selective criterion, so you probably want to use an index on that.
There are two ways this query can take:
Go for the $1 % name condition with a trigram GIN index on name.
This is what the execution plan in your question does.
Then you'll have to live with that Sort, because you can't use an index for it. The danger here is that the bitmap index scan will find so many rows that the bitmap heap scan is quite expensive.
If there is an index that is an exact match for the ORDER BY clause:
CREATE INDEX ON autocomplete_books
(author_id, score DESC, importance_rank DESC);
you can scan the index and fetch rows in ORDER BY order until you have 10 that match the filter condition $1 % name. The danger here is that it may take longer than expected to find the 10 rows.
Try first with only the one index, then with only the other index and run the query with different parameters on a data set of realistic size to see what works out best.
You should drop all other indexes than these two, because they won't do any good for this query.
If one of the two strategies is a clear winner, drop the other index so the optimizer is not tempted to use it. Otherwise keep both and hope that the optimizer picks the right one depending on the parameters.

Related

Slow Like Query in Postgres

I have 20 million Record in table its Schema is like below
FieldName Datatype
id bigint(Auto Inc,Primarykey)
name varchar(255)
phone varchar(255)
deleted_at timestamp
created_at timestamp
updated_at timestamp
It has index on name and phone column
Column Index type
name GIN trgm index
phone btree index, GIN trgm index
Created index using the following commands
CREATE INDEX btree_idx ON contacts USING btree (phone);
CREATE INDEX trgm_idx ON contacts USING GIN (phone gin_trgm_ops);
CREATE INDEX trgm_idx_name ON contacts USING GIN (name gin_trgm_ops);
I am running the below query
select * from contacts where phone like '%6666666%' limit 15;
I am doing contains query on phone. The above query takes more than 5 min to get a result. Let me provide the explain statement of this.
explain analyse select * from contacts where phone like '%6666666%' limit 15;
Limit (cost=1774.88..1830.57 rows=15 width=65) (actual time=7970.553..203001.985 rows=15 loops=1)
-> Bitmap Heap Scan on contacts (cost=1774.88..10819.13 rows=2436 width=65) (actual time=7970.552..203001.967 rows=15 loops=1)
Recheck Cond: ((phone)::text ~~ '%6666666%'::text)
Rows Removed by Index Recheck: 254869
Heap Blocks: lossy=2819
-> Bitmap Index Scan on trgm_idx (cost=0.00..1774.27 rows=2436 width=0) (actual time=6720.978..6720.978 rows=306226 loops=1)
Index Cond: ((phone)::text ~~ '%6666666%'::text)
Planning Time: 0.139 ms
Execution Time: 203002.791 ms
Here what can I do to optimize my query? and bring the result under 5 sec would be optimal
One cause of the bad performance is probably
Heap Blocks: lossy=2819
Your work_mem setting is to small to contain a bitmap with one bit per table row, so PostgreSQL degrades it to one bit per 8kB block. This leads to many more rechecks than necessary.
Also, your test is bad. The search string contains only the trigram 666, which will match many rows that don't satisfy the query and have to be removed during recheck. A trigram index is not effective in this pathological case. Test with a number that contains more digits.

Sorting a large spatial selection is not using GiST index (Postgres 11.5)

I'm having a table (demo) with a sequence as its primary key (seqno) and a geometry property contained within a JSONB column (doc). I have configured a primary key constraint for the sequence column and a GiST index for the geometry. I have already gathered statistics by running VACUUM ANALYZE. It's a fairly large table (42M rows).
CREATE TABLE demo
(
seqno bigint NOT NULL DEFAULT nextval('seqno'::regclass),
doc jsonb NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT demo_pkey PRIMARY KEY (seqno)
)
CREATE INDEX demo_doc_geometry_gist
ON demo USING gist (st_geometryfromtext(doc ->> 'geometry'::text))
I want to perform a spatial filter on a rather large area and return the first 10 rows, sorted by its primary key. Therefore, I have tried the following query:
SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10
This results in the following query plan:
Limit (cost=1000.59..15169.06 rows=10 width=633) (actual time=2479.372..2496.737 rows=10 loops=1)
-> Gather Merge (cost=1000.59..19780184.81 rows=13960 width=633) (actual time=2479.370..2496.732 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Index Scan using demo_pkey on demo (cost=0.56..19777573.45 rows=5817 width=633) (actual time=2440.310..2450.101 rows=5 loops=3)
Filter: (('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text))) AND _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text))))
Rows Removed by Filter: 221313
Planning Time: 0.375 ms
Execution Time: 2496.786 ms
This shows that the primary key constraint index is used to scan all rows and perform the spatial filter on each row, which is obviously very inefficient. There are more than 5M matches for the given spatial predicate. The GiST index is not used at all.
However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.
Limit (cost=0.42..128.90 rows=10 width=633) (actual time=0.381..0.745 rows=10 loops=1)
-> Index Scan using demo_doc_geometry_gist on demo (cost=0.42..179352.99 rows=13960 width=633) (actual time=0.380..0.742 rows=10 loops=1)
Index Cond: ('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text)))
Filter: _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text)))
Planning Time: 0.245 ms
Execution Time: 0.780 ms
Is there a way to make this query fast? Can we let the query planner combine the GiST index with the PK index to get a sorted result? Any other suggestions?
This shows that the primary key constraint index is used to scan all rows
It doesn't scan all rows, it stops after finding 10 of them which match. This would appear to be about 221313 * 3 + 10 rows, or about 1.6% of the total rows. It is not obvious that this is the wrong thing to do. You can suppress the usage of the primary key index by changing to ORDER BY seqno+0. This should use the GiST index, but I would not count on this being faster.
However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.
But it answers a far simpler question. Consider the difference between "find me 5 random people from Chicago" and "find me the 5 tallest people in Chicago".
As for making the query faster, I would try the ORDER BY seqno+0 trick. I don't think it will be faster, but I could be wrong.
I would also try a btree index on (seqno, doc) so you can get an index-only-scan, although this would be much better if your geometry was in its own column, not embedded in JSONB, so you could index just the seqno and the geometry rather than the whole JSONB. In theory PostgreSQL could give you an index only scan for an index on (seqno, ST_GeometryFromText(doc->>'geometry')), but it just isn't clever enough to realize this.
You could also try a multi-column GiST index on (seqno, ST_GeometryFromText(doc->>'geometry')) using the btree_gist extension to enable the inclusion of seqno.
Finally, you could try range partitioning your table on seqno. This would require a reorganization of your dataset, so isn't as simple as just building an index.
You can try to include the bounding box overlap operator ~ in the query, as the doc says
This operand will make use of any indexes that may be available on the
geometries.
SELECT seqno, doc
FROM demo
WHERE ST_GeometryFromText((doc->>'geometry')) ~ ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
AND ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10
Otherwise, you can run the query without the limit clause and with an offset of 0 to prevent inlining the subquery, then apply the limit.
SELECT * FROM (
SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')),
ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
OFFSET 0
) sub
ORDER BY seqno
LIMIT 10

Faster search in INTARRAY column

I have table with approximately 300 000 rows with INT[] column type
Each array contains approximately 2000 elements
I created index for this array column
create index index_name ON table_name USING GIN (column_name)
Then run query:
SELECT COUNT(*)
FROM table_name
WHERE
column_name#> ARRAY[1777]
This query runs very slow Execution time: 66886.132 ms and as EXPLAIN ANALYZE shows, not uses GIN index, only Seq Scan index is used.
Why not uses Postgres GIN index and main destination: how to run above query as fast, as it is possible?
EDIT
This is result from explain (analyze, verbose) for above query
Aggregate (cost=10000024724.75..10000024724.76 rows=1 width=0) (actual time=61087.513..61087.513 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.users (cost=10000000000.00..10000024724.00 rows=300 width=0) (actual time=12104.651..61087.500 rows=5 loops=1)
Output: id, email, pass, nick, reg_dt, reg_ip, gender, curr_location, about, followed_tag_ids, avatar_img_ext, rep_tag_ids, rep_tag_id_scores, stats, status
Filter: (users.rep_tag_ids #> '{1777}'::integer[])
Rows Removed by Filter: 299995
Planning time: 0.110 ms
Execution time: 61087.564 ms
This is table and index definitions
CREATE TABLE users
(
id serial PRIMARY KEY,
rep_tag_ids integer[] DEFAULT '{}'
-- other columns here
);
create index users_rep_tag_ids_idx ON users USING GIN (rep_tag_ids);
You should help query optimizer to use index. Install intarray extension for PostgreSQL if you don't have it yet and then recreate your index using gin__int_ops operator class.
DROP INDEX users_rep_tag_ids_idx;
CREATE INDEX users_rep_tag_ids_idx ON users USING gin (rep_tag_ids gin__int_ops);

Indexes on join tables

When searching on Google for join table indexes, I got this question.
Now, I believe that it is giving some false information in the accepted answer, or I do not understand how everything works.
Given the following tables (running on PostGreSQL 9.4):
CREATE TABLE "albums" ("album_id" serial PRIMARY KEY, "album_name" text)
CREATE TABLE "artists" ("artist_id" serial PRIMARY KEY, "artist_name" text)
CREATE TABLE "albums_artists" ("album_id" integer REFERENCES "albums", "artist_id" integer REFERENCES "artists")
I was trying to replicate the scenario from the question mentioned above, by creating first an index on both of the columns of the albums_artists table and then one index for each column (without keeping the index on both columns).
I would have been expecting very different results when using the EXPLAIN command for a normal, traditional select like the following one:
SELECT "artists".* FROM "test"."artists"
INNER JOIN "test"."albums_artists" ON ("albums_artists"."artist_id" = "artists"."artist_id")
WHERE ("albums_artists"."album_id" = 1)
However, when actually running explain on it, I get exactly the same result for each of the cases (with one index on each column vs. one index on both columns).
I've been reading the documentation on PostGreSQL about indexing and it doesn't make any sense on the results that I am getting:
Hash Join (cost=15.05..42.07 rows=11 width=36) (actual time=0.024..0.025 rows=1 loops=1)
Hash Cond: (artists.artist_id = albums_artists.artist_id)
-> Seq Scan on artists (cost=0.00..22.30 rows=1230 width=36) (actual time=0.006..0.006 rows=1 loops=1)
-> Hash (cost=14.91..14.91 rows=11 width=4) (actual time=0.009..0.009 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Bitmap Heap Scan on albums_artists (cost=4.24..14.91 rows=11 width=4) (actual time=0.008..0.009 rows=1 loops=1)
Recheck Cond: (album_id = 1)
Heap Blocks: exact=1
-> Bitmap Index Scan on albums_artists_album_id_index (cost=0.00..4.24 rows=11 width=0) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: (album_id = 1)
I would expect to not get an Index scan at the last step when using an index composed by 2 different columns (since I am only using one of them in the WHERE clause).
I was about to open a bug in an ORM library that adds one index for both columns for join tables, but now I am not so sure. Can anyone help me understand why is the behavior similar in the two cases and what would actually be the difference, if there is any?
add a NOT NULL constraint on the key columns (a tuple with NULLs would make no sense here)
add a PRIMARY KEY (forcing a UNIQUE index on the two keyfields)
As a suport for FK lookups : add a compound index for the PK fields in reversed order
after creating/adding PKs and indexes, you may want to ANALYZE the table (only key columns have statistics)
CREATE TABLE albums_artists
( album_id integer NOT NULL REFERENCES albums (album_id)
, artist_id integer NOT NULL REFERENCES artists (artist_id)
, PRIMARY KEY (album_id, artist_id)
);
CREATE UNIQUE INDEX ON albums_artists (artist_id, album_id);
The reason behind the observed behaviour is the fact that the planner/optimiser is information based, driven by heuristics. Without any information about the fraction of rows that will actually be needed given the conditions, or the fraction of rows that actually maches (in case of a JOIN), the planner makes a guess: (for example: 10% for a range query). For a small query, a hash join will always be a winning scenario, it does imply fetching all tuples from both tables, but the join itself is very efficient.
For columns that are part of a key or index, statistics will be collected, so the planner can make more realistic estimates about the amount of rows involved. Ald that will often result in an indexed plan, since that could need fewer pages to be fetched.
Foreign keys are a very special case; since the planner will know that all the values from the referring table will be present in the referred table. (that is 100%, assuming NOT NULL)

GIN index not used for small table when 0 rows returned

In a Postgres 9.4 database, I created a GIN trigram index on a table called 'persons' that contains 1514 rows like the following:
CREATE INDEX persons_index_name_1 ON persons
USING gin (lower(name) gin_trgm_ops);
and a query that looks for similar names as follows:
select name, last_name from persons where lower(name) % 'thename'
So, I first issued a query with a name I knew beforehand that would have similar matches, so the explain analyze showed that the index I created was used in this query:
select name, last_name from persons where lower(name) % 'george'
And the results were the expected:
-> Bitmap Heap Scan on persons (cost=52.01..58.72 rows=2 width=26) (actual time=0.054..0.065 rows=1 loops=1)
Recheck Cond: (lower((name)::text) % 'george'::text)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=1
-> Bitmap Index Scan on persons_index_name_1 (cost=0.00..52.01 rows=2 width=0) (actual time=0.032..0.032 rows=3 loops=1)
Index Cond: (lower((name)::text) % 'george'::text)
...
Execution time: 1.382 ms"
So, out of curiosity, I wanted to see if the index was used when the thename parameter contained a name that didn't exist at all in the table:
select name, last_name from persons where lower(name) % 'noname'
But I saw that in this case that the index was not used at all and the execution time was way higher:
-> Seq Scan on persons (cost=0.00..63.72 rows=2 width=26) (actual time=6.494..6.494 rows=0 loops=1)
Filter: (lower((name)::text) % 'noname'::text)
Rows Removed by Filter: 1514
...
Execution time: 7.387 ms
As a test, I tried the same with a GIST index and in both cases, the index was used and the execution time was like the first case above.
I went ahead and recreated the table but this time inserting 10014 rows; and I saw that in both cases above, the GIN index was used and the execution time was the best for those cases.
Why is a GIN index is not used when the query above returns no results in a table with not so much rows (1514 in my case)?
Trigram indexes are case insensitive, test with:
select 'case' <-> 'CASE' AS ci1
, 'case' % 'CASE' AS ci2
, 'CASE' <-> 'CASE' AS c1
, 'CASE' % 'CASE' AS c2;
So you might as well just:
CREATE INDEX persons_index_name_1 ON persons USING gin (name gin_trgm_ops);
And:
select name, last_name from persons where name % 'thename';
As to your actual question, for small tables an index look-up might not pay. That's exactly what your added tests demonstrate. And establishing that nothing matches can be more expensive than finding some matches.
Aside from that, your cost setting and / or table statistics may not be at their respective optimum to let Postgres pick the most adequate query plans.
The expected cost numbers translate to much higher actual cost for the sequential scan than for the bitmap index scan. You may be overestimating the cost of index scans as compared to sequential scans. random_page_cost (and cpu_index_tuple_cost) may be set too high and effective_cache_size too low.
Keep PostgreSQL from sometimes choosing a bad query plan