What's wrong with GIN index, can't avoid SEQ scan? - postgresql

I've created a table like this,
create table mytable(hash char(40), title varchar(500));
create index name_fts on mytable using gin(to_tsvector('english', 'title'));
CREATE UNIQUE INDEX md5_uniq_idx ON mytable(hash);
When I query the title,
test=# explain analyze select * from mytable where to_tsvector('english', title) ## 'abc | def'::tsquery limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..277.35 rows=10 width=83) (actual time=0.111..75.549 rows=10 loops=1)
-> Seq Scan on mytable (cost=0.00..381187.45 rows=13744 width=83) (actual time=0.110..75.546 rows=10 loops=1)
Filter: (to_tsvector('english'::regconfig, (title)::text) ## '''abc'' | ''def'''::tsquery)
Rows Removed by Filter: 10221
Planning time: 0.176 ms
Execution time: 75.564 ms
(6 rows)
The index is not used. Any ideas? I have 10m rows.

There is a typo in your index definition, it should be
ON mytable USING gin (to_tsvector('english', title))
instead of
ON mytable USING gin (to_tsvector('english', 'title'))
The way you wrote it, it is a constant and not a field that is indexed, and such an index would indeed be useless for a search like the one you perform.
To see if an index can be used, you can execute
SET enable_seqscan=off;
and then run the query again.
If the index is still not used, the index probably cannot be used.
In addition to the above, there is something that strikes me as strange with your execution plan. PostgreSQL estimates that a sequential scan of mytable will return 13744 rows and not 10 million as you say there are. Did you disable autovacuum or is there something else that could cause your table statistics to be that inaccurate?

Related

Postgres: how do you optimize queries on date column with low selectivity?

I have a table with 143 million rows (and growing), its current size is 107GB. One of the columns in the table is of type date and it has low selectivity. For any given date, its reasonable to assume that there are somewhere between 0.5 to 4 million records with the same date value.
Now, if someone tries to do something like this:
select * from large_table where date_column > '2020-01-01' limit 100
It will execute "forever", and if you EXPLAIN ANALYZE it, you can see that its doing a table scan. So the first (and only so far) idea is to try and make this into an index scan. If Postgres can scan a subsection of an index and return the "limit" number of records, it sounds fast to me:
create index our_index_on_the_date_column ON large_table (date_column DESC);
VACUUM ANALYZE large_table;
EXPLAIN ANALYZE select * from large_table where date_column > '2020-01-01' limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..37.88 rows=100 width=893) (actual time=0.034..13.520 rows=100 loops=1)
-> Seq Scan on large_table (cost=0.00..13649986.80 rows=36034774 width=893) (actual time=0.033..13.506 rows=100 loops=1)
Filter: (date_column > '2020-01-01'::date)
Rows Removed by Filter: 7542
Planning Time: 0.168 ms
Execution Time: 18.412 ms
(6 rows)
It still reverts to a sequential scan. Please disregard the execution time as this took 11 minutes before caching came into action. We can force it to use the index, by reducing the number of returned columns to what's being covered by the index:
select date_column from large_table where date_column > '2019-01-15' limit 100
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..3.42 rows=100 width=4) (actual time=0.051..0.064 rows=100 loops=1)
-> Index Only Scan using our_index_on_the_date_column on large_table (cost=0.57..907355.11 rows=31874888 width=4) (actual time=0.050..0.056 rows=100 loops=1)
Index Cond: (date_column > '2019-01-15'::date)
Heap Fetches: 0
Planning Time: 0.082 ms
Execution Time: 0.083 ms
(6 rows)
But this is of course a contrived example, since the table is very wide and covering all parts of the table in the index is not feasible.
So, anyone who can share some guidance on how to get some performance when using columns with low selectivity as predicates?

Postgresql. Optimize retriving distinct values from large table

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!
The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

Faster search in INTARRAY column

I have table with approximately 300 000 rows with INT[] column type
Each array contains approximately 2000 elements
I created index for this array column
create index index_name ON table_name USING GIN (column_name)
Then run query:
SELECT COUNT(*)
FROM table_name
WHERE
column_name#> ARRAY[1777]
This query runs very slow Execution time: 66886.132 ms and as EXPLAIN ANALYZE shows, not uses GIN index, only Seq Scan index is used.
Why not uses Postgres GIN index and main destination: how to run above query as fast, as it is possible?
EDIT
This is result from explain (analyze, verbose) for above query
Aggregate (cost=10000024724.75..10000024724.76 rows=1 width=0) (actual time=61087.513..61087.513 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.users (cost=10000000000.00..10000024724.00 rows=300 width=0) (actual time=12104.651..61087.500 rows=5 loops=1)
Output: id, email, pass, nick, reg_dt, reg_ip, gender, curr_location, about, followed_tag_ids, avatar_img_ext, rep_tag_ids, rep_tag_id_scores, stats, status
Filter: (users.rep_tag_ids #> '{1777}'::integer[])
Rows Removed by Filter: 299995
Planning time: 0.110 ms
Execution time: 61087.564 ms
This is table and index definitions
CREATE TABLE users
(
id serial PRIMARY KEY,
rep_tag_ids integer[] DEFAULT '{}'
-- other columns here
);
create index users_rep_tag_ids_idx ON users USING GIN (rep_tag_ids);
You should help query optimizer to use index. Install intarray extension for PostgreSQL if you don't have it yet and then recreate your index using gin__int_ops operator class.
DROP INDEX users_rep_tag_ids_idx;
CREATE INDEX users_rep_tag_ids_idx ON users USING gin (rep_tag_ids gin__int_ops);

GIN index not used for small table when 0 rows returned

In a Postgres 9.4 database, I created a GIN trigram index on a table called 'persons' that contains 1514 rows like the following:
CREATE INDEX persons_index_name_1 ON persons
USING gin (lower(name) gin_trgm_ops);
and a query that looks for similar names as follows:
select name, last_name from persons where lower(name) % 'thename'
So, I first issued a query with a name I knew beforehand that would have similar matches, so the explain analyze showed that the index I created was used in this query:
select name, last_name from persons where lower(name) % 'george'
And the results were the expected:
-> Bitmap Heap Scan on persons (cost=52.01..58.72 rows=2 width=26) (actual time=0.054..0.065 rows=1 loops=1)
Recheck Cond: (lower((name)::text) % 'george'::text)
Rows Removed by Index Recheck: 2
Heap Blocks: exact=1
-> Bitmap Index Scan on persons_index_name_1 (cost=0.00..52.01 rows=2 width=0) (actual time=0.032..0.032 rows=3 loops=1)
Index Cond: (lower((name)::text) % 'george'::text)
...
Execution time: 1.382 ms"
So, out of curiosity, I wanted to see if the index was used when the thename parameter contained a name that didn't exist at all in the table:
select name, last_name from persons where lower(name) % 'noname'
But I saw that in this case that the index was not used at all and the execution time was way higher:
-> Seq Scan on persons (cost=0.00..63.72 rows=2 width=26) (actual time=6.494..6.494 rows=0 loops=1)
Filter: (lower((name)::text) % 'noname'::text)
Rows Removed by Filter: 1514
...
Execution time: 7.387 ms
As a test, I tried the same with a GIST index and in both cases, the index was used and the execution time was like the first case above.
I went ahead and recreated the table but this time inserting 10014 rows; and I saw that in both cases above, the GIN index was used and the execution time was the best for those cases.
Why is a GIN index is not used when the query above returns no results in a table with not so much rows (1514 in my case)?
Trigram indexes are case insensitive, test with:
select 'case' <-> 'CASE' AS ci1
, 'case' % 'CASE' AS ci2
, 'CASE' <-> 'CASE' AS c1
, 'CASE' % 'CASE' AS c2;
So you might as well just:
CREATE INDEX persons_index_name_1 ON persons USING gin (name gin_trgm_ops);
And:
select name, last_name from persons where name % 'thename';
As to your actual question, for small tables an index look-up might not pay. That's exactly what your added tests demonstrate. And establishing that nothing matches can be more expensive than finding some matches.
Aside from that, your cost setting and / or table statistics may not be at their respective optimum to let Postgres pick the most adequate query plans.
The expected cost numbers translate to much higher actual cost for the sequential scan than for the bitmap index scan. You may be overestimating the cost of index scans as compared to sequential scans. random_page_cost (and cpu_index_tuple_cost) may be set too high and effective_cache_size too low.
Keep PostgreSQL from sometimes choosing a bad query plan

How to optimize BETWEEN condition on big table in PostgreSQL

I have a big table (about ten million rows) and I need to perform query with ? BETWEEN columnA AND columnB.
Script to create database with table and sample data:
CREATE DATABASE test;
\c test
-- Create test table
CREATE TABLE test (id INT PRIMARY KEY, range_start NUMERIC(12, 0), range_end NUMERIC(12, 0));
-- Fill the table with sample data
INSERT INTO test (SELECT value, value, value FROM (SELECT generate_series(1, 10000000) AS value) source);
-- Query I want to be optimized
SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end;
I want to create INDEX so that PostgreSQL can do fast INDEX SCAN instead of SEQ SCAN. However I failed with my initial (and obvious) attempts:
CREATE INDEX test1 ON test (range_start, range_end);
CREATE INDEX test2 ON test (range_start DESC, range_end);
CREATE INDEX test3 ON test (range_end, range_start);
Also note that the number in the query is specifically chosen to be in the middle of generated values (otherwise PostgreSQL is able to recognize that the value is near range boundary and perform some optimizations).
Any ideas or thoughts would be appreciated.
UPDATE 1 Based on the official documentation it seems that PostgreSQL is not able to properly use indexes for multicolumn inequality conditions. I am not sure why there is such limitation and if there is anything I can do to significantly speed up the query.
UPDATE 2 One possible approach would be to limit the INDEX SCAN by knowing what is the largest range I have, lets say it is 100000:
SELECT * FROM test WHERE range_start BETWEEN 4900000 AND 5000000 AND range_end > 5000000;
Why don't you try a range with a gist index ?
alter table test add numr numrange;
update test set numr = numrange(range_start,range_end,'[]');
CREATE INDEX test_idx ON test USING gist (numr);
EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000.0 <# numr;
Bitmap Heap Scan on public.test (cost=2367.92..130112.36 rows=50000 width=48) (actual time=0.150..0.151 rows=1 loops=1)
Output: id, range_start, range_end, numr
Recheck Cond: (5000000.0 <# test.numr)
-> Bitmap Index Scan on test_idx (cost=0.00..2355.42 rows=50000 width=0) (actual time=0.142..0.142 rows=1 loops=1)
Index Cond: (5000000.0 <# test.numr)
Total runtime: 0.189 ms
After a second thought it is quite obvious why PostgreSQL can not use multicolumn index for two-column inequality condition. However what I did not understand was why there is SEQ SCAN even with LIMIT clause (sorry for not expressing that in my question):
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.09 rows=1 width=16) (actual time=4743.035..4743.037 rows=1 loops=1)
-> Seq Scan on test (cost=0.00..213685.51 rows=2499795 width=16) (actual time=4743.032..4743.032 rows=1 loops=1)
Filter: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 4743.064 ms
Then it hit me that PostgreSQL can not know that it is less probable that the result will be in range_start=1 than range_start=4999999. That is why it starts scanning from the first row until it finds matching row(s).
The solution might be to convince PostgreSQL that there is some benefit to using the index:
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end ORDER BY range_start DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.53 rows=1 width=16) (actual time=0.102..0.103 rows=1 loops=1)
-> Index Scan Backward using test1 on test (cost=0.00..3667714.71 rows=2403325 width=16) (actual time=0.099..0.099 rows=1 loops=1)
Index Cond: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 0.125 ms
Quite a performance boost I would say :). However still, this boost will only work if such range exists. Otherwise it will be as slow as SEQ SCAN. So it might be good to combine this approach with what I have outlined in my second update to the original question.