Faster search in INTARRAY column - postgresql

I have table with approximately 300 000 rows with INT[] column type
Each array contains approximately 2000 elements
I created index for this array column
create index index_name ON table_name USING GIN (column_name)
Then run query:
SELECT COUNT(*)
FROM table_name
WHERE
column_name#> ARRAY[1777]
This query runs very slow Execution time: 66886.132 ms and as EXPLAIN ANALYZE shows, not uses GIN index, only Seq Scan index is used.
Why not uses Postgres GIN index and main destination: how to run above query as fast, as it is possible?
EDIT
This is result from explain (analyze, verbose) for above query
Aggregate (cost=10000024724.75..10000024724.76 rows=1 width=0) (actual time=61087.513..61087.513 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.users (cost=10000000000.00..10000024724.00 rows=300 width=0) (actual time=12104.651..61087.500 rows=5 loops=1)
Output: id, email, pass, nick, reg_dt, reg_ip, gender, curr_location, about, followed_tag_ids, avatar_img_ext, rep_tag_ids, rep_tag_id_scores, stats, status
Filter: (users.rep_tag_ids #> '{1777}'::integer[])
Rows Removed by Filter: 299995
Planning time: 0.110 ms
Execution time: 61087.564 ms
This is table and index definitions
CREATE TABLE users
(
id serial PRIMARY KEY,
rep_tag_ids integer[] DEFAULT '{}'
-- other columns here
);
create index users_rep_tag_ids_idx ON users USING GIN (rep_tag_ids);

You should help query optimizer to use index. Install intarray extension for PostgreSQL if you don't have it yet and then recreate your index using gin__int_ops operator class.
DROP INDEX users_rep_tag_ids_idx;
CREATE INDEX users_rep_tag_ids_idx ON users USING gin (rep_tag_ids gin__int_ops);

Related

Postgres Query uses Bitmap index scan on GIN indexed item. Does not GIN use B-Tree internally?

From postgres docs on GIN -
Internally, a GIN index contains a B-tree index constructed over keys.
But in my use case, I see Bitmap indexes instead -
My Schema and Indexes are created as below.
CREATE TABLE customer (
jdata jsonb NOT NULL,
comments text NULL
);
CREATE INDEX idx_jdata ON customer USING gin (jdata jsonb_path_ops);
Say, inserted a 10K records. sample prepared data/ alternate link
explain analyze select jdata from customer where jdata #> '{"supplier":{"id":"7f5644ca-f0d3-4f50-947b-9e3e38f7b796"}}'
Outpt -
Bitmap Heap Scan on file_set (cost=2139.27..36722.32 rows=10744 width=18) (actual time=1.438..267.122 rows=4048 loops=1)
Recheck Cond: (jdata #> '{"supplier": {"id": "7f5644ca-f0d3-4f50-947b-9e3e38f7b796"}}'::jsonb)
Heap Blocks: exact=1197
-> Bitmap Index Scan on idx_jdata (cost=0.00..2136.58 rows=10744 width=0) (actual time=1.214..1.214 rows=4048 loops=1)
Index Cond: (jdata #> '{"supplier": {"id": "7f5644ca-f0d3-4f50-947b-9e3e38f7b796"}}'::jsonb)
Planning Time: 0.065 ms
Execution Time: 267.456 ms
(the query plan above is based on real output form postgres - i had to change table/column names)
Why is Bitmap index created when a GIN index is created
My Postgres DB version is 13.5.
After looking at comment from #a-horse-with-no-name -> I tried below
SET enable_seqscan = OFF;
explain analyze
select * from pg_opclass where "oid"=10003;
SET enable_seqscan = ON;
and the output -
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------+
Index Scan using pg_opclass_oid_index on pg_opclass (cost=0.14..8.16 rows=1 width=93) (actual time=0.015..0.016 rows=1 loops=1)|
Index Cond: (oid = '10003'::oid) |
Planning Time: 0.060 ms |
Execution Time: 0.027 ms |
I see difference -
"Bitmap Index Scan on idx_jdata" vs "Index Scan using pg_opclass_oid_index on pg_opclass"
Does this mean anything important. Can some one add more details about the using vs on and the on pg_opsclass.
A GIN index can contain the same tuple pointer multiple times, listed under different tokens which the field is broken down into. Those need to be de-duplicated to make sure it doesn't return duplicate rows where it should not, and the way GIN chooses to do this is by forcing it go through a bitmap index scan; which inherently deduplicates the pointers. It is not the index which is the bitmap, it is the scan which uses bitmaps. Any index can be used in a bitmap index scan, but GIN indexes can only be used in bitmap scans, due to the need to deduplicate.

Slow Like Query in Postgres

I have 20 million Record in table its Schema is like below
FieldName Datatype
id bigint(Auto Inc,Primarykey)
name varchar(255)
phone varchar(255)
deleted_at timestamp
created_at timestamp
updated_at timestamp
It has index on name and phone column
Column Index type
name GIN trgm index
phone btree index, GIN trgm index
Created index using the following commands
CREATE INDEX btree_idx ON contacts USING btree (phone);
CREATE INDEX trgm_idx ON contacts USING GIN (phone gin_trgm_ops);
CREATE INDEX trgm_idx_name ON contacts USING GIN (name gin_trgm_ops);
I am running the below query
select * from contacts where phone like '%6666666%' limit 15;
I am doing contains query on phone. The above query takes more than 5 min to get a result. Let me provide the explain statement of this.
explain analyse select * from contacts where phone like '%6666666%' limit 15;
Limit (cost=1774.88..1830.57 rows=15 width=65) (actual time=7970.553..203001.985 rows=15 loops=1)
-> Bitmap Heap Scan on contacts (cost=1774.88..10819.13 rows=2436 width=65) (actual time=7970.552..203001.967 rows=15 loops=1)
Recheck Cond: ((phone)::text ~~ '%6666666%'::text)
Rows Removed by Index Recheck: 254869
Heap Blocks: lossy=2819
-> Bitmap Index Scan on trgm_idx (cost=0.00..1774.27 rows=2436 width=0) (actual time=6720.978..6720.978 rows=306226 loops=1)
Index Cond: ((phone)::text ~~ '%6666666%'::text)
Planning Time: 0.139 ms
Execution Time: 203002.791 ms
Here what can I do to optimize my query? and bring the result under 5 sec would be optimal
One cause of the bad performance is probably
Heap Blocks: lossy=2819
Your work_mem setting is to small to contain a bitmap with one bit per table row, so PostgreSQL degrades it to one bit per 8kB block. This leads to many more rechecks than necessary.
Also, your test is bad. The search string contains only the trigram 666, which will match many rows that don't satisfy the query and have to be removed during recheck. A trigram index is not effective in this pathological case. Test with a number that contains more digits.

Postgresql. Optimize retriving distinct values from large table

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!
The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

What's wrong with GIN index, can't avoid SEQ scan?

I've created a table like this,
create table mytable(hash char(40), title varchar(500));
create index name_fts on mytable using gin(to_tsvector('english', 'title'));
CREATE UNIQUE INDEX md5_uniq_idx ON mytable(hash);
When I query the title,
test=# explain analyze select * from mytable where to_tsvector('english', title) ## 'abc | def'::tsquery limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..277.35 rows=10 width=83) (actual time=0.111..75.549 rows=10 loops=1)
-> Seq Scan on mytable (cost=0.00..381187.45 rows=13744 width=83) (actual time=0.110..75.546 rows=10 loops=1)
Filter: (to_tsvector('english'::regconfig, (title)::text) ## '''abc'' | ''def'''::tsquery)
Rows Removed by Filter: 10221
Planning time: 0.176 ms
Execution time: 75.564 ms
(6 rows)
The index is not used. Any ideas? I have 10m rows.
There is a typo in your index definition, it should be
ON mytable USING gin (to_tsvector('english', title))
instead of
ON mytable USING gin (to_tsvector('english', 'title'))
The way you wrote it, it is a constant and not a field that is indexed, and such an index would indeed be useless for a search like the one you perform.
To see if an index can be used, you can execute
SET enable_seqscan=off;
and then run the query again.
If the index is still not used, the index probably cannot be used.
In addition to the above, there is something that strikes me as strange with your execution plan. PostgreSQL estimates that a sequential scan of mytable will return 13744 rows and not 10 million as you say there are. Did you disable autovacuum or is there something else that could cause your table statistics to be that inaccurate?

How to optimize BETWEEN condition on big table in PostgreSQL

I have a big table (about ten million rows) and I need to perform query with ? BETWEEN columnA AND columnB.
Script to create database with table and sample data:
CREATE DATABASE test;
\c test
-- Create test table
CREATE TABLE test (id INT PRIMARY KEY, range_start NUMERIC(12, 0), range_end NUMERIC(12, 0));
-- Fill the table with sample data
INSERT INTO test (SELECT value, value, value FROM (SELECT generate_series(1, 10000000) AS value) source);
-- Query I want to be optimized
SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end;
I want to create INDEX so that PostgreSQL can do fast INDEX SCAN instead of SEQ SCAN. However I failed with my initial (and obvious) attempts:
CREATE INDEX test1 ON test (range_start, range_end);
CREATE INDEX test2 ON test (range_start DESC, range_end);
CREATE INDEX test3 ON test (range_end, range_start);
Also note that the number in the query is specifically chosen to be in the middle of generated values (otherwise PostgreSQL is able to recognize that the value is near range boundary and perform some optimizations).
Any ideas or thoughts would be appreciated.
UPDATE 1 Based on the official documentation it seems that PostgreSQL is not able to properly use indexes for multicolumn inequality conditions. I am not sure why there is such limitation and if there is anything I can do to significantly speed up the query.
UPDATE 2 One possible approach would be to limit the INDEX SCAN by knowing what is the largest range I have, lets say it is 100000:
SELECT * FROM test WHERE range_start BETWEEN 4900000 AND 5000000 AND range_end > 5000000;
Why don't you try a range with a gist index ?
alter table test add numr numrange;
update test set numr = numrange(range_start,range_end,'[]');
CREATE INDEX test_idx ON test USING gist (numr);
EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000.0 <# numr;
Bitmap Heap Scan on public.test (cost=2367.92..130112.36 rows=50000 width=48) (actual time=0.150..0.151 rows=1 loops=1)
Output: id, range_start, range_end, numr
Recheck Cond: (5000000.0 <# test.numr)
-> Bitmap Index Scan on test_idx (cost=0.00..2355.42 rows=50000 width=0) (actual time=0.142..0.142 rows=1 loops=1)
Index Cond: (5000000.0 <# test.numr)
Total runtime: 0.189 ms
After a second thought it is quite obvious why PostgreSQL can not use multicolumn index for two-column inequality condition. However what I did not understand was why there is SEQ SCAN even with LIMIT clause (sorry for not expressing that in my question):
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.09 rows=1 width=16) (actual time=4743.035..4743.037 rows=1 loops=1)
-> Seq Scan on test (cost=0.00..213685.51 rows=2499795 width=16) (actual time=4743.032..4743.032 rows=1 loops=1)
Filter: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 4743.064 ms
Then it hit me that PostgreSQL can not know that it is less probable that the result will be in range_start=1 than range_start=4999999. That is why it starts scanning from the first row until it finds matching row(s).
The solution might be to convince PostgreSQL that there is some benefit to using the index:
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end ORDER BY range_start DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.53 rows=1 width=16) (actual time=0.102..0.103 rows=1 loops=1)
-> Index Scan Backward using test1 on test (cost=0.00..3667714.71 rows=2403325 width=16) (actual time=0.099..0.099 rows=1 loops=1)
Index Cond: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 0.125 ms
Quite a performance boost I would say :). However still, this boost will only work if such range exists. Otherwise it will be as slow as SEQ SCAN. So it might be good to combine this approach with what I have outlined in my second update to the original question.