Postgresql. Optimize retriving distinct values from large table

Postgresql. Optimize retriving distinct values from large table - postgresql

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!

The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

Related

PostgreSql not using index

I have a table named snapshots with a column named data in jsonb format
An Index is created on snapshots table
create index on snapshots using(( data->>'creator' ));
The following query was using index initially but not after couple of days
SELECT id, data - 'sections' - 'sharing' AS data FROM snapshots WHERE data->>'creator' = 'abc#email.com' ORDER BY xmin::text::bigint DESC
below is the output by running explain analyze
Sort (cost=19.10..19.19 rows=35 width=77) (actual time=292.159..292.163 rows=35 loops=1)
Sort Key: (((xmin)::text)::bigint) DESC
Sort Method: quicksort Memory: 30kB
-> Seq Scan on snapshots (cost=0.00..18.20 rows=35 width=77) (actual time=3.500..292.104 rows=35 loops=1)
Filter: ((data ->> 'creator'::text) = 'abc#email.com'::text)
Rows Removed by Filter: 152
Planning Time: 0.151 ms
Execution Time: 292.198 ms

A table with 187 rows is very small. For very small tables, a sequential scan is the most efficient strategy.
What is surprising here is the long duration of the query execution (292 milliseconds!). Unless you have incredibly lame or overloaded storage, this must mean that the table is extremely bloated – it is comparatively large, but almost all pages are empty, with only 187 live rows. You should rewrite the table to compact it:
VACUUM (FULL) snapshots;
Then the query will become must faster.

Slow distinct PostgreSQL query on nested jsonb field won't use index

I'm trying to get distinct values from a nested field on JSONB column, but it takes about 2 minutes on a 400K rows table.
The original query used DISTINCT but then I read that GROUP BY works better so tried this too, but no luck - still extremely slow.
Adding an index did not help either:
create index "orders_financial_status_index" on orders ((data ->'data'->> 'financial_status'));
ANALYZE EXPLAIN gave this result:
HashAggregate (cost=13431.16..13431.22 rows=4 width=32) (actual time=123074.941..123074.943 rows=4 loops=1)
Group Key: ((data -> 'data'::text) ->> 'financial_status'::text)
-> Seq Scan on orders (cost=0.00..12354.14 rows=430809 width=32) (actual time=2.993..122780.325 rows=434080 loops=1)
Planning time: 0.119 ms
Execution time: 123074.979 ms
It's worth mentioning that there are no null values on this column, and currently there are 4 unique values.
What should I do in order to query the distinct values faster?

No index will make this faster, because the query has to scan the whole table.
As you can see, the sequential scan uses almost all the time; the hash aggregate is fast.
Still I would not drop the index, because it allows PostgreSQL to estimate the number of groups accurately and decide on the more efficient hash aggregate rather than sorting the rows. You can try without the index to be sure.
However, two minutes for half a million rows is not very fast. Do you have slow storage? Is the table bloated? If the latter, VACUUM (FULL) should improve things.
You can speed up the query by reducing I/O. Load the table into RAM with pg_prewarm, then processing should be considerably faster.

What's wrong with GIN index, can't avoid SEQ scan?

I've created a table like this,
create table mytable(hash char(40), title varchar(500));
create index name_fts on mytable using gin(to_tsvector('english', 'title'));
CREATE UNIQUE INDEX md5_uniq_idx ON mytable(hash);
When I query the title,
test=# explain analyze select * from mytable where to_tsvector('english', title) ## 'abc | def'::tsquery limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..277.35 rows=10 width=83) (actual time=0.111..75.549 rows=10 loops=1)
-> Seq Scan on mytable (cost=0.00..381187.45 rows=13744 width=83) (actual time=0.110..75.546 rows=10 loops=1)
Filter: (to_tsvector('english'::regconfig, (title)::text) ## '''abc'' | ''def'''::tsquery)
Rows Removed by Filter: 10221
Planning time: 0.176 ms
Execution time: 75.564 ms
(6 rows)
The index is not used. Any ideas? I have 10m rows.

There is a typo in your index definition, it should be
ON mytable USING gin (to_tsvector('english', title))
instead of
ON mytable USING gin (to_tsvector('english', 'title'))
The way you wrote it, it is a constant and not a field that is indexed, and such an index would indeed be useless for a search like the one you perform.
To see if an index can be used, you can execute
SET enable_seqscan=off;
and then run the query again.
If the index is still not used, the index probably cannot be used.
In addition to the above, there is something that strikes me as strange with your execution plan. PostgreSQL estimates that a sequential scan of mytable will return 13744 rows and not 10 million as you say there are. Did you disable autovacuum or is there something else that could cause your table statistics to be that inaccurate?

How to optimize BETWEEN condition on big table in PostgreSQL

I have a big table (about ten million rows) and I need to perform query with ? BETWEEN columnA AND columnB.
Script to create database with table and sample data:
CREATE DATABASE test;
\c test
-- Create test table
CREATE TABLE test (id INT PRIMARY KEY, range_start NUMERIC(12, 0), range_end NUMERIC(12, 0));
-- Fill the table with sample data
INSERT INTO test (SELECT value, value, value FROM (SELECT generate_series(1, 10000000) AS value) source);
-- Query I want to be optimized
SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end;
I want to create INDEX so that PostgreSQL can do fast INDEX SCAN instead of SEQ SCAN. However I failed with my initial (and obvious) attempts:
CREATE INDEX test1 ON test (range_start, range_end);
CREATE INDEX test2 ON test (range_start DESC, range_end);
CREATE INDEX test3 ON test (range_end, range_start);
Also note that the number in the query is specifically chosen to be in the middle of generated values (otherwise PostgreSQL is able to recognize that the value is near range boundary and perform some optimizations).
Any ideas or thoughts would be appreciated.
UPDATE 1 Based on the official documentation it seems that PostgreSQL is not able to properly use indexes for multicolumn inequality conditions. I am not sure why there is such limitation and if there is anything I can do to significantly speed up the query.
UPDATE 2 One possible approach would be to limit the INDEX SCAN by knowing what is the largest range I have, lets say it is 100000:
SELECT * FROM test WHERE range_start BETWEEN 4900000 AND 5000000 AND range_end > 5000000;

Why don't you try a range with a gist index ?
alter table test add numr numrange;
update test set numr = numrange(range_start,range_end,'[]');
CREATE INDEX test_idx ON test USING gist (numr);
EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000.0 <# numr;
Bitmap Heap Scan on public.test (cost=2367.92..130112.36 rows=50000 width=48) (actual time=0.150..0.151 rows=1 loops=1)
Output: id, range_start, range_end, numr
Recheck Cond: (5000000.0 <# test.numr)
-> Bitmap Index Scan on test_idx (cost=0.00..2355.42 rows=50000 width=0) (actual time=0.142..0.142 rows=1 loops=1)
Index Cond: (5000000.0 <# test.numr)
Total runtime: 0.189 ms

After a second thought it is quite obvious why PostgreSQL can not use multicolumn index for two-column inequality condition. However what I did not understand was why there is SEQ SCAN even with LIMIT clause (sorry for not expressing that in my question):
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.09 rows=1 width=16) (actual time=4743.035..4743.037 rows=1 loops=1)
-> Seq Scan on test (cost=0.00..213685.51 rows=2499795 width=16) (actual time=4743.032..4743.032 rows=1 loops=1)
Filter: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 4743.064 ms
Then it hit me that PostgreSQL can not know that it is less probable that the result will be in range_start=1 than range_start=4999999. That is why it starts scanning from the first row until it finds matching row(s).
The solution might be to convince PostgreSQL that there is some benefit to using the index:
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end ORDER BY range_start DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.53 rows=1 width=16) (actual time=0.102..0.103 rows=1 loops=1)
-> Index Scan Backward using test1 on test (cost=0.00..3667714.71 rows=2403325 width=16) (actual time=0.099..0.099 rows=1 loops=1)
Index Cond: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 0.125 ms
Quite a performance boost I would say :). However still, this boost will only work if such range exists. Otherwise it will be as slow as SEQ SCAN. So it might be good to combine this approach with what I have outlined in my second update to the original question.

Postgresql index on xpath expression gives no speed up

We are trying to create OEBS-analog functionality in Postgresql. Let's say we have a form constructor and need to store form results in database (e.g. email bodies). In Oracle you could use a table with 150~ columns (and some mapping stored elsewhere) to store each field in separate column. But in contrast to Oracle we would like to store all the form in postgresql xml field.
The example of the tree is
<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<object_id>2</object_id>
<pack_form_id>23</pack_form_id>
<prod_form_id>34</prod_form_id>
</row>
We would like to search through this field.
Test table contains 400k rows and the following select executes in 90 seconds:
select *
from params
where (xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int=34
So I created this index:
create index prod_form_idx
ON params using btree(
((xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int)
);
And it made no difference. Still 90 seconds execution. EXPLAIN plan show this:
Bitmap Heap Scan on params (cost=40.29..6366.44 rows=2063 width=292)
Recheck Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
-> Bitmap Index Scan on prod_form_idx (cost=0.00..39.78 rows=2063 width=0)
Index Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
I am not the great plan interpreter so I suppose this means that index is being used. The question is: where's all the speed? And what can i do in order to optimize this kind of queries?

Well, at least the index is used. You get a bitmap index scan instead of a normal index scan though, which means the xpath() function will be called lots of times.
Let's do a little check :
CREATE TABLE foo ( id serial primary key, x xml, h hstore );
insert into foo (x,h) select XMLPARSE( CONTENT '<row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<object_id>2</object_id>
<pack_form_id>' || n || '</pack_form_id>
<prod_form_id>34</prod_form_id>
</row>' ),
('object_id=>2,prod_form_id=>34,pack_form_id=>'||n)::hstore
FROM generate_series( 1,100000 ) n;
test=> EXPLAIN ANALYZE SELECT count(*) FROM foo;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=4821.00..4821.01 rows=1 width=0) (actual time=24.694..24.694 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..4571.00 rows=100000 width=0) (actual time=0.006..13.996 rows=100000 loops=1)
Total runtime: 24.730 ms
test=> explain analyze select * from foo where (h->'pack_form_id')='123';
QUERY PLAN
----------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..5571.00 rows=500 width=68) (actual time=0.075..48.763 rows=1 loops=1)
Filter: ((h -> 'pack_form_id'::text) = '123'::text)
Total runtime: 36.808 ms
test=> explain analyze select * from foo where ((xpath('//pack_form_id/text()'::text, x))[1]::text) = '123';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..5071.00 rows=500 width=68) (actual time=4.271..3368.838 rows=1 loops=1)
Filter: (((xpath('//pack_form_id/text()'::text, x, '{}'::text[]))[1])::text = '123'::text)
Total runtime: 3368.865 ms
As we can see,
scanning the whole table with count(*) takes 25 ms
extracting one key/value from a hstore adds a small extra cost, about 0.12 µs/row
extracting one key/value from a xml using xpath adds a huge cost, about 33 µs/row
Conclusions :
xml is slow (but everyone knows that)
if you want to put a flexible key/value store in a column, use hstore
Also since your xml data is pretty big it will be toasted (compressed and stored out of the main table). This makes the rows in the main table much smaller, hence more rows per page, which reduces the efficiency of bitmap scans since all rows on a page have to be rechecked.
You can fix this though. For some reason the xpath() function (which is very slow, since it handles xml) has the same cost (1 unit) as say, the integer operator "+"...
update pg_proc set procost=1000 where proname='xpath';
You may need to tweak the cost value. When given the right info, the planner knows xpath is slow and will avoid a bitmap index scan, using an index scan instead, which doesn't need rechecking the condition for all rows on a page.
Note that this does not solve the row estimates problem. Since you can't ANALYZE the inside of the xml (or hstore) you get default estimates for the number of rows (here, 500). So, the planner may be completely wrong and choose a catastrophic plan if some joins are involved. The only solution to this is to use proper columns.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgresql. Optimize retriving distinct values from large table - postgresql

Related

PostgreSql not using index

Slow distinct PostgreSQL query on nested jsonb field won't use index

What's wrong with GIN index, can't avoid SEQ scan?

How to optimize BETWEEN condition on big table in PostgreSQL

Postgresql index on xpath expression gives no speed up

Categories

Resources