Slow distinct PostgreSQL query on nested jsonb field won't use index - postgresql

I'm trying to get distinct values from a nested field on JSONB column, but it takes about 2 minutes on a 400K rows table.
The original query used DISTINCT but then I read that GROUP BY works better so tried this too, but no luck - still extremely slow.
Adding an index did not help either:
create index "orders_financial_status_index" on orders ((data ->'data'->> 'financial_status'));
ANALYZE EXPLAIN gave this result:
HashAggregate (cost=13431.16..13431.22 rows=4 width=32) (actual time=123074.941..123074.943 rows=4 loops=1)
Group Key: ((data -> 'data'::text) ->> 'financial_status'::text)
-> Seq Scan on orders (cost=0.00..12354.14 rows=430809 width=32) (actual time=2.993..122780.325 rows=434080 loops=1)
Planning time: 0.119 ms
Execution time: 123074.979 ms
It's worth mentioning that there are no null values on this column, and currently there are 4 unique values.
What should I do in order to query the distinct values faster?

No index will make this faster, because the query has to scan the whole table.
As you can see, the sequential scan uses almost all the time; the hash aggregate is fast.
Still I would not drop the index, because it allows PostgreSQL to estimate the number of groups accurately and decide on the more efficient hash aggregate rather than sorting the rows. You can try without the index to be sure.
However, two minutes for half a million rows is not very fast. Do you have slow storage? Is the table bloated? If the latter, VACUUM (FULL) should improve things.
You can speed up the query by reducing I/O. Load the table into RAM with pg_prewarm, then processing should be considerably faster.

Related

Optimizing a postgis query - why is 2nd index not being used?

We have a table with tens of millions of polygons and we have this index:
CREATE INDEX IF NOT EXISTS polygons_geog_idx ON polygons USING GIST(geog);
That let us query the DB really efficiently, like so:
SELECT * FROM polygons WHERE st_dwithin('SRID=4326;POINT(-1 50)'::geography, geog, 500);
Now due to the business requirements, we need to return only biggest 200 polygons. Easily doable like with:
LIMIT 200
ORDER BY st_area(geog)
Full Query:
SELECT gid, st_area(geog) as size FROM polygons WHERE st_dwithin(geog, 'SRID=4326;POINT(-1 50)'::geography, 500) ORDER BY st_area(geog) DESC LIMIT 200.
Because of the order by and select our query slows down by 10x. I thought it will be easily fixable by adding another index like seen in this SO Answer: CREATE INDEX polygons_geog_area_idx ON polygons (st_area(geog));
But polygons_geog_area_idx doesn't seem to be picked up:
Sort (cost=8.23..8.23 rows=1 width=12) (actual time=133.755..142.427 rows=2325 loops=1)
Sort Key: (st_area(geog, true))
Sort Method: quicksort Memory: 205kB
-> Index Scan using polygons_geog_idx on polygons (cost=0.14..8.22 rows=1 width=12) (actual time=0.468..121.974 rows=2325 loops=1)
Index Cond: (geog && '0101000020E6100000C33126587787F1BF3B0D62B197654940'::geography)
Filter: (('0101000020E6100000C33126587787F1BF3B0D62B197654940'::geography && _st_expand(geog, '500'::double precision)) AND _st_dwithin(geog, '0101000020E6100000C33126587787F1BF3B0D62B197654940'::geography, '500'::double precision, true))
Rows Removed by Filter: 3
Planning Time: 0.157 ms
Execution Time: 151.196 ms
(note: this is on development dataset, much smaller than actual dataset this will run on later)
What am I missing? Can you even use 2 indexes like I want?
PostgreSQL cannot combine two indexes in this way, one for the order and one for selectivity.
In order to sort by the area, it first needs to compute the area. The sort itself is fast (taking only 15% of the time) so it must be the computation of the area which is slow. An EXPLAIN VERBOSE suggests to me that the computation of the area is done as part of the index scan and then the result passed up to the sort, rather than being done in the sort itself. So it makes sense that the timing of doing this would be attributed to the index scan.
To improve the time needed to compute the area, you could compute and store it as part of the table. The best way to do that (with new enough version) is with a generated column.
alter table polygons add polygon_area double precision generated always as (st_area(geog)) stored;

Why doesn't postresql use all columns in a multi-column index?

I am using the extension
CREATE EXTENSION btree_gin;
I have an index that looks like this...
create index boundaries2 on rets USING GIN(source, isonlastsync, status, (geoinfo::jsonb->'boundaries'), ctcvalidto, searchablePrice, ctcSortOrder);
before I started messing with it, the index looked like this, with the same results that I'm about to share, so it seems minor variations in the index definition don't make a difference:
create index boundaries on rets USING GIN((geoinfo::jsonb->'boundaries'), source, status, isonlastsync, ctcvalidto, searchablePrice, ctcSortOrder);
I give pgsql 11 this query:
explain analyze select id from rets where ((geoinfo::jsonb->'boundaries' ?| array['High School: Torrey Pines']) AND source='SDMLS'
AND searchablePrice>=800000 AND searchablePrice<=1200000 AND YrBlt>=2000 AND EstSF>=2300
AND Beds>=3 AND FB>=2 AND ctcSortOrder>'2019-07-05 16:02:54 UTC' AND Status IN ('ACTIVE','BACK ON MARKET')
AND ctcvalidto='9999-12-31 23:59:59 UTC' AND isonlastsync='true') order by LstDate desc, ctcSortOrder desc LIMIT 3000;
with result...
Limit (cost=120.06..120.06 rows=1 width=23) (actual time=472.849..472.850 rows=1 loops=1)
-> Sort (cost=120.06..120.06 rows=1 width=23) (actual time=472.847..472.848 rows=1 loops=1)
Sort Key: lstdate DESC, ctcsortorder DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on rets (cost=116.00..120.05 rows=1 width=23) (actual time=472.748..472.841 rows=1 loops=1)
Recheck Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Rows Removed by Index Recheck: 93
Filter: (isonlastsync AND (yrblt >= 2000) AND (estsf >= 2300) AND (beds >= 3) AND (fb >= 2) AND (status = ANY ('{ACTIVE,"BACK ON MARKET"}'::text[])))
Rows Removed by Filter: 10
Heap Blocks: exact=102
-> Bitmap Index Scan on boundaries2 (cost=0.00..116.00 rows=1 width=0) (actual time=471.762..471.762 rows=104 loops=1)
Index Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Planning Time: 0.333 ms
Execution Time: 474.311 ms
(14 rows)
The Question
Why are the columns status and isonlastsync not used by the Bitmap Index Scan on boundaries2?
It can do so if it predicts that filtering out those columns will be faster. This is usually the case if cardinality on columns is very low and you will fetch large enough portion of all rows; this is true for boolean like isonlastsync and usually true for status columns with just a few distinct values.
Rows Removed by Filter: 10 this is very little to filter out, either because your table does not hold large number of rows or most of them fit into condition you specified for those two columns. You might try generating more data in that table or selecting rows with rare status.
I suggest doing partial indexes (with WHERE condition), at least for boolean value and remove those two columns to make this index a bit more lightweight.
I cannot tell you why, but I can help you optimize the query.
You should not use a multi-column GIN index, but a GIN index on only the jsonb expression and a b-tree index on the other columns.
The order of columns matters: put the oned used in an equality condition first, with the most selective in the beginning. Next put the column with the must selective inequality or IN conditions. For the following columns, the order doesn't matter, as they will only act as filters in the index scan.
Make sure that the indexes are cached in RAM.
I'd expect you to be faster that way.
I think you're asking yourself the wrong question. As Lukasz answered already, PostgreSQL may find inefficient to check all columns in the index. The problem here is that your index is too big on disk.
Probably by trying to make this SQL faster you added as many columns as possible to the index, but it is backfiring.
The trick is to realize how much data PostgreSQL has to read to find your records. If your index contains too much data, it will have to read a lot. Also, be aware that low cardinality columns don't play well with BTree and common index types; generally you want to avoid indexing them.
To have an index as small as possible and it's quick to do lookups you have to find the column with more cardinality, or better, the column that will return less rows for your query. My guess is "ctcSortOrder". This will be the first column of your index.
Now look, after filtering by the 1st column, which column has now the most cardinality or will filter out most rows. I have no idea on your data, but "source" looks like a good candidate.
Try to avoid jsonb searches unless they are the primary source of the cardinality, and keep the index as a Btree. BTree is several times faster.
And like Lukasz suggested, look on partial indexes. For example add "WHERE Status IN ('ACTIVE','BACK ON MARKET') AND isonlastsync='true'" as these two may be common for all your searches.
Bottom line is, having a simpler, smaller index is faster than having all columns indexed. And the order of the columns does matter a lot. Stick with BTree unless there is a good reason (lots of cardinality in non-btree compatible types).
If your table is huge (>10M rows) consider table partitioning, for example by ctcSortOrder. But I don't think this is your case.

Postgresql. Optimize retriving distinct values from large table

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!
The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

Why is Postgres not using index on a simple GROUP BY?

I have created a 36M rows table with an index on type column:
CREATE TABLE items AS
SELECT
(random()*36000000)::integer AS id,
(random()*10000)::integer AS type,
md5(random()::text) AS s
FROM
generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");
I run this simple query and expect postgresql to use my index:
explain select count(*) from "items" group by "type";
But the query planner decides to use Seq Scan instead:
HashAggregate (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
Group Key: type
-> Seq Scan on items (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms
Time without EXPLAIN: 5s 979ms
I have tried several solutions from here and here:
Run VACUUM ANALYZE or VACUUM ANALYZE
Configure default_statistics_target, random_page_cost, work_mem
but nothing helps apart from setting enable_seqscan = OFF:
SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";
GroupAggregate (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
Group Key: type
-> Index Only Scan using items_type_idx on items (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms
Time without EXPLAIN: 659ms
Query with index scan is about 10x faster on my machine.
Is there a better solution than setting enable_seqscan?
UPD1
My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).
UPD2
I have tried to fill type column not with random numbers, but with i / 10000 to make pg_stats.correlation = 1 - still seqscan.
UPD3
#jgh is 100% right:
This typically only happens when the table's row width is much wider than some indexes
I've made large column data and now postgres use index. Thanks everyone!
The Index-only scans wiki says
It is important to realise that the planner is concerned with
minimising the total cost of the query. With databases, the cost of
I/O typically dominates. For that reason, "count(*) without any
predicate" queries will only use an index-only scan if the index is
significantly smaller than its table. This typically only happens when
the table's row width is much wider than some indexes'.
and
Index-only scans are only used when the planner surmises that that
will reduce the total amount of I/O required, according to its
imperfect cost-based modelling. This all heavily depends on visibility
of tuples, if an index would be used anyway (i.e. how selective a
predicate is, etc), and if there is actually an index available that
could be used by an index-only scan in principle
Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan

Slow index scan

I have table with index:
Table:
Participates (player_id integer, other...)
Indexes:
"index_participates_on_player_id" btree (player_id)
Table contains 400kk rows.
I execute the same query two times:
Query: explain analyze select * from participates where player_id=149294217;
First time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=261.061..2025.559 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 2025.644 ms
(3 rows)
Second time:
Index Scan using index_participates_on_player_id on participates (cost=0.57..19452.86 rows=6304 width=58) (actual time=0.030..0.479 rows=332 loops=1)
Index Cond: (player_id = 149294217)
Total runtime: 0.527 ms
(3 rows)
So, first query has big actual time - how to increase speed the first execute?
UPDATE
Sorry, How to accelerate first query?)
Why index scan search so slow?
The difference in execution time is probably because the second time through, the table/index data from the first run of the query is in the shared buffers cache, and so the subsequent run of the query takes less time, since it doesn't have to go long-path to disk for that information.
Edit:
Regarding the slowness of the original query, does the table have a lot of dead tuples? Those can slow things down considerably. If so, VACUUM ANALYZE the table.
Another factor can be if there are long-ish idle transactions on the server (i.e. several minutes or more). Due to the nature of MVCC this can also slow even index-based queries down quite a bit.
Also, what the query planner is expecting for the results vs. actual is quite different, so you may want to do an ANALYZE on the query beforehand to update the stats.
1.) Take a look at http://www.postgresql.org/docs/9.3/static/runtime-config-resource.html and check out for some tuning for using more memory. This can speed up your search but will not give you a warranty (depending on the answer before)!
2.) Transfer a part of your tables/indexes to a more powerful tablespace. For example a tablespace based on SSDs.