I have a table with > 10.000.000 rows. The table has a column OfficialEnterprise_vatNumber that should be unique and can be part of a full text search.
Here are the indexes:
"uq_officialenterprise_vatnumber" "CREATE UNIQUE INDEX uq_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING btree (""OfficialEnterprise_vatNumber"")"
"ix_officialenterprise_vatnumber" "CREATE INDEX ix_officialenterprise_vatnumber ON commonservices.""OfficialEnterprise"" USING gin (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text))"
But if I EXPLAIN a query that should be using the FTS index like this
SELECT * FROM commonservices."OfficialEnterprise"
WHERE
to_tsvector('commonservices.unaccent_dictionary', "OfficialEnterprise_vatNumber") ## to_tsquery('FR:* | IE:*')
ORDER BY "OfficialEnterprise_vatNumber" ASC
LIMIT 100
It shows that the used index is uq_officialenterprise_vatnumber and not ix_officialenterprise_vatnumber.
Is their something I'm missing ?
EDIT:
Here is the EXPLAIN ANALYZE statement of the original query.
"Limit (cost=0.43..1460.27 rows=100 width=238) (actual time=6996.976..6997.057 rows=15 loops=1)"
" -> Index Scan using uq_officialenterprise_vatnumber on ""OfficialEnterprise"" (cost=0.43..1067861.32 rows=73149 width=238) (actual time=6996.975..6997.054 rows=15 loops=1)"
" Filter: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Rows Removed by Filter: 1847197"
"Planning Time: 0.185 ms"
"Execution Time: 6997.081 ms"
Here is the EXPLAIN ANALYZE of the query if I add || '0' to the order by.
"Limit (cost=55558.82..55570.49 rows=100 width=270) (actual time=7.069..9.827 rows=15 loops=1)"
" -> Gather Merge (cost=55558.82..62671.09 rows=60958 width=270) (actual time=7.068..9.823 rows=15 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=54558.80..54635.00 rows=30479 width=270) (actual time=0.235..0.238 rows=5 loops=3)"
" Sort Key: (((""OfficialEnterprise_vatNumber"")::text || '0'::text))"
" Sort Method: quicksort Memory: 28kB"
" Worker 0: Sort Method: quicksort Memory: 25kB"
" Worker 1: Sort Method: quicksort Memory: 25kB"
" -> Parallel Bitmap Heap Scan on ""OfficialEnterprise"" (cost=719.16..53393.91 rows=30479 width=270) (actual time=0.157..0.166 rows=5 loops=3)"
" Recheck Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
" Heap Blocks: exact=6"
" -> Bitmap Index Scan on ix_officialenterprise_vatnumber (cost=0.00..700.87 rows=73149 width=0) (actual time=0.356..0.358 rows=15 loops=1)"
" Index Cond: (to_tsvector('commonservices.unaccent_dictionary'::regconfig, (""OfficialEnterprise_vatNumber"")::text) ## to_tsquery('FR:* | IE:*'::text))"
"Planning Time: 0.108 ms"
"Execution Time: 9.886 ms"
rows=73149...(actual...rows=15)
So it vastly misestimates the number of rows it will find. If it actually found 73149 rows, using the ordering index probably would be faster.
Have you ANALYZED your table since creating the functional index on it? Doing that might fix the estimation problem. Or at least improve it enough to fix the planning decision.
Yes, doing dummy operations like +0 or ||'' is hackish. They work by forcing the planner to think it can't use the index to fulfill the ORDER BY.
Doing full text search on a VAT number seems rather misguided in the first place. Maybe this would be best addressed with a LIKE query, or by creating an explicit column to hold the country of origin flag.
Related
I am having a table with 2M rows and running a query using 5 columns all of them indexed. Still the query execution time is more
Query:
SELECT cmp_domain as domain, slug, cmp_name as company_name, prod_categories, prod_sub_categories, cmp_web_traff_rank
FROM prospects_v5.commercepedia
WHERE
country='United States of America'
AND 'Shopify' =ANY (technologies)
AND is_live = true
OR 'General Merchandise' =ANY(prod_categories)
order by cmp_web_traff_rank
LIMIT 10
OFFSET 30000;
Below is the explain Plan:
" -> Gather Merge (cost=394508.12..401111.22 rows=56594 width=109) (actual time=14538.165..14557.052 rows=30010 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=393508.10..393578.84 rows=28297 width=109) (actual time=14520.435..14523.376 rows=10175 loops=3)"
" Sort Key: cmp_web_traff_rank"
" Sort Method: external merge Disk: 3896kB"
" Worker 0: Sort Method: external merge Disk: 4056kB"
" Worker 1: Sort Method: external merge Disk: 4096kB"
" -> Parallel Seq Scan on commercepedia (cost=0.00..391415.77 rows=28297 width=109) (actual time=6.726..14439.953 rows=32042 loops=3)"
" Filter: (((country = 'United States of America'::text) AND ('Shopify'::text = ANY (technologies)) AND is_live) OR ('General Merchandise'::text = ANY (prod_categories)))"
" Rows Removed by Filter: 459792"
"Planning Time: 0.326 ms"
"Execution Time: 14559.593 ms"
I have created btree index on country ,is_live and cmp_web_traff_rank and gin index on technologies and prod_categories.
When I use AND condition for all columns below is the explain plan
" -> Gather Merge (cost=269444.76..269511.27 rows=570 width=109) (actual time=10780.530..10785.326 rows=1672 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=268444.74..268445.45 rows=285 width=109) (actual time=10762.765..10762.862 rows=557 loops=3)"
" Sort Key: cmp_web_traff_rank"
" Sort Method: quicksort Memory: 125kB"
" Worker 0: Sort Method: quicksort Memory: 133kB"
" Worker 1: Sort Method: quicksort Memory: 124kB"
" -> Parallel Bitmap Heap Scan on commercepedia (cost=19489.58..268433.12 rows=285 width=109) (actual time=318.652..10759.284 rows=557 loops=3)"
" Recheck Cond: (country = 'United States of America'::text)"
" Rows Removed by Index Recheck: 18486"
" Filter: (is_live AND ('Shopify'::text = ANY (technologies)) AND ('General Merchandise'::text = ANY (prod_categories)))"
" Rows Removed by Filter: 80120"
" Heap Blocks: exact=18391 lossy=10838"
" -> BitmapAnd (cost=19489.58..19489.58 rows=107598 width=0) (actual time=259.181..259.183 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_is_live (cost=0.00..4944.53 rows=267214 width=0) (actual time=52.584..52.584 rows=271711 loops=1)"
" Index Cond: (is_live = true)"
" -> Bitmap Index Scan on idx_country (cost=0.00..14544.45 rows=594137 width=0) (actual time=199.594..199.594 rows=593938 loops=1)"
" Index Cond: (country = 'United States of America'::text)"
"Planning Time: 0.243 ms"
"Execution Time: 10790.385 ms"
Is there any way I can improve the query performance further.
Here are some low hanging fruits:
try increasing work_mem to eliminate lossy bitmap heap scan. You can inspect it and set it with:
SHOW work_mem;
SET work_mem = xx;
If applicable drop index on is_live and create other indices as partial conditioned on is_live.
The plan does not use your gin indices right now. Try using operators that uses gin index like
'Shopify' =ANY (technologies) -> technologies #> '{Shopify}' ;
I'm trying to understand this situation: in my test environment (pg14 on debian) I have a table with many rows (18.000.000).
Suppose the table is "entities", and in simple terms it has this structure:
CREATE TABLE entities (
id BIGINT not null,
somefield INTEGER not null,
otherfield TIMESTAMP not null
);
create index ix1 on entities(somefield);
create index ix2 on entities(otherfield);
Then I cloned it twice, say "entities1" and "entities2" with a dumb script:
insert into entities1 select * from entities;
insert into entities2 select * from entities;
Then I ran
vacuum analyze entities2;
reindex table entities2;
and I kept entities1 untouched.
Now I have this simple query:
select extract(year from otherfield), COUNT(*) as n
from entities1
group by extract(year from otherfield);
which results in:
2019 43956
2020 5981223
2021 12172333
My query runs way faster (2x) in the untouched table than in the analyzed table.
The execution plans are these:
entities1
"Finalize GroupAggregate (cost=528695.36..528746.53 rows=200 width=40) (actual time=4774.324..4780.668 rows=3 loops=1)"
" Group Key: (EXTRACT(year FROM ""otherfield""))"
" -> Gather Merge (cost=528695.36..528742.03 rows=400 width=40) (actual time=4774.316..4780.658 rows=9 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=527695.34..527695.84 rows=200 width=40) (actual time=4727.063..4727.064 rows=3 loops=3)"
" Sort Key: (EXTRACT(year FROM ""otherfield""))"
" Sort Method: quicksort Memory: 25kB"
" Worker 0: Sort Method: quicksort Memory: 25kB"
" Worker 1: Sort Method: quicksort Memory: 25kB"
" -> Partial HashAggregate (cost=527685.19..527687.69 rows=200 width=40) (actual time=4727.040..4727.041 rows=3 loops=3)"
" Group Key: EXTRACT(year FROM ""otherfield"")"
" Batches: 1 Memory Usage: 40kB"
" Worker 0: Batches: 1 Memory Usage: 40kB"
" Worker 1: Batches: 1 Memory Usage: 40kB"
" -> Parallel Seq Scan on entities1 (cost=0.00..489773.71 rows=7582297 width=32) (actual time=0.548..3302.243 rows=6065837 loops=3)"
"Planning Time: 1.309 ms"
"Execution Time: 4780.718 ms"
entities2
"Gather (cost=1000.56..4087193.67 rows=16836383 width=40) (actual time=209.129..12448.711 rows=3 loops=1)"
" Workers Planned: 1"
" Workers Launched: 1"
" Single Copy: true"
" -> GroupAggregate (cost=0.56..2402555.37 rows=16836383 width=40) (actual time=43.676..12263.595 rows=3 loops=1)"
" Group Key: EXTRACT(year FROM ""otherfield"")"
" -> Index Scan using ix2_entities2 on entities2 (cost=0.56..2101113.02 rows=18197512 width=32) (actual time=2.756..9803.865 rows=18197512 loops=1)"
"Planning Time: 0.164 ms"
"Execution Time: 12448.750 ms"
what I immediately noticed is that the first plan runs in parallel, so I forced a parallel execution (parallel_setup_cost=100000, parallel_tuple_cost = 0.001) and I obtained this:
"GroupAggregate (cost=1532983.03..2062194.15 rows=16836383 width=40) (actual time=3373.406..10682.438 rows=3 loops=1)"
" Group Key: (EXTRACT(year FROM ""otherfield""))"
" -> Gather Merge (cost=1532983.03..1760751.81 rows=18197512 width=32) (actual time=3352.633..7930.565 rows=18197512 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=1432983.00..1451938.74 rows=7582297 width=32) (actual time=3276.100..3999.320 rows=6065837 loops=3)"
" Sort Key: (EXTRACT(year FROM ""otherfield""))"
" Sort Method: external merge Disk: 91728kB"
" Worker 0: Sort Method: external merge Disk: 88432kB"
" Worker 1: Sort Method: external merge Disk: 86704kB"
" -> Parallel Index Only Scan using ix2_entities2 on entities2 (cost=0.44..385130.71 rows=7582297 width=32) (actual time=1.225..1763.335 rows=6065837 loops=3)"
" Heap Fetches: 0"
"Planning Time: 0.124 ms"
"Execution Time: 10693.600 ms"
now, with both plans in parallel, I see these differences:
seq scan on entities1 is faster than index scan on entities2, but the seq scan has to hashaggregate resulting in a more expensive subplan.
Ok, without statistics being collected, maybe the seq scan is more efficient than a non updated index scan.
the hashaggregate on entities1 says: rows = 200, which is the default estimated value.
Again maybe it's legit because no statistics are collected
the sort methods. When forced to parallel, entities2 has to perform an external merge sort on disk, caused by the size of the data (175MB on both workers). Entities1 has practically no load (50KB on both workers), and the plan chose to perform a quicksort.
Both queries return the same result, I don't understand the need of sort, but less than that, I don't understand why the ANALYZE command impact so bad on performances.
What I am missing?
Let me know if I omitted some details
Thanks in advance
I have a query that is being ran on PGSQL, and when queried at a fast rate for large data sets, it is taking a long time to run because it isn't making use of the available indexes. I found that changing the filter from multiple OR's to an IN clause causes the right index to be used. Is there a way I can force the index to be used even when using OR's?
Query with Disjunction:
SELECT field1, field2,..., fieldN
FROM table1 WHERE
((((field9='val1' OR field9='val2') OR field9='val3') OR field9='val4')
AND (field6='val5'));
Query Plan:
"Bitmap Heap Scan on table1 (cost=18.85..19.88 rows=1 width=395) (actual time=0.017..0.017 rows=0 loops=1)"
" Recheck Cond: (((field6)::text = 'val5'::text) AND (((field9)::text = 'val1'::text) OR ((field9)::text = 'val2'::text) OR ((field9)::text = 'val3'::text) OR ((field9)::text = 'val4'::text)))"
" -> BitmapAnd (cost=18.85..18.85 rows=1 width=0) (actual time=0.016..0.016 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_field6_field9 (cost=0.00..9.01 rows=611 width=0) (actual time=0.015..0.015 rows=0 loops=1)"
" Index Cond: ((field6)::text = 'val5'::text)"
" -> BitmapOr (cost=9.59..9.59 rows=516 width=0) (never executed)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val1'::text)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val2'::text)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val3'::text)"
" -> Bitmap Index Scan on idx_id_field9 (cost=0.00..2.40 rows=129 width=0) (never executed)"
" Index Cond: ((field9)::text = 'val4'::text)"
"Planning time: 0.177 ms"
"Execution time: 0.061 ms"
Query with IN
SELECT field1, field2,..., fieldN
FROM table1
WHERE
((field9 IN ('val1', 'val2', 'val3', 'val4'))
AND (field6='val5'));
Query Plan:
"Index Scan using idx_field6_field9 on table1 (cost=0.43..6.77 rows=1 width=395) (actual time=0.032..0.032 rows=0 loops=1)"
" Index Cond: (((field6)::text = 'val5'::text) AND ((field9)::text = ANY ('{val1,val2,val3,val4}'::text[])))"
"Planning time: 0.145 ms"
"Execution time: 0.055 ms"
There is an index on field 6 and field 9 which the second query uses as expected, which the first one also should. Field9 is also kind of like a state field, so its cardinality is extremely low - there's only like 9 different values across the whole table. Unfortunately, it isn't straightforward to change the query to use an IN clause, so getting PG to use the right plan would be ideal.
There is no way you can get the fast plan (single index scan) using the OR condition. You'll have to rewrite the query.
You want to know why, which is always difficult to explain. With optimizations like that, there are usually two reasons:
Nobody got around to do it.
This requires extra effort every time a query with an OR is planned:
Are there several conditions linked with OR that have the same expression on one side?
Both plans, the original and the rewritten one, would have to be estimated. It may well be that the BitmapOr is the most efficient way to process the query.
This price would have to be paid by every query with OR in it.
I am not saying that it is a bad idea to add an optimization like this, but there are two sides to the coin.
Imagine the following table:
CREATE TABLE drops(
id BIGSERIAL PRIMARY KEY,
loc VARCHAR(5) NOT NULL,
tag INT NOT NULL
);
What I want to do is perform a query where I can find all unique locations where a value matches the tag.
SELECT DISTINCT loc
FROM drops
WHERE tag = '1'
GROUP BY loc;
I am not sure whether it is due to the size (its 9m rows big!) or me being inefficient, but the query takes way too long for users to efficiently use it. At the time I was writing this, the above query took me 1:14 minutes.
Is there any tricks or methods I can utilize in order to shorten this to a mere few seconds?
Much appreciated!
The execution plan:
"Unique (cost=1967352.72..1967407.22 rows=41 width=4) (actual time=40890.768..40894.984 rows=30 loops=1)"
" -> Group (cost=1967352.72..1967407.12 rows=41 width=4) (actual time=40890.767..40894.972 rows=30 loops=1)"
" Group Key: loc"
" -> Gather Merge (cost=1967352.72..1967406.92 rows=82 width=4) (actual time=40890.765..40895.031 rows=88 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Group (cost=1966352.70..1966397.43 rows=41 width=4) (actual time=40879.910..40883.362 rows=29 loops=3)"
" Group Key: loc"
" -> Sort (cost=1966352.70..1966375.06 rows=8946 width=4) (actual time=40879.907..40881.154 rows=19129 loops=3)"
" Sort Key: loc"
" Sort Method: quicksort Memory: 1660kB"
" -> Parallel Seq Scan on drops (cost=0.00..1965765.53 rows=8946 width=4) (actual time=1.341..40858.553 rows=19129 loops=3)"
" Filter: (tag = 1)"
" Rows Removed by Filter: 3113338"
"Planning time: 0.146 ms"
"Execution time: 40895.280 ms"
The table is indexed on loc and tag.
Your 40 seconds are spent sequentially reading the whole table, throwing away 3113338 rows to keep only 19129.
The remedy is simple:
CREATE INDEX ON drops(tag);
But you say you have already done that, but I find it hard to believe. What is the command you used?
Change the condition in the query from
WHERE tag = '1'
to
WHERE tag = 1
It happens to work because '1' is a literal, but don't try to compare strings and numbers.
And, as has been mentioned, keep either the DISTINCT or the GROUP BY, but not both.
If you have used a GROUP BY clause, then there is no need to use the DISTINCT keyword. Omitting that should speed up the running time for the query.
I have problems with a query that uses a wrong query plan. Because of the non-optimal query plan the query takes almost 20s.
The problem occurs only for a small number of owner_ids. The distribution of the owner_ids is not uniform. The owner_id in the example has 7948 routes. The total number of routes is 2903096.
The database is hosted on Amazon RDS on a server with 34.2 GiB memory, 4vCPU and provisioned IOPS (instance type db.m2.2xlarge). The Postgres version is 9.3.5.
EXPLAIN ANALYZE SELECT
route.id, route_meta.name
FROM
route
INNER JOIN
route_meta
USING (id)
WHERE
route.owner_id = 128905
ORDER BY
route_meta.name
LIMIT
61
Query plan:
"Limit (cost=0.86..58637.88 rows=61 width=24) (actual time=49.731..18828.052 rows=61 loops=1)"
" -> Nested Loop (cost=0.86..7934263.10 rows=8254 width=24) (actual time=49.728..18827.887 rows=61 loops=1)"
" -> Index Scan using route_meta_i_name on route_meta (cost=0.43..289911.22 rows=2902910 width=24) (actual time=0.016..2825.932 rows=1411126 loops=1)"
" -> Index Scan using route_pkey on route (cost=0.43..2.62 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=1411126)"
" Index Cond: (id = route_meta.id)"
" Filter: (owner_id = 128905)"
" Rows Removed by Filter: 1"
"Total runtime: 18828.214 ms"
If I increase the limit to 100, a better query plan is used. It takes now less then 100ms.
EXPLAIN ANALYZE SELECT
route.id, route_meta.name
FROM
route
INNER JOIN
route_meta
USING (id)
WHERE
route.owner_id = 128905
ORDER BY
route_meta.name
LIMIT
100
Query plan:
"Limit (cost=79964.98..79965.23 rows=100 width=24) (actual time=93.037..93.294 rows=100 loops=1)"
" -> Sort (cost=79964.98..79985.61 rows=8254 width=24) (actual time=93.033..93.120 rows=100 loops=1)"
" Sort Key: route_meta.name"
" Sort Method: top-N heapsort Memory: 31kB"
" -> Nested Loop (cost=0.86..79649.52 rows=8254 width=24) (actual time=0.039..77.955 rows=7948 loops=1)"
" -> Index Scan using route_i_owner_id on route (cost=0.43..22765.84 rows=8408 width=4) (actual time=0.023..13.839 rows=7948 loops=1)"
" Index Cond: (owner_id = 128905)"
" -> Index Scan using route_meta_pkey on route_meta (cost=0.43..6.76 rows=1 width=24) (actual time=0.003..0.004 rows=1 loops=7948)"
" Index Cond: (id = route.id)"
"Total runtime: 93.444 ms"
I already tried following things:
increasing statistics for owner_id (The owner_id in the example is included in the pg_stats)
ALTER TABLE route ALTER COLUMN owner_id SET STATISTICS 1000;
reindex owner_id and name
vacuum analyse
increased work_mem from 1MB to 16MB
when I rewrite the query to row_number() OVER (ORDER BY xxx) AS rn
... WHERE rn <= yyy in a subquery, the specific case is solved. However it
introduces performance problems with other ownerids.
A similar problem was solved with a combined index, but that seems impossible here because of the different tables.
Postgres uses wrong index in query plan