Looking at the postgresql documentation it seems that the more work_mem the better since "... generally be allowed to use as much memory as this value specifies before it starts to write data into temporary files". Which I understand as bad.
But trying to measure the improvement when increaseing this value I set it to the lowest possible, 64kB, but the query got faster, how come?!?
query explain with 64MB of work_mem:
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------+
Unique (cost=325.67..333.67 rows=472 width=1046) (actual time=5.574..5.938 rows=493 loops=1) |
-> Sort (cost=325.67..329.67 rows=1599 width=1046) (actual time=5.572..5.681 rows=1773 loops=1) |
Sort Key: id, revision_date DESC, revision_id DESC |
Sort Method: quicksort Memory: 1447kB |
-> Seq Scan on indicator_revisions (cost=0.00..240.58 rows=1599 width=1046) (actual time=0.015..2.501 rows=1773 loops=1)|
Filter: (NOT draft) |
Rows Removed by Filter: 269 |
Planning Time: 0.151 ms |
Execution Time: 6.025 ms |
query plan with 64kB of work_mem:
QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------------------------------------------------+
Unique (cost=0.28..1015.05 rows=472 width=1046) (actual time=0.026..3.834 rows=493 loops=1) |
-> Index Scan using key_risk_indicators_index on indicator_revisions (cost=0.28..1011.05 rows=1599 width=1046) (actual time=0.022..3.138 rows=1773 loops=1)|
Filter: (NOT draft) |
Rows Removed by Filter: 269 |
Planning Time: 0.158 ms |
Execution Time: 3.904 ms |
2s of difference, 50% faster?!?
PS: I know that in this case in index scan kicks in, but the planner should know better right? If it's better with the index scan why the planner didn't go for it the first time?
Related
I have a simple query for a simple table in postgres. I have a simple index on that table.
In some environments it is using the index when performing the query, in other environments (on the same RDS instance, different database) it isn't. (checked using EXPLAIN ANALYSE)
One thing I've noticed is that if the 'Check X Min' flag on the index is TRUE then index is not used. (pg_catalog.pg_index.indcheckxmin)
How do I ensure the index is used and, presumably, have the 'Check X Min' flag set to false?
Table contains 100K+ rows.
Things I have tried:
The index is valid and is always used in environments where the 'Check X Min' is set to false.
set enable_seqscan to off; still does not use the index.
Creating/recreating an index in these environments always seems to have 'Check X Min' set to true.
Vacuuming does not seem to help.
Setup of table and index:
CREATE TABLE schema_1.table_1 (
field_1 varchar(20) NOT NULL,
field_2 int4 NULL,
field_3 timestamptz NULL,
field_4 numeric(10,2) NULL
);
CREATE INDEX table_1_field_1_field_3_idx ON schema_1.table_1 USING btree (field_1, field_3 DESC);
Query:
select field_1, field_2, field_3, field_4
from schema_1.table_1
where field_1 = ’abcdef’
order by field_3 desc limit 1;
When not using index:
QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------|
Limit (cost=4.41..4.41 rows=1 width=51) (actual time=3.174..3.176 rows=1 loops=1) |
-> Sort (cost=4.41..4.42 rows=6 width=51) (actual time=3.174..3.174 rows=1 loops=1) |
Sort Key: field_3 DESC |
Sort Method: top-N heapsort Memory: 25kB |
-> Seq Scan on table_1 (cost=0.00..4.38 rows=6 width=51) (actual time=3.119..3.150 rows=3 loops=1)|
Filter: ((field_1)::text = 'abcdef'::text) |
Rows Removed by Filter: 96 |
Planning time: 2.895 ms |
Execution time: 3.197 ms |
When using index:
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------|
Limit (cost=0.28..6.30 rows=1 width=51) (actual time=0.070..0.144 rows=1 loops=1) |
-> Index Scan using table_1_field_1_field_3_idx on field_1 (cost=0.28..12.31 rows=2 width=51) (actual time=0.049..0.066 rows=1 loops=1)|
Index Cond: ((field_1)::text = 'abcdef'::text) |
Planning time: 0.184 ms |
Execution time: 0.303 ms |
Have renamed fields, schema, and table to avoid sharing business context
You seem to be using CREATE INDEX CONCURRENTLY, and have long-open transactions. From the docs:
Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build.
You don't have a lot of options here. Hunt down and fix your long-open transactions, don't use CONCURRENTLY, or put up with the limitation.
I'm trying to understand if it's possible to optimize the query containing a self-join, and if it is possible - how to do it.
I'm working on a bigger real-life task, but here I extracted a simple sub-task from it to keep focus on a particular issue: optimizing a self-join query.
I have a table called parties. It contains over 85k records and looks like this:
# \d test.parties
Table "test.parties"
Column | Type | Collation | Nullable | Default
-------------+------+-----------+----------+---------
id | uuid | | |
contract_id | uuid | | |
Doing a self-join on contract_id I get this plan:
# explain analyse select p1.id from test.parties p1 join test.parties p2 on p1.contract_id = p2.contract_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=20207.87..628157.87 rows=40500000 width=16) (actual time=109.709..184.523 rows=197632 loops=1)
Merge Cond: (p1.contract_id = p2.contract_id)
-> Sort (cost=11181.94..11406.94 rows=90000 width=32) (actual time=55.560..66.173 rows=86332 loops=1)
Sort Key: p1.contract_id
Sort Method: external merge Disk: 3560kB
-> Seq Scan on parties p1 (cost=0.00..1620.00 rows=90000 width=32) (actual time=0.018..14.518 rows=86332 loops=1)
-> Sort (cost=9025.94..9250.94 rows=90000 width=16) (actual time=54.135..74.973 rows=197631 loops=1)
Sort Key: p2.contract_id
Sort Method: external sort Disk: 2544kB
-> Seq Scan on parties p2 (cost=0.00..1620.00 rows=90000 width=16) (actual time=0.009..10.462 rows=86332 loops=1)
Planning Time: 0.167 ms
Execution Time: 199.677 ms
(12 rows)
Adding an index on contract_id I get this plan:
# create index on test.parties(contract_id);
CREATE INDEX
# explain analyse select p1.id from test.parties p1 join test.parties p2 on p1.contract_id = p2.contract_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=3084.47..10570.76 rows=192484 width=16) (actual time=32.457..97.662 rows=197632 loops=1)
Hash Cond: (p1.contract_id = p2.contract_id)
-> Seq Scan on parties p1 (cost=0.00..1583.32 rows=86332 width=32) (actual time=0.013..11.293 rows=86332 loops=1)
-> Hash (cost=1583.32..1583.32 rows=86332 width=16) (actual time=32.133..32.133 rows=86332 loops=1)
Buckets: 131072 Batches: 2 Memory Usage: 3048kB
-> Seq Scan on parties p2 (cost=0.00..1583.32 rows=86332 width=16) (actual time=0.007..12.815 rows=86332 loops=1)
Planning Time: 0.444 ms
Execution Time: 110.692 ms
(8 rows)
Is there a way I could get rid of those Seq Scans?
I don't see the presence of any index in your explain plan, so assign that you have not yet looked into using indices, here is one suggestion:
CREATE INDEX idx ON parties (contract_id, id);
This should speed up the join, and it also covers the id value, which is required in the SELECT clause.
I want to query for my table with the following structure:
Table "public.company_geo_table"
Column | Type | Collation | Nullable | Default
--------------------+--------+-----------+----------+---------
geoname_id | bigint | | |
date | text | | |
cik | text | | |
count | bigint | | |
country_iso_code | text | | |
subdivision_1_name | text | | |
city_name | text | | |
Indexes:
"cik_country_index" btree (cik, country_iso_code)
"cik_geoname_index" btree (cik, geoname_id)
"cik_index" btree (cik)
"date_index" brin (date)
I tried with the following sql query, which need to query for a specific cik number during a time perid, and group by the cik with geoname_id(different areas).
select cik, geoname_id, sum(count) as total
from company_geo_table
where cik = '1111111'
and date between '2016-01-01' and '2016-01-10'
group by cik, geoname_id
The explanation result showed that they only use the cik index and date index, and did not use the cik_geoname index. Why? Is there any way I can optimize my solution? Any new indices? Thank you in advance.
HashAggregate (cost=117182.79..117521.42 rows=27091 width=47) (actual time=560132.903..560134.229 rows=3552 loops=1)
Group Key: cik, geoname_id
-> Bitmap Heap Scan on company_geo_table (cost=16467.77..116979.48 rows=27108 width=23) (actual time=6486.232..560114.828 rows=8175 loops=1)
Recheck Cond: ((date >= '2016-01-01'::text) AND (date <= '2016-01-10'::text) AND (cik = '1288776'::text))
Rows Removed by Index Recheck: 16621155
Heap Blocks: lossy=193098
-> BitmapAnd (cost=16467.77..16467.77 rows=27428 width=0) (actual time=6469.640..6469.641 rows=0 loops=1)
-> Bitmap Index Scan on date_index (cost=0.00..244.81 rows=7155101 width=0) (actual time=53.034..53.035 rows=8261120 loops=1)
Index Cond: ((date >= '2016-01-01'::text) AND (date <= '2016-01-10'::text))
-> Bitmap Index Scan on cik_index (cost=0.00..16209.15 rows=739278 width=0) (actual time=6370.930..6370.930 rows=676231 loops=1)
Index Cond: (cik = '1111111'::text)
Planning time: 12.909 ms
Execution time: 560135.432 ms
There is not good estimation (and probably the value '1111111' is used too often (I am not sure about impact, but looks so cik column has wrong data type (text), what can be a reason (or partial reason) of not good estimation.
Bitmap Heap Scan on company_geo_table (cost=16467.77..116979.48 rows=27108 width=23) (actual time=6486.232..560114.828 rows=8175 loops=1)
Looks like composite index (date, cik) can helps
Your problem seems to be here:
Rows Removed by Index Recheck: 16621155
Heap Blocks: lossy=193098
Your work_mem setting is too low, so PostgreSQL cannot fit a bitmap that contains one bit per table row, so it degrades to one bit per 8K block. This means that many false positive hits have to be removed during that bitmap heap scan.
Try with higher work_mem and see if that improves query performance.
The ideal index would be
CREATE INDEX ON company_geo_table (cik, date);
I'm running into an issue in PostgreSQL (version 9.6.10) with indexes not working to speed up a MAX query with a simple equality filter on another column. Logically it seems that a simple multicolumn index on (A, B DESC) should make the query super fast.
I can't for the life of me figure out why I can't get a query to be performant regardless of what indexes are defined.
The table definition has the following:
- A primary key foo VARCHAR PRIMARY KEY (not used in the query)
- A UUID field that is NOT NULL called bar UUID
- A sequential_id column that was created as a BIGSERIAL UNIQUE type
Here's what the relevant columns look like exactly (with names modified for privacy):
Table "public.foo"
Column | Type | Modifiers
----------------------+--------------------------+--------------------------------------------------------------------------------
foo_uid | character varying | not null
bar_uid | uuid | not null
sequential_id | bigint | not null default nextval('foo_sequential_id_seq'::regclass)
Indexes:
"foo_pkey" PRIMARY KEY, btree (foo_uid)
"foo_bar_uid_sequential_id_idx", btree (bar_uid, sequential_id DESC)
"foo_sequential_id_key" UNIQUE CONSTRAINT, btree (sequential_id)
Despite having the index listed above on (bar_uid, sequential_id DESC), the following query requires an index scan and takes 100-300ms with a few million rows in the database.
The Query (get the max sequential_id for a given bar_uid):
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f';
The EXPLAIN ANALYZE result doesn't use the proper index. Also, for some reason it checks if sequential_id IS NOT NULL even though it's declared as not null.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.75..0.76 rows=1 width=8) (actual time=321.110..321.110 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.75 rows=1 width=8) (actual time=321.106..321.106 rows=1 loops=1)
-> Index Scan Backward using foo_sequential_id_key on foo (cost=0.43..98936.43 rows=308401 width=8) (actual time=321.106..321.106 rows=1 loops=1)
Index Cond: (sequential_id IS NOT NULL)
Filter: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Rows Removed by Filter: 920761
Planning time: 0.196 ms
Execution time: 321.127 ms
(9 rows)
I can add a seemingly unnecessary GROUP BY to this query, and that speeds it up a bit, but it's still really slow for a query that should be near instantaneous with indexes defined:
SELECT MAX(sequential_id)
FROM foo
WHERE bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'
GROUP BY bar_uid;
The EXPLAIN (ANALYZE, BUFFERS) result:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=8510.54..65953.61 rows=6 width=24) (actual time=234.529..234.530 rows=1 loops=1)
Group Key: bar_uid
Buffers: shared hit=1 read=11909
-> Bitmap Heap Scan on foo (cost=8510.54..64411.55 rows=308401 width=24) (actual time=65.259..201.969 rows=309023 loops=1)
Recheck Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Heap Blocks: exact=10385
Buffers: shared hit=1 read=11909
-> Bitmap Index Scan on foo_bar_uid_sequential_id_idx (cost=0.00..8433.43 rows=308401 width=0) (actual time=63.549..63.549 rows=309023 loops=1)
Index Cond: (bar_uid = 'fa61424d-389f-4e75-ba2d-b77e6bb8491f'::uuid)
Buffers: shared read=1525
Planning time: 3.067 ms
Execution time: 234.589 ms
(12 rows)
Does anyone have any idea what's blocking this query from being on the order of 10 milliseconds? This should logically be instantaneous with the right index defined. It should only require the time to follow links to the leaf value in the B-Tree.
Someone asked:
What do you get for SELECT * FROM pg_stats WHERE tablename = 'foo' and attname = 'bar_uid';?
schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram
------------+------------------------+-------------+-----------+-----------+-----------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------+------------------+-------------+-------------------+------------------------+----------------------
public | foo | bar_uir | f | 0 | 16 | 6 | {fa61424d-389f-4e75-ba2d-b77e6bb8491f,5c5dcae9-1b7e-4413-99a1-62fde2b89c32,50b1e842-fc32-4c2c-b00f-4a17c3c1c5fa,7ff1999c-c0ea-b700-343f-9a737f6ad659,f667b353-e199-4890-9ffd-4940ea11fe2c,b24ce968-29fd-4587-ba1f-227036ee3135} | {0.203733,0.203167,0.201567,0.195867,0.1952,0.000466667} | | -0.158093 | | |
(1 row)
I'm having a table of locations(29 million rows approx)
Table "public.locations"
Column | Type| Modifiers
------------------------------------+-------------------+------------------------------------------------------------
id | integer | not null default nextval('locations_id_seq'::regclass)
dl | text |
Indexes:
"locations_pkey" PRIMARY KEY, btree (id)
"locations_test_idx" gin (to_tsvector('english'::regconfig, dl))
I want the following query to have perform well.
EXPLAIN (ANALYZE,BUFFERS) SELECT id FROM locations WHERE to_tsvector('english'::regconfig, dl) ## to_tsquery('Lymps') LIMIT 10;
But the query plan produced show sequential scan being used.
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..65.18 rows=10 width=4) (actual time=62217.569..62217.569 rows=0 loops=1)
Buffers: shared hit=262 read=447808
I/O Timings: read=861.370
-> Seq Scan on locations (cost=0.00..967615.99 rows=148442 width=2) (actual time=62217.567..62217.567 rows=0 loops=1)
Filter: (to_tsvector('english'::regconfig, dl) ## to_tsquery('Lymps'::text))
Rows Removed by Filter: 29688342
Buffers: shared hit=262 read=447808
I/O Timings: read=861.370
Planning time: 0.109 ms
Execution time: 62217.584 ms
Upon forcibly turning off seq scan
set enable_seqscan to off;
The query plan now uses the gin index.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1382.43..1403.20 rows=10 width=2) (actual time=0.043..0.043 rows=0 loops=1)
Buffers: shared hit=1 read=3
-> Bitmap Heap Scan on locations (cost=1382.43..309697.73 rows=148442 width=2) (actual time=0.043..0.043 rows=0 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, dl) ## to_tsquery('Lymps'::text))
Buffers: shared hit=1 read=3
-> Bitmap Index Scan on locations_test_idx (cost=0.00..1345.32 rows=148442 width=0) (actual time=0.041..0.041 rows=0 loops=1)
Index Cond: (to_tsvector('english'::regconfig, dl) ## to_tsquery('Lymps'::text))
Buffers: shared hit=1 read=3
Planning time: 0.089 ms
Execution time: 0.069 ms
(10 rows)
The cost settings have been pasted below.
select name,setting from pg_settings where name like '%cost';
name | setting
----------------------+---------
cpu_index_tuple_cost | 0.005
cpu_operator_cost | 0.0025
cpu_tuple_cost | 0.01
random_page_cost | 4
seq_page_cost | 1
(5 rows)
I'm looking for a solution which doesn't use sequential scan for the aforementioned query and tricks like setting sequential scan to off.
I tried to update the value of seq_page_cost to 20 but the query plan remained the same.
The problem here is that PostgreSQL thinks that there are enough rows that satisfy the condition, so it thinks it can be cheaper by sequentially fetching rows until it has 10 that match.
But there is not a single row that satisfies the condition, so the query ends up scanning the whole table, when an index scan would have found that much faster.
You can improve the quality of statistics collected for that column like this:
ALTER TABLE locations_test_idx
ALTER to_tsvector SET STATISTICS 10000;
Then run ANALYZE, and PostgreSQL will collect better statistics for that column, hopefully improving the query plan.