Postgres uses Parallel Seq Scan over Index Scan for simple query - postgresql

Debugging a slow select query I found out with EXPLAIN ANALYZE that Postgres uses Parallel Seq Scan for a very simple SELECT query on a large table (>80m rows).
Some background
I migrated a table within the same database with an INSERT INTO ... SELECT ... query, which was working but took several hours. I than ran the following two queries on both tables which perform significantly different. Note, both tables have the exact same number of rows (a bit over 80 Mio rows) and have a Primary Key on columnd id.
On src table
EXPLAIN ANALYZE
SELECT *
FROM tmp_migration.asset a
WHERE id = 1452299
Yielding...
"QUERY PLAN"
"Gather (cost=1000.00..2723419.51 rows=40149 width=2536) (actual time=56362.052..56411.077 rows=1 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Parallel Seq Scan on asset a (cost=0.00..2718404.61 rows=16729 width=2536) (actual time=53152.645..56349.660 rows=0 loops=3)"
" Filter: (id = 1452299)"
" Rows Removed by Filter: 26851637"
"Planning Time: 0.077 ms"
"Execution Time: 56411.114 ms"
Note: almost a minute on a simple select
On target table
EXPLAIN ANALYZE
SELECT *
FROM public.asset a
WHERE id = 107588
Yielding...
"QUERY PLAN"
"Index Scan using ""PK_1209d107fe21482beaea51b745e"" on asset a (cost=0.57..8.59 rows=1 width=883) (actual time=85.544..85.548 rows=1 loops=1)"
" Index Cond: (id = 107588)"
"Planning Time: 0.090 ms"
"Execution Time: 85.576 ms"
Note: less then 100 ms execution time
What I tried so far
Checking PKs for both tables. Doing VACUUM tmp_migration.asset.
What causes this inefficient behavior?
Update (Fix with help of comments)
Indeed, the index was missing. I ran
SELECT * FROM pg_indexes WHERE schemaname = 'public' OR schemaname = 'tmp_migration'
which was not giving any indexes in tmp_migration. Once I ran
CREATE INDEX "asset_idx" ON tmp_migration.asset ("id")
the query executed as expected. The reason I was so sure a PK already exists was my db GUI (Heidi SQL) reported a PK on that column. Should have double checked 🙏.
Thanks!

Related

Improve PostgreSQL query response

I have one table that includes about 100K rows and their growth and growth.
My query response has a bad response time and it affects my front-end user experience.
I want to ask for your help to improve my response time from the DB.
Today the PostgreSQL runs on my local machine, Macbook pro 13 2019 16 RAM and I5 Core.
I the future I will load this DB on a docker and run it on a Better server.
What can I do to improve it for now?
Table Structure:
CREATE TABLE dots
(
dot_id INT,
site_id INT,
latitude float ( 6 ),
longitude float ( 6 ),
rsrp float ( 6 ),
dist INT,
project_id INT,
dist_from_site INT,
geom geometry,
dist_from_ref INT,
file_name VARCHAR
);
The dot_id resets after inserting the bulk of data and each for "file_name" column.
Table Dots:
The queries:
Query #1:
await db.query(
`select MAX(rsrp) FROM dots where site_id=$1 and ${table}=$2 and project_id = $3 and file_name ilike $4`,
[site_id, dist, project_id, filename]
);
Time for response: 200ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=37159.88..37159.89 rows=1 width=4) (actual time=198.416..201.762 rows=1 loops=1)
Buffers: shared hit=16165 read=16031
-> Gather (cost=37159.66..37159.87 rows=2 width=4) (actual time=198.299..201.752 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16165 read=16031
-> Partial Aggregate (cost=36159.66..36159.67 rows=1 width=4) (actual time=179.009..179.010 rows=1 loops=3)
Buffers: shared hit=16165 read=16031
-> Parallel Seq Scan on dots (cost=0.00..36150.01 rows=3861 width=4) (actual time=122.889..178.817 rows=1088 loops=3)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (dist_from_ref = 500) AND (project_id = 1))
Rows Removed by Filter: 157073
Buffers: shared hit=16165 read=16031
Planning Time: 0.290 ms
Execution Time: 201.879 ms
(14 rows)
Query #2:
await db.query(
`SELECT DISTINCT (${table}) FROM dots where site_id=$1 and project_id = $2 and file_name ilike $3 order by ${table}`,
[site_id, project_id, filename]
);
Time for response: 1100ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=41322.12..41322.31 rows=77 width=4) (actual time=1176.071..1176.077 rows=66 loops=1)
Sort Key: dist_from_ref
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=16175 read=16021
-> HashAggregate (cost=41318.94..41319.71 rows=77 width=4) (actual time=1176.024..1176.042 rows=66 loops=1)
Group Key: dist_from_ref
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=16175 read=16021
-> Seq Scan on dots (cost=0.00..40499.42 rows=327807 width=4) (actual time=0.423..1066.316 rows=326668 loops=1)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (project_id = 1))
Rows Removed by Filter: 147813
Buffers: shared hit=16175 read=16021
Planning:
Buffers: shared hit=5 dirtied=1
Planning Time: 0.242 ms
Execution Time: 1176.125 ms
(16 rows)
Query #3:
await db.query(
`SELECT count(*) FROM dots WHERE site_id = $1 AND ${table} = $2 and project_id = $3 and file_name ilike $4`,
[site_id, dist, project_id, filename]
);
Time for response: 200ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=37159.88..37159.89 rows=1 width=8) (actual time=198.725..202.335 rows=1 loops=1)
Buffers: shared hit=16160 read=16036
-> Gather (cost=37159.66..37159.87 rows=2 width=8) (actual time=198.613..202.328 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16160 read=16036
-> Partial Aggregate (cost=36159.66..36159.67 rows=1 width=8) (actual time=179.182..179.183 rows=1 loops=3)
Buffers: shared hit=16160 read=16036
-> Parallel Seq Scan on dots (cost=0.00..36150.01 rows=3861 width=0) (actual time=119.340..179.020 rows=1088 loops=3)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (dist_from_ref = 500) AND (project_id = 1))
Rows Removed by Filter: 157073
Buffers: shared hit=16160 read=16036
Planning Time: 0.109 ms
Execution Time: 202.377 ms
(14 rows)
Tables do no have any indexes.
I added an index and it helped a bit... create index idx1 on dots (site_id, project_id, file_name, dist_from_site,dist_from_ref)
OK, that's a bit too much.
An index on columns (a,b) is useful for "where a=..." and also for "where a=... and b=..." but it is not really useful for "where b=...". Creating an index with many columns uses more disk space and is slower than with fewer columns, which is a waste if the extra columns in the index don't make your queries faster. Both dist_ columns in the index will probably not be used.
Indices are a compromise : if your table has small rows, like two integer columns, and you create an index on these two columns, then it will be as big as the table, so you better be sure you need it. But in your case, your rows are pretty large at around 5kB and the number of rows is small, so adding an index or several on small int columns costs very little overhead.
So, since you very often use WHERE conditions on both site_id and project_id you can create an index on site_id,project_id. This will also work for a WHERE condition on site_id alone. If you sometimes use project_id alone, you can swap the order of the columns so it appears first, or even create another index.
You say the cardinality of these columns is about 30, so selecting on one value of site_id or project_id should hit 1/30 or 3.3% of the table, and selecting on both columns should hit 0.1% of the table, if they are uncorrelated and evenly distributed. This should already result in a substantial speedup.
You could also add an index on dist_from_site, and another one on dist_on_ref, if they have good selectivity (ie, high cardinality in those columns). Postgres can combine indices with bitmap index scan. But, if say 50% of the rows in the table have the same value for dist_from_site, then an index will be useless for this value, due to not having enough selectivity.
You could also replace the previous 2-column index with 2 indices on site_id,project_id,dist_from_site and site_id,project_id,dist_from_ref. You can try it, see if it is worth the extra resources.
Also there's the filename column and ILIKE. ILIKE can't use an index, which means it's slow. One solution is to use an expression index
CREATE INDEX dots_filename ON dots( lower(file_name) );
and replace your where condition with:
lower(file_name) like lower($4)
This will use the index unless the parameter $4 starts with a "%". And if you never use the LIKE '%' wildcards, and you're just using ILIKE for case-insensitive comparison, then you can replace LIKE with the = operator. Basically lower(a) = lower(b) is a case-insensitive comparison.
Likewise you could replace the previous 2-column index with an index on site_id,project_id,lower(filename) if you often use these three together in the WHERE condition. But as said above, it won't optimize searches on filename alone.
Since your rows are huge, even adding 1 kB of index per row will only add 20% overhead to your table, so you can overdo it without too much trouble. So go ahead and experiment, you'll see what works best.

Postgres uses Hash Join with Seq Scan when Inner Select Index Cond is faster

Postgres is using a much heavier Seq Scan on table tracking when an index is available. The first query was the original attempt, which uses a Seq Scan and therefore has a slow query. I attempted to force an Index Scan with an Inner Select, but postgres converted it back to effectively the same query with nearly the same runtime. I finally copied the list from the Inner Select of query two to make the third query. Finally postgres used the Index Scan, which dramatically decreased the runtime. The third query is not viable in a production environment. What will cause postgres to use the last query plan?
(vacuum was used on both tables)
Tables
tracking (worker_id, localdatetime) total records: 118664105
project_worker (id, project_id) total records: 12935
INDEX
CREATE INDEX tracking_worker_id_localdatetime_idx ON public.tracking USING btree (worker_id, localdatetime)
Queries
SELECT worker_id, localdatetime FROM tracking t JOIN project_worker pw ON t.worker_id = pw.id WHERE project_id = 68475018
Hash Join (cost=29185.80..2638162.26 rows=19294218 width=16) (actual time=16.912..18376.032 rows=177681 loops=1)
Hash Cond: (t.worker_id = pw.id)
-> Seq Scan on tracking t (cost=0.00..2297293.86 rows=118716186 width=16) (actual time=0.004..8242.891 rows=118674660 loops=1)
-> Hash (cost=29134.80..29134.80 rows=4080 width=8) (actual time=16.855..16.855 rows=2102 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 115kB
-> Seq Scan on project_worker pw (cost=0.00..29134.80 rows=4080 width=8) (actual time=0.004..16.596 rows=2102 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 10833
Planning Time: 0.192 ms
Execution Time: 18382.698 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (SELECT id FROM project_worker WHERE project_id = 68475018 LIMIT 500)
Hash Semi Join (cost=6905.32..2923969.14 rows=27733254 width=24) (actual time=19.715..20191.517 rows=20530 loops=1)
Hash Cond: (t.worker_id = project_worker.id)
-> Seq Scan on tracking t (cost=0.00..2296948.27 rows=118698327 width=24) (actual time=0.005..9184.676 rows=118657026 loops=1)
-> Hash (cost=6899.07..6899.07 rows=500 width=8) (actual time=1.103..1.103 rows=500 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Limit (cost=0.00..6894.07 rows=500 width=8) (actual time=0.006..1.011 rows=500 loops=1)
-> Seq Scan on project_worker (cost=0.00..28982.65 rows=2102 width=8) (actual time=0.005..0.968 rows=500 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 4493
Planning Time: 0.224 ms
Execution Time: 20192.421 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (322016383,316007840,...,285702579)
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4766798.31 rows=21877360 width=24) (actual time=0.079..29.756 rows=22112 loops=1)
" Index Cond: (worker_id = ANY ('{322016383,316007840,...,285702579}'::bigint[]))"
Planning Time: 1.162 ms
Execution Time: 30.884 ms
... is in place of the 500 id entries used in the query
Same query ran on another set of 500 id's
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4776714.91 rows=21900980 width=24) (actual time=0.105..5528.109 rows=117838 loops=1)
" Index Cond: (worker_id = ANY ('{286237712,286237844,...,216724213}'::bigint[]))"
Planning Time: 2.105 ms
Execution Time: 5534.948 ms
The distribution of "worker_id" within "tracking" seems very skewed. For one thing, the number of rows in one of your instances of query 3 returns over 5 times as many rows as the other instance of it. For another, the estimated number of rows is 100 to 1000 times higher than the actual number. This can certainly lead to bad plans (although it is unlikely to be the complete picture).
What is the actual number of distinct values for worker_id within tracking: select count(distinct worker_id) from tracking? What does the planner think this value is: select n_distinct from pg_stats where tablename='tracking' and attname='worker_id'? If those values are far apart and you force the planner to use a more reasonable value with alter table tracking alter column worker_id set (n_distinct = <real value>); analyze tracking; does that change the plans?
If you want to nudge PostgreSQL towards a nested loop join, try the following:
Create an index on tracking that can be used for an index-only scan:
CREATE INDEX ON tracking (worker_id) INCLUDE (localdatetime);
Make sure that tracking is VACUUMed often, so that an index-only scan is effective.
Reduce random_page_cost and increase effective_cache_size so that the optimizer prices index scans lower (but don't use insane values).
Make sure that you have good estimates on project_worker:
ALTER TABLE project_worker ALTER project_id SET STATISTICS 1000;
ANALYZE project_worker;

postgres index performance unclear

I recently encountered and solved a Problem - but I do not get why there even was a Problem to begin with.
simplified said, i have 3 tables in a postgres 10.5 database:
entities (id, name)
entities_to_stuff(
id,
entities_id -> fk entities.id,
stuff_id -> fk stuff.id,
unique constraint (entity_id, stuff_id)
)
stuff(id, name)
after inserting about 200k records, selects on the query:
select * from entities_to_stuff where entities_id = 1;
started to take 100 - 400 ms.
As is understood, creating a unique constraint, creates an index on the unique fields. so i have an index on (entities_id, stuff_id), entities_id being the "leftmost" column.
according to the docs, queries including the leftmost column are the most efficient ( postgres docs on this ) - so i assumed this index would do for me.
So i checked the execution plan - it wasn't using the index.
So, just to be sure i did:
SET enable_seqscan = OFF;
and re ran the query - still took well over 100 ms most of the time.
I then got annoyed and created this index
create index "idx_entities_id" on "entities_to_stuff" ("entities_id");
and suddenly it takes 0.2 ms or even less to run and the execution plan also uses it when sequential scans are enabled.
How is this index orders of magnitudes faster than the one already existing?
Edit:
execution plan after generating the additional index:
Index Scan using idx_entities_id on entities_to_stuff (cost=0.00..12.04 rows=2 width=32) (actual time=0.049..0.050 rows=1 loops=1)
Index Cond: (entities_id = 199283)
Planning time: 0.378 ms
Execution time: 0.073 ms
plan with just the unique constraint, seq_scan=on
Gather (cost=1000.00..38679.87 rows=2 width=32) (actual time=344.321..1740.861 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Parallel Seq Scan on entities_to_stuff (cost=0.00..37679.67 rows=1 width=32) (actual time=344.088..1739.684 rows=1 loops=1)
Filter: (entities_id = 199283)
Rows Removed by Filter: 2907419
Planning time: 0.241 ms
Execution time: 740.888 ms
plan with constraint, seq-scan = off
Index Scan using uq_entities_to_stuff on entities_to_stuff (cost=0.43..66636.34 rows=2 width=32) (actual time=0.385..553.066 rows=1 loops=1)
Index Cond: (entities_id = 199283)
Planning time: 0.082 ms
Execution time: 553.103 ms

Postgresql avoid nested loop with join

Please help me to improve query performance if possible.
I have following query
select
s."CustomerCode",
s."MaterialCode",
fw."Name",
fw."ReverseName",
s."Uc"
from
"Sales" s
left join
"FiscalWeeks" fw on s."SalesDate" between fw."StartedAt" and fw."EndedAt"
And execution plan is
"Nested Loop Left Join (cost=0.00..1439970.46 rows=8954562 width=40) (actual time=0.129..114889.581 rows=1492427 loops=1)"
" Join Filter: ((s."SalesDate" >= fw."StartedAt") AND (s."SalesDate" <= fw."EndedAt"))"
" Rows Removed by Join Filter: 79098631"
" Buffers: shared hit=3818 read=10884"
" -> Seq Scan on "Sales" s (cost=0.00..29625.27 rows=1492427 width=26) (actual time=0.098..1216.287 rows=1492427 loops=1)"
" Buffers: shared hit=3817 read=10884"
" -> Materialize (cost=0.00..1.81 rows=54 width=26) (actual time=0.001..0.034 rows=54 loops=1492427)"
" Buffers: shared hit=1"
" -> Seq Scan on "FiscalWeeks" fw (cost=0.00..1.54 rows=54 width=26) (actual time=0.006..0.044 rows=54 loops=1)"
" Buffers: shared hit=1"
"Planning time: 0.291 ms"
"Execution time: 115840.838 ms"
I have following indexes
CREATE INDEX "Sales_SalesDate_idx" ON public."Sales" USING btree ("SalesDate");
ADD CONSTRAINT "FiscalWeekUnique" EXCLUDE USING gist (daterange("StartedAt", "EndedAt", '[]'::text) WITH &&);
Postgresql version is
"PostgreSQL 9.5.0, compiled by Visual C++ build 1800, 32-bit"
vacuum analyze was performed
I think that postgresql does not understand that for each row in Sales table exists only one row in table FiscalWeeks and use nested loop. How can I explain it?
Thank you.
The query has to use a nested loop join because of the join condition. The operators <= and >= do not support hash or merge joins.
Perhaps you can improve the query by adding an index to "FiscalWeeks" so that a sequential scan can be avoided, and the join condition can be pushed down into the inner loop:
CREATE INDEX ON "FiscalWeeks" ("StartedAt", "EndedAt");
Unrelated to that, but you would make your life better if you avoided upper case letters in table and column names.

PostgreSQL Performance check

I have heard PostgreSQL being used in cases where the tables run into billion of rows and having satisfactory response times too. But here's my simple experiment to check this out. I have a table with 6 columns and has 4255700 entries exactly. I have used pgtune to tune the configuration according to my setup. Now, when I run a simple "*select * from tab1*", it is taking 173.425 seconds to fetch all the rows. Is this the normal behavior? I have this single table in the DB.
The table definition is as follows -
CREATE TABLE file_group_permissions
(
fgp_id serial NOT NULL,
file_id integer NOT NULL,
pg_id integer NOT NULL,
policy_id integer,
tag_id integer,
inst_id integer,
CONSTRAINT file_group_permissions_pkey PRIMARY KEY (fgp_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE file_group_permissions
OWNER TO sa;
-- Index: fgp_file_idx
-- DROP INDEX fgp_file_idx;
CREATE INDEX fgp_file_idx
ON file_group_permissions
USING btree
(file_id);
ALTER TABLE file_group_permissions CLUSTER ON fgp_file_idx;
-- Index: fgp_inst_idx
-- DROP INDEX fgp_inst_idx;
CREATE INDEX fgp_inst_idx
ON file_group_permissions
USING btree
(inst_id);
-- Index: fgp_tag_idx
-- DROP INDEX fgp_tag_idx;
CREATE INDEX fgp_tag_idx
ON file_group_permissions
USING btree
(tag_id);
-- Index: pgfgp_idx
-- DROP INDEX pgfgp_idx;
CREATE INDEX pgfgp_idx
ON file_group_permissions
USING btree
(pg_id);
Output of EXPLAIN (ANALYZE, BUFFERS) select * from file_group_permissions -
"Seq Scan on file_group_permissions (cost=0.00..69662.00 rows=4255700
width=24) (actual time=0.019..580.273 rows=4255700 loops=1)"
" Buffers: shared hit=2432 read=24673"
"Planning time: 0.070 ms"
"Execution time: 903.325 ms"
I have a MacBook Pro with 16 Gigs of RAM and 512 Gigs of SSD. I have configured the PostgreSQL to use 2Gigs of RAM.
Edit
EXPLAIN (ANALYZE, BUFFERS) select pg_id, count(distinct file_id) from file_group_permissions where pg_id in (6117,6115,6116,6113,6114) group by 1;
"GroupAggregate (cost=0.44..102028.21 rows=208 width=8) (actual time=4970.884..5013.423 rows=3 loops=1)"
" Group Key: pg_id"
" Buffers: shared hit=50891, temp read=4824 written=4824"
" -> Index Scan using pgfgp_idx on file_group_permissions (cost=0.44..85511.31 rows=3302964 width=8) (actual time=0.062..1080.926 rows=3323389 loops=1)"
" Index Cond: (pg_id = ANY('{6117,6115,6116,6113,6114}'::integer[]))"
" Buffers: shared hit=50891"
"Planning time: 0.219 ms"
"Execution time: 5013.495 ms"
EDIT1
I separated this table into a new DB and followed the suggestions (composite indices and postgresql conf), here's the new plan -
"GroupAggregate (cost=478307.10..502996.67 rows=209 width=8) (actual
time=7500.426..7528.021 rows=3 loops=1)"
" Group Key: pg_id"
" Buffers: shared read=27105, temp read=12137 written=12137"
" -> Sort (cost=478307.10..486536.26 rows=3291664 width=8) (actual
time=2944.597..3647.248 rows=3323389 loops=1)"
" Sort Key: pg_id"
" Sort Method: external sort Disk: 58488kB"
" Buffers: shared read=27105, temp read=7311 written=7311"
" -> Seq Scan on file_group_permissions (cost=0.00..96260.12
rows=3291664 width=8) (actual time=0.016..1516.743 rows=3323389 loops=1)"
" Filter: (pg_id = ANY
('{6117,6115,6116,6113,6114}'::integer[]))"
" Rows Removed by Filter: 932311"
" Buffers: shared read=27105"
"Planning time: 0.514 ms"
"Execution time: 7540.243 ms"
This table is simply jinxed and it is hurting the performance everywhere it is being joined.
I have created a test table, just like yours, and I have populated it with exactly same amount of records:
insert into file_group_permissions (file_id,pg_id,policy_id,tag_id,inst_id)
select
trunc(random()*10000) as file_id,
trunc(random()*10000) as pg_id,
trunc(random()*10000) as policy_id,
trunc(random()*10000) as tag_id,
trunc(random()*10000) as inst_id
from generate_series(1,4255700) g
When I run your query, it executes pretty fast:
EXPLAIN (ANALYZE, BUFFERS)
select pg_id, count(distinct file_id)
from file_group_permissions
where pg_id in (6117,6115,6116,6113,6114) group by 1;
GroupAggregate (cost=0.43..8.32 rows=5 width=8) (actual time=0.339..1.608 rows=5 loops=1)
Buffers: shared hit=2158
-> Index Scan using pgfgp_idx on file_group_permissions (cost=0.43..8.24 rows=5 width=8) (actual time=0.018..1.170 rows=2147 loops=1)
Index Cond: (pg_id = ANY ('{6117,6115,6116,6113,6114}'::integer[]))
Buffers: shared hit=2158
Total runtime: 1.633 ms
I have notice this line in your execution plan:
Buffers: shared hit=50891, temp read=4824 written=4824
temp read=4824 written=4824 tells us that the database is "using" disk somehow to perform the scan operation. Perhaps you have to tweak some other postgresql.conf parameters, like these of mine:
shared_buffers = 1GB
temp_buffers = 32MB
work_mem = 32MB
effective_cache_size = 1GB