PostgreSQL SELECT too slow

PostgreSQL SELECT too slow - postgresql

I am looking for an idea to optimize my query.
Currently, I have a table of 4M lines, I only want to retrieve the last 1000 lines of a reference:
SELECT *
FROM customers_material_events
WHERE reference = 'XXXXXX'
ORDER BY date DESC
LIMIT 1000;
This is the execution plan:
Limit (cost=12512155.48..12512272.15 rows=1000 width=6807) (actual time=8953.545..9013.658 rows=1000 loops=1)
Buffers: shared hit=16153 read=30342
-> Gather Merge (cost=12512155.48..12840015.90 rows=2810036 width=6807) (actual time=8953.543..9013.613 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16153 read=30342
-> Sort (cost=12511155.46..12514668.00 rows=1405018 width=6807) (actual time=8865.186..8865.208 rows=632 loops=3)
Sort Key: date DESC
Sort Method: top-N heapsort Memory: 330kB
Worker 0: Sort Method: top-N heapsort Memory: 328kB
Worker 1: Sort Method: top-N heapsort Memory: 330kB
Buffers: shared hit=16153 read=30342
-> Parallel Seq Scan on customers_material_events (cost=0.00..64165.96 rows=1405018 width=6807) (actual time=0.064..944.029 rows=1117807 loops=3)
Filter: ((reference)::text = 'FFFEEE'::text)
Rows Removed by Filter: 17188
Buffers: shared hit=16091 read=30342
Planning Time: 0.189 ms
Execution Time: 9013.834 ms
(18 rows)
I see the execution time is very very slow...

The ideal index for this query would be:
CREATE INDEX ON customers_material_events (reference, date);
That would allow you to quickly find the values for a certain reference, automatically ordered by date, so no extra sort step is necessary.

Related

Postgres order by using value from a joined table

I have a table of conversations, and one of messages. I want to return a list of conversations ordered by their
select *
from "conversations"
left outer join "messages" on "conversations"."sender_id" = "messages"."sender_id" and event = 'user'
order by (cast("messages"."parse_data" -> 'transcription' ->> 'confidence' as double precision)) nulls last fetch next 50 rows only
And here is the index created :
create index "messages_parse_data_transcription_index_bt"
on messages using btree (sender_id, event, (cast("messages"."parse_data" -> 'transcription' ->> 'confidence' as double precision)) asc NULLS LAST );
analyse messages;
Unfortunately while it works on a large database (3GB, 7M lines on messages) it takes a very long time. Index creation is already very long (3m), and the query itself is also very very long (2M48)
Limit (cost=523667.88..523673.63 rows=50 width=3052) (actual time=132421.177..142355.335 rows=50 loops=1)
-> Gather Merge (cost=523667.88..560621.87 rows=321339 width=3052) (actual time=131738.458..141672.591 rows=50 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Sort (cost=522667.87..523471.22 rows=321339 width=3052) (actual time=131626.455..131626.606 rows=48 loops=2)
Sort Key: ((((messages.parse_data -> 'transcription'::text) ->> 'confidence'::text))::double precision)
Sort Method: top-N heapsort Memory: 125kB
Worker 0: Sort Method: top-N heapsort Memory: 125kB
-> Parallel Hash Left Join (cost=369055.84..511993.22 rows=321339 width=3052) (actual time=101533.553..131406.086 rows=279394 loops=2)
Hash Cond: ((conversations.sender_id)::text = (messages.sender_id)::text)
-> Parallel Seq Scan on conversations (cost=0.00..2634.32 rows=69832 width=33) (actual time=0.018..22.318 rows=59357 loops=2)
-> Parallel Hash (cost=281743.66..281743.66 rows=227615 width=3011) (actual time=90093.869..90093.870 rows=276748 loops=2)
Buckets: 2048 Batches: 512 Memory Usage: 2352kB
-> Parallel Seq Scan on messages (cost=0.00..281743.66 rows=227615 width=3011) (actual time=404.291..52209.756 rows=276748 loops=2)
Filter: ((event)::text = 'user'::text)
Rows Removed by Filter: 3163558
Planning Time: 20.895 ms
JIT:
Functions: 27
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 18.348 ms, Inlining 174.073 ms, Optimization 889.102 ms, Emission 424.619 ms, Total 1506.142 ms"
Execution Time: 142365.711 ms
How can I speed up my query execution ?

Gap in the execution-time parts in the query plan

I have trouble understanding the following query plan (anonyized). There seems to be a gap in how much time the actual parts of the query took and a part is missing.
The relevant part:
-> Sort (cost=306952.44..307710.96 rows=303409 width=40) (actual time=4222.122..4400.534 rows=582081 loops=1)
Sort Key: table1.column3, table1.column2
Sort Method: quicksort Memory: 71708kB
Buffers: shared hit=38058
-> Index Scan using myindex on table1 (cost=0.56..279325.68 rows=303409 width=40) (actual time=0.056..339.565 rows=582081 loops=1)
Index Cond: (((column1)::text = 'xxx1'::text) AND ((column2)::text > 'xxx2'::text) AND ((column2)::text < 'xxx3'::text))
Buffers: shared hit=38058
What happened between the index scan (up until 339.565) and starting to sort the result (4222.122)?
And the full plan (just in case):
GroupAggregate (cost=344290.95..348368.01 rows=30341 width=65) (actual time=4933.739..4933.806 rows=11 loops=1)
Group Key: t.column1, t.current_value, t.previous_value
Buffers: shared hit=38058
-> Sort (cost=344290.95..345045.68 rows=301892 width=72) (actual time=4933.712..4933.719 rows=58 loops=1)
Sort Key: t.column1, t.current_value, t.previous_value
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=38058
-> Subquery Scan on t (cost=306952.44..316813.23 rows=301892 width=72) (actual time=4573.523..4933.607 rows=58 loops=1)
Filter: ((t.current_value)::text <> (t.previous_value)::text)
Rows Removed by Filter: 582023
Buffers: shared hit=38058
-> WindowAgg (cost=306952.44..313020.62 rows=303409 width=72) (actual time=4222.144..4859.579 rows=582081 loops=1)
Buffers: shared hit=38058
-> Sort (cost=306952.44..307710.96 rows=303409 width=40) (actual time=4222.122..4400.534 rows=582081 loops=1)
Sort Key: table1.column3, table1.column2
Sort Method: quicksort Memory: 71708kB
Buffers: shared hit=38058
-> Index Scan using myindex on table1 (cost=0.56..279325.68 rows=303409 width=40) (actual time=0.056..339.565 rows=582081 loops=1)
Index Cond: (((column1)::text = 'xxx1'::text) AND ((column2)::text > 'xxx2'::text) AND ((column2)::text < 'xxx3'::text))
Buffers: shared hit=38058
Planning Time: 0.405 ms
Execution Time: 4941.003 ms

Both cost and actual time data are shown as two numbers in these plans. The first is the setup time. Sloppily described, it's the time the query step takes to deliver its first row. See this. The second is the completion time, or time to last row.
actual time=setup...completion
The setup on a sort step includes the time required to retrieve the result set needing sorting and to actually perform the sort: shuffle the rows around in RAM or hopefully not, on disk. (Because quicksort is generally O(n log(n)) in complexity, that can be long for big result sets. You knew that.)
In your plan, the inner sort handles almost 600K rows that came from an index scan. It's the step where the sort takes time. It used 71,708 kilobytes of RAM.
As far as I can tell, there's nothing anomalous in your plan. How to speed up that sort? You might try shorter or fixed-length column2 and column3 data types, but that's all guesswork without seeing your query and table definitions.

Long waiting for SELECT query on a 70 million records table. How to improve performance?

I have a table in Postgres with more than 70 million records that relates temperature with a certain time (day) and space (meteorologic station). I need to do some calculations given a period of time and a set of meteorological stations, such as sum, average, quartile and normal value. I am using It is taking 30 seconds for returning. How can I improve this waiting?
This is the explain(analyze, buffers) select avg(p) as rain FROM waterbalances group by extract(month from date), extract(year from date);:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=3310337.68..3314085.15 rows=13836 width=24) (actual time=21252.008..21252.624 rows=478 loops=1)
Group Key: (date_part('month'::text, (date)::timestamp without time zone)), (date_part('year'::text, (date)::timestamp without time zone))
Buffers: shared hit=6335 read=734014
-> Gather Merge (cost=3310337.68..3313566.30 rows=27672 width=48) (actual time=21251.984..21261.693 rows=1432 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15841 read=2195624
-> Sort (cost=3309337.66..3309372.25 rows=13836 width=48) (actual time=21130.846..21130.862 rows=477 loops=3) Sort Key: (date_part('month'::text, (date)::timestamp without time zone)), (date_part('year'::text, (date)::timestamp without time zone))
Sort Method: quicksort Memory: 92kB
Worker 0: Sort Method: quicksort Memory: 92kB
Worker 1: Sort Method: quicksort Memory: 92kB
Buffers: shared hit=15841 read=2195624
-> Partial HashAggregate (cost=3308109.29..3308386.01 rows=13836 width=48) (actual time=21130.448..21130.618 rows=477 loops=3)
Group Key: date_part('month'::text, (date)::timestamp without time zone), date_part('year'::text, (date)::timestamp without time zone)
Buffers: shared hit=15827 read=2195624
-> Parallel Seq Scan on waterbalances (cost=0.00..3009020.66 rows=39878483 width=24) (actual time=1.528..15460.388 rows=31914000 loops=3)
Buffers: shared hit=15827 read=2195624
Planning Time: 7.621 ms
Execution Time: 21262.552 ms
(20 rows)

Search results using ts_vectors and ST_Distance

We have a page where we show a list of results, and the results must be relevant given 2 factors:
keyword similarity
location
we are using postgresql postgis and ts_vectors, however, we don't know how to combine the scores coming of ts vectors and st_distance in order to have the "best" search results, the queries seem to be taking between 30 seconds and 1 minute.
SELECT [121/1808]
ts_rank_cd(doc_vectors, plainto_tsquery('Uber '), 1 | 4 | 32) AS rank, ts_headline('english', short_job_description, plainto_tsquery('Uber '), 'MaxWords=80,MinWords=50'),
-- a bunch of fields omitted...
org.logo
FROM jobs.job as job
LEFT OUTER JOIN jobs.organization as org
ON job.organization_id = org.id
WHERE job.is_expired = 0 and deleted_at is NULL and doc_vectors ## plainto_tsquery('Uber ') order by rank desc offset 80 limit 20;
Do you guys have suggestions for us?
EXPLAIN (ANALYZE, BUFFERS) for same Query:
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=886908.73..886908.81 rows=30 width=1108) (actual time=20684.508..20684.518 rows=30 loops=1)
Buffers: shared hit=1584 read=825114
-> Sort (cost=886908.68..889709.48 rows=1120318 width=1108) (actual time=20684.502..20684.509 rows=50 loops=1)
Sort Key: job.created_at DESC
Sort Method: top-N heapsort Memory: 75kB
Buffers: shared hit=1584 read=825114
-> Hash Left Join (cost=421.17..849692.52 rows=1120318 width=1108) (actual time=7.012..18887.816 rows=1111019 loops=1)
Hash Cond: (job.organization_id = org.id)
Buffers: shared hit=1581 read=825114
-> Seq Scan on job (cost=0.00..846329.53 rows=1120318 width=1001) (actual time=0.052..17866.594 rows=1111019 loops=1)
Filter: ((deleted_at IS NULL) AND (is_expired = 0) AND (is_hidden = 0))
Rows Removed by Filter: 196298
Buffers: shared hit=1564 read=824989
-> Hash (cost=264.41..264.41 rows=12541 width=107) (actual time=6.898..6.899 rows=12541 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 1037kB
Buffers: shared hit=14 read=125
-> Seq Scan on organization org (cost=0.00..264.41 rows=12541 width=107) (actual time=0.021..3.860 rows=12541 loops=1)
Buffers: shared hit=14 read=125
Planning time: 2.223 ms
Execution time: 20684.682 ms```

postresql group by query taking too long

I am running the below query and it's takig 5 minutes,
SELECT "DID"
FROM location_signals
GROUP BY "DID";
I have an index on DID, which is variable char 100, the table has about 150 million records
how to improve and optimize this further ?
is there any additional indexes that can be added or recommendations? thanks
edit: below results of explain analyse query:
Finalize GroupAggregate (cost=23803276.36..24466411.92 rows=179625 width=44) (actual time=285577.900..321360.237 rows=4833061 loops=1)
Group Key: DID
-> Gather Merge (cost=23803276.36..24462819.42 rows=359250 width=44) (actual time=285577.874..320018.354 rows=10825153 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=23802276.33..24420353.03 rows=179625 width=44) (actual time=281580.548..310818.137 rows=3608384 loops=3)
Group Key: DID
-> Sort (cost=23802276.33..24007703.15 rows=82170727 width=36) (actual time=281580.535..303887.638 rows=65736579 loops=3)
Sort Key: DID
Sort Method: external merge Disk: 2987656kB
Worker 0: Sort Method: external merge Disk: 3099408kB
Worker 1: Sort Method: external merge Disk: 2987648kB
-> Parallel Seq Scan on location_signals (cost=0.00..6259493.27 rows=82170727 width=36) (actual time=0.043..13460.990 rows=65736579 loops=3)
Planning Time: 1.332 ms
Execution Time: 322686.767 ms

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL SELECT too slow - postgresql

The ideal index for this query would be: CREATE INDEX ON customers_material_events (reference, date); That would allow you to quickly find the values for a certain reference, automatically ordered by date, so no extra sort step is necessary.

Related

Postgres order by using value from a joined table

Gap in the execution-time parts in the query plan

Long waiting for SELECT query on a 70 million records table. How to improve performance?

Search results using ts_vectors and ST_Distance

postresql group by query taking too long

Categories

Resources