postresql group by query taking too long - postgresql

I am running the below query and it's takig 5 minutes,
SELECT "DID"
FROM location_signals
GROUP BY "DID";
I have an index on DID, which is variable char 100, the table has about 150 million records
how to improve and optimize this further ?
is there any additional indexes that can be added or recommendations? thanks
edit: below results of explain analyse query:
Finalize GroupAggregate (cost=23803276.36..24466411.92 rows=179625 width=44) (actual time=285577.900..321360.237 rows=4833061 loops=1)
Group Key: DID
-> Gather Merge (cost=23803276.36..24462819.42 rows=359250 width=44) (actual time=285577.874..320018.354 rows=10825153 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=23802276.33..24420353.03 rows=179625 width=44) (actual time=281580.548..310818.137 rows=3608384 loops=3)
Group Key: DID
-> Sort (cost=23802276.33..24007703.15 rows=82170727 width=36) (actual time=281580.535..303887.638 rows=65736579 loops=3)
Sort Key: DID
Sort Method: external merge Disk: 2987656kB
Worker 0: Sort Method: external merge Disk: 3099408kB
Worker 1: Sort Method: external merge Disk: 2987648kB
-> Parallel Seq Scan on location_signals (cost=0.00..6259493.27 rows=82170727 width=36) (actual time=0.043..13460.990 rows=65736579 loops=3)
Planning Time: 1.332 ms
Execution Time: 322686.767 ms

Related

POSTGRESQL: query optimization

I have a query that I cannot optimize:
SELECT evse_label AS evse_label,
evc_model AS evc_model,
address AS address,
city AS city,
message AS message,
reason AS reason,
min(timestamp) AS oldest_timestamp,
max(timestamp) AS latest_timestamp,
count(message) AS "# messages"
FROM public.om_logbook_mart
GROUP BY evse_label,
evc_model,
address,
city,
message,
reason
ORDER BY oldest_timestamp DESC
LIMIT 10000
I tried to create the following index:
CREATE INDEX evselabel_evc_model_address_city_reason_message_timestamp_idx
ON om_logbook_mart (evse_label,evc_model,address,city,message,reason,timestamp);
Unfortunately nothing changes and the query still lasts around 40s.
Doing an explain analyze, this is the output:
Limit (cost=1184359.89..1184384.89 rows=10000 width=95) (actual time=39395.145..39451.451 rows=10000 loops=1)
-> Sort (cost=1184359.89..1186097.37 rows=694989 width=95) (actual time=39394.488..39449.924 rows=10000 loops=1)
Sort Key: (min("timestamp")) DESC
Sort Method: top-N heapsort Memory: 3375kB
-> Finalize GroupAggregate (cost=856703.95..1134710.88 rows=694989 width=95) (actual time=30343.111..39339.675 rows=97445 loops=1)
Group Key: evse_label, evc_model, address, city, message, reason
-> Gather Merge (cost=856703.95..1096486.49 rows=1389978 width=95) (actual time=30343.062..39141.342 rows=231382 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=855703.92..935048.51 rows=694989 width=95) (actual time=29867.817..37735.438 rows=77127 loops=3)
Group Key: evse_label, evc_model, address, city, message, reason
-> Sort (cost=855703.92..862943.39 rows=2895788 width=79) (actual time=29867.788..35909.736 rows=2317226 loops=3)
Sort Key: evse_label, evc_model, address, city, message, reason
Sort Method: external merge Disk: 215024kB
Worker 0: Sort Method: external merge Disk: 203704kB
Worker 1: Sort Method: external merge Disk: 219048kB
-> Parallel Seq Scan on om_logbook_mart (cost=0.00..287564.88 rows=2895788 width=79) (actual time=0.033..1852.787 rows=2317226 loops=3)
Planning Time: 1.285 ms
Execution Time: 39478.630 ms
Can you make me understand how to optimize it?
UPDATE: After setting work_mem to 500MB:
Limit (cost=543440.12..543465.12 rows=10000 width=95) (actual time=6466.267..6468.723 rows=10000 loops=1)
-> Sort (cost=543440.12..545184.77 rows=697860 width=95) (actual time=6466.265..6467.809 rows=10000 loops=1)
Sort Key: (min("timestamp")) DESC
Sort Method: top-N heapsort Memory: 3646kB
-> HashAggregate (cost=486607.40..493586.00 rows=697860 width=95) (actual time=6384.249..6425.960 rows=97403 loops=1)
Group Key: evse_label, evc_model, address, city, message, reason
Batches: 1 Memory Usage: 49169kB
-> Seq Scan on om_logbook_mart (cost=0.00..329588.97 rows=6978597 width=79) (actual time=0.040..2362.520 rows=6978597 loops=1)
Planning Time: 0.135 ms
Execution Time: 6489.381 ms
Can i do something more?

query taking nested loop instead of hash join

Below query is nested loop and runs for 21 mins, after disabling nested loop it works in < 1min. Table stats are up to date and vacuum is run on the tables, any way to figure out why postgres is taking nested loop instead of efficient hash join.
Also to disable a nested loop is it better to set enable_nestloop to off or increase the random_page_cost? I think setting nestloop to off would stop plans from using nested loop if its going to be efficient in some places. What would be a better alternative, please advise.
SELECT DISTINCT ON (three.quebec_delta)
last_value(three.reviewed_by_nm) OVER wnd AS reviewed_by_nm,
last_value(three.reviewer_specialty_nm) OVER wnd AS reviewer_specialty_nm,
last_value(three.kilo) OVER wnd AS kilo,
last_value(three.review_reason_dscr) OVER wnd AS review_reason_dscr,
last_value(three.review_notes) OVER wnd AS review_notes,
last_value(three.seven_uniform_charlie) OVER wnd AS seven_uniform_charlie,
last_value(three.di_audit_source_system_cd) OVER wnd AS di_audit_source_system_cd,
last_value(three.di_audit_update_dtm) OVER wnd AS di_audit_update_dtm,
three.quebec_delta
FROM
ods_authorization.quebec_foxtrot seven_uniform_foxtrot
JOIN ods_authorization.golf echo ON seven_uniform_foxtrot.four = echo.oscar
JOIN ods_authorization.papa three ON echo.five = three.quebec_delta
AND three.xray = '0'::bpchar
WHERE
seven_uniform_foxtrot.two_india >= (zulu () - '2 years'::interval)
AND lima (three.kilo, 'ADVISOR'::character varying)::text = 'ADVISOR'::text
WINDOW wnd AS (PARTITION BY three.quebec_delta ORDER BY three.seven_uniform_charlie DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Plan running for 21m and taking nested loop
Unique (cost=550047.63..550257.15 rows=5238 width=281) (actual time=1295000.966..1296128.356 rows=319863 loops=1)
-> WindowAgg (cost=550047.63..550244.06 rows=5238 width=281) (actual time=1295000.964..1296013.046 rows=461635 loops=1)
-> Sort (cost=550047.63..550060.73 rows=5238 width=326) (actual time=1295000.929..1295089.796 rows=461635 loops=1)
Sort Key: three.quebec_delta, three.seven_uniform_charlie DESC
Sort Method: quicksort Memory: 197021kB
-> Nested Loop (cost=1001.12..549724.06 rows=5238 width=326) (actual time=8.274..1292470.826 rows=461635 loops=1)
-> Gather (cost=1000.56..527782.84 rows=24896 width=391) (actual time=4.287..12701.687 rows=3484699 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.56..524293.24 rows=10373 width=391) (actual time=3.492..400998.923 rows=1161566 loops=3)
-> Parallel Seq Scan on papa three (cost=0.00..436912.84 rows=10373 width=326) (actual time=1.554..2455.626 rows=1161566 loops=3)
Filter: ((xray = 'november'::bpchar) AND ((lima_sierra(kilo, 'two_zulu'::character varying))::text = 'two_zulu'::text))
Rows Removed by Filter: 501723
-> Index Scan using five_tango on golf echo (cost=0.56..8.42 rows=1 width=130) (actual time=0.342..0.342 rows=1 loops=3484699)
Index Cond: (five_hotel = three.quebec_delta)
-> Index Scan using lima_alpha on quebec_foxtrot seven_uniform_foxtrot (cost=0.56..0.88 rows=1 width=65) (actual time=0.366..0.366 rows=0 loops=3484699)
Index Cond: (four = echo.oscar)
Filter: (two_india >= (zulu() - 'two_two'::interval))
Rows Removed by Filter: 1
Planning time: 0.777 ms
Execution time: 1296183.259 ms
Plan after setting enable_nestloop to off and work_mem to 8GB. I get the same plan when increasing random_page_cost to 1000.
Unique (cost=5933437.24..5933646.68 rows=5236 width=281) (actual time=19898.050..20993.124 rows=319980 loops=1)
-> WindowAgg (cost=5933437.24..5933633.59 rows=5236 width=281) (actual time=19898.049..20879.655 rows=461769 loops=1)
-> Sort (cost=5933437.24..5933450.33 rows=5236 width=326) (actual time=19898.022..19978.839 rows=461769 loops=1)
Sort Key: three.quebec_delta, three.seven_uniform_charlie DESC
Sort Method: quicksort Memory: 197056kB
-> Hash Join (cost=1947451.87..5933113.80 rows=5236 width=326) (actual time=11616.323..17931.146 rows=461769 loops=1)
Hash Cond: (echo.oscar = seven_uniform_foxtrot.four)
-> Gather (cost=438059.74..4423656.32 rows=24897 width=391) (actual time=1909.685..7291.289 rows=3484833 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Hash Join (cost=437059.74..4420166.62 rows=10374 width=391) (actual time=1904.546..7385.948 rows=1161611 loops=3)
Hash Cond: (echo.five = three.quebec_delta)
-> Parallel Seq Scan on golf echo (cost=0.00..3921922.09 rows=8152209 width=130) (actual time=0.003..1756.576 rows=6531668 loops=3)
-> Parallel Hash (cost=436930.07..436930.07 rows=10374 width=326) (actual time=1904.354..1904.354 rows=1161611 loops=3)
Buckets: 4194304 (originally 32768) Batches: 1 (originally 1) Memory Usage: 1135200kB
-> Parallel Seq Scan on papa three (cost=0.00..436930.07 rows=10374 width=326) (actual time=0.009..963.728 rows=1161611 loops=3)
Filter: ((xray = 'november'::bpchar) AND ((lima(kilo, 'two_zulu'::character varying))::text = 'two_zulu'::text))
Rows Removed by Filter: 502246
-> Hash (cost=1476106.74..1476106.74 rows=2662831 width=65) (actual time=9692.517..9692.517 rows=2685656 loops=1)
Buckets: 4194304 Batches: 1 Memory Usage: 287171kB
-> Seq Scan on quebec_foxtrot seven_uniform_foxtrot (cost=0.00..1476106.74 rows=2662831 width=65) (actual time=0.026..8791.556 rows=2685656 loops=1)
Filter: (two_india >= (zulu() - 'two_two'::interval))
Rows Removed by Filter: 9984069
Planning time: 0.742 ms
Execution time: 21218.770 ms
Try an index on papa(lima_sierra(kilo, 'two_zulu'::character varying)) and ANALYZE the table. With that index in place, PostgreSQL collects statistics on the expression, which should improve the estimate, so that you don't get a nested loop join.
If you just replace COALESCE(r_cd, 'ADVISOR') = 'ADVISOR' with
(r_cd = 'ADVISOR' or r_cd IS NULL)
That might use the current table statistics to improve the estimates enough to change the plan.

Simple equijoin between postgreSQL partitions is taking a long time (10 minutes)

We are using postgreSQL database and I am hitting some limits. I have partitions on company_sale_account table
based on company name
We generate a report on accounts matched between the two. Below is the query:
SELECT cpsa1.*
FROM company_sale_account cpsa1
JOIN company_sale_account cpsa2 ON cpsa1.sale_account_id = cpsa2.sale_account_id
WHERE cpsa1.company_name = 'company_a'
AND cpsa2.company_name = 'company_b'
We have setup BTREE indexes on sale_account_id column on both the tables.
This worked fine till recently. Now, we have 10 million rows in
company_a partition and 7 million rows in company_b partition. This query is taking
more than 10 minutes.
Below is the explain plan output for it:
Buffers: shared hit=20125996 read=47811 dirtied=75, temp read=1333427 written=1333427 I/O Timings: read=19619.322
Sort (cost=167950986.43..168904299.23 rows=381325118 width=132) (actual time=517017.334..603691.048 rows=16854094 loops=1)
Sort Key: cpsa1.crm_account_id, ((cpsa1.account_name)::text),((cpsa1.account_owner)::text), ((cpsa1.account_type)::text), cpsa1.is_customer, ((date_part('epoch'::text,cpsa1.created_date))::integer),((hstore_to_json(cpsa1.custom_crm_fields))::tex (...)
Sort Method: external merge Disk: 2862656kB
Buffers: shared hit=20125996 read=47811 dirtied=75, temp read=1333427 written=1333427
I/O Timings: read=19619.322
- Nested Loop (cost=0.00..9331268.39 rows=381325118 width=132) (actual time=1.680..118698.570 rows=16854094 loops=1)
Buffers: shared hit=20125977 read=47811 dirtied=75
I/O Timings: read=19619.322
- Append (cost=0.00..100718.94 rows=2033676 width=33) (actual time=0.014..1783.243 rows=2033675 loops=1)
Buffers: shared hit=75298 dirtied=75
- Seq Scan on company_sale_account cpsa2 (cost=0.00..0.00 rows=1 width=516) (actual time=0.001..0.001 rows=0 loops=1)
Filter: ((company_name)::text = 'company_b'::text)
- Seq Scan on company_sale_account_concur cpsa2_1 (cost=0.00..100718.94 rows=2033675 width=33) (actual time=0.013..938.145 rows=2033675 loops=1)
Filter: ((company_name)::text = 'company_b'::text)
Buffers: shared hit=75298 dirtied=75
- Append (cost=0.00..1.97 rows=23 width=355) (actual time=0.034..0.047 rows=8 loops=2033675)
Buffers: shared hit=20050679 read=47811
I/O Timings: read=19619.322
- Seq Scan on company_sale_account cpsa1 (cost=0.00..0.00 rows=1 width=4525) (actual time=0.000..0.000 rows=0 loops=2033675)
Filter: (((company_name)::text = 'company_a'::text) AND ((cpsa2.sale_account_id)::text = (sale_account_id)::text))
- Index Scan using ix_csa_adp_sale_account on company_sale_account_adp cpsa1_1 (cost=0.56..1.97 rows=22 width=165) (actual time=0.033..0.042 rows=8 loops=2033675)
Index Cond: ((sale_account_id)::text = (cpsa2.sale_account_id)::text)
Filter: ((company_name)::text = 'company_a'::text)
Buffers: shared hit=20050679 read=47811
I/O Timings: read=19619.322
Planning time: 30.853 ms
Execution time: 618218.321ms
Do you have any suggestion on how to tune postgres.
Please share your thoughts. It would be a great help to me.

Reduce execution time of my query in postgresql

Below is my query. It takes 9387.430 ms to execute, which is certainly too long for such a request. I would like to be able to reduce this execution time. Can you please help me on this ? I also provided my analyze output.
EXPLAIN ANALYZE
SELECT a.artist, b.artist, COUNT(*)
FROM release_has_artist a, release_has_artist b
WHERE a.release = b.release AND a.artist <> b.artist
GROUP BY(a.artist,b.artist)
ORDER BY (a.artist,b.artist);;
Output of EXPLAIN ANALYZE :
Sort (cost=1696482.86..1707588.14 rows=4442112 width=48) (actual time=9253.474..9314.510 rows=461386 loops=1)
Sort Key: (ROW(a.artist, b.artist))
Sort Method: external sort Disk: 24832kB
-> Finalize GroupAggregate (cost=396240.32..932717.19 rows=4442112 width=48) (actual time=1928.058..2911.463 rows=461386 loops=1)
Group Key: a.artist, b.artist
-> Gather Merge (cost=396240.32..860532.87 rows=3701760 width=16) (actual time=1928.049..2494.638 rows=566468 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=395240.29..432257.89 rows=1850880 width=16) (actual time=1912.809..2156.951 rows=188823 loops=3)
Group Key: a.artist, b.artist
-> Sort (cost=395240.29..399867.49 rows=1850880 width=8) (actual time=1912.794..2003.776 rows=271327 loops=3)
Sort Key: a.artist, b.artist
Sort Method: external merge Disk: 4848kB
-> Merge Join (cost=0.85..177260.72 rows=1850880 width=8) (actual time=2.143..1623.628 rows=271327 loops=3)
Merge Cond: (a.release = b.release)
Join Filter: (a.artist <> b.artist)
Rows Removed by Join Filter: 687597
-> Parallel Index Only Scan using release_has_artist_pkey on release_has_artist a (cost=0.43..67329.73 rows=859497 width=8) (actual time=0.059..240.998 rows=687597 loops=3)
Heap Fetches: 711154
-> Index Only Scan using release_has_artist_pkey on release_has_artist b (cost=0.43..79362.68 rows=2062792 width=8) (actual time=0.072..798.402 rows=2329742 loops=3)
Heap Fetches: 2335683
Planning time: 2.101 ms
Execution time: 9387.430 ms
In your EXPLAIN ANALYZE output, there are two Sort Method: external merge Disk: ####kB, indicating that the sort spilled out to disk and not in memory, due to an insufficiently-sized work_mem. Try increasing your work_mem up to 32MB (30 might be ok, but I like multiples of 8), and try again
Note that you can set work_mem on a per-session basis, as a global change in work_mem could potentially have negative side-effects, such as running out of memory, because postgresql.conf-configured work_mem is allocated for each session (basically, it has a multiplicative effect).

Partition and Indexes

I have a table partitioned for every quarter. Table name is data. In table there is couple of columns but also date. date is a field which has index on it created:
create index on data (date);
Now I am trying to querying the table:
justpremium=> EXPLAIN analyze SELECT sum(col_1) FROM data WHERE "date" BETWEEN '2018-12-01' AND '2018-12-31';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=355709.66..355709.67 rows=1 width=32) (actual time=577.072..577.072 rows=1 loops=1)
-> Gather (cost=355709.44..355709.65 rows=2 width=32) (actual time=577.005..578.418 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=354709.44..354709.45 rows=1 width=32) (actual time=573.255..573.256 rows=1 loops=3)
-> Append (cost=0.42..352031.07 rows=1071346 width=8) (actual time=15.286..524.604 rows=837204 loops=3)
-> Parallel Index Scan using data_date_idx on data (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=0 loops=3)
Index Cond: ((date >= '2018-12-01'::date) AND (date <= '2018-12-31'::date))
-> Parallel Seq Scan on data_y2018q4 (cost=0.00..352022.64 rows=1071345 width=8) (actual time=15.282..465.859 rows=837204 loops=3)
Filter: ((date >= '2018-12-01'::date) AND (date <= '2018-12-31'::date))
Rows Removed by Filter: 1479844
Planning time: 1.437 ms
Execution time: 578.465 ms
(13 rows)
We may see that there is Parallel Seq Scan on data_y2018q4. In fact it is normal to me. I have one quarter partition. I am querying third part of the whole partition, so I have seq scan, great.
But now let's query directly partition table:
justpremium=> EXPLAIN analyze SELECT sum(col_1) FROM data_y2018q4 WHERE "date" BETWEEN '2018-12-01' AND '2018-12-31';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=286475.38..286475.39 rows=1 width=32) (actual time=277.830..277.830 rows=1 loops=1)
-> Gather (cost=286475.16..286475.37 rows=2 width=32) (actual time=277.760..279.194 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=285475.16..285475.17 rows=1 width=32) (actual time=275.950..275.950 rows=1 loops=3)
-> Parallel Index Scan using data_y2018q4_date_idx on data_y2018q4 (cost=0.43..282796.80 rows=1071345 width=8) (actual time=0.022..227.687 rows=837204 loops=3)
Index Cond: ((date >= '2018-12-01'::date) AND (date <= '2018-12-31'::date))
Planning time: 0.187 ms
Execution time: 279.233 ms
(9 rows)
Now I have Index Scan using data_y2018q4_date_idx and also time of whole query is two times quicker: 279.233 ms compared to 578.465 ms. What is the explanation of this? How force planner to use the index scan when querying data table. How to achieve two times better timing?