Is it possible speed up this query by adding more indexes? - postgresql

I have the following SQL query
EXPLAIN ANALYZE
SELECT
full_address,
street_address,
street.street,
(
select
city
from
city
where
city.id = property.city_id
)
AS city,
(
select
state_code
from
state
where
id = property.state_id
)
AS state_code,
(
select
zipcode
from
zipcode
where
zipcode.id = property.zipcode_id
)
AS zipcode
FROM
property
INNER JOIN
street
ON street.id = property.street_id
WHERE
street.street = 'W San Miguel Ave'
AND property.zipcode_id =
(
SELECT
id
FROM
zipcode
WHERE
zipcode = '85340'
)
Below is the EXPLAIN ANALYZE results
Gather (cost=1008.86..226541.68 rows=1 width=161) (actual time=59.311..21956.143 rows=184 loops=1)
Workers Planned: 2
Params Evaluated: $3
Workers Launched: 2
InitPlan 4 (returns $3)
-> Index Scan using zipcode_zipcode_county_id_state_id_index on zipcode zipcode_1 (cost=0.28..8.30 rows=1 width=16) (actual time=0.039..0.040 rows=1 loops=1)
Index Cond: (zipcode = '85340'::citext)
-> Nested Loop (cost=0.56..225508.35 rows=1 width=113) (actual time=7430.172..14723.451 rows=61 loops=3)
-> Parallel Seq Scan on street (cost=0.00..13681.63 rows=1 width=28) (actual time=108.023..108.053 rows=1 loops=3)
Filter: (street = 'W San Miguel Ave'::citext)
Rows Removed by Filter: 99131
-> Index Scan using property_street_address_street_id_city_id_state_id_zipcode_id_c on property (cost=0.56..211826.71 rows=1 width=117) (actual time=10983.195..21923.063 rows=92 loops=2)
Index Cond: ((street_id = street.id) AND (zipcode_id = $3))
SubPlan 1
-> Index Scan using city_id_pk on city (cost=0.28..8.30 rows=1 width=9) (actual time=0.003..0.003 rows=1 loops=184)
Index Cond: (id = property.city_id)
SubPlan 2
-> Index Scan using state_id_pk on state (cost=0.27..8.34 rows=1 width=3) (actual time=0.002..0.002 rows=1 loops=184)
Index Cond: (id = property.state_id)
SubPlan 3
-> Index Scan using zipcode_id_pk on zipcode (cost=0.28..8.30 rows=1 width=6) (actual time=0.002..0.003 rows=1 loops=184)
Index Cond: (id = property.zipcode_id)
Planning Time: 1.228 ms
Execution Time: 21956.246 ms
Is it possible to speed up this query by adding more indexes?

The query can be rewritten using joins rather than subselects. This may be faster and easier to index.
SELECT
full_address,
street_address,
street.street,
city.city as city,
state.state_code as state_code,
zipcode.zipcode as zipcode,
FROM
property
INNER JOIN street ON street.id = property.street_id
INNER JOIN city ON city.id = property.city_id
INNER JOIN state ON state.id = property.state_id
INNER JOIN zipcode ON zipcode.id = property.zipcode_id
WHERE
street.street = 'W San Miguel Ave'
AND zipcode.zipcode = '85340'
Assuming all the foreign keys (property.street_id, property.city_id, etc...) are indexed this now becomes a search on street.street and zipcode.zipcode. As long as they are indexed the query should take milliseconds.

Related

PostgreSQL: Scan relevant partitions only with subquery

I have a table foundation.data which is partitioned over an asset_id with hash partitioning.
Whenever I do a query such as
with deleted_unprocessed_data as (
delete from foundation.unprocessed_ids d
where id = any(select id from foundation.unprocessed_ids up order by up.asset_id, up.data_point_timestamp asc limit 1000)
returning id, asset_id, data_point_timestamp
)
, ids_to_process as
(
insert into foundation.processed_ids select * from deleted_unprocessed_data
returning id, asset_id
)
select jh.asset_id,
min(jh.data_point_timestamp) as minborder,
max(jh.data_point_timestamp) as maxborder
from
(
select id, asset_id, data_point_timestamp FROM
foundation.DATA fd
where id = any(select id from ids_to_process)
and fd.asset_id = any(select asset_id from ids_to_process)
) jh
group by asset_id;
the explain will show that it accesses all partitions:
...
-> Append (cost=0.42..1822.75 rows=256 width=20) (actual time=0.285..0.617 rows=1 loops=1000)
-> Index Scan using asset_id_hash_0_id_idx on asset_id_hash_0 fd (cost=0.42..7.10 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
-> Index Scan using asset_id_hash_1_id_idx on asset_id_hash_1 fd_1 (cost=0.42..6.66 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
...
-> Index Scan using asset_id_hash_72_id_idx on asset_id_hash_72 fd_72 (cost=0.43..8.35 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
...
-> Index Scan using asset_id_hash_255_id_idx on asset_id_hash_255 fd_255 (cost=0.42..6.87 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
How do force the planner to only access the relevant partitions?
The subquery is accessing only one partition:
-> Hash (cost=145.14..145.14 rows=200 width=4)
-> Group (cost=0.00..143.14 rows=200 width=4)
Group Key: asset_7800_1.asset_id
-> Seq Scan on asset_7800 asset_7800_1 (cost=0.00..135.11 rows=3209 width=4)
Filter: (asset_id = 7800)
The outer query on the other hand is accessing all partitions. I think this is because the planner can't be sure that the subquery will return rows belonging to only one partition.
Don't think there is an obvious way to do this unfortunately. Try running the explain with the ANALYZE option. That will tell you what Postgres actually did and then you might see that it actually only ended up scanning one partition.
Answering my own question here but do not understand the dynamics behind so happy for hints on your side as to why this changes so much.
I used a join instead:
with deleted_unprocessed_data as (
delete from foundation.unprocessed_ids d
where id = any(select id from foundation.unprocessed_ids up order by up.asset_id, up.data_point_timestamp asc limit 1000
-- case
-- when numberOfItems<1000 then numberOfItems
-- when numberOfItems>999 then 1000
-- end
)
returning id, asset_id, data_point_timestamp
)
, ids_to_process as
(
insert into foundation.processed_ids select * from deleted_unprocessed_data
returning id, asset_id
)
select jh.asset_id,
min(jh.data_point_timestamp) as minborder,
max(jh.data_point_timestamp) as maxborder
from
(
select fd.id, fd.asset_id, fd.data_point_timestamp FROM
foundation.DATA fd
**join** ids_to_process itp
on fd.asset_id = itp.asset_id
and fd.id = itp.id
) jh
group by asset_id;
Then I get the following explain analyze which specifies that all index scans but on one partition were never executed:
...
-> Append (cost=0.42..1823.39 rows=256 width=20) (actual time=0.002..0.003 rows=1 loops=1000)
-> Index Scan using asset_id_hash_0_id_idx on asset_id_hash_0 fd (cost=0.42..7.10 rows=1 width=20) (never executed)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)
-> Index Scan using asset_id_hash_1_id_idx on asset_id_hash_1 fd_1 (cost=0.42..6.67 rows=1 width=20) (never executed)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)
...
-> Index Scan using asset_id_hash_72_id_idx on asset_id_hash_72 fd_72 (cost=0.43..8.35 rows=1 width=20) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)
...
-> Index Scan using asset_id_hash_255_id_idx on asset_id_hash_255 fd_255 (cost=0.42..6.87 rows=1 width=20) (never executed)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)

Improving performance of query

We have a PostgreSQL query with multiple tables and left outer joins, and is running very slow.
It is completing in 25-40s, so we want to optimize it more and want to decrease run time to 1-2 sec.
select a.campaignid, b.campaign_name , case when b.message_type_id = 1 then 'Promotional'
when b.message_type_id = 2 then 'Transactional'
else 'Other' end as Campaign_type, c.username , aggregator_type,
e.cli_manager_id as senderID,
b.schedule_time as campaign_schedule_date,
count(a.mobile) as campaign_submitted_count, count(case when a.status = 'DELIVRD' then mobile end) as Delivered,
count(a.mobile) as Total_count,
count(case when a.status = 'FAILED' then mobile end) as failure_count,
count(case when a.status = 'DND_check_failed' then mobile end) as DND_count,
sum(credits_used) as credits_used
from tbl_cdr_test a left outer join tbl_campaign b
on a.campaignid = b.tbl_campaign_id left outer join tbl_users_master c
on b.user_id =c.user_master_id
left outer join tbl_cli_manager e on b.user_id = e.user_id
left outer join tbl_user_channel f on b.user_id =f.user_id
left outer join tbl_user_configurations g on b.user_id = g.user_id
where date(insert_datetime) between '2020-05-23' and '2020-06-23'
and c.username = coalesce(null, c.username)
and g.msg_cat_id = coalesce(null, g.msg_cat_id)
and a.campaignid = coalesce(null, a.campaignid)
and e.cli_manager_id = coalesce(null, e.cli_manager_id)
group by a.campaignid, b.campaign_name , b.message_type_id,c.username , b.schedule_time,
aggregator_type, e.cli_manager_id;
We have create appropriate indexes as well, but still it is taking time.
Moreover there is "external merge disk" sorting method in execution plan whereas to resolve same I have set work_mem = 50MB. Still it is using disk sort instead of memory.Please suggest
Below is execution plan:
GroupAggregate (cost=4872.01..4872.07 rows=1 width=543) (actual time=20564.239..27415.264 rows=8 loops=1)
Group Key: a.campaignid, b.campaign_name, b.message_type_id, c.username, b.schedule_time, f.aggregator_type, e.cli_manager_id
-> Sort (cost=4872.01..4872.01 rows=1 width=483) (actual time=19627.424..25020.702 rows=3206196 loops=1)
Sort Key: a.campaignid, b.campaign_name, b.message_type_id, c.username, b.schedule_time, f.aggregator_type, e.cli_manager_id
Sort Method: external merge Disk: 281456kB
-> Nested Loop (cost=22.03..4872.00 rows=1 width=483) (actual time=99.704..12086.244 rows=3206196 loops=1)
Join Filter: (b.user_id = g.user_id)
-> Nested Loop Left Join (cost=21.89..4871.79 rows=1 width=495) (actual time=99.688..4518.533 rows=3206196 loops=1)
-> Nested Loop (cost=21.75..4871.54 rows=1 width=77) (actual time=99.664..935.689 rows=356244 loops=1)
-> Nested Loop (cost=21.33..31.57 rows=1 width=65) (actual time=0.295..2.376 rows=588 loops=1)
Join Filter: (b.user_id = c.user_master_id)
-> Merge Join (cost=21.18..30.22 rows=6 width=46) (actual time=0.246..0.663 rows=588 loops=1)
Merge Cond: (e.user_id = b.user_id)
-> Index Scan using "idx_FK_7hc6agd_tbl_cli_ma_1592228110_32" on tbl_cli_manager e (cost=0.42..6281.84 rows=762 width=12) (actual time=0.014..0.035 rows=5 loops=1)
Filter: (cli_manager_id = COALESCE(cli_manager_id))
-> Sort (cost=20.76..21.13 rows=147 width=34) (actual time=0.225..0.333 rows=585 loops=1)
Sort Key: b.user_id
Sort Method: quicksort Memory: 36kB
-> Seq Scan on tbl_campaign b (cost=0.00..15.47 rows=147 width=34) (actual time=0.013..0.154 rows=147 loops=1)
-> Index Scan using ind_user_master_c_user on tbl_users_master c (cost=0.14..0.21 rows=1 width=19) (actual time=0.002..0.002 rows=1 loops=588)
Index Cond: (user_master_id = e.user_id)
Filter: ((username)::text = (COALESCE(username))::text)
-> Append (cost=0.42..4839.94 rows=3 width=20) (actual time=0.546..1.426 rows=606 loops=588)
-> Index Scan using testh11_campaignid_idx on testh11 a (cost=0.42..4253.99 rows=2 width=20) (actual time=0.543..0.543 rows=0 loops=588)
Index Cond: (campaignid = b.tbl_campaign_id)
Filter: ((campaignid = COALESCE(campaignid)) AND (date(insert_datetime) >= '2020-05-23'::date) AND (date(insert_datetime) <= '2020-06-23'::date))
Rows Removed by Filter: 656
-> Index Scan using testh21_campaignid_idx on testh21 a_1 (cost=0.42..585.94 rows=1 width=20) (actual time=0.002..0.796 rows=606 loops=588)
Index Cond: (campaignid = b.tbl_campaign_id)
Filter: ((campaignid = COALESCE(campaignid)) AND (date(insert_datetime) >= '2020-05-23'::date) AND (date(insert_datetime) <= '2020-06-23'::date))
-> Index Scan using idx_user_id_tbl_user_c_1592227657_19 on tbl_user_channel f (cost=0.14..0.24 rows=1 width=422) (actual time=0.002..0.004 rows=9 loops=356244)
Index Cond: (user_id = b.user_id)
-> Index Scan using "idx_FK_6958qvy_tbl_user_c_1592228774_151" on tbl_user_configurations g (cost=0.14..0.20 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=3206196)
Index Cond: (user_id = e.user_id)
Filter: (msg_cat_id = COALESCE(msg_cat_id))
Planning Time: 6.561 ms
Execution Time: 27477.860 ms
There is a gross underestimate of the result rows for the index scan on testh21. The consequence is that PostgreSQL chooses nested loop joins, which is where your time is spent.
Try the following:
New statistics:
ANALYZE testh21;
If that improves the estimate, make sure that autoanalyze treats the table more often.
Prevent bad estimates caused by correlation:
CREATE STATISTICS testh21_stat (dependencies)
ON campaignid, insert_datetime FROM testh21;
ANALYZE testh21;
Perhaps there is a correlation between the columns, and that improves the estimate.
More detailed statistics: try raising default_statistics_target before ANALYZE of the table.
If you cannot improve the estimates, take the hammer and set enable_nestloop = off for the duration of the query.

Postgres Table Slow Performance

We have a Product table in postgres DB. This is hosted on Heroku. We have 8 GB RAM and 250 GB disk space. 1000 IPOP allowed.
We are having proper indexes on columns.
Platform
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu (Ubuntu 9.5.12-1.pgdg14.04+1), compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.4) 4.8.4, 64-bit
We are running a keywords search query on this table. We are having 2.8 millions records in this table. Our search query is too slow. Its giving us result in about 50 seconds. Which is too slow.
Query
SELECT
P .sfid AS prodsfid,
P .image_url__c image,
P .productcode sku,
P .Short_Description__c shortDesc,
P . NAME pname,
P .category__c,
P .price__c price,
P .description,
P .vendor_name__c vname,
P .vendor__c supSfid
FROM
staging.product2 P
JOIN (
SELECT
p1.sfid
FROM
staging.product2 p1
WHERE
p1. NAME ILIKE '%s%'
OR p1.productcode ILIKE '%s%'
) AS TEMP ON (P .sfid = TEMP .sfid)
WHERE
P .status__c = 'Available'
AND LOWER (
P .vendor_shipping_country__c
) = ANY (
VALUES
('us'),
('usa'),
('united states'),
('united states of america')
)
AND P .vendor_catalog_tier__c = ANY (
VALUES
('a1c37000000oljnAAA'),
('a1c37000000oljQAAQ'),
('a1c37000000oljQAAQ'),
('a1c37000000pT7IAAU'),
('a1c37000000omDjAAI'),
('a1c37000000oljMAAQ'),
('a1c37000000oljaAAA'),
('a1c37000000pT7SAAU'),
('a1c0R000000AFcVQAW'),
('a1c0R000000A1HAQA0'),
('a1c0R0000000OpWQAU'),
('a1c0R0000005TZMQA2'),
('a1c37000000oljdAAA'),
('a1c37000000ooTqAAI'),
('a1c37000000omLBAAY'),
('a1c0R0000005N8GQAU')
)
Here is the explain plan:
Nested Loop (cost=31.85..33886.54 rows=3681 width=750)
-> Hash Join (cost=31.77..31433.07 rows=4415 width=750)
Hash Cond: (lower((p.vendor_shipping_country__c)::text) = "*VALUES*".column1)
-> Nested Loop (cost=31.73..31423.67 rows=8830 width=761)
-> HashAggregate (cost=0.06..0.11 rows=16 width=32)
Group Key: "*VALUES*_1".column1
-> Values Scan on "*VALUES*_1" (cost=0.00..0.06 rows=16 width=32)
-> Bitmap Heap Scan on product2 p (cost=31.66..1962.32 rows=552 width=780)
Recheck Cond: ((vendor_catalog_tier__c)::text = "*VALUES*_1".column1)
Filter: ((status__c)::text = 'Available'::text)
-> Bitmap Index Scan on vendor_catalog_tier_prd_idx (cost=0.00..31.64 rows=1016 width=0)
Index Cond: ((vendor_catalog_tier__c)::text = "*VALUES*_1".column1)
-> Hash (cost=0.03..0.03 rows=4 width=32)
-> Unique (cost=0.02..0.03 rows=4 width=32)
-> Sort (cost=0.02..0.02 rows=4 width=32)
Sort Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=4 width=32)
-> Index Scan using sfid_prd_idx on product2 p1 (cost=0.09..0.55 rows=1 width=19)
Index Cond: ((sfid)::text = (p.sfid)::text)
Filter: (((name)::text ~~* '%s%'::text) OR ((productcode)::text ~~* '%s%'::text))
Its returning around 140,576 records. By the way we need only top 5,000 records only. Will putting Limit help here?
Let me know how to make it fast and what is causing this slow.
EXPLAIN ANALYZE
#RaymondNijland Here is the explain analyze
Nested Loop (cost=31.83..33427.28 rows=4039 width=750) (actual time=1.903..4384.221 rows=140576 loops=1)
-> Hash Join (cost=31.74..30971.32 rows=4369 width=750) (actual time=1.852..1094.964 rows=164353 loops=1)
Hash Cond: (lower((p.vendor_shipping_country__c)::text) = "*VALUES*".column1)
-> Nested Loop (cost=31.70..30962.02 rows=8738 width=761) (actual time=1.800..911.738 rows=164353 loops=1)
-> HashAggregate (cost=0.06..0.11 rows=16 width=32) (actual time=0.012..0.019 rows=15 loops=1)
Group Key: "*VALUES*_1".column1
-> Values Scan on "*VALUES*_1" (cost=0.00..0.06 rows=16 width=32) (actual time=0.004..0.005 rows=16 loops=1)
-> Bitmap Heap Scan on product2 p (cost=31.64..1933.48 rows=546 width=780) (actual time=26.004..57.290 rows=10957 loops=15)
Recheck Cond: ((vendor_catalog_tier__c)::text = "*VALUES*_1".column1)
Filter: ((status__c)::text = 'Available'::text)
Rows Removed by Filter: 645
Heap Blocks: exact=88436
-> Bitmap Index Scan on vendor_catalog_tier_prd_idx (cost=0.00..31.61 rows=1000 width=0) (actual time=24.811..24.811 rows=11601 loops=15)
Index Cond: ((vendor_catalog_tier__c)::text = "*VALUES*_1".column1)
-> Hash (cost=0.03..0.03 rows=4 width=32) (actual time=0.032..0.032 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Unique (cost=0.02..0.03 rows=4 width=32) (actual time=0.026..0.027 rows=4 loops=1)
-> Sort (cost=0.02..0.02 rows=4 width=32) (actual time=0.026..0.026 rows=4 loops=1)
Sort Key: "*VALUES*".column1
Sort Method: quicksort Memory: 25kB
-> Values Scan on "*VALUES*" (cost=0.00..0.01 rows=4 width=32) (actual time=0.001..0.002 rows=4 loops=1)
-> Index Scan using sfid_prd_idx on product2 p1 (cost=0.09..0.56 rows=1 width=19) (actual time=0.019..0.020 rows=1 loops=164353)
Index Cond: ((sfid)::text = (p.sfid)::text)
Filter: (((name)::text ~~* '%s%'::text) OR ((productcode)::text ~~* '%s%'::text))
Rows Removed by Filter: 0
Planning time: 2.488 ms
Execution time: 4391.378 ms
Another query version, with order by , but it seems very slow as well (140 seconds)
SELECT
P .sfid AS prodsfid,
P .image_url__c image,
P .productcode sku,
P .Short_Description__c shortDesc,
P . NAME pname,
P .category__c,
P .price__c price,
P .description,
P .vendor_name__c vname,
P .vendor__c supSfid
FROM
staging.product2 P
WHERE
P .status__c = 'Available'
AND P .vendor_shipping_country__c IN (
'us',
'usa',
'united states',
'united states of america'
)
AND P .vendor_catalog_tier__c IN (
'a1c37000000omDQAAY',
'a1c37000000omDTAAY',
'a1c37000000omDXAAY',
'a1c37000000omDYAAY',
'a1c37000000omDZAAY',
'a1c37000000omDdAAI',
'a1c37000000omDfAAI',
'a1c37000000omDiAAI',
'a1c37000000oml6AAA',
'a1c37000000oljPAAQ',
'a1c37000000oljRAAQ',
'a1c37000000oljWAAQ',
'a1c37000000oljXAAQ',
'a1c37000000oljZAAQ',
'a1c37000000oljcAAA',
'a1c37000000oljdAAA',
'a1c37000000oljlAAA',
'a1c37000000oljoAAA',
'a1c37000000oljqAAA',
'a1c37000000olnvAAA',
'a1c37000000olnwAAA',
'a1c37000000olnxAAA',
'a1c37000000olnyAAA',
'a1c37000000olo0AAA',
'a1c37000000olo1AAA',
'a1c37000000olo4AAA',
'a1c37000000olo8AAA',
'a1c37000000olo9AAA',
'a1c37000000oloCAAQ',
'a1c37000000oloFAAQ',
'a1c37000000oloIAAQ',
'a1c37000000oloJAAQ',
'a1c37000000oloMAAQ',
'a1c37000000oloNAAQ',
'a1c37000000oloSAAQ',
'a1c37000000olodAAA',
'a1c37000000oloeAAA',
'a1c37000000olzCAAQ',
'a1c37000000om0xAAA',
'a1c37000000ooV1AAI',
'a1c37000000oog8AAA',
'a1c37000000oogDAAQ',
'a1c37000000oonzAAA',
'a1c37000000oluuAAA',
'a1c37000000pT7SAAU',
'a1c37000000oljnAAA',
'a1c37000000olumAAA',
'a1c37000000oljpAAA',
'a1c37000000pUm2AAE',
'a1c37000000olo3AAA',
'a1c37000000oo1MAAQ',
'a1c37000000oo1vAAA',
'a1c37000000pWxgAAE',
'a1c37000000pYJkAAM',
'a1c37000000omDjAAI',
'a1c37000000ooTgAAI',
'a1c37000000op2GAAQ',
'a1c37000000one0AAA',
'a1c37000000oljYAAQ',
'a1c37000000pUlxAAE',
'a1c37000000oo9SAAQ',
'a1c37000000pcIYAAY',
'a1c37000000pamtAAA',
'a1c37000000pd2QAAQ',
'a1c37000000pdCOAAY',
'a1c37000000OpPaAAK',
'a1c37000000OphZAAS',
'a1c37000000olNkAAI'
)
ORDER BY p.productcode asc
LIMIT 5000
Here is the explain analyse for this:
Limit (cost=0.09..45271.54 rows=5000 width=750) (actual time=48593.355..86376.864 rows=5000 loops=1)
-> Index Scan using productcode_prd_idx on product2 p (cost=0.09..743031.39 rows=82064 width=750) (actual time=48593.353..86376.283 rows=5000 loops=1)
Filter: (((status__c)::text = 'Available'::text) AND ((vendor_shipping_country__c)::text = ANY ('{us,usa,"united states","united states of america"}'::text[])) AND ((vendor_catalog_tier__c)::text = ANY ('{a1c37000000omDQAAY,a1c37000000omDTAAY,a1c37000000omDXAAY,a1c37000000omDYAAY,a1c37000000omDZAAY,a1c37000000omDdAAI,a1c37000000omDfAAI,a1c37000000omDiAAI,a1c37000000oml6AAA,a1c37000000oljPAAQ,a1c37000000oljRAAQ,a1c37000000oljWAAQ,a1c37000000oljXAAQ,a1c37000000oljZAAQ,a1c37000000oljcAAA,a1c37000000oljdAAA,a1c37000000oljlAAA,a1c37000000oljoAAA,a1c37000000oljqAAA,a1c37000000olnvAAA,a1c37000000olnwAAA,a1c37000000olnxAAA,a1c37000000olnyAAA,a1c37000000olo0AAA,a1c37000000olo1AAA,a1c37000000olo4AAA,a1c37000000olo8AAA,a1c37000000olo9AAA,a1c37000000oloCAAQ,a1c37000000oloFAAQ,a1c37000000oloIAAQ,a1c37000000oloJAAQ,a1c37000000oloMAAQ,a1c37000000oloNAAQ,a1c37000000oloSAAQ,a1c37000000olodAAA,a1c37000000oloeAAA,a1c37000000olzCAAQ,a1c37000000om0xAAA,a1c37000000ooV1AAI,a1c37000000oog8AAA,a1c37000000oogDAAQ,a1c37000000oonzAAA,a1c37000000oluuAAA,a1c37000000pT7SAAU,a1c37000000oljnAAA,a1c37000000olumAAA,a1c37000000oljpAAA,a1c37000000pUm2AAE,a1c37000000olo3AAA,a1c37000000oo1MAAQ,a1c37000000oo1vAAA,a1c37000000pWxgAAE,a1c37000000pYJkAAM,a1c37000000omDjAAI,a1c37000000ooTgAAI,a1c37000000op2GAAQ,a1c37000000one0AAA,a1c37000000oljYAAQ,a1c37000000pUlxAAE,a1c37000000oo9SAAQ,a1c37000000pcIYAAY,a1c37000000pamtAAA,a1c37000000pd2QAAQ,a1c37000000pdCOAAY,a1c37000000OpPaAAK,a1c37000000OphZAAS,a1c37000000olNkAAI}'::text[])))
Rows Removed by Filter: 1707920
Planning time: 1.685 ms
Execution time: 86377.139 ms
Thanks
Aslam Bari
You might want to consider a GIN or GIST index on your staging.product2 table. Double-sided ILIKEs are slow and difficult to improve substantially. I've seen a GIN index improve a similar query by 60-80%.
See this doc.

Postgres Query not selecting index for a column with OR condition

I have a query where the Postgres is performing a Hash join with sequence scan instead of an Index join with Nested loop, when I use an OR condition. This is causing the query to take 2 seconds instead of completing in < 100ms. I have run VACUUM ANALYZE and have rebuilt the index on the PATIENTCHARTNOTE table (which is about 5GB) but its still using hash join. Do you have any suggestions on how I can improve this?
explain analyze
SELECT Count (_pcn.id) AS total_open_note
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND ( _pt.assigned_to_user_id = '136964'
OR _pcn.createdby_id = '136964'
);
Aggregate (cost=237655.59..237655.60 rows=1 width=8) (actual time=1602.069..1602.069 rows=1 loops=1)
-> Hash Join (cost=83095.43..237645.30 rows=4117 width=4) (actual time=944.850..1602.014 rows=241 loops=1)
Hash Cond: (_appt.patient_id = _pt.id)
Join Filter: ((_pt.assigned_to_user_id = 136964) OR (_pcn.createdby_id = 136964))
Rows Removed by Join Filter: 94036
-> Hash Join (cost=46650.68..182243.64 rows=556034 width=12) (actual time=415.862..1163.812 rows=94457 loops=1)
Hash Cond: (_pcn.appointment_id = _appt.id)
-> Seq Scan on patientchartnote _pcn (cost=0.00..112794.20 rows=1073978 width=12) (actual time=0.016..423.262 rows=1
073618 loops=1)
Filter: (active AND (title IS NOT NULL) AND ((title)::text <> ''::text))
Rows Removed by Filter: 22488
-> Hash (cost=35223.61..35223.61 rows=696486 width=8) (actual time=414.749..414.749 rows=692839 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2732kB
-> Seq Scan on appointment _appt (cost=0.00..35223.61 rows=696486 width=8) (actual time=0.010..271.208 rows=69
2839 loops=1)
Filter: (datecomplete IS NULL)
Rows Removed by Filter: 652426
-> Hash (cost=24698.57..24698.57 rows=675694 width=12) (actual time=351.566..351.566 rows=674929 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2737kB
-> Seq Scan on patient _pt (cost=0.00..24698.57 rows=675694 width=12) (actual time=0.013..197.268 rows=674929 loops=
1)
Filter: active
Rows Removed by Filter: 17426
Planning time: 1.533 ms
Execution time: 1602.715 ms
When I replace "OR _pcn.createdby_id = '136964'" with "AND _pcn.createdby_id = '136964'", Postgres performs an index scan
Aggregate (cost=29167.56..29167.57 rows=1 width=8) (actual time=937.743..937.743 rows=1 loops=1)
-> Nested Loop (cost=1.28..29167.55 rows=7 width=4) (actual time=19.136..937.669 rows=37 loops=1)
-> Nested Loop (cost=0.85..27393.03 rows=1654 width=4) (actual time=2.154..910.250 rows=1649 loops=1)
-> Index Scan using patient_activeassigned_idx on patient _pt (cost=0.42..3075.00 rows=1644 width=8) (actual time=1.
599..11.820 rows=1627 loops=1)
Index Cond: ((active = true) AND (assigned_to_user_id = 136964))
Filter: active
-> Index Scan using appointment_datepatient_idx on appointment _appt (cost=0.43..14.75 rows=4 width=8) (actual time=
0.543..0.550 rows=1 loops=1627)
Index Cond: ((patient_id = _pt.id) AND (datecomplete IS NULL))
-> Index Scan using patientchartnote_activeappointment_idx on patientchartnote _pcn (cost=0.43..1.06 rows=1 width=8) (actual time=0.014..0.014 rows=0 loops=1649)
Index Cond: ((active = true) AND (createdby_id = 136964) AND (appointment_id = _appt.id) AND (title IS NOT NULL))
Filter: (active AND ((title)::text <> ''::text))
Planning time: 1.489 ms
Execution time: 937.910 ms
(13 rows)
Using OR in SQL queries usually results in bad performance.
That is because – different from AND – it does not restrict, but extend the number of rows in the query result. With AND, you can use an index scan for one part of the condition and further restrict the result set with a filter on the second condition. That is not possible with OR.
So PostgreSQL does the only thing left: it computes the whole join and then filters out all rows that do not match the condition. Of course that is very inefficient when you are joining three tables (I didn't count the outer join).
Assuming that all columns called id are primary keys, you could rewrite the query as follows:
SELECT count(*) FROM
(SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pt.assigned_to_user_id = '136964'
UNION
SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pcn.createdby_id = '136964'
) q;
Even though this is running the query twice, indexes can be used to filter out all but a few rows early on, so this query should perform better.

postgres two column sort low performance

I've got a query that performs multiple joins. I try to get only those positions of each keyword that are latest in results.
Here is the query:
SELECT DISTINCT ON (p.keyword_id)
a.id AS account_id,
w.parent_id AS parent_id,
w.name AS name,
p.position AS position
FROM websites w
JOIN accounts a ON w.account_id = a.id
JOIN keywords k ON k.website_id = w.parent_id
JOIN positions p ON p.website_id = w.parent_id
WHERE a.amount > 0 AND w.parent_id NOTNULL AND (round((a.amount / a.payment_renewal_period), 2) BETWEEN 1 AND 19)
ORDER BY p.keyword_id, p.created_at DESC;
Plan with costs for that query is as follows:
Unique (cost=73673.65..76630.38 rows=264 width=40) (actual time=30777.117..49143.023 rows=259 loops=1)
-> Sort (cost=73673.65..75152.02 rows=591347 width=40) (actual time=30777.116..47352.373 rows=10891486 loops=1)
Sort Key: p.keyword_id, p.created_at DESC
Sort Method: external merge Disk: 512672kB
-> Merge Join (cost=219.59..812.26 rows=591347 width=40) (actual time=3.487..3827.028 rows=10891486 loops=1)
Merge Cond: (w.parent_id = k.website_id)
-> Nested Loop (cost=128.46..597.73 rows=1268 width=44) (actual time=3.378..108.915 rows=61582 loops=1)
-> Nested Loop (cost=2.28..39.86 rows=1 width=28) (actual time=0.026..0.216 rows=7 loops=1)
-> Index Scan using index_websites_on_parent_id on websites w (cost=0.14..15.08 rows=4 width=28) (actual time=0.004..0.023 rows=7 loops=1)
Index Cond: (parent_id IS NOT NULL)
-> Bitmap Heap Scan on accounts a (cost=2.15..6.18 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=7)
Recheck Cond: (id = w.account_id)
Filter: ((amount > '0'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) >= '1'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) <= '19'::numeric))
Heap Blocks: exact=7
-> Bitmap Index Scan on accounts_pkey (cost=0.00..2.15 rows=1 width=0) (actual time=0.006..0.006 rows=1 loops=7)
Index Cond: (id = w.account_id)
-> Bitmap Heap Scan on positions p (cost=126.18..511.57 rows=4631 width=16) (actual time=0.994..8.226 rows=8797 loops=7)
Recheck Cond: (website_id = w.parent_id)
Heap Blocks: exact=1004
-> Bitmap Index Scan on index_positions_on_5_columns (cost=0.00..125.02 rows=4631 width=0) (actual time=0.965..0.965 rows=8797 loops=7)
Index Cond: (website_id = w.parent_id)
-> Sort (cost=18.26..18.92 rows=264 width=4) (actual time=0.106..1013.966 rows=10891487 loops=1)
Sort Key: k.website_id
Sort Method: quicksort Memory: 37kB
-> Seq Scan on keywords k (cost=0.00..7.64 rows=264 width=4) (actual time=0.005..0.039 rows=263 loops=1)
Planning time: 1.081 ms
Execution time: 49184.222 ms
The thing is when I run query with w.id instead of w.parent_id in join positions part total cost decreases to
Unique (cost=3621.07..3804.99 rows=264 width=40) (actual time=128.430..139.550 rows=259 loops=1)
-> Sort (cost=3621.07..3713.03 rows=36784 width=40) (actual time=128.429..135.444 rows=40385 loops=1)
Sort Key: p.keyword_id, p.created_at DESC
Sort Method: external sort Disk: 2000kB
-> Merge Join (cost=128.73..831.59 rows=36784 width=40) (actual time=25.521..63.299 rows=40385 loops=1)
Merge Cond: (k.website_id = w.id)
-> Index Only Scan using index_keywords_on_website_id_deleted_at on keywords k (cost=0.27..24.23 rows=264 width=4) (actual time=0.137..0.274 rows=263 loops=1)
Heap Fetches: 156
-> Materialize (cost=128.46..606.85 rows=1268 width=44) (actual time=3.772..49.587 rows=72242 loops=1)
-> Nested Loop (cost=128.46..603.68 rows=1268 width=44) (actual time=3.769..30.530 rows=61582 loops=1)
-> Nested Loop (cost=2.28..45.80 rows=1 width=32) (actual time=0.047..0.204 rows=7 loops=1)
-> Index Scan using websites_pkey on websites w (cost=0.14..21.03 rows=4 width=32) (actual time=0.007..0.026 rows=7 loops=1)
Filter: (parent_id IS NOT NULL)
Rows Removed by Filter: 4
-> Bitmap Heap Scan on accounts a (cost=2.15..6.18 rows=1 width=4) (actual time=0.018..0.019 rows=1 loops=7)
Recheck Cond: (id = w.account_id)
Filter: ((amount > '0'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) >= '1'::numeric) AND (round((amount / (payment_renewal_period)::numeric), 2) <= '19'::numeric))
Heap Blocks: exact=7
-> Bitmap Index Scan on accounts_pkey (cost=0.00..2.15 rows=1 width=0) (actual time=0.004..0.004 rows=1 loops=7)
Index Cond: (id = w.account_id)
-> Bitmap Heap Scan on positions p (cost=126.18..511.57 rows=4631 width=16) (actual time=0.930..2.341 rows=8797 loops=7)
Recheck Cond: (website_id = w.parent_id)
Heap Blocks: exact=1004
-> Bitmap Index Scan on index_positions_on_5_columns (cost=0.00..125.02 rows=4631 width=0) (actual time=0.906..0.906 rows=8797 loops=7)
Index Cond: (website_id = w.parent_id)
Planning time: 1.124 ms
Execution time: 157.167 ms
Indexes on websites
Indexes:
"websites_pkey" PRIMARY KEY, btree (id)
"index_websites_on_account_id" btree (account_id)
"index_websites_on_deleted_at" btree (deleted_at)
"index_websites_on_domain_id" btree (domain_id)
"index_websites_on_parent_id" btree (parent_id)
"index_websites_on_processed_at" btree (processed_at)
Indexes on positions
Indexes:
"positions_pkey" PRIMARY KEY, btree (id)
"index_positions_on_5_columns" UNIQUE, btree (website_id, keyword_id, created_at, engine_id, region_id)
"overlap_index" btree (keyword_id, created_at)
The second EXPLAIN output shows more than 200 times fewer rows, so it is hardly surprising that sorting is much faster.
You will notice that the sort spills to disk in both cases (Sort Method: external merge Disk: ...kB). If you can keep the sort in memory by raising work_mem, it will be much faster.
But the first sort is so large that you won't be able to fit it in memory.
Ideas to speed up the query:
An index on (keyword_id, created_at)for positions. Not sure if that helps though.
Do the filtering first, like this:
SELECT
a.id AS account_id,
w.parent_id AS parent_id,
w.name AS name,
p.position AS position
FROM (SELECT DISTINCT ON (keyword_id)
positions,
website_id,
keyword_id,
created_at
FROM positions
ORDER BY keyword_id, created_at DESC) p
JOIN ...
WHERE ...
ORDER BY p.keyword_id, p.created_at DESC;
Remark: The DISTINCT ON is somewhat strange, since you do not ORDER BY the values of the SELECT list, so the result values are not well defined.