SELECT DISTINCT optimization - select

I'm using PostgreSQL 8.4. What is the best to optimize this query:
SELECT DISTINCT campaigns.* FROM campaigns
LEFT JOIN adverts ON campaign_id = campaigns.id
LEFT JOIN shops ON campaigns.shop_id = shops.id
LEFT JOIN exports_adverts ON advert_id = adverts.id
LEFT JOIN exports ON export_id = exports.id
LEFT JOIN rotations ON rotations.advert_id = adverts.id
LEFT JOIN blocks ON block_id = blocks.id
WHERE
(shops.is_active = TRUE)
AND exports.user_id = any(uids)
OR blocks.user_id = any(uids)
AND campaigns.id = any(ids)
My tables is:
CREATE TABLE campaigns (
id integer NOT NULL,
shop_id integer NOT NULL,
title character varying NOT NULL,
...
);
CREATE TABLE adverts (
id integer NOT NULL,
campaign_id integer NOT NULL,
title character varying NOT NULL,
...
);
CREATE TABLE shops (
id integer NOT NULL,
title character varying NOT NULL,
is_active boolean DEFAULT true NOT NULL,
...
);
CREATE TABLE exports (
id integer NOT NULL,
title character varying,
user_id integer NOT NULL,
...
);
CREATE TABLE exports_adverts (
id integer NOT NULL,
export_id integer NOT NULL,
advert_id integer NOT NULL,
...
);
CREATE TABLE rotations (
id integer NOT NULL,
block_id integer NOT NULL,
advert_id integer NOT NULL,
...
);
CREATE TABLE blocks (
id integer NOT NULL,
title character varying NOT NULL,
user_id integer NOT NULL,
...
);
I already have an index on all fields used in this query. Is there anything I can do to optimize this query?
EXPLAIN for this query:
Unique (cost=284529.95..321207.47 rows=57088 width=106) (actual time=508048.104..609870.600 rows=106 loops=1)
-> Sort (cost=284529.95..286567.59 rows=815056 width=106) (actual time=508048.102..602413.688 rows=8354563 loops=1)
Sort Key: campaigns.id, campaigns.shop_id, campaigns.title
Sort Method: external merge Disk: 1017136kB
-> Hash Left Join (cost=2258.33..62419.56 rows=815056 width=106) (actual time=49.509..17510.009 rows=8354563 loops=1)
Hash Cond: (rotations.block_id = blocks.id)
-> Merge Right Join (cost=1719.44..44560.73 rows=815056 width=110) (actual time=42.194..12317.422 rows=8354563 loops=1)
Merge Cond: (rotations.advert_id = adverts.id)
-> Index Scan using rotations_advert_id_key on rotations (cost=0.00..29088.30 rows=610999 width=8) (actual time=0.040..3026.898 rows=610999 loops=1)
-> Sort (cost=1719.44..1737.90 rows=7386 width=110) (actual time=42.144..3965.416 rows=8354563 loops=1)
Sort Key: adverts.id
Sort Method: external sort Disk: 1336kB
-> Hash Left Join (cost=739.01..1244.87 rows=7386 width=110) (actual time=10.519..21.351 rows=10571 loops=1)
Hash Cond: (exports_adverts.export_id = exports.id)
-> Hash Left Join (cost=715.60..1119.90 rows=7386 width=114) (actual time=10.178..17.472 rows=10571 loops=1)
Hash Cond: (adverts.id = exports_adverts.advert_id)
-> Hash Left Join (cost=304.71..433.53 rows=2781 width=110) (actual time=3.614..5.106 rows=3035 loops=1)
Hash Cond: (campaigns.id = adverts.campaign_id)
-> Hash Join (cost=1.13..9.32 rows=112 width=106) (actual time=0.051..0.303 rows=106 loops=1)
Hash Cond: (campaigns.shop_id = shops.id)
-> Seq Scan on campaigns (cost=0.00..6.23 rows=223 width=106) (actual time=0.011..0.150 rows=223 loops=1)
-> Hash (cost=1.08..1.08 rows=4 width=4) (actual time=0.015..0.015 rows=4 loops=1)
-> Seq Scan on shops (cost=0.00..1.08 rows=4 width=4) (actual time=0.010..0.012 rows=4 loops=1)
Filter: is_active
-> Hash (cost=234.37..234.37 rows=5537 width=8) (actual time=3.546..3.546 rows=5537 loops=1)
-> Seq Scan on adverts (cost=0.00..234.37 rows=5537 width=8) (actual time=0.010..2.200 rows=5537 loops=1)
-> Hash (cost=227.06..227.06 rows=14706 width=8) (actual time=6.532..6.532 rows=14706 loops=1)
-> Seq Scan on exports_adverts (cost=0.00..227.06 rows=14706 width=8) (actual time=0.016..3.028 rows=14706 loops=1)
-> Hash (cost=14.85..14.85 rows=685 width=4) (actual time=0.311..0.311 rows=685 loops=1)
-> Seq Scan on exports (cost=0.00..14.85 rows=685 width=4) (actual time=0.014..0.156 rows=685 loops=1)
-> Hash (cost=368.95..368.95 rows=13595 width=4) (actual time=7.281..7.281 rows=13595 loops=1)
-> Seq Scan on blocks (cost=0.00..368.95 rows=13595 width=4) (actual time=0.027..3.990 rows=13595 loops=1)

Splitting or into 2 union queries may help (without DISTINCT):
SELECT campaigns.*
FROM campaigns
LEFT JOIN adverts ON campaign_id = campaigns.id
LEFT JOIN shops ON campaigns.shop_id = shops.id
LEFT JOIN exports_adverts ON advert_id = adverts.id
LEFT JOIN exports ON export_id = exports.id
LEFT JOIN rotations ON rotations.advert_id = adverts.id
LEFT JOIN blocks ON block_id = blocks.id
WHERE
(shops.is_active = TRUE)
AND exports.user_id = any(uids)
UNION
SELECT campaigns.*
FROM campaigns
LEFT JOIN adverts ON campaign_id = campaigns.id
LEFT JOIN shops ON campaigns.shop_id = shops.id
LEFT JOIN exports_adverts ON advert_id = adverts.id
LEFT JOIN exports ON export_id = exports.id
LEFT JOIN rotations ON rotations.advert_id = adverts.id
LEFT JOIN blocks ON block_id = blocks.id
WHERE
blocks.user_id = any(uids)
AND campaigns.id = any(ids)

Related

postgres query optimisation to avoid hash right join

i have this Postgres query where i left join a couple of tables. This query runs for hours and causes issues. When I run explain analyse I see that the most time is spent in one of the left joins, for which optimiser selects Right Hash Join. When I use inner join instead and run explain analyse, optimiser selects a different plan and query finishes in minutes.
I have to use left join because with inner join some data will be excluded.
How should i rewrite the query to avoid this hash right join?
Many thanks in advance!
Links to query plans are attached above. I am using PostgreSQL 12.11 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
WITH memberships AS (
SELECT customer_sk
, membership_sk
, membership_state
, membership_b2b_type
, membership_sml_type
, membership_start_date
, membership_end_date
, membership_pause_from
, membership_pause_to
, covid_pause_start_date
, covid_pause_end_date
, city_sk AS membership_city_region_sk
, sport_persona_current
, membership_cancellation_reason
, membership_sequence_nr_reverse
, company_sk
, company_name
FROM dwh.fact_membership
WHERE membership_is_urban_sports IS TRUE
),
-- Data preparation
request_cancellation AS (
SELECT membership_sk,
requested_cancellation_last_date
FROM staging.request_cancellation
),
blacklisted_emails AS (
SELECT customer_sk, email, 'blacklisted' AS blacklisted
FROM dwh_userdata.blacklist_emails
),
nonanon_customer AS (
SELECT id
, first_name
, last_name
, email
FROM dwh_userdata.customer
),
nonanon_customer_address_prep AS (
SELECT customer_id
, city
, state
, country
, zip
, row_number() over (partition by customer_id order by created_at desc) as row_number
FROM dwh_userdata.customer_address
),
nonanon_customer_address AS (
SELECT *
FROM nonanon_customer_address_prep
WHERE row_number = 1
),
favorite_sport_category_prep_1 AS (
SELECT membership_sk
, service_top_category_name
, count(DISTINCT booking_sk) as cnt_booking
FROM dwh.report_venue_visitors
WHERE booking_is_valid
GROUP BY 1, 2
),
favorite_sport_category_prep_2 AS (
SELECT membership_sk
, service_top_category_name
, cnt_booking
, row_number()
over (partition by membership_sk order by cnt_booking DESC,service_top_category_name ) AS row_number
FROM favorite_sport_category_prep_1
),
favorite_sport_category AS (
SELECT membership_sk
, service_top_category_name AS favourite_sport_category
, cnt_booking
FROM favorite_sport_category_prep_2
WHERE row_number = 1
),
free_trial AS (
select distinct membership_sk
, customer_sk
, trial_status AS free_trial_status
, trial AS free_trial_length
, trial_start_date AS free_trial_start
, trial_end_date AS free_trial_end
FROM dwh.report_memberships
WHERE trial_status IS NOT NULL
and trial_start_date >= '2020-06-23'
)
-- #### OUTOPUT TABLE
SELECT c.customer_sk AS named_user
, CASE WHEN c.gender IN ('M', 'F') THEN c.gender ELSE NULL END AS gender
, nc.first_name
, nc.last_name
, customer_language
, anss.state AS newsletter_status
, dl.city_name AS membership_city_region
, dl.country_code AS membership_country_code
, dl.country_name AS membership_country_name
, dl.admin1 AS membership_administrative_state
, m.membership_sk
, m.membership_state
, m.membership_b2b_type
, m.company_sk
, m.company_name
, m.membership_sml_type
, CASE
WHEN m.membership_start_date IS NOT NULL
THEN CONCAT(TO_CHAR(m.membership_start_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_start_date
, CASE
WHEN m.membership_end_date IS NOT NULL THEN CONCAT(TO_CHAR(m.membership_end_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_end_date
, ft.free_trial_status
, ft.free_trial_length
, CASE
WHEN ft.free_trial_start IS NOT NULL THEN CONCAT(TO_CHAR(ft.free_trial_start, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS free_trial_start
, CASE
WHEN ft.free_trial_end IS NOT NULL THEN CONCAT(TO_CHAR(ft.free_trial_end, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS free_trial_end
, CASE
WHEN m.membership_pause_from IS NOT NULL
THEN CONCAT(TO_CHAR(m.membership_pause_from, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_pause_from
, CASE
WHEN m.membership_pause_to IS NOT NULL THEN CONCAT(TO_CHAR(m.membership_pause_to, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS membership_pause_to
, CASE
WHEN m.covid_pause_start_date IS NOT NULL THEN CONCAT(TO_CHAR(m.covid_pause_start_date, 'YYYY-MM-DD'),
'T00:00:00')
ELSE NULL END AS covid_pause_start_date
, CASE
WHEN m.covid_pause_end_date IS NOT NULL
THEN CONCAT(TO_CHAR(m.covid_pause_end_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS covid_pause_end_date
, CASE
WHEN rc.requested_cancellation_last_date IS NOT NULL THEN CONCAT(
TO_CHAR(rc.requested_cancellation_last_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS requested_cancellation_last_date
, membership_cancellation_reason
, be.blacklisted AS blacklist_email
, fsc.favourite_sport_category AS fav_sports_category
, m.sport_persona_current
, ambd.membership_months_active
, ambd.membership_months_total
, ambd.is_gm1_positive
, ambd.cnt_bookings_total
, ambd.cnt_bookings_last_30_days_total
, ambd.cnt_bookings_last_30_days_onsite
, ambd.cnt_bookings_onsite
, ambd.cnt_bookings_online
, ambd.cnt_bookings_last_30_days_online
, CASE
WHEN ambd.latest_booking_date IS NOT NULL
THEN CONCAT(TO_CHAR(ambd.latest_booking_date, 'YYYY-MM-DD'), 'T00:00:00')
ELSE NULL END AS latest_booking_date
, ambd.avg_bookings_active_month
, ambd.last_checkin_type
, ambd.fav_sports_category_onsite
, ambd.fav_sports_category_online
, ambd.fav_studio_last_30_days
, ambd.fav_studio_group_website
FROM dwh.dim_customer c
LEFT JOIN nonanon_customer nc
ON nc.id = c.customer_sk
LEFT JOIN nonanon_customer_address nca
ON nca.customer_id = customer_sk
LEFT JOIN memberships m
ON c.customer_sk = m.customer_sk
AND membership_sequence_nr_reverse = 1
LEFT JOIN request_cancellation rc
ON m.membership_sk = rc.membership_sk
LEFT JOIN dwh.dim_location dl
ON m.membership_city_region_sk = dl.city_sk
LEFT JOIN blacklisted_emails be
ON be.email = nc.email
LEFT JOIN favorite_sport_category fsc
ON fsc.membership_sk = m.membership_sk
LEFT JOIN staging.airship_newsletter_subscription_status anss
ON anss.customer_id = c.customer_sk
LEFT JOIN free_trial ft
ON ft.customer_sk = m.customer_sk
LEFT JOIN staging.airship_membership_booking_details ambd
ON ambd.membership_sk = m.membership_sk
AND membership_sequence_nr_reverse = 1
WHERE be.blacklisted IS NULL
AND nc.email NOT LIKE '%delete%'
AND nc.email IS NOT NULL
AND ((m.membership_sk IS NULL AND anss.state = 'subscribed') OR membership_state IS NOT NULL)
Results of EXPLAIN ANALYSE:
Hash Left Join (cost=6667580.77..6764370.56 rows=3256 width=692) (actual time=4319030.909..4328353.358 rows=518825 loops=1)
Hash Cond: (fact_membership.customer_sk = ft.customer_sk)
-> Hash Left Join (cost=6663581.42..6759951.96 rows=3256 width=380) (actual time=4318059.369..4324841.032 rows=518825 loops=1)
Hash Cond: (fact_membership.membership_sk = ambd.membership_sk)
Join Filter: (fact_membership.membership_sequence_nr_reverse = 1)
-> Hash Left Join (cost=6655261.78..6748793.03 rows=3256 width=242) (actual time=4317733.942..4323056.862 rows=518825 loops=1)
Hash Cond: (c.customer_sk = anss.customer_id)
Filter: (((fact_membership.membership_sk IS NULL) AND (anss.state = 'subscribed'::text)) OR (fact_membership.membership_state IS NOT NULL))
Rows Removed by Filter: 129098
-> Merge Left Join (cost=6642237.84..6733674.25 rows=3256 width=227) (actual time=4317378.943..4321020.832 rows=647923 loops=1)
Merge Cond: (fact_membership.membership_sk = favorite_sport_category_prep_2.membership_sk)
-> Sort (cost=167496.47..167504.61 rows=3256 width=218) (actual time=4146517.144..4147134.144 rows=647923 loops=1)
Sort Key: fact_membership.membership_sk
Sort Method: external merge Disk: 82352kB
-> Merge Left Join (cost=150681.68..167306.50 rows=3256 width=218) (actual time=4142397.925..4145027.017 rows=647923 loops=1)
Merge Cond: (c.customer_sk = nonanon_customer_address_prep.customer_id)
-> Sort (cost=59476.20..59484.34 rows=3256 width=218) (actual time=4139725.733..4140241.833 rows=647923 loops=1)
Sort Key: c.customer_sk
Sort Method: external merge Disk: 82344kB
-> Hash Right Join (cost=52983.04..59286.23 rows=3256 width=218) (actual time=33403.336..4135281.108 rows=647923 loops=1)
Hash Cond: (request_cancellation.membership_sk = fact_membership.membership_sk)
-> Seq Scan on request_cancellation (cost=0.00..5128.40 rows=308340 width=8) (actual time=1.160..228.691 rows=308340 loops=1)
-> Hash (cost=52942.34..52942.34 rows=3256 width=214) (actual time=30038.787..30048.670 rows=647923 loops=1)
Buckets: 65536 (originally 4096) Batches: 131072 (originally 1) Memory Usage: 10511kB
-> Gather (cost=1064.24..52942.34 rows=3256 width=214) (actual time=11.564..12621.194 rows=647923 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Left Join (cost=64.24..51616.74 rows=1357 width=214) (actual time=5.510..22450.906 rows=215974 loops=3)
Hash Cond: (fact_membership.city_sk = dl.city_sk)
-> Nested Loop Left Join (cost=59.79..51608.59 rows=1357 width=191) (actual time=5.239..22013.464 rows=215974 loops=3)
-> Nested Loop (cost=59.37..50428.72 rows=1357 width=60) (actual time=4.923..6958.191 rows=215974 loops=3)
-> Hash Left Join (cost=58.94..49440.62 rows=1357 width=55) (actual time=3.419..2000.407 rows=215976 loops=3)
Hash Cond: ((customer.email)::text = blacklist_emails.email)
Filter: (('blacklisted'::text) IS NULL)
Rows Removed by Filter: 122
-> Parallel Seq Scan on customer (cost=0.00..46660.28 rows=271334 width=46) (actual time=0.999..1668.668 rows=216091 loops=3)
Filter: ((email IS NOT NULL) AND ((email)::text !~~ '%delete%'::text))
Rows Removed by Filter: 3191
-> Hash (cost=34.53..34.53 rows=1953 width=54) (actual time=2.222..2.226 rows=1953 loops=3)
Buckets: 2048 Batches: 1 Memory Usage: 144kB
-> Seq Scan on blacklist_emails (cost=0.00..34.53 rows=1953 width=54) (actual time=0.263..1.207 rows=1953 loops=3)
-> Index Scan using customer_pk on dim_customer c (cost=0.42..0.73 rows=1 width=13) (actual time=0.020..0.020 rows=1 loops=647929)
Index Cond: (customer_sk = customer.id)
-> Index Scan using dwh_fact_membership_3b307128 on fact_membership (cost=0.42..0.86 rows=1 width=131) (actual time=0.066..0.067 rows=1 loops=647923)
Index Cond: (customer_sk = c.customer_sk)
Filter: ((membership_is_urban_sports IS TRUE) AND (membership_sequence_nr_reverse = 1))
Rows Removed by Filter: 0
-> Hash (cost=3.09..3.09 rows=109 width=35) (actual time=0.148..0.214 rows=109 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 16kB
-> Seq Scan on dim_location dl (cost=0.00..3.09 rows=109 width=35) (actual time=0.031..0.098 rows=109 loops=3)
-> Materialize (cost=91205.48..107807.50 rows=2553 width=4) (actual time=2668.900..3946.682 rows=470415 loops=1)
-> Subquery Scan on nonanon_customer_address_prep (cost=91205.48..107801.12 rows=2553 width=4) (actual time=2666.188..3647.463 rows=470415 loops=1)
Filter: (nonanon_customer_address_prep.row_number = 1)
Rows Removed by Filter: 40218
-> WindowAgg (cost=91205.48..101418.18 rows=510635 width=148) (actual time=2664.902..3526.361 rows=510633 loops=1)
-> Sort (cost=91205.48..92482.07 rows=510635 width=12) (actual time=2664.083..2833.676 rows=510634 loops=1)
Sort Key: customer_address.customer_id, customer_address.created_at DESC
Sort Method: external merge Disk: 13032kB
-> Seq Scan on customer_address (cost=0.00..34063.35 rows=510635 width=12) (actual time=4.596..1522.444 rows=510635 loops=1)
-> Materialize (cost=6474741.37..6566128.10 rows=13051 width=13) (actual time=170857.053..173215.019 rows=465703 loops=1)
-> Subquery Scan on favorite_sport_category_prep_2 (cost=6474741.37..6566095.47 rows=13051 width=13) (actual time=170855.731..173002.743 rows=465703 loops=1)
Filter: (favorite_sport_category_prep_2.row_number = 1)
Rows Removed by Filter: 1343535
-> WindowAgg (cost=6474741.37..6533469.01 rows=2610117 width=29) (actual time=170854.901..172755.674 rows=1809238 loops=1)
-> Sort (cost=6474741.37..6481266.67 rows=2610117 width=21) (actual time=170853.124..171205.257 rows=1809238 loops=1)
Sort Key: report_venue_visitors.membership_sk, (count(DISTINCT report_venue_visitors.booking_sk)) DESC, report_venue_visitors.service_top_category_name
Sort Method: external merge Disk: 63696kB
-> GroupAggregate (cost=5839877.44..6063400.07 rows=2610117 width=21) (actual time=154838.978..169250.761 rows=1809238 loops=1)
Group Key: report_venue_visitors.membership_sk, report_venue_visitors.service_top_category_name
-> Sort (cost=5839877.44..5889232.80 rows=19742146 width=21) (actual time=154835.761..158654.645 rows=19827987 loops=1)
Sort Key: report_venue_visitors.membership_sk, report_venue_visitors.service_top_category_name
Sort Method: external merge Disk: 694120kB
-> Seq Scan on report_venue_visitors (cost=0.00..2233036.56 rows=19742146 width=21) (actual time=1.868..117392.591 rows=19827987 loops=1)
Filter: booking_is_valid
Rows Removed by Filter: 6441170
-> Hash (cost=7199.42..7199.42 rows=317242 width=19) (actual time=352.386..352.386 rows=317242 loops=1)
Buckets: 65536 Batches: 8 Memory Usage: 2606kB
-> Seq Scan on airship_newsletter_subscription_status anss (cost=0.00..7199.42 rows=317242 width=19) (actual time=1.120..154.407 rows=317242 loops=1)
-> Hash (cost=4207.06..4207.06 rows=121006 width=150) (actual time=320.770..320.771 rows=121006 loops=1)
Buckets: 32768 Batches: 8 Memory Usage: 3111kB
-> Seq Scan on airship_membership_booking_details ambd (cost=0.00..4207.06 rows=121006 width=150) (actual time=1.446..107.525 rows=121006 loops=1)
-> Hash (cost=3993.93..3993.93 rows=434 width=26) (actual time=951.259..951.264 rows=26392 loops=1)
Buckets: 32768 (originally 1024) Batches: 1 (originally 1) Memory Usage: 1760kB
-> Subquery Scan on ft (cost=3981.99..3993.93 rows=434 width=26) (actual time=857.944..888.163 rows=26392 loops=1)
-> Unique (cost=3981.99..3989.59 rows=434 width=30) (actual time=857.288..878.098 rows=26392 loops=1)
-> Sort (cost=3981.99..3983.08 rows=434 width=30) (actual time=856.675..863.298 rows=26392 loops=1)
Sort Key: report_memberships.membership_sk, report_memberships.customer_sk, report_memberships.trial_status, report_memberships.trial, report_memberships.trial_start_date, report_memberships.trial_end_date
Sort Method: quicksort Memory: 2830kB
-> Bitmap Heap Scan on report_memberships (cost=2256.96..3962.98 rows=434 width=30) (actual time=102.229..817.152 rows=26392 loops=1)
Recheck Cond: ((trial_start_date >= '2020-06-23'::date) AND (trial_status IS NOT NULL))
Heap Blocks: exact=1383
-> BitmapAnd (cost=2256.96..2256.96 rows=434 width=0) (actual time=99.478..99.479 rows=0 loops=1)
-> Bitmap Index Scan on dwh_report_memberships_bc76fe51 (cost=0.00..578.02 rows=31145 width=0) (actual time=7.497..7.497 rows=26392 loops=1)
Index Cond: (trial_start_date >= '2020-06-23'::date)
-> Bitmap Index Scan on dwh_report_memberships_35525e76 (cost=0.00..1678.48 rows=90406 width=0) (actual time=91.704..91.704 rows=92029 loops=1)
Index Cond: (trial_status IS NOT NULL)
Planning Time: 7.850 ms
Execution Time: 4328700.854 ms

Is it possible speed up this query by adding more indexes?

I have the following SQL query
EXPLAIN ANALYZE
SELECT
full_address,
street_address,
street.street,
(
select
city
from
city
where
city.id = property.city_id
)
AS city,
(
select
state_code
from
state
where
id = property.state_id
)
AS state_code,
(
select
zipcode
from
zipcode
where
zipcode.id = property.zipcode_id
)
AS zipcode
FROM
property
INNER JOIN
street
ON street.id = property.street_id
WHERE
street.street = 'W San Miguel Ave'
AND property.zipcode_id =
(
SELECT
id
FROM
zipcode
WHERE
zipcode = '85340'
)
Below is the EXPLAIN ANALYZE results
Gather (cost=1008.86..226541.68 rows=1 width=161) (actual time=59.311..21956.143 rows=184 loops=1)
Workers Planned: 2
Params Evaluated: $3
Workers Launched: 2
InitPlan 4 (returns $3)
-> Index Scan using zipcode_zipcode_county_id_state_id_index on zipcode zipcode_1 (cost=0.28..8.30 rows=1 width=16) (actual time=0.039..0.040 rows=1 loops=1)
Index Cond: (zipcode = '85340'::citext)
-> Nested Loop (cost=0.56..225508.35 rows=1 width=113) (actual time=7430.172..14723.451 rows=61 loops=3)
-> Parallel Seq Scan on street (cost=0.00..13681.63 rows=1 width=28) (actual time=108.023..108.053 rows=1 loops=3)
Filter: (street = 'W San Miguel Ave'::citext)
Rows Removed by Filter: 99131
-> Index Scan using property_street_address_street_id_city_id_state_id_zipcode_id_c on property (cost=0.56..211826.71 rows=1 width=117) (actual time=10983.195..21923.063 rows=92 loops=2)
Index Cond: ((street_id = street.id) AND (zipcode_id = $3))
SubPlan 1
-> Index Scan using city_id_pk on city (cost=0.28..8.30 rows=1 width=9) (actual time=0.003..0.003 rows=1 loops=184)
Index Cond: (id = property.city_id)
SubPlan 2
-> Index Scan using state_id_pk on state (cost=0.27..8.34 rows=1 width=3) (actual time=0.002..0.002 rows=1 loops=184)
Index Cond: (id = property.state_id)
SubPlan 3
-> Index Scan using zipcode_id_pk on zipcode (cost=0.28..8.30 rows=1 width=6) (actual time=0.002..0.003 rows=1 loops=184)
Index Cond: (id = property.zipcode_id)
Planning Time: 1.228 ms
Execution Time: 21956.246 ms
Is it possible to speed up this query by adding more indexes?
The query can be rewritten using joins rather than subselects. This may be faster and easier to index.
SELECT
full_address,
street_address,
street.street,
city.city as city,
state.state_code as state_code,
zipcode.zipcode as zipcode,
FROM
property
INNER JOIN street ON street.id = property.street_id
INNER JOIN city ON city.id = property.city_id
INNER JOIN state ON state.id = property.state_id
INNER JOIN zipcode ON zipcode.id = property.zipcode_id
WHERE
street.street = 'W San Miguel Ave'
AND zipcode.zipcode = '85340'
Assuming all the foreign keys (property.street_id, property.city_id, etc...) are indexed this now becomes a search on street.street and zipcode.zipcode. As long as they are indexed the query should take milliseconds.

PostgreSQL table indexing

I want to index my tables for the following query:
select
t.*
from main_transaction t
left join main_profile profile on profile.id = t.profile_id
left join main_customer customer on (customer.id = profile.user_id)
where
(upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%')))
and t.service_type = 'SERVICE_1'
and t.status = 'SUCCESS'
and t.mode = 'AUTO'
and t.transaction_type = 'WITHDRAW'
and customer.client = 'corp'
and t.pub_date>='2018-09-05' and t.pub_date<='2018-11-05'
order by t.pub_date desc, t.id asc
LIMIT 1000;
This is how I tried to index my tables:
CREATE INDEX main_transaction_pr_id ON main_transaction (profile_id);
CREATE INDEX main_profile_user_id ON main_profile (user_id);
CREATE INDEX main_customer_client ON main_customer (client);
CREATE INDEX main_transaction_gin_req_no ON main_transaction USING gin (upper(request_no) gin_trgm_ops);
CREATE INDEX main_customer_gin_phone ON main_customer USING gin (upper(phone) gin_trgm_ops);
CREATE INDEX main_transaction_general ON main_transaction (service_type, status, mode, transaction_type); --> don't know if this one is true!!
After indexing like above my query is spending over 4.5 seconds for just selecting 1000 rows!
I am selecting from the following table which has 34 columns including 3 FOREIGN KEYs and it has over 3 million data rows:
CREATE TABLE main_transaction (
id integer NOT NULL DEFAULT nextval('main_transaction_id_seq'::regclass),
description character varying(255) NOT NULL,
request_no character varying(18),
account character varying(50),
service_type character varying(50),
pub_date" timestamptz(6) NOT NULL,
"service_id" varchar(50) COLLATE "pg_catalog"."default",
....
);
I am also joining two tables (main_profile, main_customer) for searching customer.phone and for selecting customer.client. To get to the main_customer table from main_transaction table, I can only go by main_profile
My question is how can I index my table too increase performance for above query?
Please, do not use UNION for OR for this case (upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%'))) instead can we use case when condition? Because, I have to convert my PostgreSQL query into Hibernate JPA! And I don't know how to convert UNION except Hibernate - Native SQL which I am not allowed to use.
Explain:
Limit (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.381 rows=1 loops=1)
-> Sort (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.380 rows=1 loops=1)
Sort Key: t.pub_date DESC, t.id
Sort Method: quicksort Memory: 27kB
-> Hash Join (cost=20817.10..411600.73 rows=38 width=1906) (actual time=3214.473..3885.369 rows=1 loops=1)
Hash Cond: (t.profile_id = profile.id)
Join Filter: ((upper((t.request_no)::text) ~~ '%20181104-2158-2723948%'::text) OR (upper((customer.phone)::text) ~~ '%20181104-2158-2723948%'::text))
Rows Removed by Join Filter: 593118
-> Seq Scan on main_transaction t (cost=0.00..288212.28 rows=205572 width=1906) (actual time=0.068..1527.677 rows=593119 loops=1)
Filter: ((pub_date >= '2016-09-05 00:00:00+05'::timestamp with time zone) AND (pub_date <= '2018-11-05 00:00:00+05'::timestamp with time zone) AND ((service_type)::text = 'SERVICE_1'::text) AND ((status)::text = 'SUCCESS'::text) AND ((mode)::text = 'AUTO'::text) AND ((transaction_type)::text = 'WITHDRAW'::text))
Rows Removed by Filter: 2132732
-> Hash (cost=17670.80..17670.80 rows=180984 width=16) (actual time=211.211..211.211 rows=181516 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 3166kB
-> Hash Join (cost=6936.09..17670.80 rows=180984 width=16) (actual time=46.846..183.689 rows=181516 loops=1)
Hash Cond: (customer.id = profile.user_id)
-> Seq Scan on main_customer customer (cost=0.00..5699.73 rows=181106 width=16) (actual time=0.013..40.866 rows=181618 loops=1)
Filter: ((client)::text = 'corp'::text)
Rows Removed by Filter: 16920
-> Hash (cost=3680.04..3680.04 rows=198404 width=8) (actual time=46.087..46.087 rows=198404 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2966kB
-> Seq Scan on main_profile profile (cost=0.00..3680.04 rows=198404 width=8) (actual time=0.008..20.099 rows=198404 loops=1)
Planning time: 0.757 ms
Execution time: 3885.680 ms
With the restriction to not use UNION, you won't get a good plan.
You can slightly speed up processing with the following indexes:
main_transaction ((service_type::text), (status::text), (mode::text),
(transaction_type::text), pub_date)
main_customer ((client::text))
These should at least get rid of the sequential scans, but the hash join that takes the lion's share of the processing time will remain.

Postgres Query not selecting index for a column with OR condition

I have a query where the Postgres is performing a Hash join with sequence scan instead of an Index join with Nested loop, when I use an OR condition. This is causing the query to take 2 seconds instead of completing in < 100ms. I have run VACUUM ANALYZE and have rebuilt the index on the PATIENTCHARTNOTE table (which is about 5GB) but its still using hash join. Do you have any suggestions on how I can improve this?
explain analyze
SELECT Count (_pcn.id) AS total_open_note
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND ( _pt.assigned_to_user_id = '136964'
OR _pcn.createdby_id = '136964'
);
Aggregate (cost=237655.59..237655.60 rows=1 width=8) (actual time=1602.069..1602.069 rows=1 loops=1)
-> Hash Join (cost=83095.43..237645.30 rows=4117 width=4) (actual time=944.850..1602.014 rows=241 loops=1)
Hash Cond: (_appt.patient_id = _pt.id)
Join Filter: ((_pt.assigned_to_user_id = 136964) OR (_pcn.createdby_id = 136964))
Rows Removed by Join Filter: 94036
-> Hash Join (cost=46650.68..182243.64 rows=556034 width=12) (actual time=415.862..1163.812 rows=94457 loops=1)
Hash Cond: (_pcn.appointment_id = _appt.id)
-> Seq Scan on patientchartnote _pcn (cost=0.00..112794.20 rows=1073978 width=12) (actual time=0.016..423.262 rows=1
073618 loops=1)
Filter: (active AND (title IS NOT NULL) AND ((title)::text <> ''::text))
Rows Removed by Filter: 22488
-> Hash (cost=35223.61..35223.61 rows=696486 width=8) (actual time=414.749..414.749 rows=692839 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2732kB
-> Seq Scan on appointment _appt (cost=0.00..35223.61 rows=696486 width=8) (actual time=0.010..271.208 rows=69
2839 loops=1)
Filter: (datecomplete IS NULL)
Rows Removed by Filter: 652426
-> Hash (cost=24698.57..24698.57 rows=675694 width=12) (actual time=351.566..351.566 rows=674929 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2737kB
-> Seq Scan on patient _pt (cost=0.00..24698.57 rows=675694 width=12) (actual time=0.013..197.268 rows=674929 loops=
1)
Filter: active
Rows Removed by Filter: 17426
Planning time: 1.533 ms
Execution time: 1602.715 ms
When I replace "OR _pcn.createdby_id = '136964'" with "AND _pcn.createdby_id = '136964'", Postgres performs an index scan
Aggregate (cost=29167.56..29167.57 rows=1 width=8) (actual time=937.743..937.743 rows=1 loops=1)
-> Nested Loop (cost=1.28..29167.55 rows=7 width=4) (actual time=19.136..937.669 rows=37 loops=1)
-> Nested Loop (cost=0.85..27393.03 rows=1654 width=4) (actual time=2.154..910.250 rows=1649 loops=1)
-> Index Scan using patient_activeassigned_idx on patient _pt (cost=0.42..3075.00 rows=1644 width=8) (actual time=1.
599..11.820 rows=1627 loops=1)
Index Cond: ((active = true) AND (assigned_to_user_id = 136964))
Filter: active
-> Index Scan using appointment_datepatient_idx on appointment _appt (cost=0.43..14.75 rows=4 width=8) (actual time=
0.543..0.550 rows=1 loops=1627)
Index Cond: ((patient_id = _pt.id) AND (datecomplete IS NULL))
-> Index Scan using patientchartnote_activeappointment_idx on patientchartnote _pcn (cost=0.43..1.06 rows=1 width=8) (actual time=0.014..0.014 rows=0 loops=1649)
Index Cond: ((active = true) AND (createdby_id = 136964) AND (appointment_id = _appt.id) AND (title IS NOT NULL))
Filter: (active AND ((title)::text <> ''::text))
Planning time: 1.489 ms
Execution time: 937.910 ms
(13 rows)
Using OR in SQL queries usually results in bad performance.
That is because – different from AND – it does not restrict, but extend the number of rows in the query result. With AND, you can use an index scan for one part of the condition and further restrict the result set with a filter on the second condition. That is not possible with OR.
So PostgreSQL does the only thing left: it computes the whole join and then filters out all rows that do not match the condition. Of course that is very inefficient when you are joining three tables (I didn't count the outer join).
Assuming that all columns called id are primary keys, you could rewrite the query as follows:
SELECT count(*) FROM
(SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pt.assigned_to_user_id = '136964'
UNION
SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pcn.createdby_id = '136964'
) q;
Even though this is running the query twice, indexes can be used to filter out all but a few rows early on, so this query should perform better.

postgresql doesn't use index for primary key = foreign key

I have 3 main tables,
ts_entity(id,short_name,name,type_id)
ts_entry_entity(id,entity_id,entry_id)
ts_entry(id, ... other columns ...)
All the id columns are UUID, and have a Btree index.
ts_entry_entity.entity_id has foreign key to ts_entity.id, and also has Btree index.
ts_entry_entity.entry_id also is foreign key, and also has Btree index.
I have one SQL, like
select ts_entity.id,ts_entity.short_name,ts_entity.name,ts_entry.id, ... ts_entry.otherColumns ...
from ts_entity,ts_entry_entity,ts_entry
where ts_entity.id=ts_entry_entity.entity_id
and ts_entry_entity.entry_id=ts_entry.id
and ... ts_entry.otherColumns='xxx' ...
order by ts_entity.short_name
limit 100 offset 0
And here comes the weird thing, "ts_entry_entity.entity_id=ts_entity.id" doesn't use any indexes and it costs about 50s.
There is no where condition on table ts_entity.
My question: Why does ts_entry_entity.entity_id=ts_entity.id not use an index? Why does it cost so much time? How can I optimize the SQL?
Below is the explain analyze result.
Limit (cost=235455.31..235455.41 rows=1 width=1808) (actual
time=54590.304..54590.781 rows=100 loops=1) -> Unique
(cost=235455.31..235455.41 rows=1 width=1808) (actual
time=54590.301..54590.666 rows=100 loops=1)
-> Sort (cost=235455.31..235455.32 rows=1 width=1808) (actual time=54590.297..54590.410 rows=100 loops=1)
Sort Key: ts_entity.short_name, ts_entity.id, ts_entity.name, ts_entry_version.display_date, ts_entry.id,
(formatdate(totimestamp(ts_entry_version.display_date, '-5'::character
varying), 'MM/DD/YYYY'::charac ter varying)),
ts_entry_version.submitted_date,
(formatdate(totimestamp(ts_entry_version.submitted_date,
'-5'::character varying), 'MM/DD/YYYY'::character varying)),
ts_entry_type.name, (get_priority((ts_entry_version.prio
rity)::integer)), ts_entry_version.priority,
(get_sentiment((ts_entry_version.sentiment)::integer)),
ts_entry_version.sentiment,
(getdisplayvalue((ts_entry_version.source_id)::character varying, 0,
', '::character varying) ), ts_entry_version.source_id,
(NULLIF((ts_entry_version.title)::text, ''::text)),
ts_entry.submitted_date,
(formatdate(totimestamp(ts_entry.submitted_date, '-5'::character
varying), 'MM/DD/YYYY'::character varying)), (get
displayvalue((ts_entry_version.submitter_id)::character varying, 0, ',
'::character varying)), ts_entry_version.submitter_id,
entryadhoc_o9e2c9f871634dd3aeafe9bdced2e34f.owner_id,
(getdisplayvalue(toentityid((entryadhoc_o9
e2c9f871634dd3aeafe9bdced2e34f.value)::character varying,
'23f03fe70a16aed0d7e210357164e401'::character varying), 0, ',
'::character varying)),
(toentityid((entryadhoc_o9e2c9f871634dd3aeafe9bdced2e34f.value)::character
var ying, '23f03fe70a16aed0d7e210357164e401'::character varying)),
entryadhoc_td66ad96a9ab472db3cf1279b65baa69.owner_id,
(totimestamp((entryadhoc_td66ad96a9ab472db3cf1279b65baa69.value)::character
varying, '-5'::character vary ing)),
(formatdate(totimestamp((entryadhoc_td66ad96a9ab472db3cf1279b65baa69.value)::character
varying, '-5'::character varying), 'MM/DD/YYYY'::character varying)),
entryadhoc_z3757638d8d64373ad835c3523a6a70b.owner_id, (tot
imestamp((entryadhoc_z3757638d8d64373ad835c3523a6a70b.value)::character
varying, '-5'::character varying)),
(formatdate(totimestamp((entryadhoc_z3757638d8d64373ad835c3523a6a70b.value)::character
varying, '-5'::character va rying), 'MM/DD/YYYY'::character varying)),
entryadhoc_i0f819c1244b427794a83767eaa68e73.owner_id,
(totimestamp((entryadhoc_i0f819c1244b427794a83767eaa68e73.value)::character
varying, '-5'::character varying)), (formatdate(t
otimestamp((entryadhoc_i0f819c1244b427794a83767eaa68e73.value)::character
varying, '-5'::character varying), 'MM/DD/YYYY'::character varying)),
entryadhoc_i7f5d5035cac421daa9879c1e21ec63f.owner_id,
(getdisplayvalue(toentit
yid((entryadhoc_i7f5d5035cac421daa9879c1e21ec63f.value)::character
varying, '23f03fe70a16aed0d7e210357164e401'::character varying), 0, ',
'::character varying)),
(toentityid((entryadhoc_i7f5d5035cac421daa9879c1e21ec63f.val
ue)::character varying, '23f03fe70a16aed0d7e210357164e401'::character
varying)), entryadhoc_v7f9c1146ee24742a73b83526dc66df7.owner_id,
(NULLIF(entryadhoc_v7f9c1146ee24742a73b83526dc66df7.value, ''::text))
Sort Method: external merge Disk: 3360kB
-> Nested Loop (cost=22979.01..235455.30 rows=1 width=1808) (actual time=94.889..54532.919 rows=2846 loops=1)
Join Filter: (ts_entry_entity.entity_id = ts_entity.id)
Rows Removed by Join Filter: 34363583
-> Nested Loop (cost=22979.01..234676.15 rows=1 width=987) (actual time=78.801..2914.864 rows=2846 loops=1)
-> Nested Loop Anti Join (cost=22978.59..234675.43 rows=1 width=987) (actual
time=78.776..2867.254 rows=2846 loops=1)
-> Hash Join (cost=22978.17..63457.52 rows=258 width=987) (actual
time=78.614..2573.586 rows=2846 loops=1)
Hash Cond: (ts_entry.current_version_id = ts_entry_version.id)
-> Hash Left Join (cost=19831.38..59727.56 rows=154823 width=383) (actual
time=47.558..2391.088 rows=155061 loops=1)
Hash Cond: (ts_entry.id = entryadhoc_v7f9c1146ee24742a73b83526dc66df7.owner_id)
-> Hash Left Join (cost=16526.15..54467.69 rows=154823 width=337) (actual
time=38.534..2138.354 rows=155061 loops=1)
Hash Cond: (ts_entry.id = entryadhoc_i7f5d5035cac421daa9879c1e21ec63f.owner_id)
-> Hash Left Join (cost=13220.92..49207.82 rows=154823 width=291) (actual
time=30.462..1888.735 rows=155061 loops=1)
Hash Cond: (ts_entry.id = entryadhoc_i0f819c1244b427794a83767eaa68e73.owner_id)
-> Hash Left Join (cost=9915.69..43947.95 rows=154823 width=245) (actual
time=22.268..1640.688 rows=155061 loops=1)
Hash Cond: (ts_entry.id =
entryadhoc_z3757638d8d64373ad835c3523a6a70b.owner_id)
-> Hash Left Join (cost=6610.46..38688.08 rows=154823 width=199) (actual
time=19.612..1409.457 rows=155061 loops=1)
Hash Cond: (ts_entry.id =
entryadhoc_td66ad96a9ab472db3cf1279b65baa69.owner_id)
-> Hash Left Join (cost=3305.23..33428.21 rows=154823 width=153) (actual time=12.431..1161.689 rows=155061 loops=1)
Hash Cond: (ts_entry.id =
entryadhoc_o9e2c9f871634dd3aeafe9bdced2e34f.owner_id)
-> Seq Scan on ts_entry (cost=0.00..28168.34 rows=154823 width=107) (actual time=0.101..898.818 rows=155061 loops=1)
Filter: ((NOT is_draft) AND (class <> 2))
Rows Removed by Filter: 236596
-> Hash (cost=3292.29..3292.29 rows=1035 width=46) (actual time=12.304..12.304 rows=2846 loops=1)
Buckets: 4096 (originally 2048) Batches: 1 (originally 1) Memory
Usage: 305kB
-> Bitmap Heap Scan on ts_attribute entryadhoc_o9e2c9f871634dd3aeafe9bdced2e34f (cost=40.45..3292.29
rows=1035 width=46) (actual time=1.191 ..9.030 rows=2846 loops=1)
Recheck Cond: (def_id = 'b4e9878722eb409c9fdfff3fdba582a3'::bpchar)
More details about tables:
ts_entity(id,short_name,name,type_id)
ts_entry_entity(id,entity_id,entry_id)
ts_entry(id,version_id)
ts_entry_version(id,entry_id,submitted_date,title,submitter)
ts_attribute(id,attribute_definition_id,entry_id,value)
ts_attribute_definition(id,name)
As you can see, ts_entry_version will save all versions for one entry. ts_attribute is used for the extendable columns for entry.
More details about the SQL
We have several filters on ts_entry_version columns and ts_attribute.value.
ts_attribute.value is varchar, but the content may be time milliseconds, normal string value, one or serval id values.
The structure of the SQL is like below:
select ts_entity.short_name, ts_entry_version.title, ts_attribute.value from ts_entity, ts_entry_entity,ts_entry left join ts_attribute on ts_entry.id=ts_attribute.entry_id and ts_attribute.attribute_definition_id='xxx' where ts_entity.id=ts_entry_entity.entity_id and ts_entry_entity.entry_id=ts_entry.id and ts_entry.version_id=ts_entry_version.id and ts_entry_version.title like '%xxx%' order by ts_entity.short_name asc limit 100 offset 0
I found clue in PostgreSQL official documentation, https://www.postgresql.org/docs/current/static/runtime-config-query.html
Change the configuration and the Query Optimizer will prefer to use indexes.