PostgreSQL - weird query planner behavior - postgresql

Assume I have a query like this:
SELECT *
FROM clients c
INNER JOIN clients_balances cb ON cb.id_clients = c.id
LEFT JOIN clients com ON com.id = c.id_companies
LEFT JOIN clients com_real ON com_real.id = c.id_companies_real
LEFT JOIN rate_tables rt_orig ON rt_orig.id = c.orig_rate_table
LEFT JOIN rate_tables rt_term ON rt_term.id = c.term_rate_table
LEFT JOIN payment_terms pt ON pt.id = c.id_payment_terms
LEFT JOIN paygw_clients_profiles cpgw ON (cpgw.id_clients = c.id AND cpgw.id_companies = c.id_companies_real)
WHERE
EXISTS (SELECT * FROM accounts WHERE (name LIKE 'x' OR accname LIKE 'x' OR ani LIKE 'x') AND id_clients = c.id)
AND c."type" = '0'
AND c."id" > 0
ORDER BY c."name";
This query takes around 35 seconds to run when used in the production environment ("clients" has about 1 million records). However, if I take out ANY join - the query will take only about 300 ms to execute.
I've played around with the query planner settings, but to no avail.
Here are a few explain analyze outputs:
http://explain.depesz.com/s/hzy (slow - 48049.574 ms)
http://explain.depesz.com/s/FWCd (fast - 286.234 ms, rate_tables JOIN removed)
http://explain.depesz.com/s/MyRf (fast - 539.733 ms, paygw_clients_profiles JOIN removed)
It looks like in the fast case the planner starts from the EXISTS statement and has to perform join for only two rows in total. However, in the slow case it will first join all the tables and then filter by EXISTS.
What I need to do is to make this query run in a reasonable time with all seven join in place.
Postgres version is 9.3.10 on CentOS 6.3.
Thanks.
UPDATE
Rewriting the query like this:
SELECT *
FROM clients c
INNER JOIN clients_balances cb ON cb.id_clients = c.id
INNER JOIN accounts a ON a.id_clients = c.id AND (a.name = 'x' OR a.accname = 'x' OR a.ani = 'x')
LEFT JOIN clients com ON com.id = c.id_companies
LEFT JOIN clients com_real ON com_real.id = c.id_companies_real
LEFT JOIN rate_tables rt_orig ON rt_orig.id = c.orig_rate_table
LEFT JOIN rate_tables rt_term ON rt_term.id = c.term_rate_table
LEFT JOIN payment_terms pt ON pt.id = c.id_payment_terms
LEFT JOIN paygw_clients_profiles cpgw ON (cpgw.id_clients = c.id AND cpgw.id_companies = c.id_companies_real)
WHERE
c."type" = '0' AND c.id > 0
ORDER BY c."name";
makes it run fast, however, this is not acceptable, as account filtration parameters are optional, and I still need the result if there are no matches in that table. Using "LEFT JOIN accounts" instead of "INNER JOIN accounts" kills the performance again.

As suggested by Tome Lane, I've changed the following two parameters: join_collapse_limit and from_collapse_limit to 10 instead of the default 8, and this solved the issue.

Related

SSMS Query slow join

Tried to join three tables: car, car_and_engine, and engine. The second table, car_and_engine, connects the cars and their engines. A car type has up to three possible engine types. The query is significantly slower than expected (based on experience with similar operations in other languages). Is there anything terribly inefficient about this code?
select engine_type, AVG(horsepower) into #horsepower_by_engine_type
from TRANSPORT.dbo.engine
group by engine_type
go
with temp as(select * from TRANSPORT.dbo.car left join TRANSPORT.dbo.car_and_engine on TRANSPORT.dbo.car_and_engine.car_type_y = TRANSPORT.dbo.car.car_type_x)
select * from temp left join #horsepower_by_engine_type as e1 on temp.engine_type_1 = e1.engine_type
left join #horsepower_by_engine_type as e2 on temp.engine_type_2 = e2.engine_type
left join #horsepower_by_engine_type as e3 on temp.engine_type_3 = e3.engine_type
You don't really need a temp table (except when you are doing some diagnostics). You could replace your temp table syntax with an inline-view.
with temp as(select * from TRANSPORT.dbo.car left join TRANSPORT.dbo.car_and_engine on TRANSPORT.dbo.car_and_engine.car_type = TRANSPORT.dbo.car.car_type)
select * from temp left join
(select engine_type, AVG(horsepower)
from TRANSPORT.dbo.engine
group by engine_type) as e1 on temp.engine_type_1 = e1.engine_type
left join
(select engine_type, AVG(horsepower)
from TRANSPORT.dbo.engine
group by engine_type) as e2 on temp.engine_type_2 = e2.engine_type
left join
(select engine_type, AVG(horsepower)
from TRANSPORT.dbo.engine
group by engine_type) as e3 on temp.engine_type_3 = e3.engine_type
Better still, you could put your summary into your CTE
with temp as (select * from TRANSPORT.dbo.car left join TRANSPORT.dbo.car_and_engine on TRANSPORT.dbo.car_and_engine.car_type = TRANSPORT.dbo.car.car_type),
avgHP as (select engine_type, AVG(horsepower) from TRANSPORT.dbo.engine group by engine_type)
select * from temp left join avgHP as e1 on temp.engine_type_1 = e1.engine_type
left join avgHP as e2 on temp.engine_type_2 = e2.engine_type
left join avgHP as e3 on temp.engine_type_3 = e3.engine_type

Long running query hangs application despite multiple cores

Our server has 8 cores and is running a web application(DHIS2) which uses postgres as database.
There is big select query which takes a few hours to execute. (The query is executed from the terminal)
When that query is run the cpu utilization of that query's process jumps to a constant 100%.
This hangs the application and the application's page does not even load in the browser. This must be because other postgres processes are waiting for that query's process to comeplete.
BUT, when we have muliple cores in the machine then why utilization of a single core to 100% should stop the rest of the processes from executing?
I am unable to understand the concept of multiple cpus and cores in this context. Does postgres not recognize them? What can be the dependency of a select query on another query?
Could somebody please explain this behaviour and suggest ways in which to manage execution of big queries though some kind of postgres configuration may be?
Postgres Version - 9.6
OS - Ubuntu 16
Database Size - 200GB on disk
DHIS2 Version - 2.30
Query (Calucates outliers) -
select datasets,
max(ou1.name) Country, ou1.organisationunitid as Country__Id ,
max(ou2.name) state, ou2.organisationunitid as State__Id ,
max(ou3.name) Division, ou3.organisationunitid as Division__Id ,
max(ou4.name) District, max(ou4.code) as District__Code, ou4.organisationunitid as District__Id,
max(ou5.name) Block, max(ou5.code) as Block__Code, ou5.organisationunitid as Block__Id,
max(ou6.name) Facility, max(ou6.code) as Facility__Code,ou6.organisationunitid as Facility__Id,
max(ou.name) as outlierfacility,
max(de.name) as dataelement,
max(coc.name) as category,
concat(max(p.startdate),':',max(p.enddate)) as period,
max(pt.name) as frequency,
_dv.value,
u upperbound,
l lowerbound,
mean,
std
from
(
with stats as (
select dv.sourceid,
dv.dataelementid,
dv.categoryoptioncomboid,
dv.attributeoptioncomboid,
array_agg(distinct dv.periodid) as periods,
array_agg(distinct ds.name) as datasets,
avg(dv.value::float) as mean,
stddev(dv.value::float) as std
from datavalue dv
inner join datasetmembers dsm on dsm.dataelementid = dv.dataelementid
inner join dataelement de on de.dataelementid = dsm.dataelementid
inner join dataset ds on ds.datasetid = dsm.datasetid
inner join period pe on pe.periodid = dv.periodid
inner join periodtype pt on pt.periodtypeid = pe.periodtypeid
inner join categoryoptioncombo coc on dv.categoryoptioncomboid = coc.categoryoptioncomboid
inner join _orgunitstructure ous on ous.organisationunitid = dv.sourceid
where pe.startdate between date('2019-04-29') - interval '6 months' and date('2019-04-29') and pt.name='Monthly'
and de.valueType in ('NUMBER','INTEGER')
and ds.uid in ('123qwe123','123ewq123')
group by dv.sourceid,dv.dataelementid,dv.categoryoptioncomboid,dv.attributeoptioncomboid
)
select dv.*,datasets,mean,std,mean+3*std u,mean-3*std l
from datavalue dv
inner join period pe on pe.periodid = dv.periodid
inner join periodtype pt on pt.periodtypeid = pe.periodtypeid
inner join stats on
stats.dataelementid = dv.dataelementid and
stats.sourceid= dv.sourceid and
stats.categoryoptioncomboid = dv.categoryoptioncomboid and
stats.attributeoptioncomboid = dv.attributeoptioncomboid
where dv.periodid = any(periods)
and (dv.value::float > mean+3*std or dv.value::float < mean-3*std)
) _dv
inner join dataelement de on _dv.dataelementid = de.dataelementid
inner join categoryoptioncombo coc on _dv.categoryoptioncomboid = coc.categoryoptioncomboid
inner join _orgunitstructure ous on _dv.sourceid = ous.organisationunitid
inner join organisationunit ou on ou.organisationunitid = ous.organisationunitid
left join organisationunit ou1 on ou1.organisationunitid = ous.idlevel1
left join organisationunit ou2 on ou2.organisationunitid = ous.idlevel2
left join organisationunit ou3 on ou3.organisationunitid = ous.idlevel3
left join organisationunit ou4 on ou4.organisationunitid = ous.idlevel4
left join organisationunit ou5 on ou5.organisationunitid = ous.idlevel5
left join organisationunit ou6 on ou6.organisationunitid = ous.idlevel6
inner join period p on _dv.periodid = p.periodid
inner join periodtype pt on p.periodtypeid = pt.periodtypeid
group by ou1.organisationunitid,
ou2.organisationunitid,
ou3.organisationunitid,
ou4.organisationunitid,
ou5.organisationunitid,
ou6.organisationunitid,
_dv.dataelementid,_dv.sourceid,_dv.categoryoptioncomboid,_dv.attributeoptioncomboid,_dv.periodid,_dv.value,u,l,mean,std,datasets
order by country,state,division,district,block,facility,dataelement,category

Why does not adding distinct in this query produce duplicate rows?

This query was taken from a Rails application log...I'm trying to edit a massive postgresql statement I didn't write....If I don't add a distinct keyword after the SELECT, 2 duplicate rows appear for each braintree account. Why is this and is there another way to avoid having to use the distinct to avoid duplicates?
EDIT: I understand what distinct is supposed to do, the reason I'm asking is that it doesn't generate duplicates for other toy lines. By other toy lines, this query is building a "table" for a particular toy id (this specific example toys.id = 12). How do I figure out where the duplicate rows are being generated?
SELECT accounts.braintree_account_id as braintree_account_id,
accounts.braintree_account_id as braintree_account_id, format('%s %s', addresses.first_name,
addresses.last_name) as shipping_address_full_name,
users.email as email, addresses.line_1 as shipping_address_line_1,
addresses.line_2 as shipping_address_line_2, addresses.city as
shipping_address_city, addresses.state as shipping_address_state,
addresses.zip as shipping_address_zip_code, addresses.country
as shipping_address_country, CASE WHEN xy_shirt IS NULL THEN '' ELSE xy_shirt END, plans.name as plan_name, toys.sku as sku, to_char(accounts.created_at, 'MM/DD/YYYY HH24:MM:SS') as
account_created_at,
to_char(accounts.next_assessment_at, 'MM/DD/YYYY HH24:MM:SS') as account_next_assessment_at,
accounts.account_status as account_status FROM \"accounts\" INNER JOIN \"addresses\" ON
\"addresses\".\"id\" = \"accounts\".\"shipping_address_id\" AND \"addresses\".\"type\" IN
('ShippingAddress') LEFT OUTER JOIN shipping_methods ON
shipping_methods.account_id = accounts.id LEFT OUTER JOIN plans ON
accounts.plan_id = plans.id
LEFT OUTER JOIN users ON
accounts.user_id = users.id LEFT OUTER JOIN toys ON plans.toy_id = toys.id
LEFT OUTER JOIN account_variations ON accounts.id =
account_variations.account_id LEFT OUTER JOIN variations ON
account_variations.variation_id = variations.id
LEFT OUTER JOIN
choice_value_variations ON variations.id =
choice_value_variations.variation_id
LEFT OUTER JOIN choice_values ON
choice_value_variations.choice_value_id = choice_values.id LEFT OUTER
JOIN choice_types ON choice_values.choice_type_id = choice_types.id
LEFT
OUTER JOIN choice_type_toys ON choice_type_toys.toy_id = toys.id
AND choice_type_toys.choice_type_id = choice_types.id
LEFT OUTER JOIN
(SELECT * FROM crosstab('SELECT accounts.id, choice_types.id,
choice_values.presentation FROM accounts\n
LEFT JOIN account_variations ON
accounts.id=account_variations.account_id\n
LEFT JOIN variations ON account_variations.variation_id=variations.id\n
LEFT JOIN choice_value_variations ON
variations.id=choice_value_variations.variation_id\n
LEFT JOIN choice_values ON
choice_value_variations.choice_value_id=choice_values.id\n
LEFT JOIN choice_types ON choice_values.choice_type_id=choice_types.id
ORDER BY 1,2',\n 'select distinct choice_types.id
from choice_types JOIN choice_values ON choice_values.choice_type_id =
choice_types.id JOIN choice_value_variations ON
choice_value_variations.choice_value_id = choice_values.id JOIN
variations ON choice_value_variations.variation_id = variations.id JOIN choice_type_toys ON choice_type_toys.choice_type_id = choice_types.id JOIN toys ON toys.id = choice_type_toys.toy_id
where toys.id=12 ORDER
BY choice_types.id ASC')\n
AS (account_id int, xy_shirt
VARCHAR)) account_variation_view\n ON
accounts.id=account_variation_view.account_id WHERE
\"accounts\".\"account_status\" = 'active' AND
\"addresses\".\"flagged_invalid_at\" IS NULL AND \"toys\".\"id\" = 12
AND (NOT EXISTS (SELECT \"account_skipped_months\".* FROM
\"account_skipped_months\" WHERE
\"account_skipped_months\".\"month_year\" = 'JUL2016' AND
(account_skipped_months.account_id = accounts.id)))"
The purpose of using DISTINCT in a SELECT statement is to eliminate duplicate rows.

Left Join Is Not Doing What I expect

Totally confused and I have been working at this for 2 hours
I thought restriction on the left side of the join are honored
On this query I am getting [docSVsys].[visibility] 1 and <> 1
I thought this would restrict [docSVsys].[visibility] to 1
select top 1000
[docSVsys].[sID], [docSVsys].[visibility]
,[Table].[sID],[Table].[enumID],[Table].[valueID]
from [docSVsys] with (nolock)
left Join [DocMVenum1] as [Table] with (nolock)
on [docSVsys].[visibility] in (1)
and [Table].[sID] = [docSVsys].[sID]
and [Table].[enumID] = '140'
and [Table].[valueID] in (1,7)
This works
select top 1000
[docSVsys].[sID], [docSVsys].[visibility]
,[Table].[sID],[Table].[enumID],[Table].[valueID]
from [docSVsys] with (nolock)
left Join [DocMVenum1] as [Table] with (nolock)
on [Table].[sID] = [docSVsys].[sID]
and [Table].[enumID] = '140'
and [Table].[valueID] in (1,7)
where [docSVsys].[visibility] in (1)
I am just having a really off day as I had it in my mind the left side honored the join
SELECT *
FROM A
LEFT JOIN B ON Condition
is equivalent to
SELECT *
FROM A
CROSS JOIN B
WHERE Condition
UNION ALL
SELECT A.*, NULL AS B
FROM A
WHERE NOT EXISTS (SELECT * FROM B WHERE Condition)
Some rough pseudo-code...
Note, that all rows from A get through. It's just that the columns from B can be NULL if the join fails for some particular row of A.
Put the filter on docSVsys into the WHERE clause.
LEFT JOINs preserve all rows from the left (first) table, no matter what. The condition in the ON clause is only for matching which rows from the right/second table should be paired with rows from the left/first table.
If you want to exclude some rows from the firs table, use the WHERE clause:
select top 1000
[docSVsys].[sID], [docSVsys].[visibility]
,[Table].[sID],[Table].[enumID],[Table].[valueID]
from [docSVsys] with (nolock)
left Join [DocMVenum1] as [Table] with (nolock)
on [Table].[sID] = [docSVsys].[sID]
and [Table].[enumID] = '140'
and [Table].[valueID] in (1,7)
where [docSVsys].[visibility] in (1)

Eliminating NULL rows in TSQL query [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
How to eliminate NULL fields in TSQL
I am using SSMS 2008 R2 and am developing a TSQL query. I want just 1 record / profile_name. Because some of these values are NULL, I am currently doing LEFT JOINS on most of the tables. But the problem with the LEFT JOINs is that now I get > 1 record for some profile_names!
But if I change this to INNER JOINs then some profile_names are excluded entirely because they have NULL values for these columns. How do I limit the query result to just one record / profile_name regardless of NULL values? And if there are non-NULL values then I want it to choose the record with non-NULL values. Here is initial query:
select distinct
gp.group_profile_id,
gp.profile_name,
gp.license_number,
gp.is_accepting,
case when gp.is_accepting = 1 then 'Yes'
when gp.is_accepting = 0 then 'No '
end as is_accepting_placement,
mo.profile_name as managing_office,
regions.[region_description] as region,
pv.vendor_name,
pv.id as vendor_id,
at.description as applicant_type,
dbo.GetGroupAddress(gp.group_profile_id, null, 0) as [Office Address],
gsv.status_description
from group_profile gp With (NoLock)
inner join group_profile_type gpt With (NoLock) on gp.group_profile_type_id = gpt.group_profile_type_id and gpt.type_code = 'FOSTERHOME' and gp.agency_id = #agency_id and gp.is_deleted = 0
inner join group_profile mo With (NoLock) on gp.managing_office_id = mo.group_profile_id
left outer join payor_vendor pv With (NoLock) on gp.payor_vendor_id = pv.payor_vendor_id
left outer join applicant_type at With (NoLock) on gp.applicant_type_id = at.applicant_type_id and at.is_foster_home = 1
inner join group_status_view gsv With (NoLock) on gp.group_profile_id = gsv.group_profile_id and gsv.status_value = 'OPEN' and gsv.effective_date =
(Select max(b.effective_date) from group_status_view b With (NoLock)
where gp.group_profile_id = b.group_profile_id)
left outer join regions With (NoLock) on isnull(mo.regions_id, gp.regions_id) = regions.regions_id
left join enrollment en on en.group_profile_id = gp.group_profile_id
join event_log el on el.event_log_id = en.event_log_id
left join people client on client.people_id = el.people_id
As you can see, the results of the above query is 1 row / profile_name:
group_profile_id profile_name license_number is_accepting is_accepting_placement managing_office region vendor_name vendor_id applicant_type Office Address status_description Cert Date2
But now watch what happens when I add in 2 LEFT JOINs and 1 additional column:
select distinct
gp.group_profile_id,
gp.profile_name,
gp.license_number,
gp.is_accepting,
case when gp.is_accepting = 1 then 'Yes'
when gp.is_accepting = 0 then 'No '
end as is_accepting_placement,
mo.profile_name as managing_office,
regions.[region_description] as region,
pv.vendor_name,
pv.id as vendor_id,
at.description as applicant_type,
dbo.GetGroupAddress(gp.group_profile_id, null, 0) as [Office Address],
gsv.status_description,
ri.[description] as race
from group_profile gp With (NoLock)
inner join group_profile_type gpt With (NoLock) on gp.group_profile_type_id = gpt.group_profile_type_id and gpt.type_code = 'FOSTERHOME' and gp.agency_id = #agency_id and gp.is_deleted = 0
inner join group_profile mo With (NoLock) on gp.managing_office_id = mo.group_profile_id
left outer join payor_vendor pv With (NoLock) on gp.payor_vendor_id = pv.payor_vendor_id
left outer join applicant_type at With (NoLock) on gp.applicant_type_id = at.applicant_type_id and at.is_foster_home = 1
inner join group_status_view gsv With (NoLock) on gp.group_profile_id = gsv.group_profile_id and gsv.status_value = 'OPEN' and gsv.effective_date =
(Select max(b.effective_date) from group_status_view b With (NoLock)
where gp.group_profile_id = b.group_profile_id)
left outer join regions With (NoLock) on isnull(mo.regions_id, gp.regions_id) = regions.regions_id
left join enrollment en on en.group_profile_id = gp.group_profile_id
join event_log el on el.event_log_id = en.event_log_id
left join people client on client.people_id = el.people_id
left join race With (NoLock) on el.people_id = race.people_id
left join race_info ri with (nolock) on ri.race_info_id = race.race_info_id
The above query results in all of the same profile_names, but some with NULL race values:
group_profile_id profile_name license_number is_accepting is_accepting_placement managing_office region vendor_name vendor_id applicant_type Office Address status_description Cert Date2 race
Unfortunately it complicates matters that I need to join in 2 additional tables for this one additional field value (race). If I simply change the last two LEFT JOINs above to INNER JOINs then I eliminate the NULL rows above. But I also eliminate some of the profile_names:
group_profile_id profile_name license_number is_accepting is_accepting_placement managing_office region vendor_name vendor_id applicant_type Office Address status_description Cert Date2 race
Hopefully I have provided all of the details that you need for this question.
Not the most elegant solution, but one that will work:
select [stuff]
from group_profile gp With (NoLock)
inner join group_profile_type gpt With (NoLock) on gp.group_profile_type_id = gpt.group_profile_type_id and gpt.type_code = 'FOSTERHOME' and gp.agency_id = #agency_id and gp.is_deleted = 0
inner join group_profile mo With (NoLock) on gp.managing_office_id = mo.group_profile_id
join payor_vendor pv on ISNULL(gp.payor_vendor_id, 'THISVALUEWILLNEVEROCCUR') = ISNULL(pv.payor_vendor_id, 'THISVALUEWILLNEVEROCCUR')
...etc...
Biggest issue with what I posted is that you'll be doing a whole lot of table scans.