SQL LEFT JOIN and SUM() slow

SQL LEFT JOIN and SUM() slow - tsql

I'm running a query for calculating the total length of an object.
The object is made out of several smaller segments(from a different table).
Running queries with a single segment for each object works fine, but the load gets really slow for adding more segments. (3000 objects and 9000 segments).
Below is the relevant part of my query:
SELECT
t1.ObjectID,
SUM(SegmentLength) AS TotalLength
FROM Objects t1
LEFT JOIN
Segment t2
ON
t1.ObjectID=t2.ObjectID AND
t1.ProjektID=t2.ProjektID AND
t1.CraneID = t2.CraneID
WHERE
t1.ObjectID='esttest3' AND
(x_ > 0 OR y_ > 0) AND
x_ IS NOT NULL AND
y_ IS NOT NULL
GROUP BY
t1.ObjectID
This takes about 1 minute for 3000 objects with 3 segments each.
I've tried adding indices:
FROM Objects t1 WITH (INDEX(IX_Object_All))
LEFT JOIN
Segment t2 WITH (INDEX(IX_Segment_All))
But this did not improve performance.
EDIT: These indexes were non-clustered, creating clustered indices fixed the issue. (No reference to indices was needed).
If I change the LEFT JOIN into a INNER JOIN and this makes the extracted query(the one above) instantaneous(~0 sec). However inner join still takes a minute in my original query with extra fields (basically making it a left join again I suppose, since there will be values for each object)
I also tried to create a view with calculated TotalLength and making a LEFT JOIN with the view. Still no improvement.
My final idea is to create a trigger for the Segment table to update a TotalLength field in the Object table. I worry however that this will negatively impact performance when importing projects with ~9k segments, updating the value for each one...

Related

Unexpected sort order on postgres left outer join

Background
I'm using Postgres 11 and pgAdmin4 v5.2. The problem I describe below is on my dev machine which has both the postgres server and pgAdmin client.
Questions I've looked at on SO that deal with incorrect ordering seem related to collation-related issues with ordering of text fields, whereas my problem is on an integer field.
Setup
I have a table norm_plans that contains ~5k records.
Column | Type
---------------------------------
canon_id | integer
name | character varying(200)
...
other fields
canon_id is autopopulated using a sequence.
I've created a new table norm_plans_cmp as a copy of norm_plans (CREATE TABLE norm_plans_cmp AS TABLE norm_plans WITH DATA;)
I next insert some new records into norm_plans and update some existing records (fields other than canon_id.
The new records increment the sequence and are assigned canon_id values as expected.
I now want to compare norm_plans against norm_plans_cmp so I perform a left outer join:
select a.*, b.*
from norm_plans a
left outer join norm_plans_cmp b
on a.canon_id = b.canon_id
order by a.canon_id
Problem
I would expect records to be sorted by canon_id. This holds true from 1-2000, but after 2,000 I get canon_ids from 5,001 to 5,111 (which is the last canon_id) and then it again picks up from 2,001. I'm viewing this data from pgAdmin, see screenshot 1 below showing the shift from 2,000 to 5,001, and screenshot 2 showing the transition again from 5,111 back to 2,001.
Additional observations
While incorrect, the ordering seems consistent. Running the query multiple times results in the same (incorrect) ordering.
Despite my question title, I'm not totally sure the left join has anything to do with this.
Running SELECT * ... ORDER BY canon_id on norm_plans or norm_plans_cmp alone also result in incorrect ordering, albeit at different points in the order.
Answers to this SO question suggest index corruption may be a contributing problem, but I have no indexes on either norm_plans or norm_plans_cmp (canon_id is not defined as a PK).
At this point, I'm stumped!

Joining too many tables makes Postgres query extremely slow

I've been trying to optimize this simple query on Postgres 12 that joins several tables to a base relation. They each have 1-to-1 relation and have anywhere from 10 thousand to 10 million rowss.
SELECT *
FROM base
LEFT JOIN t1 ON t1.id = base.t1_id
LEFT JOIN t2 ON t2.id = base.t2_id
LEFT JOIN t3 ON t3.id = base.t3_id
LEFT JOIN t4 ON t4.id = base.t4_id
LEFT JOIN t5 ON t5.id = base.t5_id
LEFT JOIN t6 ON t6.id = base.t6_id
LEFT JOIN t7 ON t7.id = base.t7_id
LEFT JOIN t8 ON t8.id = base.t8_id
LEFT JOIN t9 ON t9.id = base.t9_id
(the actual relations are a bit more complicated than this, but for demonstration purposes this is fine)
I noticed that the query is still very slow when I only do SELECT base.id which seems odd, because then query planner should know that the joins are unnecessary and shouldn't affect the performance.
Then I noticed that 8 seems to be some kind of magic number. If I remove any single one of the joins, the query time goes from 500ms to 1ms. With EXPLAIN I was able to see that Postgres is doing index only scans when joining 8 tables, but with 9 tables it starts doing sequential scans.
That's even when I only do SELECT base.id so somehow the amount of tables is tripping up the query planner.

We finally found out that there is indeed a configuration setting in postgres called join_collapse_limit, which is set to 8 by default.
https://www.postgresql.org/docs/current/runtime-config-query.html
The planner will rewrite explicit JOIN constructs (except FULL JOINs) into lists of FROM items whenever a list of no more than this many items would result. Smaller values reduce planning time but might yield inferior query plans. By default, this variable is set the same as from_collapse_limit, which is appropriate for most uses. Setting it to 1 prevents any reordering of explicit JOINs. Thus, the explicit join order specified in the query will be the actual order in which the relations are joined. Because the query planner does not always choose the optimal join order, advanced users can elect to temporarily set this variable to 1, and then specify the join order they desire explicitly.
After reading this article we decided to increase the limit, along with other values such as from_collapse_limit and geco_threshold. Beware that query planning time increases exponentially with the amount of joins, so the limit is there for a reason and should not be increased carelessly.

How to optimize the following query by adding more indexes?

I am trying to optimize a query which has been destroying my DB.
https://explain.depesz.com/s/isM1
If you have any insights into how to make this better please let me know.
We are using RDS/Postgres 11.9
explain analyze SELECT "src_rowdifference"."key",
"src_rowdifference"."port_id",
"src_rowdifference"."shipping_line_id",
"src_rowdifference"."container_type_id",
"src_rowdifference"."shift_id",
"src_rowdifference"."prev_availability_id",
"src_rowdifference"."new_availability_id",
"src_rowdifference"."date",
"src_rowdifference"."prev_last_update",
"src_rowdifference"."new_last_update"
FROM "src_rowdifference"
INNER JOIN "src_containertype" ON ("src_rowdifference"."container_type_id" = "src_containertype"."key")
WHERE ("src_rowdifference"."container_type_id" IN
(SELECT U0."key"
FROM "src_containertype" U0
INNER JOIN "notification_tablenotification_container_types" U1 ON (U0."key" = U1."containertype_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."new_last_update" >= '2020-01-15T03:11:06.291947+00:00'::timestamptz
AND "src_rowdifference"."port_id" IN
(SELECT U0."key"
FROM "src_port" U0
INNER JOIN "notification_tablenotification_ports" U1 ON (U0."key" = U1."port_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."shipping_line_id" IN
(SELECT U0."key"
FROM "src_shippingline" U0
INNER JOIN "notification_tablenotification_shipping_lines" U1 ON (U0."key" = U1."shippingline_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."prev_last_update" IS NOT NULL
AND NOT ("src_rowdifference"."prev_availability_id" = 'na'
AND "src_rowdifference"."prev_availability_id" IS NOT NULL)
AND NOT ("src_rowdifference"."key" IN
(SELECT V1."rowdifference_id"
FROM "notification_tablenotificationtrigger_row_differences" V1
WHERE V1."tablenotificationtrigger_id" IN
(SELECT U0."id"
FROM "notification_tablenotificationtrigger" U0
WHERE U0."notification_id" = 'test#test.com'))));
All my indexes are btree + btree(varchar_pattern_ops)
"src_rowdifference_port_id_shipping_line_id_9b3465fc_uniq" UNIQUE CONSTRAINT, btree (port_id, shipping_line_id, container_type_id, shift_id, date, new_last_update)
Edit: A little unrelated change that I made was added some more ssd disk space to my RDS instance. That made a huge difference to the CPU usage and in turn made a huge difference to the number of connections we have.

It is hard to think about the plan as a whole, as I don't understand what it is looking for. But looking at the individual pieces, there are two which together dominate the run time.
One is the index scan on src_rowdifference_port_id_shipping_line_id_9b3465fc, which seems pretty slow given the number of rows returned. Comparing the Index Condition to the index columns, I can see that the condition on new_last_update cannot be applied efficiently in the index because two columns in the index come before it and have no equality conditions in the node. So instead that >= is applied as an "in-index filter" where it needs to test each row and reject it, rather than just skipping it in bulk. I don't know how many rows that removes as the "Rows Removed by Filter" does not count in-index filters, but it is potentially large. So one thing to try would be to make a new index on (port_id, shipping_line_id, container_type_id, new_last_update). Or maybe replace that index with a reordered version (port_id, shipping_line_id, container_type_id, new_last_update, shift_id, date) but of course that might make some other query worse.
The other time consuming thing is kicking the materialized node 47 thousand times (each one looping over up to 22 thousand rows) to implement NOT (SubPlan 1). That should be using a hashed subplan, rather than a linear searched subplan. The only reason I can think of that it not doing the hashed subplan is that work_mem is not large enough to anticipate fitting it into memory. What is your setting for work_mem? What happens if you bump it up to "100MB" or so?
The NOT (SubPlan 1) from the EXPLAIN corresponds to the part of your query AND NOT ("src_rowdifference"."key" IN (...)). If bumping up work_mem doesn't work, you could try rewriting that into a NOT EXISTS clause instead.

Why are rows in temp table not showing in count?

I have a T-SQL query with temporary tables which is running and showing the count of rows per table however for one of the temp tables (which has 124 rows) the SELECT COUNT is showing as a zero and not the 124 that are in the temporary table.
I have tried changing the joins in the select statement and retyping the entire query. I have added aliases for all the fields and double checked that all of the temporary tables are giving the correct results. As far as I can see I have written the COUNT the same way for all 4 of the temporary tables but the COUNT(#ML.MembershipID) is the only one that is showing as a zero and not matching the row count.
--this is the select part of my query
SELECT #Type.MembershipType,
COUNT(#MS.MembershipID) AS MembersStartCount,
COUNT(#ME.MembershipID) AS MembersEndCount,
COUNT(#ML.MembershipID) AS MembersLostCount,
COUNT(#MN.MembershipID) AS MembersGainedCount
FROM Filteredccx_Membership mem INNER JOIN #Type ON mem.ccx_membershipid=#Type.MembershipID
LEFT OUTER JOIN #MS ON mem.ccx_membershipid=#MS.MembershipID
LEFT OUTER JOIN #ME ON mem.ccx_membershipid=#ME.MembershipID
LEFT OUTER JOIN #MN ON mem.ccx_membershipid=#MN.MembershipID
LEFT OUTER JOIN #ML ON mem.ccx_membershipid=#ML.MembershipID
GROUP BY #type.MembershipType
The COUNT(#ML.MembershipID) AS MembersLostCount, should be showing the 124 in total across the 2 membership types but it is showing 0 in both rows. All of the other COUNT's are showing the number of rows in the temp tables.

Your LEFT OUTER JOIN is probably not returning anything from the #ML table.
Meaning that #ML.MembershipID is NULL for all rows.
Try changing it to:
COUNT(ISNULL(#ML.MembershipID, 0)) AS MembersLostCount,

PostgREST using limit and offset in subqueries or CTE

we are using PostgREST in our project for some quite complex database views.
From some point on, when we are using limit and offset (x-range headers or query parameters) with sub-selects we get very high response times.
From what we have read, it seems like this is a known issue where postgresql executes the sub-selects even for the records which are not requested. The solution would be to jiggle a little with the offset and limit, putting it in a subselect or a CTE table.
Is there an internal GUC value or something similar that we can use in the database views in order to optimize the response times ? Does anybody have a hint on how to achieve this ?
EDIT: as suggested here are some more details. Let's say we have a relationship between product and parts. I want to know the parts count per product (this is a simplified version of the database views that we are exposing).
There are two ways of doing this
A. Subselect:
SELECT products.id
,(
SELECT count(part_id) AS total
FROM parts
WHERE product_id = products.id
)
FROM products limit 1000 OFFSET 99000
B. CTE:
WITH parts_count
AS (
SELECT product_id
,count(part_id) AS total
FROM parts
GROUP BY product_id
ORDER BY product_id
)
SELECT products.id
,parts_count.total
FROM products
LEFT JOIN parts_count ON parts_count.product_id = product.id
LIMIT 1000
OFFSET 99000
Problem with A is that the sub-select is performed for every row so even though I read only 1000 records there are 100 000 subselects.
Problem with B is that the join with parts_count table takes very long since there are 100 0000 records there (although the with query takes only 200 ms! for 2000 records). Ideally I would like to limit the parts_count table with the same limit and offset as the main query but I can't do this in PostgREST since it just appends the limit and offset at the end, I don't have access to those parameters inside the WITH query

It is unavoidable that high OFFSET leads to bad performance.
There is no other way to compute OFFSET but to scan and discard all the rows until you reach the offset, and no database in the world will be fast if OFFSET is high.
That's a conceptual problem, and the only way to avoid it is to avoid OFFSET.
If your goal is pagination, then usually keyset pagination is a better solution:
You add an ORDER BY clause that matches your requirements, make sure there is a unique key in the ORDER BY clause and remember the last value you found. To fetch the next page, add a WHERE condition with that values. With proper index support, this can be very fast.
For your query, a more efficient version is probably:
SELECT p.id
count(parts.part_id) AS total
FROM (SELECT id FROM products
LIMIT 1000 OFFSET 99000) p
LEFT JOIN parts ON parts.product_id = p.id
GROUP BY p.id;
It is rather weird that you have no ORDER BY, but LIMIT and OFFSET. That doesn't make much sense.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse