I have this SQL:
SELECT
*
FROM
Requisicao r
join convenio c on c.idconvenio = r.idconvenio
join empresa e on e.idempresa = c.idempresa
When I execute it I get this plan of execution:
PLAN JOIN (C NATURAL,E INDEX (INTEG_160),R INDEX (INTEG_318))
What means that Convenio's index was not used (every table has its indexes)
I would like to understand it a little better so I can improve some performance issues I'm having with this system.
Thanks.
What seems wrong for you? Because you don't have any conditions (WHERE clause) server will read one table naturally, i.e. from the very first row to the last one. Taking into consideration index's selectivity served decided that it would be better to read from c and to join records from e and r.
I agree with Andrei. If convenio.idconvenio has low selectivity, the plan is fine.
Related
I am trying to optimize a query which has been destroying my DB.
https://explain.depesz.com/s/isM1
If you have any insights into how to make this better please let me know.
We are using RDS/Postgres 11.9
explain analyze SELECT "src_rowdifference"."key",
"src_rowdifference"."port_id",
"src_rowdifference"."shipping_line_id",
"src_rowdifference"."container_type_id",
"src_rowdifference"."shift_id",
"src_rowdifference"."prev_availability_id",
"src_rowdifference"."new_availability_id",
"src_rowdifference"."date",
"src_rowdifference"."prev_last_update",
"src_rowdifference"."new_last_update"
FROM "src_rowdifference"
INNER JOIN "src_containertype" ON ("src_rowdifference"."container_type_id" = "src_containertype"."key")
WHERE ("src_rowdifference"."container_type_id" IN
(SELECT U0."key"
FROM "src_containertype" U0
INNER JOIN "notification_tablenotification_container_types" U1 ON (U0."key" = U1."containertype_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."new_last_update" >= '2020-01-15T03:11:06.291947+00:00'::timestamptz
AND "src_rowdifference"."port_id" IN
(SELECT U0."key"
FROM "src_port" U0
INNER JOIN "notification_tablenotification_ports" U1 ON (U0."key" = U1."port_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."shipping_line_id" IN
(SELECT U0."key"
FROM "src_shippingline" U0
INNER JOIN "notification_tablenotification_shipping_lines" U1 ON (U0."key" = U1."shippingline_id")
WHERE U1."tablenotification_id" = 'test#test.com')
AND "src_rowdifference"."prev_last_update" IS NOT NULL
AND NOT ("src_rowdifference"."prev_availability_id" = 'na'
AND "src_rowdifference"."prev_availability_id" IS NOT NULL)
AND NOT ("src_rowdifference"."key" IN
(SELECT V1."rowdifference_id"
FROM "notification_tablenotificationtrigger_row_differences" V1
WHERE V1."tablenotificationtrigger_id" IN
(SELECT U0."id"
FROM "notification_tablenotificationtrigger" U0
WHERE U0."notification_id" = 'test#test.com'))));
All my indexes are btree + btree(varchar_pattern_ops)
"src_rowdifference_port_id_shipping_line_id_9b3465fc_uniq" UNIQUE CONSTRAINT, btree (port_id, shipping_line_id, container_type_id, shift_id, date, new_last_update)
Edit: A little unrelated change that I made was added some more ssd disk space to my RDS instance. That made a huge difference to the CPU usage and in turn made a huge difference to the number of connections we have.
It is hard to think about the plan as a whole, as I don't understand what it is looking for. But looking at the individual pieces, there are two which together dominate the run time.
One is the index scan on src_rowdifference_port_id_shipping_line_id_9b3465fc, which seems pretty slow given the number of rows returned. Comparing the Index Condition to the index columns, I can see that the condition on new_last_update cannot be applied efficiently in the index because two columns in the index come before it and have no equality conditions in the node. So instead that >= is applied as an "in-index filter" where it needs to test each row and reject it, rather than just skipping it in bulk. I don't know how many rows that removes as the "Rows Removed by Filter" does not count in-index filters, but it is potentially large. So one thing to try would be to make a new index on (port_id, shipping_line_id, container_type_id, new_last_update). Or maybe replace that index with a reordered version (port_id, shipping_line_id, container_type_id, new_last_update, shift_id, date) but of course that might make some other query worse.
The other time consuming thing is kicking the materialized node 47 thousand times (each one looping over up to 22 thousand rows) to implement NOT (SubPlan 1). That should be using a hashed subplan, rather than a linear searched subplan. The only reason I can think of that it not doing the hashed subplan is that work_mem is not large enough to anticipate fitting it into memory. What is your setting for work_mem? What happens if you bump it up to "100MB" or so?
The NOT (SubPlan 1) from the EXPLAIN corresponds to the part of your query AND NOT ("src_rowdifference"."key" IN (...)). If bumping up work_mem doesn't work, you could try rewriting that into a NOT EXISTS clause instead.
these are 3 approaches how to make a join. I would like to hear some word on perforance of these 3 queries.
Thank you
SELECT * FROM
tableA A LEFT JOIN tableB B
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
ON B.ColumnB = A.ColumnB
WHERE ColumnX = 'XY'
Versus
SELECT * FROM
tableA A LEFT JOIN tableB B
ON B.ColumnB = A.ColumnB
INNER JOIN tableC C
ON C.ColumnC = B.ColumnB
WHERE ColumnX = 'XY'
Versus Common Table Expression
WITH T...
It does not matter.
SQL Server has a cost-based optimizer (as opposed to a rule-based optimizer). That means that the engine is able to figure out that both of your first two options are identical. Run your estimated and actual execution plans and you will see that this is the case.
The only reason you would choose one option over the other is for readability's sake. I go with your second option, because it's a lot easier to read when there are a great many joins involved. ON clauses in reverse order become quite difficult to track.
In my experience, any of the above could be quicker depending on your tables.
As you're setting up joins, you want to start with the most restrictive as possible (without negatively affecting your end result, obviously). This same logic also applies to the Where clause for the same reason. By starting with the most restrictive, you're limiting the number of rows that are being joined and thus evaluated by the Where clause and then returned/manipulated in the select clause. For my answers below regarding the three specific scenarios, I'm assuming a sufficiently complicated query that is doing more than just looking to combine data from multiple tables (i.e., queries answering specific questions).
If Table A is huge and Tables B & C are smaller and more directly related to the data you're trying to isolate, then the first option would likely be fastest.
If Table B or C are huge and Table A is more related to your desired data, the second option would likely be fastest.
As far as option 3 goes, I love CTEs but I try to only use them when I need to do so. Using a CTE will speed up your overall query if the data joined, manipulated, and returned by the CTE is only related to the rest of the query in a limited fashion. Including tables that are only partially related to your end result in your primary string of joins is going to needlessly slow down your query. If you can parse out that data into a CTE, it can run quickly by itself and then be incorporated back into the main query at the end.
I have a query on a postgresql 9.2 system that takes about 20s in it's normal form but only takes ~120ms when using a CTE.
I simplified both queries for brevity.
Here is the normal form (takes about 20s):
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
ORDER BY modified_at DESC
LIMIT 25;
Here is the explain for this query: http://explain.depesz.com/s/2v8
The CTE form (about 120ms):
WITH raw AS (
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
)
SELECT *
FROM raw
ORDER BY modified_at DESC
LIMIT 25;
Here is the explain for the CTE: http://explain.depesz.com/s/uxy
Simply by moving the ORDER BY to the outer part of the query reduces the cost by 99%.
I have two questions: 1) is there a way to construct the first query without using a CTE in such a way that it is logically equivalent more performant and 2) what does this difference in performance say about how the planner is determining how to fetch the data?
Regarding the questions above, are there additional statistics or other planner hints that would help improve the performance of the first query?
Edit: Taking away the limit also causes the query to use a heap scan as opposed to an index scan backwards. Without the LIMIT the query completes in 40ms.
After seeing the effect of the LIMIT I tried with LIMIT 1, LIMIT 2, etc. The query performs in under 100ms when using LIMIT 1 and 10s+ with LIMIT > 1.
After thinking about this some more, question 2 boils down to why does the planner use an index scan backwards in one case and a bitmap heap scan + sort in another logically equivalent case? And how can I "help" the planner use the efficient plan in both cases?
Update:
I accepted Craig's answer because it was the most comprehensive and helpful. The way I ended up solving the problem was by using a query that was practically equivalent though not logically equivalent. At the root of the issue was an index scan backwards of the index on modified_at. In order to inform the planner that this was not a good idea I add a predicate of the form WHERE modified_at >= NOW() - INTERVAL '1 year'. This included enough data for the application but prevented the planner from going down the backwards index scan path.
This was a much lower impact solution that prevented the need to rewrite the queries using either a sub query or a CTE. YMMV.
Here's why this is happening, with the following explanation current until at least 9.3 (if you're reading this and on a newer version, check to make sure it hasn't changed):
PostgreSQL doesn't optimize across CTE boundaries. Each CTE clause is run in isolation and its results are consumed by other parts of the query. So a query like:
WITH blah AS (
SELECT * FROM some_table
)
SELECT *
FROM blah
WHERE id = 4;
will cause the full inner query to get executed. PostgreSQL won't "push down" the id = 4 qualification into the inner query. CTEs are "optimization fences" in that regard, which can be both good or bad; it lets you override the planner when you want to, but prevents you from using CTEs as simple syntactic cleanup for a deeply nested FROM subquery chain if you do need push-down.
If you rephrase the above as:
SELECT *
FROM (SELECT * FROM some_table) AS blah
WHERE id = 4;
using a sub-query in FROM instead of a CTE, Pg will push the qual down into the subquery and it'll all run nice and quickly.
As you have discovered, this can also work to your benefit when the query planner makes a poor decision. It appears that in your case a backward index scan of the table is immensely more expensive a bitmap or index scan of two smaller indexes followed by a filter and sort, but the planner doesn't think it will be so it plans the query to scan the index.
When you use the CTE, it can't push the ORDER BY into the inner query, so you're overriding its plan and forcing it to use what it thinks is an inferior execution plan - but one that turns out to be much better.
There's a nasty workaround that can be used for these situations called the OFFSET 0 hack, but you should only use it if you can't figure out a way to make the planner do the right thing - and if you have to use it, please boil this down to a self-contained test case and report it to the PostgreSQL mailing list as a possible query planner bug.
Instead, I recommend first looking at why the planner is making the wrong decision.
The first candidate is stats / estimates problems, and sure enough when we look at your problematic query plan there's a factor of 3500 mis-estimation of the expected result rows. That's big, but not impossibly big, though it's more interesting that you actually only get one row where the planner is expecting a non-trivial row set. That doesn't help us much, though; if the row count is lower than expected that means that choosing to use the index was a better choice than expected.
The main issue looks like it's not using the smaller, more selective indexes sierra_kilo and papa_lima because it sees the ORDER BY and thinks that it'll save more time doing a backward index scan and avoiding the sort than it really does. That makes sense given that there's only one matching row to sort! If it got the expected 3500 rows then it might've made more sense to avoid the sort, though that's still a fairly small rowset to just sort in memory.
Do you set any parameters like enable_seqscan, etc? If you do, unset them; they're for testing only and totally inappropriate for production use. If you aren't using the enable_ params I think it's worth raising this on the PostgreSQL mailing list pgsql-perform. The anonymized plans make this a bit difficult, though, especially since there's no gurantee that identifiers from one plan refer to the same objects in the other plan, and they don't match what you wrote in the query on the question. You'll want to produce a properly hand-done version where everything matches up before asking on the mailing list.
There's a fairly good chance that you'll need to provide the real values for anyone to help. If you don't want to do that on a public mailing list, there's another option available. (I should note that I work for one of them, per my profile).
Just a shot in the dark, but what happens if you run this
SELECT *
FROM (
SELECT *
FROM tableA
WHERE (columna = 1 OR columnb = 2) AND
atype = 35 AND
aid IN (1, 2, 3)
) AS x
ORDER BY modified_at DESC
LIMIT 25;
This is in a stored procedure..This if statement, then I do a little work. The #AsOfDate is a passed in variable of date datatype. The question I have is Why do I get better performance by removing the inner-most exists, but ONLY when the entire statement is in an IF EXISTS?
The two tables:
dbo.TXXX_InventoryDetail -- 1.3 billion records..stats up to date
dbo.TXXX_InventoryFull -- 9.8 million records..stats up to date
Statement:
if exists (select 1
from dbo.TXXX_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryFull'. Scan count 9240262, logical reads 29548864
Table 'T001_InventoryDetail'. Scan count 1, logical reads 17259
If I remove the second where exists and do a join:
if exists (select 1
from dbo.TXXX_InventoryDetail o,
dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut)
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryDetail'. Scan count 0, logical reads 333952
Table 'TXXX_InventoryFull'. Scan count 1, logical reads 630
Now..the reason I think it is the if exists is that if I remove it and do a select count(*) like this:
select COUNT(*)
from dbo.T001_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate
TXXX_InventoryFull'. Scan count 41, logical reads 692
T001_InventoryDetail'. Scan count 65, logical reads 17477
Worktable'. Scan count 0, logical reads 0
It is generally said that one should avoid doing coordinated subqueries in the predicate, as these tend to force nested loop joins. When querying large datasets, especially where one's trying to discover a difference between the sets, it's important to allow the query optimizer to choose dynamically between hash, merge and nested loop algorhithms, which may not be possible if the query is structured using a coordinated subquery. Better to create these as derived tables in the FROM clause.
I have found similar issues using the EXISTS statement on a SQL 08 R2 server, where the exact same statement runs fine on SQL 08 and SQL 05.
I found that changing something like
WHILE EXISTS(SELECT * FROM X)
Would be super slow, but:
WHILE ISNULL((SELECT TOP 1 ID FROM X), 0) <> 0
Runs perfectly fast again.
To me, it seems like an R2 issue...
I would guess that the plan you get is quite different when you use the join. Perhaps the imbalance in the number of rows (very large outer table, smaller inner table) is giving the optimizer fits, but it can probably eliminate rows much easier with the join (you'll probably see additional loop operators with the worse query). Tough to really guess without seeing the plans or being able to reproduce, but you should always aim at eliminating the most rows as early in the plan as possible. Pulling back millions of rows through several operators / subqueries only to eliminate most of them later in the plan is almost certainly going to yield worse performance.
I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here