SQL gets processed in this order:
From,
Where,
Group By,
Having,
Select,
Order By
In the new style of syntax for joins (explicitly using the word join), why doesn't this work faster than using the old style of joins (listing tables and then using a where clause)?
From gets processed before Where, so why wouldn't the newer style of join be faster?
The way that I imagine it is like this:
If you use the old style syntax, you are looking at entire tables and then filtering out the results.
If you use the new style syntax, you are filtering out your results first before moving to a 2nd step.
Am I missing something?
When you send a query to postgresql, it doesn't always do scanning, filtering, etc in the same order. It examines the query, the tables involved, any constraints or indexes that might be involved, and comes up with an execution plan. If you want to see the execution pan for a query, you can use EXPLAIN, and it will invoke the planner without actually executing the query. Here's some documentation for EXPLAIN.
You tagged your question for postgresql, but other RDBMSes have similar facilities for examining the query plan.
Related
(Postgres 11.7)
I'm using the Rows pg_hint_plan hint to dynamically fix a bad row-count estimate.
My query accepts an array of arguments, which get unnested and joined onto the rest of the query as a predicate. By default, the query planner always assumes this array-argument contains 100 records, whereas in reality this number could be very different. This bad estimate was resulting in poor query plans. I set the number of rows definitively from within the calling application, by changing the hint text per query.
This approach seems to work sometimes, but I see some strange behaviour testing the query (in DBeaver).
If I start with a brand new connection, when I explain the query (or indeed just run it), the hint seems to be ignored for the first 6 executions, but thereafter it starts getting interpreted correctly. This is consistently reproducible: I see the offending row count estimates change on the 7th execution on a new connection.
More interestingly, the query also uses some (immutable) functions to do some lookup operations. If I remove these and replace them with an equivalent CTE or sub-select, this strange behaviour seems to disappear, and the hints are evaluated correctly all the time, even on a brand new connection.
What could be causing it to not honour the pg_hint_plan hints until after 6 requests have been made in that session? Why does the presence of the functions have a bearing on the hints?
Since you are using JDBC, try setting the prepareThreshold connection parameter to 1, as detailed in the documentation.
That will make the driver use a server prepared statement as soon as possible, ond it seems like this extension only works in that case.
From PostgreSQL 9.6 Release Notes:
Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.
My question is: If a CTE (WITH clause) contains only read operations, but its results is used to feed a writing operation, like an insert or update, is it also disallowed to parallelize sequential scans?
I mean, as CTE is much like a temporary table which only exists for currently executing query, can I suppose that its inner query can take advantage of the brand new parallel seq-scan of PostgreSQL 9.6? Or, otherwise, is it treated as a using subquery and cannot perform parallel scan?
For example, consider this query:
WITH foobarbaz AS (
SELECT foo FROM bar
WHERE some_expensive_function(baz)
)
DELETE FROM bar
USING foobarbaz
WHERE bar.foo = foobarbaz.foo
;
Is that foobarbaz calculation expected to be able to be parallelized or is it disallowed because of the delete sentence?
If it isn't allowed, I thought that can replace the CTE by a CREATE TEMPORARY TABLE statement. But I think I will fall into the same issue as CREATE TABLE is a write operation. Am I wrong?
Lastly, a last chance I could try is to perform it as a pure read operation and use its result as input for insert and / or update operations. Outside of a transaction it should work. But the question is: If the read operation and the insert/update are between a begin and commit sentences, it not will be allowed anyway? I understand they are two completely different operations, but in the same transaction and Postgres.
To be clear, my concern is that I have an awful mass of hard-to-read and hard-to-redesign SQL queries that involves multiple sequential scans with low-performance function calls and which performs complex changes over two tables. The whole process runs in a single transaction because, if not, the mess in case of failure would be totally unrecoverable.
My hope is to able to parallelize some sequential scans to take advantage of the 8 cpu cores of the machine to be able to complete the process sooner.
Please, don't answer that I need to fully redesign that mess: I know and I'm working on it. But it is a great project and we need to continue working meantime.
Anyway, any suggestion will be thankful.
EDIT:
I add a brief report of what I could discover up to now:
As #a_horse_with_no_name says in his comment (thanks), CTE and the rest of the query is a single DML statement and, if it has a write operation, even outside of the CTE, then the CTE cannot be parallelized (I also tested it).
Also I found this wiki page with more concise information about parallel scans than what I found in the release notes link.
An interesting point I could check thanks to that wiki page is that I need to declare the involved functions as parallel safe. I did it and worked (in a test without writings).
Another interesting point is what #a_horse_with_no_name says in his second comment: Using DbLink to perform a pure read-only query. But, investigating a bit about that, I seen that postgres_fdw, which is explicitly mentioned in the wiki as non supporting parallel scans, provides roughly the same functionality using a more modern and standards-compliant infrastructure.
And, on the other hand, even if it would worked, I were end up getting data from outside the transaction which, in some cases would be acceptable for me but, I think, not as good idea as general solution.
Finally, I checked that is possible to perform a parallel-scan in a read-only query inside a transaction, even if it later performs write operations (no exception is triggered and I could commit).
...in summary, I think that my best bet (if not the only one) would be to refactor the script in a way that it reads the data to memory before to later perform the write operations in the same transaction.
It will increase I/O overhead but, attending the latencies I manage it will be even better.
I have some complex queries that will produce same result. The only difference is execution order. For example, a query performs selection first before join while the other query performs join first, then selection. However, when I read the explanation (on the explain tab, using PgAdmin III), both queries have the same diagram.
Why?
I'm not a pro with explaining this with all the correct terminologies, however essentially the preprocessing attempts to find the most efficient way to execute the statement. It does this by breaking them down into simpler sub statements- just because you write it one way it doesn't mean it is the same order the pre processing will execute the plan. Kind of like precedence with arithmetic (brackets, multiply, divide, etc).
Certain operations will influence the statement order of execution enabling you to "tune" your queries to make them more efficient. http://www.postgresql.org/docs/current/interactive/performance-tips.html
I have two queries, let's call them Query A and Query B.
Both of these queries run in under a second for the scenario I'm testing and Query A returns 1 result, and Query B returns 0 results.
If I union (or union all) these two queries, it takes over a minute to return the (expected) 1 result.
Both queries select the same columns from the same tables. I could potentially rewrite this entire thing without a union by having a highly conditional where clause but I was trying to get away from doing that.
Any ideas? I'm not sure how much of the exact query and schema I can get away with sharing, but I'm happy to provide what I can.
This is on MSSQL 2008 if it matters to anyone's response.
I would try looking at the execution plans within Management Studio for the individual queries, and then compare that to the execution plan for the query containing the UNION.
If there's that drastic of a difference in the execution times, I would imagine that there's something wrong with the execution plan for the UNION'd query. Identifying what's different will help point you (and maybe us) in the right direction on what the underlying problem is.
The separate clauses in a UNION that are very similar and on the same tables can be merged into one query by the optimiser. You can see this by the lack on UNION operator in the query plan. I've seen similar things before but rarely
What you can do is a SELECT.. INTO #temp... for the first query followed by an INSERT #temp... for the second
Now, where did I read this...
Are they both doing table scans? It sounds like it might be exceeding cache capacity and you're caching to disk.
Even if they are from the same table the records would probably lock independently.
I have situation, where running a query that filters by an indexed column in a partitioned table, performs a full table scan.
Apparently , this is a known issue in postgresql, and it's explained in detail here.
Is there a more elegant way around this other than performing a query on each partition, and then performing a UNION on all of the results?
Indexes work just fine to do a scan only of the relevant partitions in PostgreSQL. But, you have to set everything up properly for it to work, and it's easy to miss a step in the long list of things documented at http://www.postgresql.org/docs/current/static/ddl-partitioning.html
The main thing to realize is that in order to avoid a sequential scan, you have to provide enough information to PostgreSQL so it can prove some partitions cannot have the data you're looking for; then they are skipped as potential sources for the query results. The article you link to points this out as a solution to the seq scan problem: "If you add range constraints to the date field of each partition, this query can be optimized into a loop where you query the “latest” partition first and work backwards until you find a single value that is higher than the range of all the remaining partitions."--but doesn't show the improved plan you'd see after that change.
Some common mistakes you might have made:
-The constraint_exclusion parameter in the postgresql.conf file is off by default. With that default, you won't get what you expect.
-Didn't create non-overlapping partitions using CHECK, which keeps the planner from knowing what's inside each of them. It's possible to miss this step but still get your data into the right partitions properly, the planner just won't know that.
-Did not put an index on each partition, only created one on the master table. This will give you a sequential scan just on the relevant partition, so not as bad as the above but not good either.
There's some work to make this all easier in upcoming PostgreSQL releases (setting constraint_partition is fairly automatic in 8.4 and some sort of partition setup automation is being worked in). Right now, if you follow the instructions carefully and avoid all these problems, it should work.