PostgreSQL Results Same Explanation on Different Queries - postgresql

I have some complex queries that will produce same result. The only difference is execution order. For example, a query performs selection first before join while the other query performs join first, then selection. However, when I read the explanation (on the explain tab, using PgAdmin III), both queries have the same diagram.
Why?

I'm not a pro with explaining this with all the correct terminologies, however essentially the preprocessing attempts to find the most efficient way to execute the statement. It does this by breaking them down into simpler sub statements- just because you write it one way it doesn't mean it is the same order the pre processing will execute the plan. Kind of like precedence with arithmetic (brackets, multiply, divide, etc).
Certain operations will influence the statement order of execution enabling you to "tune" your queries to make them more efficient. http://www.postgresql.org/docs/current/interactive/performance-tips.html

Related

How to: Change actual execution method from "row" to "batch" - Azure SQL Server

I am having some major issues. When inserting data into my database, I am using an INSTEAD OF INSERT trigger which performs a query.
On my TEST database, this query takes much less than 1 second for insert of a single row. In production however, this query takes MUCH longer (> 30 seconds for 1 row).
When comparing the Execution plans for both of them, there seems to be some CLEAR differences:
Test has: "Actual Execution Method: Batch"
Prod has: "Actual Execution Method: Row"
Test has: "Actual number of rows: 1"
Prod has: "Actual number of rows 92.000.000"
Less than a week ago production was running similar to test. But not anymore - sadly.
Can any of you help me figure out why?
I believe, if I can just get the same execution plan for both, it should be no problem.
Sometimes using query hint OPTION(hash Join) helps to force a query plan to use batch processing mode. The following query that uses AdventureWorks2012 sample database demonstrates what I am saying.
SELECT s.OrderDate, s.ShipDate, sum(d.OrderQty),avg(d.UnitPrice),avg(d.UnitPriceDiscount)
FROM Demo d
join Sales.SalesOrderHeader s
on d.SalesOrderID=s.SalesOrderID
WHERE d.OrderQty>500
GROUP BY s.OrderDate,s.ShipDate
The above query uses row mode. With the query hint it then uses batch mode.
SELECT s.OrderDate, s.ShipDate, sum(d.OrderQty),avg(d.UnitPrice),avg(d.UnitPriceDiscount)
FROM Demo d
join Sales.SalesOrderHeader s
on d.SalesOrderID=s.SalesOrderID
WHERE d.OrderQty>500
GROUP BY s.OrderDate,s.ShipDate
OPTION(hash Join)
You don't get to force row vs. batch processing directly in SQL Server. It is a cost-based decision in the optimizer. You can (as you have noticed) force a plan that was generated that uses batch mode. However, there is no specific "only use batch mode" model on purpose as it is not always the fastest. Batch mode execution is like a turbo on a car engine - it works best when you are working with larger sets of rows. It can be slower on small cardinality OLTP queries.
If you have a case where you have 1 row vs. 92M rows, then you have a bigger problem with having a problem that has high variance in the number of rows processed in the query. That can make it very hard to make a query optimal for all scenarios if you have parameter sensitivity or the shape of the query plan internally can create cases where sometimes you have only one row vs. 92M. Ultimately, the solutions for this kind of problem are either to use option(recompile) if the cost of the compile is far less than the variance from having a bad plan or (as you have done) finding a specific plan in the query store that you can force that works well enough for all cases.
Hope that helps explain what is happening under the hood.
I have found a somewhat satifying solution to my problem.
By going into Query store of the database, using Microsoft SQL Server Management Studio, I was able to Force a specific plan for a specific query - but only if the plan was already made by the query.

Parallel queries on CTE for writing operations in PostgreSQL

From PostgreSQL 9.6 Release Notes:
Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.
My question is: If a CTE (WITH clause) contains only read operations, but its results is used to feed a writing operation, like an insert or update, is it also disallowed to parallelize sequential scans?
I mean, as CTE is much like a temporary table which only exists for currently executing query, can I suppose that its inner query can take advantage of the brand new parallel seq-scan of PostgreSQL 9.6? Or, otherwise, is it treated as a using subquery and cannot perform parallel scan?
For example, consider this query:
WITH foobarbaz AS (
SELECT foo FROM bar
WHERE some_expensive_function(baz)
)
DELETE FROM bar
USING foobarbaz
WHERE bar.foo = foobarbaz.foo
;
Is that foobarbaz calculation expected to be able to be parallelized or is it disallowed because of the delete sentence?
If it isn't allowed, I thought that can replace the CTE by a CREATE TEMPORARY TABLE statement. But I think I will fall into the same issue as CREATE TABLE is a write operation. Am I wrong?
Lastly, a last chance I could try is to perform it as a pure read operation and use its result as input for insert and / or update operations. Outside of a transaction it should work. But the question is: If the read operation and the insert/update are between a begin and commit sentences, it not will be allowed anyway? I understand they are two completely different operations, but in the same transaction and Postgres.
To be clear, my concern is that I have an awful mass of hard-to-read and hard-to-redesign SQL queries that involves multiple sequential scans with low-performance function calls and which performs complex changes over two tables. The whole process runs in a single transaction because, if not, the mess in case of failure would be totally unrecoverable.
My hope is to able to parallelize some sequential scans to take advantage of the 8 cpu cores of the machine to be able to complete the process sooner.
Please, don't answer that I need to fully redesign that mess: I know and I'm working on it. But it is a great project and we need to continue working meantime.
Anyway, any suggestion will be thankful.
EDIT:
I add a brief report of what I could discover up to now:
As #a_horse_with_no_name says in his comment (thanks), CTE and the rest of the query is a single DML statement and, if it has a write operation, even outside of the CTE, then the CTE cannot be parallelized (I also tested it).
Also I found this wiki page with more concise information about parallel scans than what I found in the release notes link.
An interesting point I could check thanks to that wiki page is that I need to declare the involved functions as parallel safe. I did it and worked (in a test without writings).
Another interesting point is what #a_horse_with_no_name says in his second comment: Using DbLink to perform a pure read-only query. But, investigating a bit about that, I seen that postgres_fdw, which is explicitly mentioned in the wiki as non supporting parallel scans, provides roughly the same functionality using a more modern and standards-compliant infrastructure.
And, on the other hand, even if it would worked, I were end up getting data from outside the transaction which, in some cases would be acceptable for me but, I think, not as good idea as general solution.
Finally, I checked that is possible to perform a parallel-scan in a read-only query inside a transaction, even if it later performs write operations (no exception is triggered and I could commit).
...in summary, I think that my best bet (if not the only one) would be to refactor the script in a way that it reads the data to memory before to later perform the write operations in the same transaction.
It will increase I/O overhead but, attending the latencies I manage it will be even better.

Old vs New Style Joins

SQL gets processed in this order:
From,
Where,
Group By,
Having,
Select,
Order By
In the new style of syntax for joins (explicitly using the word join), why doesn't this work faster than using the old style of joins (listing tables and then using a where clause)?
From gets processed before Where, so why wouldn't the newer style of join be faster?
The way that I imagine it is like this:
If you use the old style syntax, you are looking at entire tables and then filtering out the results.
If you use the new style syntax, you are filtering out your results first before moving to a 2nd step.
Am I missing something?
When you send a query to postgresql, it doesn't always do scanning, filtering, etc in the same order. It examines the query, the tables involved, any constraints or indexes that might be involved, and comes up with an execution plan. If you want to see the execution pan for a query, you can use EXPLAIN, and it will invoke the planner without actually executing the query. Here's some documentation for EXPLAIN.
You tagged your question for postgresql, but other RDBMSes have similar facilities for examining the query plan.

Vastly different query run time in application

I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?
The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.
Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.

Using a UNION or UNION ALL on two select statements makes them incredibly slower

I have two queries, let's call them Query A and Query B.
Both of these queries run in under a second for the scenario I'm testing and Query A returns 1 result, and Query B returns 0 results.
If I union (or union all) these two queries, it takes over a minute to return the (expected) 1 result.
Both queries select the same columns from the same tables. I could potentially rewrite this entire thing without a union by having a highly conditional where clause but I was trying to get away from doing that.
Any ideas? I'm not sure how much of the exact query and schema I can get away with sharing, but I'm happy to provide what I can.
This is on MSSQL 2008 if it matters to anyone's response.
I would try looking at the execution plans within Management Studio for the individual queries, and then compare that to the execution plan for the query containing the UNION.
If there's that drastic of a difference in the execution times, I would imagine that there's something wrong with the execution plan for the UNION'd query. Identifying what's different will help point you (and maybe us) in the right direction on what the underlying problem is.
The separate clauses in a UNION that are very similar and on the same tables can be merged into one query by the optimiser. You can see this by the lack on UNION operator in the query plan. I've seen similar things before but rarely
What you can do is a SELECT.. INTO #temp... for the first query followed by an INSERT #temp... for the second
Now, where did I read this...
Are they both doing table scans? It sounds like it might be exceeding cache capacity and you're caching to disk.
Even if they are from the same table the records would probably lock independently.