PostgreSQL approaching plan caching on identical queries? - postgresql

I am running some benchmarks tests on a lot of queries. I have a set of queries and they will be run multiple times after each other. I know that PostgreSQL caches query plans so this is important to consider but as far as I know this does not always happen.
So I have two approaches. I am considering to either (a) force the query plan to be generated each time I run a query or either (b) to 'warm up' a bit so that a plan is cached and it is reused each time. How can I perform either or what precautions can I take to ensure that one or the other is happening?
It would be great if I could monitor plans in the cache but I am not sure if it is possible.
UPDATE: My queries are complex SELECTs to retrieve data, no DELETEs/INSERTs etc. Does this mean I should not give so much respect to the query planner in benchmarks?

PostgreSQL only caches query plans if
you use prepared statements
the statement is executed inside a PL/pgSQL function
So if you want to benchmark how much faster your queries become if you avoid the overhead of planning, you should create a prepared statement and execute it al least six times (because the first five runs will always generate a custom plan).
If your queries are complex, odds are that you might even lose if you cache query plans, particularly if the runtime of the queries is long. In such a case, it is usually better to spend more effort on planning each query. The biggest win with prepared statements is when the execution time of the queries is low.

Related

How can I bench mark query performance in postgreSQL? CPU or TIME but need to be consistant for every run

How can I bench mark SQL performance in postgreSQL? I tried using Explain Analyze but that gives varied Execution time every time when I repeat same query.
I am applying some tuning techniques on my query and trying to see whether my tuning technique is improving the query performace. The Explain analyze has varying execution times that I cant bechmark and compare . The tuning has imapact in MilliSeconds so I am looking for bechmarch that can give fixed values to compare against.
There will always be variations in the time it takes a statement to complete:
Pages may be cached in memory or have to be read from disk. This is usually the source of the greatest deviations.
Concurrent processes may need CPU time
You have to wait for internal short time locks (latches) to access a data structure
These are just the first three things that come to my mind.
In short, execution time is always subject to small variations.
Run the query several times and take the the median of the execution times. That is as good as it gets.
Tuning for milliseconds only makes sense if it is a query that is executed a lot.
Also, tuning only makes sense if you have realistic test data. Don't make the mistake to examine and tune a query with only a few test data when it will have to perform with millions of rows.

Benchmarking Redshift Queries

I want to know how long my queries take to execute, so that I can see whether my changes improve the runtime or not.
Simply timing the executing of the whole query is unsuitable, since this also takes into account the (highly variable) time spent waiting in an execution queue.
Redshift provides the STL_WLM_QUERY table that contains separate columns for queue wait time and execution time. However, my queries do not reliably show up in this table. For example if I execute the same query multiple times the number of corresponding rows in STL_WLM_QUERY is often much smaller than the number of repetitions. Sometimes, but not always, only one row is generated no matter how often I run the query. I suspect some caching is going on.
Is there a better way to find the actual execution time of a Redshift query, or can someone at least explain under what circumstances exactly a row in STL_WLM_QUERY is generated?
My tips
If possible, ensure that your query has not waited at all, if it has
there should be a row on stl_wlm_query. If it did wait - then rerun
it.
Run the query once to compile it, then a second time to benchmark
it. compile time can be significant
Disable the new query result caching feature (if you have it yet -
you probably don't)
(https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-redshift-introduces-result-caching-for-sub-second-response-for-repeat-queries/)

How to get the execution plan of a prepared statement work?

How does the DBMS (postgres in my case) deals with execution plan and prepared statement.
The parameters of a query can have a huge impact on the execution plan, mainly due to data statistics.
It might prefer in certain cases use an index if the data is well distributed but for a particular value prefer a sequential scan because the parameter is not discriminant (usually when the parameter matches > 10% of table rows)
I am wondering if prepared statement are always a good way to improve performances or if it more a kind of "best effort"
Thanks in advance
Edit: my concern is about running frequently the same query, but with other parameters that could need to vary the execution plan. It is quite hard to measure the performance gain of prepared statement vs always have the most accurate execution plan
A prepared statement is a GREAT way to make the same simple query run over and over faster. For instance, something like
insert into table values ($1,$2,$3);
OTOH it is NOT a great way to make big ugly complex reporting queries run faster, where the data set may change based on what's in the where clause.
The whole point of prepared queries is to save the somewhat expensive step of query planning over and over. For the simple insert listed above, run 1,000 times, the cost of planning adds up.
OTOH for a big complex reporting query, the planning time is inconsequential. Most big reporting queries etc take seconds to minutes to even hours to run. The planning time, measured in milliseconds, is not worth worrying about here.

Understanding EXPLAIN function in Postgresql

Trying to understand EXPLAIN function - I have two queries - first query is optimised, that is running 600 ms(I have 100k rows) and second query is running 900 ms
But when I run EXPLAIN ANALYZE - first query, that is running quickly shows me cost - 64296 and second query, that is running slowly shows me cost - 20873
can't understand why faster query has bigger cost, and why longer running query has smaller cost.
Could someone give me some hint ?
PostgreSQL EXPLAIN is an animal that really has a lot of arms & legs, each of which can cause it to work in a way that isn't easy to understand at first.
To answer your question, I understand that although running the first query Q1 (not its EXPLAIN), it runs faster than the second (Q2), but when you do an EXPLAIN ANALYSE, Q1 actually has a higher cost.
I could think of two reasons that come to mind at this moment:
If the Queries are LIMIT queries, its possible for Q1 to execute faster and still have higher 'cost', since the PostgreSQL Planner (intentionally) does not plan for a smaller total cost, but a smaller cost of the required result (in this case, a smaller number of rows).
Another reason could be that caching could be playing havoc with your times. Could you confirm if the observation is persistent with multiple (3+) runs?
Besides these hunches, if you really want to get deep into understanding EXPLAIN, recommend you to refer the following articles here, here and here.
Cost is what planner thinks about how many recourses (I/O and CPU time) it will take to perform the query. It's just an estimation, calculated by a mathematical model.
In your case planner was wrong, it chose suboptimal plan. It happens sometimes.
Why? There could be many reasons. Maybe statistics are inadequate (try to run analyze for your tables first of all). Maybe statistics are ok, but planner uses the wrong model (for example, you may have correlated predicates in your query which are known to be problematic). Maybe your query is over several dozens of tables and planner just can't go through all possible plans. And so on.

Vastly different query run time in application

I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?
The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.
Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.