How does the DBMS (postgres in my case) deals with execution plan and prepared statement.
The parameters of a query can have a huge impact on the execution plan, mainly due to data statistics.
It might prefer in certain cases use an index if the data is well distributed but for a particular value prefer a sequential scan because the parameter is not discriminant (usually when the parameter matches > 10% of table rows)
I am wondering if prepared statement are always a good way to improve performances or if it more a kind of "best effort"
Thanks in advance
Edit: my concern is about running frequently the same query, but with other parameters that could need to vary the execution plan. It is quite hard to measure the performance gain of prepared statement vs always have the most accurate execution plan
A prepared statement is a GREAT way to make the same simple query run over and over faster. For instance, something like
insert into table values ($1,$2,$3);
OTOH it is NOT a great way to make big ugly complex reporting queries run faster, where the data set may change based on what's in the where clause.
The whole point of prepared queries is to save the somewhat expensive step of query planning over and over. For the simple insert listed above, run 1,000 times, the cost of planning adds up.
OTOH for a big complex reporting query, the planning time is inconsequential. Most big reporting queries etc take seconds to minutes to even hours to run. The planning time, measured in milliseconds, is not worth worrying about here.
Related
I am trying to compare the performance of a view before and after adding an index. So I am trying to measure the performance of it using below query:
create table qtemp.ffs as (select * from psavlldsvw) with data
Statement ran successfully (1,932 ms = 1.932 sec)
Above statement is what I have used where psavlldsvw is the view name.
As you might guess, the idea is to measure how much time the above query takes to complete in both cases.
Can I please get some feedback on how good this method is for comparison?
The test is indeed meaningless...
First of all the question is poorly worded, you can not and are not testing a view. Views are performance neutral on Db2 for i.
Running a statement, adding an index and rerunning the statement is a meaningless test. Db2 for i has all kinds tricks built in to improve the speed of a repeated statement. Among them
Input data cached in memory
Data access paths are left open
Starting from a fresh connection, you can ensure that no data is in memory by using SETOBJACC OBJ(YOURLIB/YOURFILE) OBJTYPE(*FILE) POOL(*PURGE) for each table referenced by your statement.
Now run the statement multiple times; at least 3 if the system defaults have not been changed. You should see that the first (few) iterations is slower than the last few. This is a result of the data access path being left open for a repeated statement.
Now add your index, disconnect/reconnect, clear the object(s) from memory and run your tests again.
Depending on the use case for the statement, you may want to focus on the first iteration performance or the later iterations.
Mao is correct in that using Visual Explain (VE) is the best way to see if an index is being used or otherwise understanding how the query is performing.
Lastly realize that load on the server effects how the query engine operates. The query engine optimizer will calculate your jobs "fair share" of memory and that value will affect rather or not some more efficient yet memory intensive plans would be used. So if you're testing in a non-prod environment that doesn't exactly match prod in terms of resources, data size and load, the results are likely to differ when the query is moved to prod.
Performance tuning is part art, part science. Generally, use VE to ensure that you've got a decent query to start with. Then monitor actual production use to ensure that it's preforming as expected.
How can I bench mark SQL performance in postgreSQL? I tried using Explain Analyze but that gives varied Execution time every time when I repeat same query.
I am applying some tuning techniques on my query and trying to see whether my tuning technique is improving the query performace. The Explain analyze has varying execution times that I cant bechmark and compare . The tuning has imapact in MilliSeconds so I am looking for bechmarch that can give fixed values to compare against.
There will always be variations in the time it takes a statement to complete:
Pages may be cached in memory or have to be read from disk. This is usually the source of the greatest deviations.
Concurrent processes may need CPU time
You have to wait for internal short time locks (latches) to access a data structure
These are just the first three things that come to my mind.
In short, execution time is always subject to small variations.
Run the query several times and take the the median of the execution times. That is as good as it gets.
Tuning for milliseconds only makes sense if it is a query that is executed a lot.
Also, tuning only makes sense if you have realistic test data. Don't make the mistake to examine and tune a query with only a few test data when it will have to perform with millions of rows.
I want to know how long my queries take to execute, so that I can see whether my changes improve the runtime or not.
Simply timing the executing of the whole query is unsuitable, since this also takes into account the (highly variable) time spent waiting in an execution queue.
Redshift provides the STL_WLM_QUERY table that contains separate columns for queue wait time and execution time. However, my queries do not reliably show up in this table. For example if I execute the same query multiple times the number of corresponding rows in STL_WLM_QUERY is often much smaller than the number of repetitions. Sometimes, but not always, only one row is generated no matter how often I run the query. I suspect some caching is going on.
Is there a better way to find the actual execution time of a Redshift query, or can someone at least explain under what circumstances exactly a row in STL_WLM_QUERY is generated?
My tips
If possible, ensure that your query has not waited at all, if it has
there should be a row on stl_wlm_query. If it did wait - then rerun
it.
Run the query once to compile it, then a second time to benchmark
it. compile time can be significant
Disable the new query result caching feature (if you have it yet -
you probably don't)
(https://aws.amazon.com/about-aws/whats-new/2017/11/amazon-redshift-introduces-result-caching-for-sub-second-response-for-repeat-queries/)
I am running some benchmarks tests on a lot of queries. I have a set of queries and they will be run multiple times after each other. I know that PostgreSQL caches query plans so this is important to consider but as far as I know this does not always happen.
So I have two approaches. I am considering to either (a) force the query plan to be generated each time I run a query or either (b) to 'warm up' a bit so that a plan is cached and it is reused each time. How can I perform either or what precautions can I take to ensure that one or the other is happening?
It would be great if I could monitor plans in the cache but I am not sure if it is possible.
UPDATE: My queries are complex SELECTs to retrieve data, no DELETEs/INSERTs etc. Does this mean I should not give so much respect to the query planner in benchmarks?
PostgreSQL only caches query plans if
you use prepared statements
the statement is executed inside a PL/pgSQL function
So if you want to benchmark how much faster your queries become if you avoid the overhead of planning, you should create a prepared statement and execute it al least six times (because the first five runs will always generate a custom plan).
If your queries are complex, odds are that you might even lose if you cache query plans, particularly if the runtime of the queries is long. In such a case, it is usually better to spend more effort on planning each query. The biggest win with prepared statements is when the execution time of the queries is low.
Trying to understand EXPLAIN function - I have two queries - first query is optimised, that is running 600 ms(I have 100k rows) and second query is running 900 ms
But when I run EXPLAIN ANALYZE - first query, that is running quickly shows me cost - 64296 and second query, that is running slowly shows me cost - 20873
can't understand why faster query has bigger cost, and why longer running query has smaller cost.
Could someone give me some hint ?
PostgreSQL EXPLAIN is an animal that really has a lot of arms & legs, each of which can cause it to work in a way that isn't easy to understand at first.
To answer your question, I understand that although running the first query Q1 (not its EXPLAIN), it runs faster than the second (Q2), but when you do an EXPLAIN ANALYSE, Q1 actually has a higher cost.
I could think of two reasons that come to mind at this moment:
If the Queries are LIMIT queries, its possible for Q1 to execute faster and still have higher 'cost', since the PostgreSQL Planner (intentionally) does not plan for a smaller total cost, but a smaller cost of the required result (in this case, a smaller number of rows).
Another reason could be that caching could be playing havoc with your times. Could you confirm if the observation is persistent with multiple (3+) runs?
Besides these hunches, if you really want to get deep into understanding EXPLAIN, recommend you to refer the following articles here, here and here.
Cost is what planner thinks about how many recourses (I/O and CPU time) it will take to perform the query. It's just an estimation, calculated by a mathematical model.
In your case planner was wrong, it chose suboptimal plan. It happens sometimes.
Why? There could be many reasons. Maybe statistics are inadequate (try to run analyze for your tables first of all). Maybe statistics are ok, but planner uses the wrong model (for example, you may have correlated predicates in your query which are known to be problematic). Maybe your query is over several dozens of tables and planner just can't go through all possible plans. And so on.