I understand that explain in postgresql just estimates the cost of a query and explain analyze does the same and also executes a query and gives the actual results.
But I can't figure out in which cases I should use explain and explain analyze.
As you correctly mention, the difference between explain & explain analyze is that the former generates the query plan by estimating the cost, while the latter actually executes the query.
Thus, explain analyze will give you more accurate query plan / cost.
However, you don't want to "explain analyze" any data modification queries, unless you intend to actually modify the database. These would be create table, alter, update, insert, drop, delete & truncate table queries
Likewise for very costly queries, you may want to avoid putting the extra burden on the server by running an explain analyze.
A good rule to follow is to try just explain first. examine the output, and if the cost estimates or query plans differ significantly from what you expect, run explain analyze making sure that
the database is able to take on the additional load
no data will be inadvertently changed as a result of running this query.
Related
In PostgreSQL, we can use "EXPLAIN ANALYZE" on a query to get the query plan of a given SQL Query. While this is useful, is there anyway that we are able to get information on other candidate plans that the optimizer generated (and subsequently discarded)?
This is so that we can do an analysis ourselves for some of the candidates (for e.g. top 3) generated by the DBMS.
No. The planner discards incipient plans as early as it can, before they are even completely formed. Once it decides a plan can't be the best, it never finishes constructing it, so it can't display it.
You can usually use the various enable_* settings or the *_cost settings to force it to make a different choice and show the plan for that, but it can be hard to control exactly what that different choice is.
You can also temporarily drop an index to see what it would do without that index. If you DROP an index inside a transaction, then do the EXPLAIN, then ROLLBACK the transaction, it will rollback the DROP INDEX so that the index doesn't need to be rebuilt, it will just be revived. But be warned that DROP INDEX will take a strong lock on the table and hold it until the ROLLBACK, so this method is not completely free of consequences.
If you just want to see what the other plan is, you just need EXPLAIN, not EXPLAIN ANALYZE. This is faster and, if the statement has side effects, also safer.
How does the DBMS (postgres in my case) deals with execution plan and prepared statement.
The parameters of a query can have a huge impact on the execution plan, mainly due to data statistics.
It might prefer in certain cases use an index if the data is well distributed but for a particular value prefer a sequential scan because the parameter is not discriminant (usually when the parameter matches > 10% of table rows)
I am wondering if prepared statement are always a good way to improve performances or if it more a kind of "best effort"
Thanks in advance
Edit: my concern is about running frequently the same query, but with other parameters that could need to vary the execution plan. It is quite hard to measure the performance gain of prepared statement vs always have the most accurate execution plan
A prepared statement is a GREAT way to make the same simple query run over and over faster. For instance, something like
insert into table values ($1,$2,$3);
OTOH it is NOT a great way to make big ugly complex reporting queries run faster, where the data set may change based on what's in the where clause.
The whole point of prepared queries is to save the somewhat expensive step of query planning over and over. For the simple insert listed above, run 1,000 times, the cost of planning adds up.
OTOH for a big complex reporting query, the planning time is inconsequential. Most big reporting queries etc take seconds to minutes to even hours to run. The planning time, measured in milliseconds, is not worth worrying about here.
I am running some benchmarks tests on a lot of queries. I have a set of queries and they will be run multiple times after each other. I know that PostgreSQL caches query plans so this is important to consider but as far as I know this does not always happen.
So I have two approaches. I am considering to either (a) force the query plan to be generated each time I run a query or either (b) to 'warm up' a bit so that a plan is cached and it is reused each time. How can I perform either or what precautions can I take to ensure that one or the other is happening?
It would be great if I could monitor plans in the cache but I am not sure if it is possible.
UPDATE: My queries are complex SELECTs to retrieve data, no DELETEs/INSERTs etc. Does this mean I should not give so much respect to the query planner in benchmarks?
PostgreSQL only caches query plans if
you use prepared statements
the statement is executed inside a PL/pgSQL function
So if you want to benchmark how much faster your queries become if you avoid the overhead of planning, you should create a prepared statement and execute it al least six times (because the first five runs will always generate a custom plan).
If your queries are complex, odds are that you might even lose if you cache query plans, particularly if the runtime of the queries is long. In such a case, it is usually better to spend more effort on planning each query. The biggest win with prepared statements is when the execution time of the queries is low.
Trying to understand EXPLAIN function - I have two queries - first query is optimised, that is running 600 ms(I have 100k rows) and second query is running 900 ms
But when I run EXPLAIN ANALYZE - first query, that is running quickly shows me cost - 64296 and second query, that is running slowly shows me cost - 20873
can't understand why faster query has bigger cost, and why longer running query has smaller cost.
Could someone give me some hint ?
PostgreSQL EXPLAIN is an animal that really has a lot of arms & legs, each of which can cause it to work in a way that isn't easy to understand at first.
To answer your question, I understand that although running the first query Q1 (not its EXPLAIN), it runs faster than the second (Q2), but when you do an EXPLAIN ANALYSE, Q1 actually has a higher cost.
I could think of two reasons that come to mind at this moment:
If the Queries are LIMIT queries, its possible for Q1 to execute faster and still have higher 'cost', since the PostgreSQL Planner (intentionally) does not plan for a smaller total cost, but a smaller cost of the required result (in this case, a smaller number of rows).
Another reason could be that caching could be playing havoc with your times. Could you confirm if the observation is persistent with multiple (3+) runs?
Besides these hunches, if you really want to get deep into understanding EXPLAIN, recommend you to refer the following articles here, here and here.
Cost is what planner thinks about how many recourses (I/O and CPU time) it will take to perform the query. It's just an estimation, calculated by a mathematical model.
In your case planner was wrong, it chose suboptimal plan. It happens sometimes.
Why? There could be many reasons. Maybe statistics are inadequate (try to run analyze for your tables first of all). Maybe statistics are ok, but planner uses the wrong model (for example, you may have correlated predicates in your query which are known to be problematic). Maybe your query is over several dozens of tables and planner just can't go through all possible plans. And so on.
I have many read-only tables in a Postgres database. All of these tables can be queried using any combination of columns.
What can I do to optimize queries? Is it a good idea to add indexes to all columns to all tables?
Columns that are used for filtering or joining (or, to a lesser degree, sorting) are of interest for indexing. Columns that are just selected are barely relevant!
For the following query only indexes on a and e may be useful:
SELECT a,b,c,d
FROM tbl_a
WHERE a = $some_value
AND e < $other_value;
Here, f and possibly c are candidates, too:
SELECT a,b,c,d
FROM tbl_a
JOIN tbl_b USING (f)
WHERE a = $some_value
AND e < $other_value
ORDER BY c;
After creating indexes, test to see if they are actually useful with EXPLAIN ANALYZE. Also compare execution times with and without the indexes. Deleting and recreating indexes is fast and easy. There are also parameters to experiment with EXPLAIN ANALYZE. The difference may be staggering or nonexistent.
As your tables are read only, index maintenance is cheap. It's merely a question of disk space.
If you really want to know what you are doing, start by reading the docs.
If you don't know what queries to expect ...
Try logging enough queries to find typical use cases. Log queries with the parameter log_statement = all for that. Or just log slow queries using log_min_duration_statement.
Create indexes that might be useful and check the statistics after some time to see what actually gets used. PostgreSQL has a whole infrastructure in place for monitoring statistics. One convenient way to study statistics (and many other tasks) is pgAdmin where you can chose your table / function / index and get all the data on the "statistics" tab in the object browser (main window).
Proceed as described above to see if indexes in use actually speed things up.
If the query planner should chose to use one or more of your indexes but to no or adverse effect then something is probably wrong with your setup and you need to study the basics of performance optimization: vacuum, analyze, cost parameters, memory usage, ...
If you have filtering by more columns indexes may help but not too much. Also indexes may not help for small tables.
First search for "postgresql tuning" - you will find usefull information.
If database can fit in memory - buy enough RAM.
If database can not fit in memory - SSD will help.
If this is not enough and database is read only - run 2, 3 or more servers. Or partition database (in the best case to fit in memory of each server).
Even if queries are generated I think they will not be random. Monitor database for slow queries and improve only them.