What is PostgreSQL explain telling me exactly? - postgresql

MySQL's explain output is pretty straightforward. PostgreSQL's is a little more complicated. I haven't been able to find a good resource that explains it either.
Can you describe what exactly explain is saying or at least point me in the direction of a good resource?

The part I always found confusing is the startup cost vs total cost. I Google this every time I forget about it, which brings me back to here, which doesn't explain the difference, which is why I'm writing this answer. This is what I have gleaned from the Postgres EXPLAIN documentation, explained as I understand it.
Here's an example from an application that manages a forum:
EXPLAIN SELECT * FROM post LIMIT 50;
Limit (cost=0.00..3.39 rows=50 width=422)
-> Seq Scan on post (cost=0.00..15629.12 rows=230412 width=422)
Here's the graphical explanation from PgAdmin:
(When you're using PgAdmin, you can point your mouse at a component to read the cost details.)
The cost is represented as a tuple, e.g. the cost of the LIMIT is cost=0.00..3.39 and the cost of sequentially scanning post is cost=0.00..15629.12. The first number in the tuple is the startup cost and the second number is the total cost. Because I used EXPLAIN and not EXPLAIN ANALYZE, these costs are estimates, not actual measures.
Startup cost is a tricky concept. It doesn't just represent the amount of time before that component starts. It represents the amount of time between when the component starts executing (reading in data) and when the component outputs its first row.
Total cost is the entire execution time of the component, from when it begins reading in data to when it finishes writing its output.
As a complication, each "parent" node's costs includes the cost's of its child nodes. In the text representation, the tree is represented by indentation, e.g. LIMIT is a parent node and Seq Scan is its child. In the PgAdmin representation, the arrows point from child to parent — the direction of the flow of data — which might be counterintuitive if you are familiar with graph theory.
The documentation says that costs are inclusive of all child nodes, but notice that the total cost of the parent 3.39 is much smaller than the total cost of it's child 15629.12. Total cost is not inclusive because a component like LIMIT doesn't need to process its entire input. See the EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 LIMIT 2; example in Postgres EXPLAIN documentation.
In the example above, startup time is zero for both components, because neither component needs to do any processing before it starts writing rows: a sequential scan reads the first row of the table and emits it. The LIMIT reads its first row and then emits it.
When would a component need to do a lot of processing before it can start to output any rows? There are a lot of possible reasons, but let's look at one clear example. Here's the same query from before but now containing an ORDER BY clause:
EXPLAIN SELECT * FROM post ORDER BY body LIMIT 50;
Limit (cost=23283.24..23283.37 rows=50 width=422)
-> Sort (cost=23283.24..23859.27 rows=230412 width=422)
Sort Key: body
-> Seq Scan on post (cost=0.00..15629.12 rows=230412 width=422)
And graphically:
Once again, the sequential scan on post has no startup cost: it starts outputting rows immediately. But the sort has a significant startup cost 23283.24 because it has to sort the entire table before it can output even a single row. The total cost of the sort 23859.27 is only slightly higher than the startup cost, reflecting the fact that once the entire dataset has been sorted, the sorted data can be emitted very quickly.
Notice that the startup time of the LIMIT 23283.24 is exactly equal to the startup time of the sort. This is not because LIMIT itself has a high startup time. It actually has zero startup time by itself, but EXPLAIN rolls up all of the child costs for each parent, so the LIMIT startup time includes the sum startup times of its children.
This rollup of costs can make it difficult to understand the execution cost of each individual component. For example, our LIMIT has zero startup time, but that's not obvious at first glance. For this reason, several other people linked to explain.depesz.com, a tool created by Hubert Lubaczewski (a.k.a. depesz) that helps understand EXPLAIN by — among other things — subtracting out child costs from parent costs. He mentions some other complexities in a short blog post about his tool.

Explaining_EXPLAIN.pdf could help too.

It executes from most indented to least indented, and I believe from the bottom of the plan to the top. (So if there are two indented sections, the one farther down the page executes first, then when they meet the other executes, then the rule joining them executes.)
The idea is that at each step there are 1 or 2 datasets that arrive and get processed by some rule. If just one dataset, that operation is done to that data set. (For instance scan an index to figure out what rows you want, filter a dataset, or sort it.) If two, the two datasets are the two things that are indented further, and they are joined by the rule you see. The meaning of most of the rules can be reasonably easily guessed (particularly if you have read a bunch of explain plans before), however you can try to verify individual items either by looking in the documentation or (easier) by just throwing the phrase into Google along with a few keywords like EXPLAIN.
This is obviously not a full explanation, but it provides enough context that you can usually figure out whatever you want. For example consider this plan from an actual database:
explain analyze
select a.attributeid, a.attributevalue, b.productid
from orderitemattribute a, orderitem b
where a.orderid = b.orderid
and a.attributeid = 'display-album'
and b.productid = 'ModernBook';
------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=125379.14..125775.12 rows=3311 width=29) (actual time=841.478..841.478 rows=0 loops=1)
Merge Cond: (a.orderid = b.orderid)
-> Sort (cost=109737.32..109881.89 rows=57828 width=23) (actual time=736.163..774.475 rows=16815 loops=1)
Sort Key: a.orderid
Sort Method: quicksort Memory: 1695kB
-> Bitmap Heap Scan on orderitemattribute a (cost=1286.88..105163.27 rows=57828 width=23) (actual time=41.536..612.731 rows=16815 loops=1)
Recheck Cond: ((attributeid)::text = 'display-album'::text)
-> Bitmap Index Scan on (cost=0.00..1272.43 rows=57828 width=0) (actual time=25.033..25.033 rows=16815 loops=1)
Index Cond: ((attributeid)::text = 'display-album'::text)
-> Sort (cost=15641.81..15678.73 rows=14769 width=14) (actual time=14.471..16.898 rows=1109 loops=1)
Sort Key: b.orderid
Sort Method: quicksort Memory: 76kB
-> Bitmap Heap Scan on orderitem b (cost=310.96..14619.03 rows=14769 width=14) (actual time=1.865..8.480 rows=1114 loops=1)
Recheck Cond: ((productid)::text = 'ModernBook'::text)
-> Bitmap Index Scan on id_orderitem_productid (cost=0.00..307.27 rows=14769 width=0) (actual time=1.431..1.431 rows=1114 loops=1)
Index Cond: ((productid)::text = 'ModernBook'::text)
Total runtime: 842.134 ms
(17 rows)
Try reading it for yourself and see if it makes sense.
What I read is that the database first scans the id_orderitem_productid index, using that to find the rows it wants from orderitem, then sorts that dataset using a quicksort (the sort used will change if data doesn't fit in RAM), then sets that aside.
Next, it scans orditematt_attributeid_idx to find the rows it wants from orderitemattribute and then sorts that dataset using a quicksort.
It then takes the two datasets and merges them. (A merge join is a sort of "zipping" operation where it walks the two sorted datasets in parallel, emitting the joined row when they match.)
As I said, you work through the plan inner part to outer part, bottom to top.

There is an online helper tool available too, Depesz, which will highlight where the expensive parts of the analysis results are.
also has one, here's the same results, which to me make it clearer where the problem is.

PgAdmin will show you a graphical representation of the explain plan. Switching back and forth between the two can really help you understand what the text representation means. However, if you just want to know what it is going todo, you may be able to just always use the GUI.

PostgreSQL's official documentation provides an interesting, thorough explanation on how to understand explain's output.

If you install pgadmin, there's an Explain button that as well as giving the text output draws diagrams of what's happening, showing the filters, sorts and sub-set merges that I find really useful to see what's happening.

dalibo/pev2 is a visualizer tool which is very helpful.
Its available here - https://explain.dalibo.com/
Postgres Explain Visualizer 2 (PEV2) looks similar to pev. However pev is not actively maintained.
This project is a rewrite of the excellent Postgres Explain Visualizer
(pev). Kudos go to Alex Tatiyants.
The pev project was initialy written in early 2016 but seems to be
abandoned since then. There was no activity at all for more than 3
years and counting though there are several issues open and relevant
pull requests pending.

Related

We have intermittent slow queries. Is our PostgreSQL struggling with memory?

I am investigating a few slow queries and I need some help reading the data I got.
We have this one particular query which uses an index and runs pretty fast most of the time, however from time to time it runs slow (700ms+), not sure why.
Limit (cost=8.59..8.60 rows=1 width=619) (actual time=5.653..5.654 rows=1 loops=1)
-> Sort (cost=8.59..8.60 rows=1 width=619) (actual time=5.652..5.652 rows=1 loops=1)
Sort Key: is_main DESC, id
Sort Method: quicksort Memory: 25kB
-> Index Scan using index_pictures_on_imageable_id_and_imageable_type on pictures (cost=0.56..8.58
rows=1 width=619) (actual time=3.644..5.587 rows=1 loops=1)
Index Cond: ((imageable_id = 12345) AND ((imageable_type)::text = 'Product'::text))
Filter: (tag = 30)
Rows Removed by Filter: 2
Planning Time: 1.699 ms
Execution Time: 5.764 ms
If I understand that correctly, I would say that almost the entire cost of the query is on index scan, right? which sounds good to me, so why does the same query run pretty slow sometimes?
I started to think that maybe our instance is not being able to keep the entire index in memory, so it is using disk from time to time. That would explain the slow queries. However, that is way over my head. Does that make sense?
That table has around 15 million rows and 5156 MB in size. Index is 1752 MB. BTW, it is a btree index.
Our PostgreSQL is on a "Highly available" Google Cloud SQL instance. It has 2 vCPUs and 7.5 GB of RAM. Our entire database is around 35 GB in size.
CPU consumption almost never goes beyond 40%. It usually settles around 20-30%.
Checking instance memory graph, I noticed that consumption grows until ~4 GB, then it drops down ~700 MB and it starts growing again. That is a repetitive pattern.
In theory, the instance has 7.5 GB of RAM, but I don't know if all of it is supposed to be available for PostgreSQL. Anyway, ~3.5 GB just for OS sounds pretty high, right?
Memory graph
I read that these configs are important, so throwing them here (Cloud SQL defaults):
shared_buffers | 318976
temp_buffers | 1024
work_mem | 4096
Considering that we have a bunch of other tables and indexes, is it reasonable to assume that if one index alone is 1.7 GB, 7.5 GB for the entire instance is too low?
Is there any way I can assert whether we have a memory issue or not?
I appreciate your help.
Three things that can help you:
This function do a "prewarm" on table permanently on your memory. This reduces drastically your disk access, helping a lot on performance. The limitation for prewarm is resources. So, not all tables can be put on memory. If the table is small or not constantly accessed, it's not recommended. Every time that your database is stopped, on the next up of database, you need to run pg_prewarm() again
https://www.postgresql.org/docs/current/pgprewarm.html
Create a CLUSTER on your index. You can create one cluster per table. Clustering your index is a great way to get a good access of the data. The way that data is stored is related with cluster, so, to access a determined position on previously ordered data is very faster.
CLUSTER [VERBOSE] table_name [ USING index_name ]
Reference: https://www.postgresql.org/docs/current/sql-cluster.html
Run periodically VACUUM ANALYZE on table. Postgresql collect statistics about your queries and classifies the information in vacuum with analyze option focused on optimize your queries.
I think is more a memory problem as you say. Checking your graph I can say that most of the time your database is using the 4GB of memory assigned and when you run your query postgres has to use the disk.
I suppose your query runs faster is when is under the memory limit. Another thing to consider is that maybe, time ago, your database was not big as now and with the dafult memory assign (4 GB) was ok.
You can modify your memory assigned to postgres configuring the flags, in particular the work_mem flag. I suggest to assign 2GB of extra memory and check the results. If you see your database uses again the 100% of the memory, consider increasing the whole memory and the memory assigned to the database.

explain analyze - cost to actual time relation

Usual when improving my queries I see a coinciding improvement with both cost and actual time when running an explain analyze on both before and after queries.
However, in one case, the before query reports
"Hash Join (cost=134.06..1333.57 rows=231 width=70)
(actual time=115.349..115.650 rows=231 loops=1)"
<cut...>
"Planning time: 4.060 ms"
"Execution time: 115.787 ms"
and the after reports
"Hash Join (cost=4.63..1202.61 rows=77 width=70)
(actual time=0.249..0.481 rows=231 loops=1)"
<cut...>
"Planning time: 2.079 ms"
"Execution time: 0.556 ms"
So as you can see, the costs are similar but actual and real execution times are vastly different, regardless of the order in which I run the tests.
Using Postgres 8.4.
Can anyone clear up my understanding as to why the cost does not show an improvement?
There isn't much information available in the details given in the question but a few pointers can may be help others who come here searching on the topic.
The cost is a numerical estimate based on table statistics that are calculated when analyze is run on the tables that are involved in the query. If the table has never been analyzed then the plan and the cost may be way sub optimal. The query plan is affected by the table statistics.
The actual time is the actual time taken to run the query. Again this may not correlate properly to the cost depending on how fresh the table statistics are. The plan may be arrived upon depending on the current table statistics, but the actual execution may find real data conditions different from what the table statistics tell, resulting in a skewed execution time.
Point to note here is that, table statistics affect the plan and the cost estimate, where as the plan and actual data conditions affect the actual time. So, as a best practice, before working on query optimization, always run analyze on the tables.
A few notes:
analyze <table> - updates the statistics of the table.
vacuum analyze <table> - removes stale versions of the updated records from the table and then updates the statistics of the table.
explain <query> - only generates a plan for the query using statistics of the tables involved in the query.
explain (analyze) <query> - generates a plan for the query using existing statistics of the tables involved in the query, and also runs the query collecting actual run time data. Since the query is actually run, if the query is a DML query, then care should be taken to enclose it in begin and rollback if the changes are not intended to be persisted.
Cost meaning
The costs are in an arbitrary unit. A common misunderstanding is that they are in milliseconds or some other unit of time, but that’s not the case.
The cost units are anchored (by default) to a single sequential page read costing 1.0 units (seq_page_cost).
Each row processed adds 0.01 (cpu_tuple_cost)
Each non-sequential page read adds 4.0 (random_page_cost).
There are many more constants like this, all of which are configurable.
Startup cost
The first numbers you see after cost= is known as the “startup cost”. This is an estimate of how long it will take to fetch the first row.
The startup cost of an operation includes the cost of its children.
Total cost
After the startup cost and the two dots, is known as the “total cost”. This estimates how long it will take to return all the rows.
example
QUERY PLAN |
--------------------------------------------------------------+
Sort (cost=66.83..69.33 rows=1000 width=17) |
Sort Key: username |
-> Seq Scan on users (cost=0.00..17.00 rows=1000 width=17)|
We can see that the total cost of the Seq Scan operation is 17.00, and the startup cost of the Seq Scan is 0.00. For the Sort operation, the total cost is 69.33, which is not much more than its startup cost (66.83).
Actual time meaning
The “actual time” values are in milliseconds of real time, it is the result of EXPLAIN's ANALYZE. Note: the EXPLAIN ANALYZE option performs the query (be careful with UPDATE and DELETE)
EXPLAIN ANALYZE could be used to compare the estimated number of rows with the actual rows returned by each operation.
Helping the planner estimate more accurately
Gather better statistics
tables also change over time, so tuning the autovacuum settings to make sure it runs frequently enough for your workload can be very helpful.
If you’re having trouble with bad estimates for a column with a skewed distribution, you may benefit from increasing the amount of information Postgres gathers by using the ALTER TABLE SET STATISTICS command, or even the default_statistics_target for the whole database.
Another common cause of bad estimates is that, by default, Postgres will assume that two columns are independent. You can fix this by asking it to gather correlation data on two columns from the same table via extended statistics.
Tune the constants it uses for the calculations
Assuming you’re running on SSDs, you’ll likely at minimum want to tune your setting of random_page_cost. This defaults to 4, which is 4x more expensive than the seq_page_cost we looked at earlier. This ratio made sense on spinning disks, but on SSDs it tends to penalize random I/O too much.
Source:
PG doc - using explain
Postgres explain cost

Identify queries leading to seq_scan shown in pg_stat_user_tables

I have verified all the queries using EXPLAIN, and they show an index scan node in the plan.
But when i see stats in the table pg_stat_user_tables I see a non-zero value of seq_scan.
It is possible that PostgreSQL is doing a bitmap heap scan rather than just an index scan, but I am not completely sure.
I have following queries:
Does bitmap heap scan count as seq_scan in above stats table ?
How to identify the queries that perform the sequential scan? The traffic to the database is non-uniform, hence monitoring pg_stat_activity is not helpful.
A bitmap index scan is counted under idx_scan.
Finding the query that performs the sequential scan is harder.
Let's assume that the table is fairly big so that a sequential scan takes a certain time (for this answer, I assume at least 500 ms, but that may be different of couse). If the sequential scan is very short, you wouldn't and shouldn't worry.
Now put auto_explain into shared_preload_libraries and add the following parameters to postgresql.conf:
auto_explain.log_min_duration = 500
auto_explain.log_nested_statements = on
Then restart the server.
Now you will get the execution plans of all statements exceeding a duration of 500 ms in the PostgreSQL log, and you should be able to find the query.
By the way, sequential scans are not always something you should worry about. If they occur only rarely, that is usually fine. It might be that you are hunting your own database backup that uses pg_dump! Only expensive sequential scans that happen often are a problem.

Why would there be varying response times for the same query?

Is there a reason why the same query executed a number of times have huge variance in response times? from 50% - 200% what the projected response time is? They range from 6 seconds to 20 seconds even though it is the only active query in the database.
Context:
Database on Postgres 9.6 on AWS RDS (with Provisioned IOPS)
Contains one table comprising five numeric columns, indexed on id, holding 200 million rows
The query:
SELECT col1, col2
FROM calculations
WHERE id > 0
AND id < 100000;
The query's explain plan:
Bitmap Heap Scan on calculation (cost=2419.37..310549.65 rows=99005 width=43)
Recheck Cond: ((id > 0) AND (id <= 100000))
-> Bitmap Index Scan on calculation_pkey (cost=0.00..2394.62 rows=99005 width=0)
Index Cond: ((id > 0) AND (id <= 100000))
Is there any reasons why a simple query like this isn't more predictable in response time?
Thanks.
When you see something like this in PostgreSQL EXPLAIN ANALYZE:
(cost=2419.37..310549.65)
...it doesn't mean the cost is between 2419.37 and 310549.65. These are in fact two different measures. The first value is the startup cost, and the second value is the total cost. Most of the time you'll care only about the total cost. The times that you should be concerned with startup cost is when that component of the execution plan is in relation to (for example) an EXISTS clause, where only the first row needs to be returned (so you only care about startup cost, not total, as it exits almost immediately after startup).
The PostgreSQL documentation on EXPLAIN goes into greater detail about this.
A query may be (and should be, excluding special cases) more predictable in response time when you are a sole user of the server. In the case of a cloud server, you do not know anything about the actual server load, even if your query is the only one performed on your database, because the server most likely supports multiple databases at the same time. As you asked about response time, there may be also various circumstances involved in accessing a remote server over the network.
After investigation of the historical load, we have found out that the provisioned IOPS we originally configured had been exhausted during the last set of load tests performed on the environment.
According to Amazon's documentation #http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html, after this point, Amazon does not guarantee consistency in execution times and the SLAs are no longer applicable.
We have confirmed that replicating the database onto a new instance of AWS RDS with same configuration yields consistent response times when executing the query multiple times.

How do you minimise the number of auxiliary scan descriptors for a table in T-SQL?

I've recently seen occasional problems with stored procedures on a legacy system which displays error messages like this:
Server Message: Number 10901, Severity 17:
This query requires X auxiliary scan
descriptors but currently there are
only Y auxiliary scan descriptors
available. Either raise the value of
the 'number of aux scan descriptors'
configuration parameter or try your
query later.
where X is slightly lower than Y. The Sybase manual usefully tells me that I should redesign my table to use less auxiliary scan descriptors (how?!), or increase the number available on the system. The weird thing is, it's been working fine for years and the only thing that's changed is that we amended the data types of a couple of columns and added an index. Can anyone shed any light on this?
You don't say what version of Sybase you are on but the following is good for ASE 12.5 onwards.
I suspect that it's the addition of the new index that's thrown out the query plan for that stored procedure. Have you tried running
update statistics *table_name*
on it? If that fails you can find out how many scan descriptors you have by running
sp_monitorconfig "aux scan descriptors"
and then increase that by running
sp_configure "aux scan descriptors", x
where x is the number of scan descriptors you require.
If you wish to reduce the number of scan descriptors that the store procedure is using then according to here you have to
Rewrite the query, or break it into steps using temporary tables. For data-only-locked tables, consider adding indexes if there are many table scans.
but without seeing a query plan it's impossible to give more specific advice.
This is defect in Sybase 12.5.2 for which a CR was submitted, see issue 361967 in this list. It was patched for 12.5.3 and above.