Understanding simple PostgreSQL EXPLAIN - postgresql

I can't understand EXPLAIN of quite simple query:
select * from store order by id desc limit 1;
QUERY PLAN Limit (cost=0.00..0.03 rows=1 width=31)
-> Index Scan Backward using store_pkey on store (cost=0.00..8593.28 rows=250000 width=31)
Why does top level node (limit) have cost lower than nested(index scan) has? As I read from documentation it should be cumulative cost, i.e. 8593.28 + 0.03

The docs (emphasis added) say;
Actually two numbers are shown: the start-up cost before the first row can be returned, and the total cost to return all the rows. For most queries the total cost is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up cost instead of the smallest total cost (since the executor will stop after getting one row, anyway).
In other words, 8593.28 would be the cost to return all the rows, but due to the limit you're only returning one so the actual cost is much lower (more or less equal to the startup cost)

The numbers you see in the top node (0.00..0.03) are (per documentation)
0.00 .. the estimated start-up cost of the node
0.03 .. the estimated total cost of the node
If you want actual total times, run EXPLAIN ANALYZE, which appends actual times for every node. Like:
Limit (cost=0.29..0.31 rows=1 width=30) (actual time=xxx.xxx..xxx.xxx rows=1 loops=1)
Bold emphasis mine.

Related

How does an index's fill factor relate to a query plan?

When a PostgreSQL query's execution plan is generated, how does an index's fill factor affect whether the index gets used in favor of a sequential scan?
A fellow dev and I were reviewing the performance of a PostgreSQL (12.4) query with a windowed function of row_number() OVER (PARTITION BY x, y, z) and seeing if we could speed it up with an index on said fields. We found that during the course of the query the index would get used if we created it with a fill factor >= 80 but not at 75. This was a surprise to us as we did not expect the fill factor to be considered in creating the query plan.
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used. What causes this behavior and should we consider it when selecting an index's fill factor on a table that will have frequent inserts and deletes and be periodically vacuumed?
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used.
So, it is not the fill factor, but rather the size of the index (which is influenced by the fill factor). This agrees with my memory that index size is a (fairly weak) influence on the cost estimate. That influence is almost zero if you are reading only one tuple, but larger if you area reading many tuples.
If the cost estimates of the plan are close to each other, then small differences such as this will be enough to drive one over the other. But that doesn't mean you should worry about them. If one plan is clearly superior to the other, then you should think about why the estimates are so close together to start with when the realities are not close together.

what are rows and width in postgresql explain?

What does rows and width mean in PostgreSQL?
I read the doc, which talks about the "estimated number of rows". I still do not get how these rows are calculated and what does this signify.
I ran explain on 2 queries, both returns 12 rows. One uses correlated subqueries and the other one joins.
subquery explain:
XN HashAggregate (cost=812238052402.22..812238052402.23 rows=4 width=32)
Join explain:
XN HashAggregate (cost=6670401214317.72..6670401214317.75 rows=12 width=11)
I have 2 questions?
I am fetching the same number of columns in both queries. Then how the width are different?
How do I interpret rows? and how are they calculated?
The width is average length of row in bytes and it is calculated as sum of average width of output columns. The Postresql holds some metrics for any column, and one is avg_width.It is used for expected memory allocations. You can see these statistics in view pg_stats.
There are lot of articles about row estimations. You can find it in PostgreSQL documentation too https://www.postgresql.org/docs/current/row-estimation-examples.html .

PostgreSQL default_statistics_target not improving row estimation

I am trying to optimize our queries on Postgres which takes minutes sometimes using huge tables. Started looking at query plan and noticed close 1000x difference between estimated number of rows and actual rows on running with EXPLAIN ANALYZE.
This lead me to the parameter default_statistics_target which controls the number rows sampled by ANALYZE command to collect stats used by query planner. As few blogs suggested, I experimented by increased value setting it to 1000 and event to max allowed value of 10000.
Ran ANALYZE every time to ensure it stats are updated. But surprisingly, this did not improve the rows estimation at all. In fact it reduced the estimated value a bit further which seems strange to understand.
Also tested by reducing the value to 10. Which seems to have improved the count a bit. So I am confused if the param actually does what I thought it does. Or if there is some other way to improve row estimation. Any help would be much appreciated.
Postgres version: 9.6
Query plan: At the last index scan step, it has estimated 462 but actual is 1.9M.
https://explain.depesz.com/s/GZY
After changing default_statistics_target = 1000, rows at Index scan step were
-> (cost=0.57..120.42 rows=114 width=32) (actual time=248.999..157947.395 rows=1930518 loops=1)
And on setting it to default_statistics_target = 10, counts were:
-> (cost=0.57..2610.79 rows=2527 width=32) (actual time=390.437..62668.837 rows=1930518 loops=1)
P.S. Table under consideration has more than 100M rows.
This looks like a correlation problem. The planner assumes that the conditions on project_id, event_name_id, and "timestamp" are independent and so multiplies their estimated selectivity. If they are not independent, then no amount of traditional statistics is going to help that. Maybe you need extended statistics
Also, at the time it makes the plan it doesn't even know what value event_name_id will be compared to, as $0 is not determined until run time, so it can't use the value-specific statistics for that. You could execute the subquery manually, then hard code the resulting value into that spot in the query, so the planner knows what the value is while it is planning.

Speed up ST_snap query in PostGIS

Explanation
I have 2 tables in PostgreSQL using the PostGIS extension. Both tables are representing streets as linestrings from a province.
streetsA table (orange lines) has a table size of 96 MB (471026 rows), the second table streetsB (green lines) has a storage size of 78 MB (139708 rows). The streets differ a bit in their positions, that is why I applied a ST_Snap function to match streetsB to streetsA.
create table snapped as select ST_snap(a.geom, b.geom, ST_Distance(a.geom, b.geom)*0.5) from streetsA as a, streetsB as b;
However due to the large size of the tables, the query takes more than 5 hours to complete. I haven't changed anything in the postgres settings. Is it a good idea to perform the query on such a large dataset? Does a spatial index make sense for this query? I am using a 16GB RAM Laptop with Core i7.
The EXPLAIN method gives me following output:
Nested Loop (cost=0.00..5264516749.25 rows=65806100408 width=32)
Seq Scan on streetsa a (cost=0.00..16938.26 rows=471026 width=153)
Materialize (cost=0.00..12127.62 rows=139708 width=206)
Seq Scan on streets b (cost=0.00..11429.08 rows=139708

Postgresql index only scan speed

So I've got an index only scan returning 750k rows, and sucking it into a cte and doing count star is taking .5 seconds. It's barely using any iops, and maxing out the instance to 16xlarge isn't moving the needle. Switched to bitmap heap scan and it's still giving me .5 seconds. What are some alternatives (other than using a mat view) I can try to speed it up? Or is this just postgres v10 at its finest?