PostgreSQL default_statistics_target not improving row estimation - postgresql

I am trying to optimize our queries on Postgres which takes minutes sometimes using huge tables. Started looking at query plan and noticed close 1000x difference between estimated number of rows and actual rows on running with EXPLAIN ANALYZE.
This lead me to the parameter default_statistics_target which controls the number rows sampled by ANALYZE command to collect stats used by query planner. As few blogs suggested, I experimented by increased value setting it to 1000 and event to max allowed value of 10000.
Ran ANALYZE every time to ensure it stats are updated. But surprisingly, this did not improve the rows estimation at all. In fact it reduced the estimated value a bit further which seems strange to understand.
Also tested by reducing the value to 10. Which seems to have improved the count a bit. So I am confused if the param actually does what I thought it does. Or if there is some other way to improve row estimation. Any help would be much appreciated.
Postgres version: 9.6
Query plan: At the last index scan step, it has estimated 462 but actual is 1.9M.
https://explain.depesz.com/s/GZY
After changing default_statistics_target = 1000, rows at Index scan step were
-> (cost=0.57..120.42 rows=114 width=32) (actual time=248.999..157947.395 rows=1930518 loops=1)
And on setting it to default_statistics_target = 10, counts were:
-> (cost=0.57..2610.79 rows=2527 width=32) (actual time=390.437..62668.837 rows=1930518 loops=1)
P.S. Table under consideration has more than 100M rows.

This looks like a correlation problem. The planner assumes that the conditions on project_id, event_name_id, and "timestamp" are independent and so multiplies their estimated selectivity. If they are not independent, then no amount of traditional statistics is going to help that. Maybe you need extended statistics
Also, at the time it makes the plan it doesn't even know what value event_name_id will be compared to, as $0 is not determined until run time, so it can't use the value-specific statistics for that. You could execute the subquery manually, then hard code the resulting value into that spot in the query, so the planner knows what the value is while it is planning.

Related

How does an index's fill factor relate to a query plan?

When a PostgreSQL query's execution plan is generated, how does an index's fill factor affect whether the index gets used in favor of a sequential scan?
A fellow dev and I were reviewing the performance of a PostgreSQL (12.4) query with a windowed function of row_number() OVER (PARTITION BY x, y, z) and seeing if we could speed it up with an index on said fields. We found that during the course of the query the index would get used if we created it with a fill factor >= 80 but not at 75. This was a surprise to us as we did not expect the fill factor to be considered in creating the query plan.
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used. What causes this behavior and should we consider it when selecting an index's fill factor on a table that will have frequent inserts and deletes and be periodically vacuumed?
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used.
So, it is not the fill factor, but rather the size of the index (which is influenced by the fill factor). This agrees with my memory that index size is a (fairly weak) influence on the cost estimate. That influence is almost zero if you are reading only one tuple, but larger if you area reading many tuples.
If the cost estimates of the plan are close to each other, then small differences such as this will be enough to drive one over the other. But that doesn't mean you should worry about them. If one plan is clearly superior to the other, then you should think about why the estimates are so close together to start with when the realities are not close together.

what are rows and width in postgresql explain?

What does rows and width mean in PostgreSQL?
I read the doc, which talks about the "estimated number of rows". I still do not get how these rows are calculated and what does this signify.
I ran explain on 2 queries, both returns 12 rows. One uses correlated subqueries and the other one joins.
subquery explain:
XN HashAggregate (cost=812238052402.22..812238052402.23 rows=4 width=32)
Join explain:
XN HashAggregate (cost=6670401214317.72..6670401214317.75 rows=12 width=11)
I have 2 questions?
I am fetching the same number of columns in both queries. Then how the width are different?
How do I interpret rows? and how are they calculated?
The width is average length of row in bytes and it is calculated as sum of average width of output columns. The Postresql holds some metrics for any column, and one is avg_width.It is used for expected memory allocations. You can see these statistics in view pg_stats.
There are lot of articles about row estimations. You can find it in PostgreSQL documentation too https://www.postgresql.org/docs/current/row-estimation-examples.html .

What is the estimate elapsed time for a query given its explained cost

If I use postgres's EXPLAIN command, there's a top-level "cost". Assuming that the explain is accurate (ie despite the cost being in reality quite unreliable and/or inconsistent), what is the very approximate conversion from cost to minutes/seconds query duration (for a "large" cost)?
In my case, the query cost is 60 million
For context, my hardware is a regular laptop and the data is 12M rows joining to 250K rows on an indexed column, grouped on several columns to produce 1K rows of output.
This question is not about the query itself per se - there could be better ways to code the query.
This question is also not about how inaccurate, unreliable or inconsistent the explain output is.
This question is about estimating the run time a query would take if it executed given its EXPLAIN cost and given that the EXPLAIN output is in fact an accurate analysis of the query.

Understanding simple PostgreSQL EXPLAIN

I can't understand EXPLAIN of quite simple query:
select * from store order by id desc limit 1;
QUERY PLAN Limit (cost=0.00..0.03 rows=1 width=31)
-> Index Scan Backward using store_pkey on store (cost=0.00..8593.28 rows=250000 width=31)
Why does top level node (limit) have cost lower than nested(index scan) has? As I read from documentation it should be cumulative cost, i.e. 8593.28 + 0.03
The docs (emphasis added) say;
Actually two numbers are shown: the start-up cost before the first row can be returned, and the total cost to return all the rows. For most queries the total cost is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up cost instead of the smallest total cost (since the executor will stop after getting one row, anyway).
In other words, 8593.28 would be the cost to return all the rows, but due to the limit you're only returning one so the actual cost is much lower (more or less equal to the startup cost)
The numbers you see in the top node (0.00..0.03) are (per documentation)
0.00 .. the estimated start-up cost of the node
0.03 .. the estimated total cost of the node
If you want actual total times, run EXPLAIN ANALYZE, which appends actual times for every node. Like:
Limit (cost=0.29..0.31 rows=1 width=30) (actual time=xxx.xxx..xxx.xxx rows=1 loops=1)
Bold emphasis mine.

How do I get an accurate measurement of a query's efficiency?

I am comparing queries on PostgreSQL 8.3.14 which return the same result set.
I have used EXPLAIN on my queries to track the estimated total cost. I have also run the queries a few times and recorded the total time it took to run. I understand consecutive runs will cause more data to be cached and skew the actual no-cache runtime.
Still I would expect that EXPLAIN cost would be somewhat proportional to the total runtime (with cache skew).
My data denies this. I compared 4 queries.
Query A
Total Cost: 119 500
Average Runtime: 28.101 seconds
Query B
Total Cost: 115 700
Average Runtime: 28.291 seconds
Query C
Total Cost: 116 200
Average Runtime: 32.409 seconds
Query D
Total Cost: 93 200
Average Runtime: 37.503 seconds
I ran Query D last and if anything it should be the fastest because of the caching problem. Since running the queries without cache seems to be difficult based on this Q+A:
[SO]:See and clear Postgres caches/buffers?
How can I measure which query is the most efficient?
The query cost shown by the planner is a function of the structure of your indexes and also the relative frequencies of certain values in the relevant tables. PostgreSQL keeps track of the most common values seen in all of the columns of all of your tables so that it can get an idea of how many rows each stage of each plan is likely to operate on.
This information can become out of date. If you are really trying to get an accurate idea of how costly a query is, make sure that the statistics postgres is using is up to date, by executing a VACUUM ANALYZE statement.
Beyond that, the planner is forced to do some apples to oranges comparisons; somehow comparing the time it takes to seek versus the time it takes to run a tight loop over an in-memory relation. Since different hardware can do these things at different relative speeds, sometimes, especially for near ties, postgres may guess wrong. These relative costs can be tuned in the configuration of your server's config file
Edit:
The statistics collected by postgesql do not relate to "query performance" and are not updated by successive queries. They only describe the frequency and distribution of values in each column of each table (except where disabled.) Having accurate statistics is important for accurate query planning, but its on you, the operator, to tell PostgreSQL how often and to what level of detail those statstics should be gathered. The discrepency you are observing is a sign that the stastics are out of date, or that you could benefit from tuning other planner parameters.
Try running them through explain analyze and posting the output from that to http://explain.depesz.com/