Speed up ST_snap query in PostGIS - postgresql

Explanation
I have 2 tables in PostgreSQL using the PostGIS extension. Both tables are representing streets as linestrings from a province.
streetsA table (orange lines) has a table size of 96 MB (471026 rows), the second table streetsB (green lines) has a storage size of 78 MB (139708 rows). The streets differ a bit in their positions, that is why I applied a ST_Snap function to match streetsB to streetsA.
create table snapped as select ST_snap(a.geom, b.geom, ST_Distance(a.geom, b.geom)*0.5) from streetsA as a, streetsB as b;
However due to the large size of the tables, the query takes more than 5 hours to complete. I haven't changed anything in the postgres settings. Is it a good idea to perform the query on such a large dataset? Does a spatial index make sense for this query? I am using a 16GB RAM Laptop with Core i7.
The EXPLAIN method gives me following output:
Nested Loop (cost=0.00..5264516749.25 rows=65806100408 width=32)
Seq Scan on streetsa a (cost=0.00..16938.26 rows=471026 width=153)
Materialize (cost=0.00..12127.62 rows=139708 width=206)
Seq Scan on streets b (cost=0.00..11429.08 rows=139708

Related

what are rows and width in postgresql explain?

What does rows and width mean in PostgreSQL?
I read the doc, which talks about the "estimated number of rows". I still do not get how these rows are calculated and what does this signify.
I ran explain on 2 queries, both returns 12 rows. One uses correlated subqueries and the other one joins.
subquery explain:
XN HashAggregate (cost=812238052402.22..812238052402.23 rows=4 width=32)
Join explain:
XN HashAggregate (cost=6670401214317.72..6670401214317.75 rows=12 width=11)
I have 2 questions?
I am fetching the same number of columns in both queries. Then how the width are different?
How do I interpret rows? and how are they calculated?
The width is average length of row in bytes and it is calculated as sum of average width of output columns. The Postresql holds some metrics for any column, and one is avg_width.It is used for expected memory allocations. You can see these statistics in view pg_stats.
There are lot of articles about row estimations. You can find it in PostgreSQL documentation too https://www.postgresql.org/docs/current/row-estimation-examples.html .

PostgreSQL default_statistics_target not improving row estimation

I am trying to optimize our queries on Postgres which takes minutes sometimes using huge tables. Started looking at query plan and noticed close 1000x difference between estimated number of rows and actual rows on running with EXPLAIN ANALYZE.
This lead me to the parameter default_statistics_target which controls the number rows sampled by ANALYZE command to collect stats used by query planner. As few blogs suggested, I experimented by increased value setting it to 1000 and event to max allowed value of 10000.
Ran ANALYZE every time to ensure it stats are updated. But surprisingly, this did not improve the rows estimation at all. In fact it reduced the estimated value a bit further which seems strange to understand.
Also tested by reducing the value to 10. Which seems to have improved the count a bit. So I am confused if the param actually does what I thought it does. Or if there is some other way to improve row estimation. Any help would be much appreciated.
Postgres version: 9.6
Query plan: At the last index scan step, it has estimated 462 but actual is 1.9M.
https://explain.depesz.com/s/GZY
After changing default_statistics_target = 1000, rows at Index scan step were
-> (cost=0.57..120.42 rows=114 width=32) (actual time=248.999..157947.395 rows=1930518 loops=1)
And on setting it to default_statistics_target = 10, counts were:
-> (cost=0.57..2610.79 rows=2527 width=32) (actual time=390.437..62668.837 rows=1930518 loops=1)
P.S. Table under consideration has more than 100M rows.
This looks like a correlation problem. The planner assumes that the conditions on project_id, event_name_id, and "timestamp" are independent and so multiplies their estimated selectivity. If they are not independent, then no amount of traditional statistics is going to help that. Maybe you need extended statistics
Also, at the time it makes the plan it doesn't even know what value event_name_id will be compared to, as $0 is not determined until run time, so it can't use the value-specific statistics for that. You could execute the subquery manually, then hard code the resulting value into that spot in the query, so the planner knows what the value is while it is planning.

Does POSTGRES's query optimizer's statistical estimator compute a most common value list for an intermediate product of a multi-table join?

I am reading through Postgres' query optimizer's statistical estimator's code to understand how it works.
For reference, Postgres' query optimizer's statistical estimator estimates the size of the output of an operation (e.g. join, select) in a Postgres plan tree. This allows Postgres to choose between the different ways a query can be executed.
Postgres' statistical estimator uses cached statistics about the contents of each a relation's columns to help estimate output size. The two key saved data structures seem to be:
A most common value (MCV) list: a list of each of the most common values stored in that column and the frequency that they appear in the column.
A histogram of the data stored in that column.
For example, given the table:
X Y
1 A
1 B
1 C
2 A
2 D
3 B
The most common values list for Y would contain {1:0.5, 2:0.333}.
However, when Postgres completes the first join in a multi join operation like in the example below:
SELECT *
FROM A, B, C, D
WHERE A.ID = B.ID AND B.ID2 = C.ID2 AND C.ID3 = D.ID3
the resulting table does not have an MCV (or histogram) (since we've just created the table and we haven't ANALYZEd it! This will make it harder to estimate the output size/cost of the remaining joins.
Does Postgres automatically generate/estimate the MCV (and histogram) for this table to help statistical estimation? If it does, how does it create this MCV?
For reference, here's what I've looked at so far:
The documentation giving a high level overview of how Postgres statistical planner works:
https://www.postgresql.org/docs/12/planner-stats-details.html
The code which carries out the majority of POSTGRES's statistical estimation:
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/selfuncs.c
The code which generates a relation's MCV:
https://github.com/postgres/postgres/blob/master/src/backend/statistics/mcv.c
Generic logic for clause selectivities:
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/clausesel.c
A pointer to the right code file to look at would be much appreciated! Many thanks for your time. :)
The result of a join is called a join relation in PostgreSQL jargon, but that does not mean that it is a “materialized” table that is somehow comparable to a regular PostgreSQL table (which is called a base relation).
In particular, since the join relation does not physically exist, it cannot be ANALYZEd to collect statistics. Rather, the row count is estimated based on the size of the joined relations and the selectivity of the join conditions. This selectivity is a number between 0 (the condition excludes all rows) and 1 (the condition does not filter out anything).
The relevant code is in calc_joinrel_size_estimate in src/backend/optimizer/path/costsize.c, which you are invited to study.
The key points are:
Join conditions that correspond to foreign keys are considered specially:
If all columns in a foreign key are join conditions, then we know that the result of such a join must be as big as the referenced table, so the selectivity is 1 / referenced table size.
Other join conditions are estimated separately by guessing what percentage of rows will be eliminated by that condition.
In the case of an left (or right) outer join, we know that the result size must be at least as big as the left (or right) side.
Finally, the size of the cartesian join (the product of the relation sizes) is multiplied with all selectivities calculated above.
Note that this treats all conditions as independent, which causes bad estimates if the conditions are correlated. But since PostgreSQL doesn't have cross-table statistics, it cannot do better.

Postgresql index only scan speed

So I've got an index only scan returning 750k rows, and sucking it into a cte and doing count star is taking .5 seconds. It's barely using any iops, and maxing out the instance to 16xlarge isn't moving the needle. Switched to bitmap heap scan and it's still giving me .5 seconds. What are some alternatives (other than using a mat view) I can try to speed it up? Or is this just postgres v10 at its finest?

Understanding simple PostgreSQL EXPLAIN

I can't understand EXPLAIN of quite simple query:
select * from store order by id desc limit 1;
QUERY PLAN Limit (cost=0.00..0.03 rows=1 width=31)
-> Index Scan Backward using store_pkey on store (cost=0.00..8593.28 rows=250000 width=31)
Why does top level node (limit) have cost lower than nested(index scan) has? As I read from documentation it should be cumulative cost, i.e. 8593.28 + 0.03
The docs (emphasis added) say;
Actually two numbers are shown: the start-up cost before the first row can be returned, and the total cost to return all the rows. For most queries the total cost is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up cost instead of the smallest total cost (since the executor will stop after getting one row, anyway).
In other words, 8593.28 would be the cost to return all the rows, but due to the limit you're only returning one so the actual cost is much lower (more or less equal to the startup cost)
The numbers you see in the top node (0.00..0.03) are (per documentation)
0.00 .. the estimated start-up cost of the node
0.03 .. the estimated total cost of the node
If you want actual total times, run EXPLAIN ANALYZE, which appends actual times for every node. Like:
Limit (cost=0.29..0.31 rows=1 width=30) (actual time=xxx.xxx..xxx.xxx rows=1 loops=1)
Bold emphasis mine.