I am trying to run a spatial query on large table (50k rows). Table contains the data for shortest path determination using pgr_astar() - following the https://docs.pgrouting.org/latest/en/pgr_aStar.html
I prepared the query(below) and everything works fine but I have to run it several hundreds times for different parameters so my question is how to optimize this work?
Query takes about 40 - 50s.
SELECT * FROM pgr_astar('SELECT id, source, target, cost, reverse_co, x1, y1, x2, y2 FROM public."500m_g"', 679009, 631529, directed := false, heuristic := 5)
the explain shows:
Function Scan on pgr_astar (cost=0.25..10.25 rows=1000 width=40)
Related
I am trying to define custom function and I wanted to find how can I calculate estimated cost of that function
https://www.postgresql.org/docs/current/sql-createfunction.html
I tried giving different values of cost function but unable to find to find how to estimate that cost.
If I cared enough to bother, I would do it experimentally.
For example, if your function takes double precision, you could compare:
explain analyze select sqrt(x::double precision) from generate_series(1,1000000) f(x);
to
explain analyze select your_func(x::double precision) from generate_series(1,1000000) f(x);
And then find the cost setting that makes the ratio of the cost estimates about match the ratio of the actual times.
You could try to subtract the baseline costs of the generate_series and the cast, but if the added time of your function is so small that it warrants such precision, then it is probably small enough to just make the cost 1 and not worry about it.
The formula postgresql uses for calculate the explain cost plus some examples of explain below,is necessary to sum everything from the table (indexes , foreign keys ,sequence):
references : https://postgrespro.com/blog/pgsql/5969403
SELECT relname , relpages*current_setting('seq_page_cost')::numeric +
reltuples*current_setting('cpu_tuple_cost')::numeric as cost
FROM pg_class
--WHERE relname='tablename';
You can use EXPLAIN to see the cost of CPU from each query,notice that this value is static and based on the objects used.
CREATE OR REPLACE FUNCTION a() RETURNS SET OF INTEGER AS $$
SELECT 1;
$$
LANGUAGE SQL;
EXPLAIN SELECT * FROM a() CROSS JOIN (Values(1),(2),(3)) as foo;
Nested Loop (cost=0.25..47.80 rows=3000 width=8)
-> Function Scan on a (cost=0.25..10.25 rows=1000 width=4)
-> Materialize (cost=0.00..0.05 rows=3 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3 width=4)
(4 rows)
If two functions with COST 0.0001 AND 10000 get executed on the same time as the predicate of a SELECT statement the query planner will execute first the function of cost 0.0001 and only later the 10000 cost condition as you can see in this example below.
EXPLAIN SELECT * FROM pg_language WHERE lanname ILIKE '%sql%' AND slow_
function(lanname)AND fast_function(lanname);
QUERY PLAN
-------------------------------------------------------------------------
Seq Scan on pg_language (cost=0.00..101.05 rows=1 width=114)
Filter: (fast_function(lanname) AND (lanname ~~* '%sql%'::text) AND
slow_function(lanname))
(2 rows)
Is unordered_window guaranteed to provide elements in same order when used repeatedly in select? I was hoping to avoid the cost of ordering since the order is irrelevant as long as it's the same.
I.e, will xs[i] and ys[i] always be elements from same row in xyz?
select
array_agg(x) over unordered_window as xs,
array_agg(y) over unordered_window as ys
from
xyz
window
unordered_window as (partition by z);
Use EXPLAIN (VERBOSE, COSTS OFF) to see what happens:
QUERY PLAN
═══════════════════════════════════════════════════════════
WindowAgg
Output: array_agg(x) OVER (?), array_agg(y) OVER (?), z
-> Sort
Output: z, x, y
Sort Key: xyz.z
-> Seq Scan on laurenz.xyz
Output: z, x, y
There is only a single sort, so we can deduce that the order will be the same.
But that is not guaranteed, to it is possible (albeit unlikely) that the implementation may change.
But you see that a sort is performed anyway. You may as well add the ORDER BY; all that will do is another sort key, which won't slow down the execution much. So you might just as well add the ORDER BY and be safe.
For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/
I have problem with gist index.
I have table 'country' with 'geog' (geography,multipolygon) columny.
I have also gist index on this column.
This simple query with ST_CoveredBy() against table with 2 rows ( each 'geog' about 5MB) takes 13 s (the query result is correct):
select c."ID" from gis.country c where ST_CoveredBy(ST_GeogFromText('SRID=4326;POINT(8.4375 58.5791015625)'), c."geog") =true
When I droped the index, the query also took 13s.
What I've already did:
VACUUM ANALYZE gis.country ("geog")
maybe this is the problem: "Do not call with a GEOMETRYCOLLECTION as an argument" I have read (http://www.mail-archive.com/postgis-users#postgis.refractions.net/msg17780.html), that it's because overlapping polygons, but in country map doesn't exist overlapping polygons
EDIT
Query plan -
Index Scan using sindx_country_geography_col1 on country c (cost=0.00..8.52 rows=1 width=4)
Index Cond: ('0101000020E61000000000000000E0204000000000204A4D40'::geography && "geog")
Filter: _st_covers("geog", '0101000020E61000000000000000E0204000000000204A4D40'::geography)
You won't see any benefit of an index querying against a table with only two rows. The benefit of an index only shines if you have hundreds or more rows to query.
I'm going to guess that you have two very detailed country multipolygons. There are strategies to divide these into grids to improve performance. How you break up your countries into grids should be based on (1) the density of your areas of interest (where you are most likely to query), and (2) multipolygon complexity or density of vertices.
I am running the following query on 489 million rows (102 gb) on a computer with 2 gb of memory:
select * from table order by x, y, z, h, j, l;
I am using psycopg2 with a server cursor ("cursor_unique_name") and fetch 30000 rows at a time.
Obviously the result of the query cannot stay in memory, but my question is whether the following set of queries would be just as fast:
select * into temp_table from table order by x, y, z, h, j, l;
select * from temp_table
This means that I would use a temp_table to store the ordered result and fetch data from that table instead.
The reason for asking this question is that the takes only 36 minutes to complete if run manually using psql, but it took more than 8 hours (never finished) to fetch the first 30000 rows when the query was executed using psycopg2.
If you want to fetch this table by chunks and sorted then you need to create an index. Every fetch will need to sort this whole table if there will be no such index. Your cursor probably sorted this table once for every row fetched — waiting for red giant sun would probably end sooner…
create index tablename_order_idx on tablename (x, y, z, h, j, l);
If your table data is relatively stable then you should cluster it by this index. This way table data will be fetched without too much seeking on disk.
cluster tablename using tablename_order_idx;
If you want to get data in chunks the you should not use cursor, as it will always work one row at a time. You should use limit and offset:
select * from tablename order by x, y, z, h, j, l
limit 30000 offset 44*30000