"Sparse" Geospatial Queries with ST_Contains are Much Slower than dense ones - postgresql

I am hitting a wall when it comes to trying to explain what is happening with a query I have. For simplicity, I have stripped it down to the minimum. Simply put, I am try to find all points within a simple envelope like so:
SELECT
lines.start_point AS point
FROM
"objects"
INNER JOIN "lines" ON "lines"."id" = "objects"."line_id"
WHERE
(ST_Contains
(
ST_MakeEnvelope (142.055256,-10.798657,142.385532,-10.485534, 4326),lines.start_point::geometry
)
)
LIMIT 50 OFFSET 0;
I have indexes setup on lines.start_point and everything is very fast in areas with lots of data. I get data returned in the sub 500ms range for areas that have lots of data.
What I did not expect is that areas with very little data would be super slow - sometimes > 90,000ms. Is there something I am totally missing here with ST_Contains that would explain this?
As with the example bounding box above and screenshot of return points my data only has 63 start points within this box but the query took 2min 54sec to find them. My only thought is that maybe ST_Contains is quite fast when it can just pick up points quickly when it can find a lot in a single pass but if it has to scan the entire area, it is really slow.
Additionally, I have tried using other ways to looks for points - like the && operator. When this is the case, the roles reverse. Dense areas take a really long time and sparse areas are lightning fast. And example of that query is here:
SELECT
lines.start_point AS point
FROM
"objects"
INNER JOIN "lines" ON "lines"."id" = "objects"."line_id"
WHERE
lines.start_point && ST_MakeEnvelope (142.055256,-10.798657,142.385532,-10.485534, 4326)
LIMIT 50 OFFSET 0;
Any information would help. Thanks
EDIT: Add && query example

Because of the limit clause, the planner may think it is faster not to use the spatial index. You can try to query all rows and then to apply the limit. Make sure to keep the offset 0 to prevent inlining.
SELECT * FROM (
SELECT
lines.start_point AS point
FROM
"objects"
INNER JOIN "lines" ON "lines"."id" = "objects"."line_id"
WHERE
(ST_Contains
(
ST_MakeEnvelope (142.055256,-10.798657,142.385532,-10.485534, 4326),lines.start_point::geometry
)
)
OFFSET 0)
LIMIT 50;
Also I see you do a cast to geometry, so make sure the index is on the geometry too!

Related

Searching nearest neighbour in same Table GIS Postgrsql

We have a database of individual trees with geo location in the DB we seem to have a geom point combined from long and lat named estimated_geometric_location. We get a periodic update of these trees lets say every month. I would like to get a list of trees that has two properties. I am looking to identify the most likely update of a specific tree ie. when a new set of trees from one tracking event comes we need to run a routine suggesting date entry x.2 is an update of the datapoint x.1. Ideally this routine then updates the new data point(child) adding the older mother data point which then hopefully represents that same tree.
So far i have something like this but the DB is not responding (or maybe i am not waiting long enough... waited about 10minutes so far)
SELECT
i.id
,ST_Distance(i.estimated_geometric_location, i.b_estimated_geometric_location) AS dist
FROM(
SELECT
a.id
,b.id AS b_id
,a.estimated_geometric_location
,b.estimated_geometric_location AS b_estimated_geometric_location
,rank() OVER (PARTITION BY a.id ORDER BY ST_Distance(a.estimated_geometric_location, b.estimated_geometric_location)) AS pos
FROM trees a, trees b
WHERE a.id <> b.id) i
WHERE pos = 1
Would be great to get some ideas on this. I got this from a post here somewhere and have adapted it but so far no luck.
There are a couple of things to mention. If the data comes from a tracking event, why compare existing trees to each other? I'd expect to have something like
SELECT id
FROM trees
ORDER BY st_distance(estimated_geometric_location, st_makepoint(15, 30))
LIMIT 1
which returns the tree closest to the point with longitude 15 and latitude 30. Have a look at whether you need to do that join at all.
Supposing that you do, the problem with a query like this is complexity. If you have any number (say 1000) trees in your database, then you're actually calculating the distances between 1000 trees and all of their 999 counterparts, calculating 999.000 distances! Just saying that if the distance between A and B is the same as between B and A, then you should be able to shave off half of them by saying a.id < b.id.
Furthermore, think about what you're doing. You want to find the minimal distance between any two trees and the ids of the trees that correspond to that distance, right? There is no need to calculate any distances as soon as you know they're not the minimal one.
SELECT a.id, b.id, ST_Distance(a.estimated_geometric_location, b.estimated_geometric_location)) distance
FROM trees a, trees b
WHERE a.id < b.id
ORDER BY distance
LIMIT 1
is a much simpler way of getting there, and for me it's a lot faster as well.

Assign points to polygons effeciently

I have table of polygons (thousands), and table of points (millions). Both tables have GIST indexes on geometry columns. Important this is, polygons do not overlap, so every point is contained by exactly one polygon. I want to generate table with this relation (polygon_id + point_id).
Trivial solution of course is
SELECT a.polygon_id, p.point_id
FROM my_polygons a
JOIN my_points p ON ST_Contains(a.geom, p.geom)
This works, but I think it is unnecessary slow, since it matches every polygon with every point - it does not know that every point can belong to one polygon only.
Is there any way to speed things up?
I tried looping for every polygon, selecting points by ST_Contains, but only those not already in the result table:
CREATE TABLE polygon2point (polygon_id uuid, point_id uuid);
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT polygon_id, geom
FROM my_polygon
LOOP
INSERT INTO polygon2point (polygon_id, point_id)
SELECT r.polygon_id, p.point_id
FROM my_points p
LEFT JOIN polygon2point t ON p.point_id = t.point_id
WHERE t.point_id IS NULL AND ST_Contains(r.geom, p.geom);
END LOOP;
END$$;
This even slower than trivial JOIN approach. Any ideas?
A way to increase the speed is to subdivide the polygons into smaller ones.
You would create a new table (or a materialized view should the polygon change often), index it, and then run the query. If the subdivisions have 128 vertices or less, the data will, by default, be stored uncompressed on disk, making the queries even faster.
CREATE TABLE poly_subdivided AS
SELECT ST_SUBDIVIDE(a.geom, 128) AS geom , a.polygon_id
FROM poly;
CREATE INDEX poly_subdivided_geom_idx ON poly_subdivided USING gist(geom);
ANALYZE poly_subdivided;
SELECT a.polygon_id, p.point_id
FROM poly_subdivided a
JOIN my_points p ON ST_Contains(a.geom, p.geom)
Here is a great article on the topic.

How to get a road path which can be travelled in 10 minutes from a location

I have postgis road network table data base with speed limits based on type of road. I can able to get shortest path/route between two points by using Dijkstra or any other algorithm. Now I want to get possible paths that can be travelled from a location (point) in 10 minutes of time. Because of I'm having a speed limits based on road type the resultant paths may not be of same length.in this case single source all destinations algorithms may be helpful but my destination points are may or may not available as a nodes in the network because of my time as cost. Please help me.
pgr_drivingDistance uses the cost value you provide, and in the units you implicitly specify, meaning that when you add a column <traveling_time> (note that I use seconds in my example) as the time needed to traverse an edge (given the length and speed limit) and select that as cost, the functions result will represent the equal driving time limits.
As for the parts where the algorithm couldnĀ“t fully traverse the next edge 'in time', you will need to add those yourself. The general idea here is to identify all possible edges connected to the end vertices in the result set of pgr_drivingDistance, but not equal to any of the involved edges, and interpolate a new end point along those lines.
- Updated -
The following query is an out-of-my-head attempt and not tested at all, but in theory should is tested and returns a polygon all full and partial edges representing a 600 seconds trip along your network:
WITH
dd AS (
SELECT pg.id1 AS node,
pg.id2 AS edge,
pg.cost
FROM pgr_drivingDistance('SELECT id,
source,
target,
<travel_time_in_sec> AS cost
FROM <edge_table>',
<start_id>,
600,
false,
false
) AS pg
),
dd_edgs AS (
SELECT edg.id,
edg.geom
FROM <edge_table> AS edg
JOIN dd AS d1
ON edg.source = d1.node
JOIN dd AS d2
ON edg.target = d2.node
),
dd_ext AS (
SELECT edg.id,
CASE
WHEN dd.node = edg.source
THEN ST_LineSubstring(edg.geom, 0, (600 - dd.cost) / edg.<travel_time>)
ELSE ST_LineSubstring(edg.geom, 1 - ((600 - dd.cost) / edg.<travel_time>), 1)
END AS geom
FROM dd
JOIN <edge_table> AS edg
ON dd.node IN (edg.source, edg.target) AND edg.id NOT IN (SELECT id FROM dd_edgs)
)
SELECT id,
geom
FROM dd_ext
UNION ALL
SELECT id,
geom
FROM dd_edgs;
The CASE statement decides if, for any follow-up edge, the fraction of line length will be calculated from the start or end point.
As a sidenote: the current version of pgRouting provides a set of functions where inter-edge-points are to be considered; if updating your (rather outdated) PostGIS/pgRouting versions is an option, consider those functions instead.

Optimize query for intersection of ST_Buffer layer in PostGIS

I have two tables stored in PostGIS:
1. a multipolygon vector with about 590000 rows (layerA) and
2. a single multipart (1 row) vector layer (layerB)
and I want to find the area of the intersection between each polygon's buffer in layerA and layerB. My query so far is
SELECT ST_Area(ST_Intersection(a.geom, b.geom)) AS myarea, a.gid AS mygid FROM
(SELECT ST_Buffer(geom, 500) AS geom, gid FROM layerA) AS a,
layerB AS b
So far, I can see my query working but I calculate that it needs 17 hours to be completed (with my PC). Is there another way to execute this query more efficiently and faster?
What if you check intersects of overlapping area before intersection and area calculation, it might lower time.
SELECT ST_Area(ST_Intersection(a.geom, b.geom)) AS myarea, a.gid AS mygid FROM
(SELECT ST_Buffer(geom, 500) AS geom, gid FROM layerA) AS a,
layerB AS b WHERE ST_intersects(a.geom, b.geom)
You would probably get more answers to this at gis.stackexchange.com.
Therea are several things you can do.
You should make sure you get that first filtering of polygons actually intersecting with help of index.
Put a gist index on the table with many geometries and use st_dwithin(geom,500) instead of st_intersects on the buffered geometries. That is because the buffered geometries cannot use the index calculated on the unbuffered geometries.
Also, you say you have multi polygons. If there actually is more than 1 polygon in each multipolygon you might get a lot more speed if you first split the polygons to single polygons before building the index. That will make the.index doing a much bigger part of the job.
There is actually a function in postgis to split even single polygons into smaller pieces for the same reason.
ST_SubDivide
So first use ST_Dump to get single polygons:
CREATE table a_singles AS
SELECT id, (ST_Dump(geom)).geom geom FROM a;
Then create index:
CREATE INDEX idx_a_s_geom
ON a_singles
USING gist(geom);
At last the query, something like
SELECT ST_Area(ST_Intersection(ST_Buffer(a_s.geom,500), b.geom))
FROM a_singles AS a_s
INNER JOIN b
on ST_DWithin(a_s.geom,b.geom,500);
If that still is slow you can start playing with ST_SubDivide.
One more thing. If the single multipolygon in table b contains many geometries, also split them and put an index also there.
It might be slow also after all those things. That depends on how many vertex points there is in the splitted polygons that actually intersect (and for st_dwithin also on how many vertexpoints there is in polygons with overlapping bounding boxes)
But now you don't have any index helping you so this should make it quite a lot faster.

How to count up a value until all geometry features from one table are selected

For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/