I have table of polygons (thousands), and table of points (millions). Both tables have GIST indexes on geometry columns. Important this is, polygons do not overlap, so every point is contained by exactly one polygon. I want to generate table with this relation (polygon_id + point_id).
Trivial solution of course is
SELECT a.polygon_id, p.point_id
FROM my_polygons a
JOIN my_points p ON ST_Contains(a.geom, p.geom)
This works, but I think it is unnecessary slow, since it matches every polygon with every point - it does not know that every point can belong to one polygon only.
Is there any way to speed things up?
I tried looping for every polygon, selecting points by ST_Contains, but only those not already in the result table:
CREATE TABLE polygon2point (polygon_id uuid, point_id uuid);
DO $$DECLARE r record;
BEGIN
FOR r IN SELECT polygon_id, geom
FROM my_polygon
LOOP
INSERT INTO polygon2point (polygon_id, point_id)
SELECT r.polygon_id, p.point_id
FROM my_points p
LEFT JOIN polygon2point t ON p.point_id = t.point_id
WHERE t.point_id IS NULL AND ST_Contains(r.geom, p.geom);
END LOOP;
END$$;
This even slower than trivial JOIN approach. Any ideas?
A way to increase the speed is to subdivide the polygons into smaller ones.
You would create a new table (or a materialized view should the polygon change often), index it, and then run the query. If the subdivisions have 128 vertices or less, the data will, by default, be stored uncompressed on disk, making the queries even faster.
CREATE TABLE poly_subdivided AS
SELECT ST_SUBDIVIDE(a.geom, 128) AS geom , a.polygon_id
FROM poly;
CREATE INDEX poly_subdivided_geom_idx ON poly_subdivided USING gist(geom);
ANALYZE poly_subdivided;
SELECT a.polygon_id, p.point_id
FROM poly_subdivided a
JOIN my_points p ON ST_Contains(a.geom, p.geom)
Here is a great article on the topic.
Related
I have two polygon layers. I want to run st_intersection on them, to give the result of the areas where they overlap as a new layer. The new layer should contain the attributes from both input layers. I found this image which seems to illustrate my desired end results.
My two input layers are both polygons:
SELECT st_geometrytype(geom),
COUNT(*)
FROM a
GROUP BY st_geometrytype(geom)
-- Result is 1368 st_polygons
SELECT st_geometrytype(geom),
COUNT(*)
FROM b
GROUP BY st_geometrytype(geom)
-- Result is 539548 st_polygons
The query I run is as below:
SELECT a.*,
b.*,
st_intersection(a.geom, b.geom) as geom
FROM a,b
WHERE st_intersects(a.geom, b.geom)
However in the result I get not just polygons (which I expect), but lines, points, multipolygons and geometry collections. I guess because some of my input polygons share points but not true intersections perhaps?
Grateful for some advice please on how to deal with this, whether my query is correct, anything I can do to improve performance etc. Thanks.
ST_intersect returns several geometry types, depending on the relative topology.
For example, running ST_intersect on two adjacent polygons returns the common part of the shared boundary.
While it ouptuts a single table (as you can verify in pgadmin, for example), in the Browser swatch of QGIS it will be shown as multiple tables of different geometry types (for example: POLYGON, MULTIPOLY, LINE, and POINT) but (somewhat confusingly) with the same name.
Visually, you can tell them apart observing the accompaining icons on the left:
You can however select which type of geometry you want, for example by adding a WHERE filter with ST_Dimension:
SELECT a.*,
b.*,
st_intersection(a.geom, b.geom) as geom
FROM a,b
WHERE st_intersects(a.geom, b.geom)
AND ST_Dimension(st_intersects(a.geom, b.geom)) = 2;
or, for performance sake, re-write it in a fashion similar to:
SELECT clipped.*
FROM (
SELECT a.id, b."fieldName",
(ST_Dump(ST_Intersection(a.geom, b.geom))).geom AS geom
FROM "public"."table_A_name" AS a INNER JOIN "public"."table_B_name" AS b
ON ST_Intersects(a.geom, b.geom)
) AS clipped
WHERE ST_Dimension("clipped"."geom") = 2;
The latter solution creates an anonymous temporary table, which allows ST_Intersection to run only once.
You might have noticed thath the trick is in ST_Dimension("clipped"."geom") = 2.
ST_Dimensions which filters the outputs from ST_Intersection so as to keep only polygons (which have a topological dimension of 2).
I have two tables stored in PostGIS:
1. a multipolygon vector with about 590000 rows (layerA) and
2. a single multipart (1 row) vector layer (layerB)
and I want to find the area of the intersection between each polygon's buffer in layerA and layerB. My query so far is
SELECT ST_Area(ST_Intersection(a.geom, b.geom)) AS myarea, a.gid AS mygid FROM
(SELECT ST_Buffer(geom, 500) AS geom, gid FROM layerA) AS a,
layerB AS b
So far, I can see my query working but I calculate that it needs 17 hours to be completed (with my PC). Is there another way to execute this query more efficiently and faster?
What if you check intersects of overlapping area before intersection and area calculation, it might lower time.
SELECT ST_Area(ST_Intersection(a.geom, b.geom)) AS myarea, a.gid AS mygid FROM
(SELECT ST_Buffer(geom, 500) AS geom, gid FROM layerA) AS a,
layerB AS b WHERE ST_intersects(a.geom, b.geom)
You would probably get more answers to this at gis.stackexchange.com.
Therea are several things you can do.
You should make sure you get that first filtering of polygons actually intersecting with help of index.
Put a gist index on the table with many geometries and use st_dwithin(geom,500) instead of st_intersects on the buffered geometries. That is because the buffered geometries cannot use the index calculated on the unbuffered geometries.
Also, you say you have multi polygons. If there actually is more than 1 polygon in each multipolygon you might get a lot more speed if you first split the polygons to single polygons before building the index. That will make the.index doing a much bigger part of the job.
There is actually a function in postgis to split even single polygons into smaller pieces for the same reason.
ST_SubDivide
So first use ST_Dump to get single polygons:
CREATE table a_singles AS
SELECT id, (ST_Dump(geom)).geom geom FROM a;
Then create index:
CREATE INDEX idx_a_s_geom
ON a_singles
USING gist(geom);
At last the query, something like
SELECT ST_Area(ST_Intersection(ST_Buffer(a_s.geom,500), b.geom))
FROM a_singles AS a_s
INNER JOIN b
on ST_DWithin(a_s.geom,b.geom,500);
If that still is slow you can start playing with ST_SubDivide.
One more thing. If the single multipolygon in table b contains many geometries, also split them and put an index also there.
It might be slow also after all those things. That depends on how many vertex points there is in the splitted polygons that actually intersect (and for st_dwithin also on how many vertexpoints there is in polygons with overlapping bounding boxes)
But now you don't have any index helping you so this should make it quite a lot faster.
I've a table with some points representing buildings :
CREATE TABLE buildings(
pk serial NOT NULL,
geom geometry(Point,4326),
height double precision,
area double precision,
perimeter double precision
)
And another table with polylines(most of them closed):
CREATE TABLE regions
(
pk serial NOT NULL,
geom geometry(Polygon,4326)
)
I would like to:
count the numbers of points inside each regions (buildings_n)
find the average value of one the features(eg. area) within the regions boundary (area_avg)
Adding the two new columns:
ALTER TABLE regions ADD COLUMN buildings_n integer;
ALTER TABLE regions ADD COLUMN area_avg double precision;
How can I do these two queries?
I've tried this one for the point 1, but it fails:
INSERT INTO regions (buildings_n)
SELECT count(b.geom)
FROM regions a, buildings b
WHERE st_contains(a.geom,b.geom);
thank you,
Stefano
Region geometry
The first problem you have is that ST_Contains with 'polylines' or linestrings only finds the points that are exactly on the linestring's geometry. If you want points within a region represented by a linestring it won't work, especially if these are not closed. See the examples of valid ST_Contains relations here:
http://www.postgis.org/docs/ST_Contains.html
For the spatial relation to work you have to transform the region's geometry to polygons, either beforehand or on the fly in the query. For example:
ST_Contains(ST_MakePolygon(a.geom),b.geom)
See this reference for more info:
http://www.postgis.org/docs/ST_MakePolygon.html
Calculate aggregate values
The second problem is that to use the aggregate functions count or average on subsets of the buildings table (and not the entire table) you need to associate the region id with each building...
SELECT a.pk region_pk, b.pk building_pk, b.area
FROM regions a, buildings b
WHERE ST_Contains(ST_MakePolygon(a.geom),b.geom)
.. and then group your building data by the region they belong to:
SELECT region_pk, count(), avg(area) average
FROM joined_regions_and_buildings
GROUP BY region_pk;
Update new columns
The third problem is that you are using INSERT to add values to the newly created columns. INSERT is for adding new records to a table, UPDATE is used for changing values of existing records in a table.
Solution
So, all of the points above combined result in the following query:
WITH joined_regions_and_buildings AS (
SELECT a.pk region_pk, b.pk building_pk, b.area
FROM regions a, buildings b
WHERE ST_Contains(ST_MakePolygon(a.geom),b.geom)
)
UPDATE regions a
SET buildings_n = b.count, area_avg = b.average
FROM (
SELECT region_pk, count(), avg(area) average
FROM joined_regions_and_buildings
GROUP BY region_pk
) b
WHERE a.pk = b.region_pk;
I have 2.2M line geometries (roads) in one table and 1500 line geometries (coast lines) in another. Both tables have a spatial index.
I need to find the endpoints of roads which are within a certain distance from the coast and store the point geometry along with the distance.
Current solution, which seems ineffecient, and takes many many hours to complete on a very fast machine;
CREATE TEMP TABLE with start and end points of the road geometries within distance, using ST_STARTPOINT, ST_ENDPOINT and ST_DWITHIN.
CREATE SPATIAL INDEXES for both geometry columns in the temp table.
Do two INSERT INTO operations, one for startpoints and one for endpoints;
SELECT geometry and distance, using ST_DISTANCE from point to coastline and a WHERE ST_DWITHIN to only consider points within the chosen distance.
Code looks something along these lines:
create temp table roadpoints_temp as select st_startpoint(road.geom) as geomstart, st_endpoint(road.geom) as geomend from
coastline_table coast, roadline_table road where st_dwithin(road.geom, coast.geom, 100);
create index on roadpoints_temp (geomstart);
create index on roadpoints_temp (geomend);
create table roadcoast_points as select roadpoints_temp.geomstart as geom, round(cast(st_distance(roadpoints_temp.geomstart,kyst.geom) as numeric),2) as dist
from roadpoints_temp, coastline_table coast where st_dwithin(roadpoints_temp.geomstart, coast.geom, 100);
insert into roadcoast_points select roadpoints_temp.geomend as geom, round(cast(st_distance(roadpoints_temp.geomend,kyst.geom) as numeric),2) as dist
from roadpoints_temp, coastline_table coast where st_dwithin(roadpoints_temp.geomend, coast.geom, 100);
drop table roadpoints_temp;
All comments and suggestions welcome :-)
You need to effectively utilize your indexes. It seems that fastest plan would be to find for each coast all the roads that are within distance of it. Doing two rechecks separately means you lose connection of closest coastline to the road and need to re-find this pair again and again.
You need to check your execution plan using EXPLAIN to have a Seq Scan on coastline table and GiST index scan on road table.
select road.*
from coastline_table coast, roadline_table road
where
ST_DWithin(coast.geom, road.geom, 100) -- indexed query
and -- non-indexed recheck
(
ST_DWithin(ST_StartPoint(road.geom), coast.geom, 100)
or ST_DWithin(ST_EndPoint(road.geom), coast.geom, 100)
);
For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/