PostgreSQL - optimising joins on latitudes and longitudes comparing distances - postgresql

I have two tables, say A and B that contain city information with two columns: latitude and longitude. A contains 100,000 records and B contains 1,000,000 records. My objective is to find the rows of B that are within 1 kilometre from A (for each row in A). How do I go about doing this efficiently? I am targeting a time of less than 30 minutes.
The following query takes forever (which I believe is the result of the cross-product of 100,000 * 1,000,000 = 100 billion row comparisons!):
select *
from A
inner join B
on is_nearby(A.latitude, A.longitude, B.latitude, B.longitude)
is_nearby() is just a simple function that finds the difference between the latitudes and longitudes.
I did a test for one row of A, it takes about 5 seconds per row. By my calculation, it is going to take several weeks for the query to finish execution, which is not acceptable.

Yes, PostGIS will make things faster, since it (a) knows how to convert degrees of latitude and longitude to kilometres (I'll use the geography type below), and (b) supports a GiST index, which is optimal for GIS.
Assuming you have PostGIS version 2 available on your system, upgrade your datbase and tables:
CREATE EXTENSION postgis;
-- Add a geog column to each of your tables, starting with table A
ALTER TABLE A ADD COLUMN geog geography(Point,4326);
UPDATE A SET geog = ST_MakePoint(longitude, latitude);
CREATE INDEX ON A USING GIST (geog);
--- ... repeat for B, C, etc.
Now to find the rows of B that are within 1 kilometre from A (for each row in A):
SELECT A.*, B.*, ST_Distance(A.geog, B.geog)/1000 AS dist_km
FROM A
JOIN B ON ST_DWithin(A.geog, B.geog, 1000);

Related

Optimize query for intersection of ST_Buffer layer in PostGIS

I have two tables stored in PostGIS:
1. a multipolygon vector with about 590000 rows (layerA) and
2. a single multipart (1 row) vector layer (layerB)
and I want to find the area of the intersection between each polygon's buffer in layerA and layerB. My query so far is
SELECT ST_Area(ST_Intersection(a.geom, b.geom)) AS myarea, a.gid AS mygid FROM
(SELECT ST_Buffer(geom, 500) AS geom, gid FROM layerA) AS a,
layerB AS b
So far, I can see my query working but I calculate that it needs 17 hours to be completed (with my PC). Is there another way to execute this query more efficiently and faster?
What if you check intersects of overlapping area before intersection and area calculation, it might lower time.
SELECT ST_Area(ST_Intersection(a.geom, b.geom)) AS myarea, a.gid AS mygid FROM
(SELECT ST_Buffer(geom, 500) AS geom, gid FROM layerA) AS a,
layerB AS b WHERE ST_intersects(a.geom, b.geom)
You would probably get more answers to this at gis.stackexchange.com.
Therea are several things you can do.
You should make sure you get that first filtering of polygons actually intersecting with help of index.
Put a gist index on the table with many geometries and use st_dwithin(geom,500) instead of st_intersects on the buffered geometries. That is because the buffered geometries cannot use the index calculated on the unbuffered geometries.
Also, you say you have multi polygons. If there actually is more than 1 polygon in each multipolygon you might get a lot more speed if you first split the polygons to single polygons before building the index. That will make the.index doing a much bigger part of the job.
There is actually a function in postgis to split even single polygons into smaller pieces for the same reason.
ST_SubDivide
So first use ST_Dump to get single polygons:
CREATE table a_singles AS
SELECT id, (ST_Dump(geom)).geom geom FROM a;
Then create index:
CREATE INDEX idx_a_s_geom
ON a_singles
USING gist(geom);
At last the query, something like
SELECT ST_Area(ST_Intersection(ST_Buffer(a_s.geom,500), b.geom))
FROM a_singles AS a_s
INNER JOIN b
on ST_DWithin(a_s.geom,b.geom,500);
If that still is slow you can start playing with ST_SubDivide.
One more thing. If the single multipolygon in table b contains many geometries, also split them and put an index also there.
It might be slow also after all those things. That depends on how many vertex points there is in the splitted polygons that actually intersect (and for st_dwithin also on how many vertexpoints there is in polygons with overlapping bounding boxes)
But now you don't have any index helping you so this should make it quite a lot faster.

Delete overlapping shapefile geometry by majority coverage?

I am trying to replace records in a table with new records containing updated geometry field values.
I have two tables that contain records with geometry fields. I would like to identify (& remove) all of the records in one table that are covered by a majority (>50%) by a geometry field in the other table. A lot of the fields overlap in minute ways so ST_Intersects() returns nearly all of the records. None of the records are completely contained by the records as well, so ST_CoveredBy() & ST_Within() return no records at all.
How can I identify & remove all records with geometry that the new geometry values overlap by a majority (>50%)?
The function to use is ST_Intersection, which returns the geometry. You can then compare the area to the source area.
To make it more efficient, make sure that both geometry column are indexed and restrict the area computation to intersecting geometries only.
SELECT a.id
FROM a, b
WHERE ST_Area(ST_Intersection(a.geom, b.geom)) > 0.5 * ST_Area(a.geom)
AND ST_Intersects(a.geom, b.geom) = true;
See this answer if you are interested in finding the biggest areas instead of the one being greater than 50%

PostGIS - Count Points in Polygons (and average their features within the boundaries)

I've a table with some points representing buildings :
CREATE TABLE buildings(
pk serial NOT NULL,
geom geometry(Point,4326),
height double precision,
area double precision,
perimeter double precision
)
And another table with polylines(most of them closed):
CREATE TABLE regions
(
pk serial NOT NULL,
geom geometry(Polygon,4326)
)
I would like to:
count the numbers of points inside each regions (buildings_n)
find the average value of one the features(eg. area) within the regions boundary (area_avg)
Adding the two new columns:
ALTER TABLE regions ADD COLUMN buildings_n integer;
ALTER TABLE regions ADD COLUMN area_avg double precision;
How can I do these two queries?
I've tried this one for the point 1, but it fails:
INSERT INTO regions (buildings_n)
SELECT count(b.geom)
FROM regions a, buildings b
WHERE st_contains(a.geom,b.geom);
thank you,
Stefano
Region geometry
The first problem you have is that ST_Contains with 'polylines' or linestrings only finds the points that are exactly on the linestring's geometry. If you want points within a region represented by a linestring it won't work, especially if these are not closed. See the examples of valid ST_Contains relations here:
http://www.postgis.org/docs/ST_Contains.html
For the spatial relation to work you have to transform the region's geometry to polygons, either beforehand or on the fly in the query. For example:
ST_Contains(ST_MakePolygon(a.geom),b.geom)
See this reference for more info:
http://www.postgis.org/docs/ST_MakePolygon.html
Calculate aggregate values
The second problem is that to use the aggregate functions count or average on subsets of the buildings table (and not the entire table) you need to associate the region id with each building...
SELECT a.pk region_pk, b.pk building_pk, b.area
FROM regions a, buildings b
WHERE ST_Contains(ST_MakePolygon(a.geom),b.geom)
.. and then group your building data by the region they belong to:
SELECT region_pk, count(), avg(area) average
FROM joined_regions_and_buildings
GROUP BY region_pk;
Update new columns
The third problem is that you are using INSERT to add values to the newly created columns. INSERT is for adding new records to a table, UPDATE is used for changing values of existing records in a table.
Solution
So, all of the points above combined result in the following query:
WITH joined_regions_and_buildings AS (
SELECT a.pk region_pk, b.pk building_pk, b.area
FROM regions a, buildings b
WHERE ST_Contains(ST_MakePolygon(a.geom),b.geom)
)
UPDATE regions a
SET buildings_n = b.count, area_avg = b.average
FROM (
SELECT region_pk, count(), avg(area) average
FROM joined_regions_and_buildings
GROUP BY region_pk
) b
WHERE a.pk = b.region_pk;

How to count up a value until all geometry features from one table are selected

For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/

SQL - 5% random sample by group

I have a table with about 10 million rows and 4 columns, no primary key. Data in Column 2 3 4 (x2 x3 and x4) are grouped by 50 groups identified in column1 X1.
To get a random sample of 5% from table, I have always used
SELECT TOP 5 PERCENT *
FROM thistable
ORDER BY NEWID()
The result returns about 500,000 rows. But, some groups get an unequal representation in the sample (relative to their original size) if sampled this way.
This time, to get a better sample, I wanted to get 5% sample from each of the 50 groups identified in column X1. So, at the end, I can get a random sample of 5% of rows in each of the 50 groups in X1 (instead of 5% of entire table).
How can I approach this problem? Thank you.
You need to be able to count each group and then coerce the data out in a random order. Fortuantly, we can do this with a CTE-style query. Although CTE isn't strictly needed it will help break down the solution into little bits, rather than a lots of sub-selects and the like.
I assume you've already got a column that groups the data, and that the value in this column is the same for all items in the group. If so, something like this might work (columns and table names to be changed to suit your situation):
WITH randomID AS (
-- First assign a random ID to all rows. This will give us a random order.
SELECT *, NEWID() as random FROM sourceTable
),
countGroups AS (
-- Now we add row numbers for each group. So each group will start at 1. We order
-- by the random column we generated in the previous expression, so you should get
-- different results in each execution
SELECT *, ROW_NUMBER() OVER (PARTITION BY groupcolumn ORDER BY random) AS rowcnt FROM randomID
)
-- Now we get the data
SELECT *
FROM countGroups c1
WHERE rowcnt <= (
SELECT MAX(rowcnt) / 20 FROM countGroups c2 WHERE c1.groupcolumn = c2.groupcolumn
)
The two CTE expressions allow you to randomly order and then count each group. The final select should then be fairly straightforward: for each group, find out how many rows there are in it, and only return 5% of them (total_row_count_in_group / 20).