so in here:
https://postgis.net/workshops/postgis-intro/knn.html
it is shown how to use "cross join lateral" to execute a query that joins two tables with geometry/point column on each (with spatial index for performance).
I have two tables, table1 with points: (1 1) (2 2) and table2 with points (1.5 1.5) and (9 9).
with the cross join lateral by distance from table1 to table2 I get the following result:
(1 1) (1.5 1.5)
(2 2) (1.5 1.5)
but I am interested in "one to one" relation, meaning the same point (1.5 1.5) in table2 should not be mapped(joined) to more than one point in table1 so I expect a result like that:
(1 1) (1.5 1.5)
(2 2) (9 9)
is it possible to add such condition to the subquery ?
by the way, the above expected is only one option because switching the pairs also maps each points to its nearest without using the same point more than once.
performance is critical to me and I have spatial index on those columns on both tables.
thanks!
Related
Data
I have 2 tables
- 3D point geometries (obs.geom), n=10
- a single 3D point (target.geom), n=1
Problem - part 1
When I run the following code it lists all 10 of the obs geoms rather than just the closest point to the target.geom. Can anyone give me some pointers?
SELECT ST_3DClosestPoint(target.geom, obs.geom)
FROM target, obs
Part 2
I then want to add in the Distance3D
SELECT ST_3DDistance(ST_3DClosestPoint(target.geom, obs.geom) as dist_closest, obs.geom) as distance
FROM target, obs
where dist_closest > 1.5
We cannot use knn operator(it works only with 2D) so we have to work around a bit.
For a single point in the target table it will be like this.
select * , st_distance(o.geom, t.geom), st_3ddistance(o.geom, t.geom)
from obs o, target t
order by st_3ddistance(o.geom, t.geom) limit 1
But it will not work if you want results for many targets at once. If you want find closest points for many targets then we have to use lateral join
select t2.*, a.*
from target t2,
lateral (select o.*
from obs o, target t
where t2.id=t.id
order by st_3ddistance(o.geom, t.geom) limit 1) a
If you want more then one closest point just increase the limit.
I'm not sure I can get a clear question name...
What I want is to calculate the distance between points and polygons (this is step 1 and then, for each point, get only the closest polygon (nb : one polygon can have many points attached, but one point must be attached to only one polygon).
I'm currently doing is the following :
CREATE TABLE temp_table AS
SELECT
areas.*
points.* -- includes a points_id column
ST_DistanceSphere(areas.geometry, points.geometry) AS distance_sphere
FROM points
INNER JOIN areas
ON st_DWithin(areas.geometry, points.geometry, 25)
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY temp_table.points_id ORDER BY distance_sphere ASC) as rownumber, *
FROM temp_table
) X
WHERE rownumber = 1
I have a feeling it's quite inefficient (the first request has been processing all night, on a 4 000 000 rows database... It took 29mn with a limit 10 at the end) as it's calculating many useless rows.
Would putting the first request in the second one be faster ?
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY temp_table.points_id ORDER BY distance_sphere ASC) as rownumber, *
FROM (
SELECT
areas.*
points.* -- includes a points_id column
ST_DistanceSphere(areas.geometry, points.geometry) AS distance_sphere
FROM areas
INNER JOIN points
ON st_DWithin(areas.geometry, points.geometry, 25)
)
) X
WHERE rownumber = 1
If not, how could I optimize what I'm doing ?
What EPSG/SRID do you use (degree, meters) for example:
- 4326 is in degrees
- 3857 is in meters
If you use meteric then you should use st_distance not st_distancesphere. If you use degree EPSG then be carefull with st_dwithin as this using units of EPSG so 25 means 25 degree and that is HUGE distance (around 3000 km).
So if you use 4326 (degree) then for your st_dwithin use much smaller value then 25.
Create gist indexes on both geometry columns.
Create index on point using gist(geometry);
Create index on areas using gist(geometry);
And just use your question with proposed changes.(change st_distancesphare to st_distance or use st_dwithin with much smaller value).
I have the following, which gives me the number of customers within 10,000 meters of any store location:
SELECT COUNT(*) as customer_count FROM customer_table c
WHERE EXISTS(
SELECT 1 FROM locations_table s
WHERE ST_Distance_Sphere(s.the_geom, c.the_geom) < 10000
)
What I need is for this query to return not only the number of customers within 10,000 meters, but also the following. The number of customers within...
10,000 meters
more than 10,000, but less than 50,000
more than 50,000, but less than 10,0000
more than 100,000
...of any location.
I'm open to this working a couple of ways. For a given customer, only count them one time (the shortest distance to any store), which would count everyone exactly once. I realize this is probably pretty complex. I'm also open to having people be counted multiple times, which is really the accurate values anyway and think should be much simpler.
Thanks for any direction.
You can do both types of queries relatively easily. But an issue here is that you do not know which customers are associated with which store locations, which seems like an interesting thing to know. If you want that, use the PK and store_name of the locations_table in the query. See both options with location id and store_name below. To emphasize the difference between the two options:
The first option indicates how many customers are in every distance class for every store location, for all customers for every store location.
The second option indicates how many customers are in every distance class for every store location, for the nearest store location for each customer only.
This is a query of O(n x m) running order (implemented with the CROSS JOIN between customer_table and locations_table) and likely to become rather slow with increasing numbers of rows in either table.
Count customers in all distance classes
You should make a CROSS JOIN between the distances of customers from store locations and then group them by the store location id, name and classes of maximum distance that you define. You can create a "table" from your distance classes with the VALUES command which you can then simply use in any query:
SELECT loc_dist.id, loc_dist.store_name, grps.grp, count(*)
FROM (
SELECT s.id, s.store_name, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM customer_table c, locations_table s) AS loc_dist
JOIN (
VALUES(1, 10000.), (2, 50000.), (3, 100000.), (4, 1000000.)
) AS grps(grp, dist) ON loc_dist.dist < grps.dist
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;
Count customers in the nearest distance class
If you want customers listed in the nearest distance class only, then you should make the same CROSS JOIN on customer_table and locations_table as in the previous case, but then simply select the lowest group (i.e. the closest store) using a CASE clause in the query and GROUP BY store location id, name and distance class as before:
SELECT
id, store_name,
CASE
WHEN dist < 10000. THEN 1
WHEN dist < 50000. THEN 2
WHEN dist < 100000. THEN 3
ELSE 4
END AS grp,
count(*)
FROM (
SELECT s.id, s.store_name, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM customer_table c, locations_table s) AS loc_dist
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;
For example, I have this query to find the minimum distance between two geometries (stored in 2 tables) with a PostGIS function called ST_Distance.
Having thousands of geometries (in both tables) it takes to much time without using ST_DWithin. ST_DWithin returns true if the geometries are within the specified distance of one another (here 2000m).
SELECT DISTINCT ON
(id)
table1.id,
table2.id
min(ST_Distance(a.geom, b.geom)) AS distance
FROM table1 a, table2 b
WHERE ST_DWithin(a.geom, b.geom, 2000.0)
GROUP BY table1.id, table2.id
ORDER BY table1.id, distance
But you have to estimate the distance value to fetch all geometries (e.g. stored in table1). Therefore you have to look at your data in some way in a GIS, or you have to calculate the maximum distance for all (and that takes a lot of time).
In the moment I do it in that way that I approximate the distance value until all features are queried from table1, for example.
Would it be efficient that my query automatically increases (with a reasonable value) the distance value until the count of all geometries (e.g. for table1) is reached? How can I put this in execution?
Would it be slow down everything because the query needs maybe a lot of approaches to find the distance value?
Do I have to use a recursive query for this purpose?
See this post here: K-Nearest Neighbor Query in PostGIS
Basically, the <-> operator is a bit unusual in that it works in the order by clause, but it avoids having to make a guess as to how far you want to search in ST_DWithin. There is a major gotcha with this operator though, which is that the geometry in the order by clause must be a constant that is you CAN NOT write:
select a.id, b.id from table a, table b order by geom.a <-> geom.b limit 1;
Instead you would have to create a loop, substituting in a value above for geom.b
More information can be found here: http://boundlessgeo.com/2011/09/indexed-nearest-neighbour-search-in-postgis/
I have a table that has a value field. The records have values somewhat evenly distributed between 0 and 100.
I want to query this table for n records, given a target mean, x, so that I'll receive a weighted random result set where avg(value) will be approximately x.
I could easily do something like
SELECT TOP n * FROM table ORDER BY abs(x - value)
... but that would give me the same result every time I run the query.
What I want to do is to add weighting of some sort, so that any record may be selected, but with diminishing probability as the distance from x increases, so that I'll end up with something like a normal distribution around my given mean.
I would appreciate any suggestions as to how I can achieve this.
why not use the RAND() function?
SELECT TOP n * FROM table ORDER BY abs(x - value) + RAND()
EDIT
Using Rand won't work because calls to RAND in a select have a tendency to produce the same number for most of the rows. Heximal was right to use NewID but it needs to be used directly in the order by
SELECT Top N value
FROM table
ORDER BY
abs(X - value) + (cast(cast(Newid() as varbinary) as integer))/10000000000
The large divisor 10000000000 is used to keep the avg(value) closer to X while keeping the AVG(x-value) low.
With that all said maybe asking the question (without the SQL bits) on https://stats.stackexchange.com/ will get you better results.
try
SELECT TOP n * FROM table ORDER BY abs(x - value), newid()
or
select * from (
SELECT TOP n * FROM table ORDER BY abs(x - value)
) a order by newid()