Which GEO implementation to use for millions of points - postgresql

I am trying to figure out which GEO implementation to use to find the nearest points based on long/lat to a certain point. I will have millions if not billions of different latitude/longitude points that will need to be compared. I have been looking at many different implementations to do the job I need to be done. I have looked into Postgis (looks like it is very popular and performs well), Neo4J (Graph databases are a new concept to me and I am unsure how they perfrom), AWS dynamodb geohash (Scales very well, but only library is written in Java, I am hoping to write a library in node.js), etc but can't figure out which would perform best. I am purely looking into performance opposed to number of features. All I need to be able is to compare one point to all points and find the closest (read operation), and as well, be able to change a point in the database quickly (write operation). Could anyone suggest based on these requirements a good implementation

PostGIS has several function for geohashing. If you make your strings long enough the search becomes quicker (fewer collisions per box + its 8 neighbours) but the geohash generation slower on inserting new points.
The question is also how accurate you want to be. At increasing latitude, lat/long "distance" deteriorates because a degree of longitude shrinks from about 110km at the Equator to 0 at the poles, while a degree of latitude is always about 110km. At the mid-latitude of 45 degrees a degree of longitude is nearly 79km, giving an error in distance of a factor of 2 (sqr(110/79)). Spherical distance to give you true distance between lat/long pairs is very expensive to calculate (lots of trigonometry going on) and then you geohashing won't work (unless you convert all points to planar coordinates).
A solution that might work is the following:
CREATE INDEX hash8 ON tablename(substring(hash_column FROM 1 FOR 8)). This gives you an index on a box twice as large as your resolution, which helps finding points and reducing the need to search neighbouring hash boxes.
On INSERT of a point, compute its geohash of length 9 (10m resolution approx.) into hash_column, using PostGIS. You could use a BEFORE INSERT TRIGGER here.
In a function:
Given a point, find the nearest point by looking for all points with a geohash value shortened to 8 chars equal to the given points 8-char geohash (hence the index above).
Compute the distance to each of the encountered points using spherical coordinates, keeping the closest point. But since you are only looking for the nearest point (at least initially), do not search on distance using spherical coordinates, but use the below optimization, which should make the search much much faster.
Compute if the given point is closer to the edge of the box determined by the 8-char geohash than the closest computed point. If so, repeat the procedure with the 7-char geohash on all points in its 8 neighbours. This can be highly optimized by computing distances to individual box sides and corners and evaluating only the relevant neighbour hash boxes; I leave this to you to tinker with.
In any case, this will not be particularly speedy. If you are indeed going towards billions of points you might want to think about clustering which has a rather "natural" solution for geohashing (break up your table on substring(hash_column FROM 1 FOR 2) for instance, giving you four quadrants). Just make sure that you account for cross-boundary searches.
Two optimizations can be made fairly quickly:
First, "normalize" your spherical coordinates (meaning: compensate for the reduced length of a degree of longitude with increasing latitude) so that you can search for nearest points using a "pseudo-cartesian" approach. This only works if points are close together, but since you are working with lots of points this should not be a problem. More specifically, this should work for all points in geohash boxes of length 6 or more.
Assuming the WGS84 ellipsoid (which is used in all GPS devices), the Earth's major axis (a) is 6,378,137 meter, with an ellipticity (e2) of 0.00669438. A second of longitude has a length of
longSec := Pi * a * cos(lat) / sqrt(1 - e2 * sqr(sin(lat))) / 180 / 3600
or
longSec := 30.92208078 * cos(lat) / sqrt(1 - 0.00669438 * sqr(sin(lat)))
For a second of latitude:
latSec := 30.870265 - 155.506 * cos(2 * lat) + 0.0003264 + cos(4 * lat)
The correction factor to make your local coordinate system "square" is by multiplying your longitude values by longSec/latSec.
Secondly, since you are looking for the nearest point, do not search on distance because of the computationally expensive square root. Instead, search on the term within the square root, the squared distance if you will, because this has the same property of selecting for the nearest point.
In pseudo-code:
CREATE FUNCTION nearest_point(pt geometry, ptHash8 char(8)) RETURNS integer AS $$
DECLARE
corrFactor double precision;
ptLat double precision;
ptLong double precision;
currPt record;
minDist double precision;
diffLat double precision;
diffLong double precision;
minId integer;
BEGIN
minDist := 100000000.; -- a large value, 10km (squared)
ptLat := ST_Y(pt);
ptLong := ST_X(pt);
corrFactor := 30.92208078 * cos(radians(ptLat)) / (sqrt(1 - 0.00669438 * power(sin(radians(ptLat)), 2)) *
(30.870265 - 155.506 * cos(2 * radians(ptLat)) + 0.0003264 + cos(4 * radians(ptLat))));
FOR currPt IN SELECT * FROM all_points WHERE hash8 = ptHash8
LOOP
diffLat := ST_Y(currPt.pt) - ptLat;
diffLong := (ST_X(currPt.pt) - ptLong) * corrFactor; -- "square" things out
IF (diffLat * diffLat) < (minDist * diffLong * diffLong) THEN -- no divisions here to speed thing up a little further
minDist := (diffLat * diffLat) / (diffLong * diffLong); -- this does not happen so often
minId := currPt.id;
END IF;
END LOOP;
IF minDist < 100000000. THEN
RETURN minId;
ELSE
RETURN NULL;
END IF;
END; $$ LANGUAGE PLPGSQL STRICT;
Needless to say, this would be a lot faster in a C language function. Also, do not forget to do boundary checking to see if neighbouring geohash boxes need to be searched.
Incidentally, "spatial purists" would not index on the 8-char geohash and search from there; instead they would start from the 9-char hash and work outwards from there. However, a "miss" in your initial hash box (because there are no other points or you are close to a hash box side) is expensive because you have to start computing distances to neighbouring hash boxes and pull in more data. In practice you should work from a hash box which is about twice the size of the typical nearest point; what that distance is depends on your point set.

Related

Efficient way to select all points inside radius

I'm working with meanstack application. so I have mongodb collection that contain worldwide locations. Schema loooks like follows.
[{
address : "example address",
langitude : 79.8816,
latitude : 6.773
},
{...}]
What is the efficient way to select all points inside Circle( pre defined point and pre defined radius) ..??
Using SQL Query also we can do this. But I want a efficient way without iterating one by one and check weather it is inside that radius or not.
Distance d between two points with coordinates {lat1,lon1} and {lat2,lon2} is given by
d = 2*asin(sqrt((sin((lat1-lat2)/2))^2 +
cos(lat1)*cos(lat2)*(sin((lon1-lon2)/2))^2))
Above formula gives d in radians. To get the answer in km, use the formula below.
distance_km ≈ radius_km * distance_radians ≈ 6371 * d
Here, the magic number 6371 is approx. the radius of planet earth. This is the minimum computation that you will have to do in your case. You can compute this distance and select all the records having a distance less than your radius value.
Better approach
You can use Elasticsearch which supports geolocation queries for better performance.

Distance order of coordinates equation without acos

I need an equation that will compute the distance order between 2 coordinates(the result unit doesnt matter. I want obtain only sorted list).
The equation can include cosinus and sinus of specyfic cord, but can't acos and sin/cos of the difrence cord1-cord2.
For example: it can include cos(cord1.lattitude) +sin(cord2.longitude) but cant sin(cord1.lattitude-cord2.lattitude).
Someone told me that such of that equation exist but I cannot find it on the internet.
EDIT: I found the solution that includes the earth curve and gives pretty solid solution. This query gives us the Descending distance order value of two coordinates.Compute the excacly distance is not a problem when we got correct order of data:
(sin(currentPosLattitude) * sin(targetLattitude)) +(cos(currentPositionLattitude) * cos(targetLattitude))*
(sin(currentPosLongitude) * sin(targetLongittude) + cos(currentPosLongitude)*cos(targetLongittude))

Most efficient way to find points within a certain radius from a given point

I've read several questions + answers here on SO about this theme, but I can't understand which is the common way (if there is one...) to find all the points whithin a "circle" having a certain radius, centered on a given point.
In particular I found two ways that seem the most convincing:
select id, point
from my_table
where st_Distance(point, st_PointFromText('POINT(-116.768347 33.911404)', 4326)) < 10000;
and:
select id, point
from my_table
where st_Within(point, st_Buffer(st_PointFromText('POINT(-116.768347 33.911404)', 4326), 10000));
Which is the most efficient way to query my database? Is there some other option to consider?
Creating a buffer to find the points is a definite no-no because of (1) the overhead of creating the geometry that represents the buffer, and (2) the point-in-polygon calculation is much less efficient than a simple distance calculation.
You are obviously working with (longitude, latitude) data so you should convert that to an appropriate Cartesian coordinate system which has the same unit of measure as your distance of 10,000. If that distance is in meter, then you could also cast the point from the table to geography and calculate directly on the (long, lat) coordinates. Since you only want to identify the points that are within the specified distance, you could use the ST_DWithin() function with calculation on the sphere for added speed (don't do this when at very high latitudes or with very long distances):
SELECT id, point
FROM my_table
WHERE ST_DWithin(point::geography,
ST_GeogFromText('POINT(-116.768347 33.911404)'),
10000, false);
I have used following query
SELECT *, ACOS(SIN(latitude) * SIN(Lat)) + COS(latitude) * COS(Lat) * COS(longitude) - (Long)) ) * 6380 AS distance FROM Table_tab WHERE ACOS( SIN(latitude) * SIN(Lat) + COS(latitude) * COS(Lat) * COS(longitude) - Long )) * 6380 < 10
In above query latitude and longitude are from database and lat, long are the points from we want to search.
WORKING : it will calculate the distance(In KM) between all the points in database from search points and check if the distance is less then 10 KM. It will return all the co-ordinates within 10 KM.
I do not know how postgis does it best, but in general:
Depending on your data it might be best to first search in a square bounding box (which contains the search area circle) in order to eliminate a lot of candidates, this should be extremely fast as you can use simple range operators on lon/lat which are ideally indexed properly for this.
In a second step search using the radius.
Also if your limit max points is relatively low and you know you have a lot of candidates, you may simply do a first 'optimistic' attempt with a box inside your circle, if you find enough points you are done !

postgis: point returned in ST_LineLocatePoint not able to detect in ST_Contains

I am using postgis's ST_LineLocatePoint to find out the closest point on a LineString to the given Point, and using ST_LineInterpolatePoint to extract a Point from the returned float number.
ST_LineLocatePoint Query:
SELECT ST_AsText(ST_LineInterpolatePoint(foo.the_line,
ST_LineLocatePoint(foo.the_line,
ST_GeomFromText('POINT(12.962315 77.584841)')))) AS g
FROM (
SELECT ST_GeomFromText('LINESTRING(12.96145 77.58408,12.96219 77.58447,12.96302 77.58489,
12.96316 77.58496,12.96348 77.58511)') AS the_line
) AS foo;
Output:
g
------------------------------------------
POINT(12.9624389808159 77.5845959902924)
Which exactly lies on the linestring I have passed. Demonstration is displayed here.
But when I check whether this point lies in the linestring using ST_Contains it always return false, even though the point lies within.
ST_Contains Query:
SELECT ST_Contains(ST_GeomFromText('LINESTRING(12.96145 77.58408,12.96219 77.58447,
12.96302 77.58489, 12.96316 77.58496, 12.96348 77.58511)'),
ST_GeomFromText('POINT(12.9624389808159 77.5845959902924)'));
Output
st_contains
-------------
f
I am not getting where I am doing wrong. Can anyone help me in this.
Postgresql : 9.4
postgis : 2.1
reference: ST_LineLocatePoint, ST_Contains
I am not getting where I am doing wrong.
I think you're doing good... I had the same issue some time ago... I used ST_ClosestPoint to locate point on linestring and then cut a linestring with this point, but I can't.
Following the documentation:
ST_ClosestPoint — Returns the 2-dimensional point on g1 that is
closest to g2. This is the first point of the shortest line.
So I get situation where one function says - this point is on a line, and other functions says - ok, but I can't cut cause your point is not on a line... I was confused like you're now...
In my case resolution was to draw another line which will intersect first line 'exactly' in given point and after that first line was cutted...
After some research I found issue was about rounding of coordinates counted and writen. I explain it to myself that, according to the definitions line is infinitely thin and point is infinitely small (they do not have the area), so they can easily miss each other - but it's my reasoning and I'm not sure whether it is good. I advice you to use st_intersects, but with very low st_buffer or ST_DWithin function also with very low distance.
To be sure that your point lies on a line it have to be a part of this line (e.g. LINESTRING(0 0, 5 5) points (0 0) and (5 5). Example with point(3 3) works because it's coordinates are counted without any roundings.
This is actually a really common question (most likely a duplicate, but I'm too lazy to find it.)
The issue is related to numerical precision, where the Point is not exactly on the LineString, but is within a very small distance of it. Sort of like how SELECT sin(pi()) is not exactly zero.
Rather than using DE-9IM spatial predicates (like Contains, or Covers, etc.) which normally expect exact noding, it is more robust to use distance-based techniques like ST_DWithin with a small distance threshold. For example:
SELECT ST_Distance(the_point, the_line),
ST_Covers(the_point, the_line),
ST_DWithin(the_point, the_line, 1e-10)
FROM (
SELECT 'POINT(12.9624389808159 77.5845959902924)'::geometry AS the_point,
'LINESTRING(12.96145 77.58408,12.96219 77.58447,12.96302 77.58489,12.96316 77.58496,12.96348 77.58511)'::geometry AS the_line
) AS foo;
-[ RECORD 1 ]----------------------
st_distance | 1.58882185807825e-014
st_covers | f
st_dwithin | t
Here you can see that ST_DWithin indicates that the point is within a very small distance of the line, so it effectively contains the point.
ST_Contains() only returns true if the geometry to test lies within the supplied geometry. In your case the point has to lie within the linestring and this is always false since a linestring does not have an interior.
You should use the ST_Covers() function instead: true if no point of the geometry to test (your point) lies outside the supplied geometry (your linestring).

Dijkstra's algorithm with negative weights

Can we use Dijkstra's algorithm with negative weights?
STOP! Before you think "lol nub you can just endlessly hop between two points and get an infinitely cheap path", I'm more thinking of one-way paths.
An application for this would be a mountainous terrain with points on it. Obviously going from high to low doesn't take energy, in fact, it generates energy (thus a negative path weight)! But going back again just wouldn't work that way, unless you are Chuck Norris.
I was thinking of incrementing the weight of all points until they are non-negative, but I'm not sure whether that will work.
As long as the graph does not contain a negative cycle (a directed cycle whose edge weights have a negative sum), it will have a shortest path between any two points, but Dijkstra's algorithm is not designed to find them. The best-known algorithm for finding single-source shortest paths in a directed graph with negative edge weights is the Bellman-Ford algorithm. This comes at a cost, however: Bellman-Ford requires O(|V|·|E|) time, while Dijkstra's requires O(|E| + |V|log|V|) time, which is asymptotically faster for both sparse graphs (where E is O(|V|)) and dense graphs (where E is O(|V|^2)).
In your example of a mountainous terrain (necessarily a directed graph, since going up and down an incline have different weights) there is no possibility of a negative cycle, since this would imply leaving a point and then returning to it with a net energy gain - which could be used to create a perpetual motion machine.
Increasing all the weights by a constant value so that they are non-negative will not work. To see this, consider the graph where there are two paths from A to B, one traversing a single edge of length 2, and one traversing edges of length 1, 1, and -2. The second path is shorter, but if you increase all edge weights by 2, the first path now has length 4, and the second path has length 6, reversing the shortest paths. This tactic will only work if all possible paths between the two points use the same number of edges.
If you read the proof of optimality, one of the assumptions made is that all the weights are non-negative. So, no. As Bart recommends, use Bellman-Ford if there are no negative cycles in your graph.
You have to understand that a negative edge isn't just a negative number --- it implies a reduction in the cost of the path. If you add a negative edge to your path, you have reduced the cost of the path --- if you increment the weights so that this edge is now non-negative, it does not have that reducing property anymore and thus this is a different graph.
I encourage you to read the proof of optimality --- there you will see that the assumption that adding an edge to an existing path can only increase (or not affect) the cost of the path is critical.
You can use Dijkstra's on a negative weighted graph but you first have to find the proper offset for each Vertex. That is essentially what Johnson's algorithm does. But that would be overkill since Johnson's uses Bellman-Ford to find the weight offset(s). Johnson's is designed to all shortest paths between pairs of Vertices.
http://en.wikipedia.org/wiki/Johnson%27s_algorithm
There is actually an algorithm which uses Dijkstra's algorithm in a negative path environment; it does so by removing all the negative edges and rebalancing the graph first. This algorithm is called 'Johnson's Algorithm'.
The way it works is by adding a new node (lets say Q) which has 0 cost to traverse to every other node in the graph. It then runs Bellman-Ford on the graph from point Q, getting a cost for each node with respect to Q which we will call q[x], which will either be 0 or a negative number (as it used one of the negative paths).
E.g. a -> -3 -> b, therefore if we add a node Q which has 0 cost to all of these nodes, then q[a] = 0, q[b] = -3.
We then rebalance out the edges using the formula: weight + q[source] - q[destination], so the new weight of a->b is -3 + 0 - (-3) = 0. We do this for all other edges in the graph, then remove Q and its outgoing edges and voila! We now have a rebalanced graph with no negative edges to which we can run dijkstra's on!
The running time is O(nm) [bellman-ford] + n x O(m log n) [n Dijkstra's] + O(n^2) [weight computation] = O (nm log n) time
More info: http://joonki-jeong.blogspot.co.uk/2013/01/johnsons-algorithm.html
Actually I think it'll work to modify the edge weights. Not with an offset but with a factor. Assume instead of measuring the distance you are measuring the time required from point A to B.
weight = time = distance / velocity
You could even adapt velocity depending on the slope to use the physical one if your task is for real mountains and car/bike.
Yes, you could do that with adding one step at the end i.e.
If v ∈ Q, Then Decrease-Key(Q, v, v.d)
Else Insert(Q, v) and S = S \ {v}.
An expression tree is a binary tree in which all leaves are operands (constants or variables), and the non-leaf nodes are binary operators (+, -, /, *, ^). Implement this tree to model polynomials with the basic methods of the tree including the following:
A function that calculates the first derivative of a polynomial.
Evaluate a polynomial for a given value of x.
[20] Use the following rules for the derivative: Derivative(constant) = 0 Derivative(x) = 1 Derivative(P(x) + Q(y)) = Derivative(P(x)) + Derivative(Q(y)) Derivative(P(x) - Q(y)) = Derivative(P(x)) - Derivative(Q(y)) Derivative(P(x) * Q(y)) = P(x)*Derivative(Q(y)) + Q(x)*Derivative(P(x)) Derivative(P(x) / Q(y)) = P(x)*Derivative(Q(y)) - Q(x)*Derivative(P(x)) Derivative(P(x) ^ Q(y)) = Q(y) * (P(x) ^(Q(y) - 1)) * Derivative(Q(y))