Find points near LineString in mongodb sorted by distance - mongodb

I have an array of points representing a street (black line) and points, representing a places on map (red points). I want to find all the points near the specified street, sorted by distance. I also need to have the ability to specify max distance (blue and green areas). Here is a simple example:
I thought of using the $near operator but it only accepts Point as an input, not LineString.
How mongodb can handle this type of queries?

As you mentioned, Mongo currently doesn't support anything other than Point. Have you come across the concept of a route boxer? 1 It was very popular a few years back on Google Maps. Given the line that you've drawn, find stops that are within dist(x). It was done by creating a series of bounding boxes around each point in the line, and searching for points that fall within the bucket.
I stumbled upon your question after I just realised that Mongo only works with points, which is reasonable I assume.
I already have a few options of how to do it (they expand on what #mnemosyn says in the comment). With the dataset that I'm working on, it's all on the client-side, so I could use the routeboxer, but I would like to implement it server-side for performance reasons. Here are my suggestions:
break the LineString down into its individual coordinate sets, and query for $near using each of those, combine results and extract an unique set. There are algorithms out there for simplifying a complex line, by reducing the number of points, but a simple one is easy to write.
do the same as above, but as a stored procedure/function. I haven't played around with Mongo's stored functions, and I don't know how well they work with drivers, but this could be faster than the first option above as you won't have to do roundtrips, and depending on the machine that your instance(s) of Mongo is(are) hosted, calculations could be faster by microseconds.
Implement the routeboxer approach server-side (has been done in PHP), and then use either of the above 2 to find stops that are $within the resulting bounding boxes. Heck since the routeboxer method returns rectangles, it would be possible to merge all these rectangles into one polygon covering your route, and just do a $within on that. (What #mnemosyn suggested).
EDIT: I thought of this but forgot about it, but it might be possible to achieve some of the above using the aggregation framework.
It's something that I'm going to be working on soon (hopefully), I'll open-source my result(s) based on which I end up going with.
EDIT: I must mention though that 1 and 2 have the flaw that if you have 2 points in a line that are say 2km apart, and you want points that are within 1.8km of your line, you'll obviously miss all the points between that part of your line. The solution is to inject points onto your line when simplifying it (I know, beats the objective of reducing points when adding new ones back in).
The flaw with 3 then is that it won't always be accurate as some points within your polygon are likely to have a distance greater than your limit, though the difference wouldn't be a significant percentage of your limit.
[1] google maps utils routeboxer

As you said Mongo's $near only works on points not lines as the centre point however if you flip your premise from find points near the line to find the line near the point then you can use your points as the centre and line as the target
this is the difference between
foreach line find points near it
and
foreach point find line near it
if you have a large number of points to check you can combine this with nevi_me's answer to reduce the list of points that need checking to a much smaller subset

Related

scipy.interpolate.griddata slow due to unnecessary data

I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.
So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.
https://github.com/scipy/scipy/issues/17378

Changing max node capacity in M-tree affects the results

Posting the code for the entire tree for this problem would be pointless (too long and chaotic), and I've tried to fix this problem for a while now, so I don't really want some concrete solution, but more like ideas as to why this might be happening. So:
I have a dataset of 1.000.000 coordinates and I insert them into the tree. I do a range search after and for MaxCapacity=10 I get the correct results (and for any number >= 10). If I switch to MaxCapacity=4 results are wrong. But if I shrink the dataset to about 20.000 coordinates the results are again correct for MaxCapacity=4.
So to me, this looks like an incorrect split algorithm and it just shows for small MaxCapacities and large datasets where we have an enormous amount of splits. But the algorithm checks out for almost everything so I can't really find a mistake there. Any other ideas? Tree is written in SCALA, promotion policy promotes the two points that are the furthest away from each other and for split policy we iterate through the entries of the overflown node and we put each entry into the group of the promoted point that is closer to.
Don't know if anyone will be interested in this but I found the reasons causing this. I thought the problem was in split but I was wrong. The problem was when I was choosing in the Insert Recursion algorithm what node to jump to next in order to place the entry. So I was choosing this node by calculating the distance between each node's center and the entry's point. The node with minimum said distance was chosen.
This works fine if the entry happens to reside inside the radius of multiple nodes. In this case the minDistance works as intended but if the entry doesn't reside in any node's radius? In this case we would have to expand the radius as well to contain the entry. So we would need to find the node whose radius would expand less if it were to include the entry into its children. For a node, its distance from the entry point might be minimum but the expansion needed might be catastrophically big. I had not considered this case and as a result entries were placed in wrong nodes, causing huge expansions, causing huge overlaps. When I implemented this case the problem was fixed!

pgr_drivingDistance with flexible distance value on each route

I would like to calculate a graph similiar to an isochrone using pgsql. Therefore, I already used the algorithm pgr_drivingDistance. You provide a starting point and a distance value and receives an isochrone.
The output using the algorithm is received with code which looks something like:
SELECT * FROM pgr_drivingDistance(
'SELECT id, source, target, cost FROM edge_table',
2, 2, false -- starting point, distance, directed
);
The red star represents the starting point.
Now, I want a graph which works the same way, like starting at one point and get routes in all directions. The difference is, that I don't want to provide a travel distance, but a list with point coordinates, which are lying on the road network. The route in every direction has to stop at the first reached point lying on each route. The distance on every route is different and I don't know which points are the closest ones.
The desired output using the "stopping" points, which are visualized in green, is supposed to look like this.
I tried already:
Using the given algorithm pgr_drivingDistance and raising the distance value every time no point is reached -> problem here: the distance is equal for all directions and not individual for each route.
Using the algorithm pgr_dijkstra for each route -> problem here: because you don't know which point is affected you don't know which end point to choose for the calculation. You also cannot take the closest one in the immediate vicinity because you need the closest one on the specific route.
I know that I have to build an almost complete new algorithm, but maybe someone has an idea how to start or even experience with this kind of problem.
Thank you in advance!
This is a one to many routing problem. You have to compute the route to each end point to find the shortest one. I have not looked at the pgRouting function recently, but I believe there is a one to many, many to one and many to many Dijkstra function(s). You should be able to use the one to many to compute all the routs in one go and then you can sort the routs based on length to find the shortest one.

Clarification on OVERPASS_MAX_QUERY_AREA_SIZE default (OSMnx, Overpass API)

I am using OSMnx to query the Overpass API. I've noticed that it has a fairly large default for minimum area size:
OVERPASS_MAX_QUERY_AREA_SIZE = 50*1000*50*1000
This value is used to subdivide "larger" polygons into chunks to submit to the Overpass API.
I'd like to understand why the area is so large. For example, the entirety of San Francisco (~50 sq miles) is "simplified" to a single query.
Key questions:
Is there any advantage to reducing query sizes submitted to the Overpass API?*
Is there any advantage to reducing the complexity of shapes/polygons being submitted to the Overpass API (that is, using rectangles with just 4 corner coordinates), versus more complex polygons?**
*Note: Example query that I would be running (looking for the ways that would constitute a walk network):
[out:json][timeout:180];(way["highway"]["area"!~"yes"]["highway"!~"cycleway|motor|proposed|construction|abandoned|platform|raceway"]["foot"!~"no"]["service"!~"private"]["access"!~"private"](37.778007,-122.445467,37.783454,-122.438958);>;);out;
**Note: This question is partially answered in this other post. That said, that question does not focus completely on the performance implications, and is not asked in the context of the variable area threshold used in OSMnx to subdivide "larger" geometries.
max_query_area_size appears to be some heuristic value someone came up after doing a number of test runs. From Overpass API side this figure has pretty much no meaning on its own.
It may be completely off for different kinds of queries or even in a different area than SF. As an example: for infrequent tags, it's usually better to go ahead with a rather large bounding box, rather than firing off a huge number of queries with tiny bounding boxes.
For some statement types, a large bounding box may cause significant longer processing time, though. In this case splitting up the area in smaller pieces may help. Some queries might even consume too much memory, which forces you to split your bounding box in smaller pieces.
As you didn't mention the kind of query you want to run, it's very difficult to provide some general advise. It's like asking for a best way to write SQL statements without providing any additional context.
Using bounding boxes instead of (poly:...) has performance advantages. If you can specify a bounding box, use the respective bounding box filter rather than providing 4 lat/lon pairs to the poly filter.

Modifying Levenshtein Distance for positional Bias

I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.
For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.
One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?
UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.
What you're looking for looks like a Smith-Waterman local alignment: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm