Accuracy of MongoDB Intersect Query - mongodb

I read somewhere that MongoDB doesn't accurately do spatial searches, as it creates a bounding box around the objects and checks if they intersect, rather than checking whether the original shapes actually intersect. Annoyingly I can't find this webpage again.
Has anyone else had this experience?
UPDATE:
I'm trying to decide whether to use MongoDb or PostGis for a scalable web system (Java Spring-Boot back-end), which requires accurate intersection queries. So for PostGis I'd probably use ST_Overlaps, and for MongoDb $geoIntersects. I will also use the spatially near functions however their accuracy isn't so important.
Thanks :)

Related

How to do in-memory search for polygons that contain a given point?

I have a PostGreSQL table that has a geometry type column, in which different simple polygons (possibly intersecting) are stored. The polygons are are all areas within a city. I receive an input of a point (latitude-longitude pair) and need to find the list of polygons that contain the given point. What I have currently:
Unclustered GiST index defined on the polygon column.
Use ST_Contains(#param_Point, table.Polygon) on the whole table.
It is quite slow, so I am looking for a more performant in-memory alternative. I have the following ideas:
Maintain dictionary of polygons in Redis, keyed by their geohash. Polygons with same geohash would be saved as a list. When I receive the point, calculate its geohash and trim to a desired level. Then search in the Redis map and keep trimming the point's geohash until I find the first result (or enough results).
Have a trie of geohashes loaded from the database. Update the trie periodically or by receiving update events. Calculate the point's geohash, search in the trie until I find enough results. I prefer this because the map may have long lists for a geohash, given the nature of the polygons.
Any other approaches?
I have read about libraries like GeoTrie and Polygon Geohasher but can't seem to integrate them with the database and the above ideas.
Any cues or starting points, please?
Have you tried using ST_Within? Not sure if it meets your criteria but I believe it is meant to be faster than st_contains

Clarification on OVERPASS_MAX_QUERY_AREA_SIZE default (OSMnx, Overpass API)

I am using OSMnx to query the Overpass API. I've noticed that it has a fairly large default for minimum area size:
OVERPASS_MAX_QUERY_AREA_SIZE = 50*1000*50*1000
This value is used to subdivide "larger" polygons into chunks to submit to the Overpass API.
I'd like to understand why the area is so large. For example, the entirety of San Francisco (~50 sq miles) is "simplified" to a single query.
Key questions:
Is there any advantage to reducing query sizes submitted to the Overpass API?*
Is there any advantage to reducing the complexity of shapes/polygons being submitted to the Overpass API (that is, using rectangles with just 4 corner coordinates), versus more complex polygons?**
*Note: Example query that I would be running (looking for the ways that would constitute a walk network):
[out:json][timeout:180];(way["highway"]["area"!~"yes"]["highway"!~"cycleway|motor|proposed|construction|abandoned|platform|raceway"]["foot"!~"no"]["service"!~"private"]["access"!~"private"](37.778007,-122.445467,37.783454,-122.438958);>;);out;
**Note: This question is partially answered in this other post. That said, that question does not focus completely on the performance implications, and is not asked in the context of the variable area threshold used in OSMnx to subdivide "larger" geometries.
max_query_area_size appears to be some heuristic value someone came up after doing a number of test runs. From Overpass API side this figure has pretty much no meaning on its own.
It may be completely off for different kinds of queries or even in a different area than SF. As an example: for infrequent tags, it's usually better to go ahead with a rather large bounding box, rather than firing off a huge number of queries with tiny bounding boxes.
For some statement types, a large bounding box may cause significant longer processing time, though. In this case splitting up the area in smaller pieces may help. Some queries might even consume too much memory, which forces you to split your bounding box in smaller pieces.
As you didn't mention the kind of query you want to run, it's very difficult to provide some general advise. It's like asking for a best way to write SQL statements without providing any additional context.
Using bounding boxes instead of (poly:...) has performance advantages. If you can specify a bounding box, use the respective bounding box filter rather than providing 4 lat/lon pairs to the poly filter.

GIN and GIST indexing in Postgres for k-d trees? (NOT full-text searches or geographic coordinates)

I've been looking for information on using GIN and GIST indexes, for implementing k-d trees, but every time I've looked, inevitably every result is about geographic coordinates (and thus assumes you're using PostGIS) or full-text searching, which bears little resemblance to the type of indexing I'm trying to do.
What I'd really like to do is to use a k-d tree algorithm to search a database of about 500,000 similar items in k dimensions (where k ~ 8-12 and can vary based on category), using a custom N-nearest-neighbors algorithm that's been optimized to use only integer values as input. But obviously, that's much too specific to find anything at all, and loosening my search terms gets me a bunch of full-text-search tutorials. Can someone please at least point me in the right direction? There's almost nothing in the official Postgres documentation, and what information is there has been shown to be inaccurate and/or outdated in a lot of the more recent information I've been able to find.
Really, any information regarding how one might implement a k-d tree type of indexing algorithm in Postgres would be immensely helpful to me.

Find points near LineString in mongodb sorted by distance

I have an array of points representing a street (black line) and points, representing a places on map (red points). I want to find all the points near the specified street, sorted by distance. I also need to have the ability to specify max distance (blue and green areas). Here is a simple example:
I thought of using the $near operator but it only accepts Point as an input, not LineString.
How mongodb can handle this type of queries?
As you mentioned, Mongo currently doesn't support anything other than Point. Have you come across the concept of a route boxer? 1 It was very popular a few years back on Google Maps. Given the line that you've drawn, find stops that are within dist(x). It was done by creating a series of bounding boxes around each point in the line, and searching for points that fall within the bucket.
I stumbled upon your question after I just realised that Mongo only works with points, which is reasonable I assume.
I already have a few options of how to do it (they expand on what #mnemosyn says in the comment). With the dataset that I'm working on, it's all on the client-side, so I could use the routeboxer, but I would like to implement it server-side for performance reasons. Here are my suggestions:
break the LineString down into its individual coordinate sets, and query for $near using each of those, combine results and extract an unique set. There are algorithms out there for simplifying a complex line, by reducing the number of points, but a simple one is easy to write.
do the same as above, but as a stored procedure/function. I haven't played around with Mongo's stored functions, and I don't know how well they work with drivers, but this could be faster than the first option above as you won't have to do roundtrips, and depending on the machine that your instance(s) of Mongo is(are) hosted, calculations could be faster by microseconds.
Implement the routeboxer approach server-side (has been done in PHP), and then use either of the above 2 to find stops that are $within the resulting bounding boxes. Heck since the routeboxer method returns rectangles, it would be possible to merge all these rectangles into one polygon covering your route, and just do a $within on that. (What #mnemosyn suggested).
EDIT: I thought of this but forgot about it, but it might be possible to achieve some of the above using the aggregation framework.
It's something that I'm going to be working on soon (hopefully), I'll open-source my result(s) based on which I end up going with.
EDIT: I must mention though that 1 and 2 have the flaw that if you have 2 points in a line that are say 2km apart, and you want points that are within 1.8km of your line, you'll obviously miss all the points between that part of your line. The solution is to inject points onto your line when simplifying it (I know, beats the objective of reducing points when adding new ones back in).
The flaw with 3 then is that it won't always be accurate as some points within your polygon are likely to have a distance greater than your limit, though the difference wouldn't be a significant percentage of your limit.
[1] google maps utils routeboxer
As you said Mongo's $near only works on points not lines as the centre point however if you flip your premise from find points near the line to find the line near the point then you can use your points as the centre and line as the target
this is the difference between
foreach line find points near it
and
foreach point find line near it
if you have a large number of points to check you can combine this with nevi_me's answer to reduce the list of points that need checking to a much smaller subset

What SRID should I use for my application and how?

I'm using PostgreSQL with PostGIS. All my data has already decimal lat/long attached to it (i.e. -87.34554 33.12321) but to use PostGIS I need to convert it to a certain type of SRID.
The majority of my queries are looking for data inside a certain radius.
What SRID should I use? I created already a geometry column with SRID 4269.
In this example:
link text the author is converting SRID 4269 to SRID 32661. I'm very confused about how and when to use these SRIDs. Any lite on the subject would be truly appreciated.
As long as you never intend to reproject/transform the data to another coordinate system, it doesn't technically matter what srid you use. However assuming you don't want to throw away that important metadata, and you do want to transform it, you will want to ensure your assigned srid matches the data, so postgis knows what to do when the time comes.
So why would you want to reproject from epsg:4269? The answer is because certain types of queries (such as distance) make no sense in this 'unprojected' world. Your units are in decimal degrees, and a straight measurement of x decimal degrees is a different real distance depending where in the planet you are.
In your example above, someone is using epsg:32661 as they believe it will give them better accuracy for the are they're working in. If your data is in a specific area of the globe, you can select a projection that's accurate for that area. If it spans the entire globe, you have to choose a projection that does 'ok' for your needs.
Now fortunately PostGIS has a few ways of making all this easier. For approx distances you can just use the st_distance_sphere function which, as you might guess, assumes the earth is a sphere. Or the more accurate st_distance_spheroid. Using these, you don't need to reproject and you will probably be fine for your distance queries except in edge cases. Newer versions of PostGIS also let you use geography columns
tl;dr - use st_distance_spheroid for your distance queries, store your data in geography columns, or transform it to a local projection (when storing, or on the fly, depending on your needs).
Take a look at this question: How do you know what SRID to use for a shp file?
The SRID is just a way of storing the WKT inside the database (you may have noticed that, altough you store lat/long points, the preferred storing is a long string with number and capital letters).
The SRID or EPSG can be different for the country/state/... altough there are some very common ones especially the 2 mentioned by you. If you need specific info what area uses what SRID, there is a database for handling that.
Inside your database, you have a table spatial_ref_sys that has the information on what SRID PostGIS knows about.