I have a collection, name Events:
Each document in Events collection has source and destination in lat-long.
I would like to make a query on the Events collection and get only those events that are within some distance from source and within some distance from destination.
I read that MongoDB does not support two geospatial indexes on one collection.
I am confused as in how my data model should look like and how can I make a query to achieve my purpose?
Thanks
You'll have to work with the geo index limitation, so it leaves you with a couple of options. One is to create two collections and run two queries, and resolve the intersection at the application level. This could be expensive depending on what you're doing.
The other scenario is to work with one collection, but change your query to check $geowithin some geometry which represents the intersection area of the areas around your source and destination. Since you're querying for events that are both within some distance of your source and destination, this implies that there is an intersection area. It is up to you to calculate the intersection geometry. If possible, you can precalculate and store these intersections.
You may only have 1 geospatial index per collection, for now. While MongoDB may allow to create multiple indexes, this behavior is unsupported. Because MongoDB can only use one index to support a single query, in most cases, having multiple geo indexes will produce undesirable behavior
There are a few tickets in JIRA for this that you might want to vote on:
https://jira.mongodb.org/browse/SERVER-2331
https://jira.mongodb.org/browse/SERVER-3653
Related
Status:
As far as I understand from reading the mongodb documentation and also playing with indexes, queries and monitoring results, my understanding of the way geo location queries in mongodb work is the following:
Start at the given location
look at EVERY document from close to far
keep those matching additional query criteria
until either number_limit or distance_limit is reached
Example
To show what we are trying to do: Let's take the mongodb tutorial example as a base: https://docs.mongodb.com/manual/tutorial/geospatial-tutorial/
Let's assume we have a list of restaurants with location and much more information on top, like established_at, type (chinese, thai, italian, ...), priceOfACoke, numberOfWaiters, wheelchairAccess, ...
Problem:
Let's assume you want to query the collection of all restaurants in the US of A to return all Italian restaurants close to the city center of Pittsburgh that have been established between 2-5 years ago, with wheelChair access and more than 50 waiters, where a coke is cheaper than 1$.
This is a geo query with with restrictive additional criteria and no distance limit; and since the "waiters>50 an coke cheaper than $1" filters out most/all of the results, this query seems to be running through the whole collection and takes very long.
If run without "geoNear", assuming there is a combined index of the fields in question, this query is quite fast, even if there are only 10 results out of 1 million documents.
However, as soon as geoNear comes into play, the performance is terrible
From what I understand, there can only be ONE geo index per collection and only ONE additional property in the geo index, so there is not much to do to help mongodb finding the results with several criteria, as a traditional index seems not to be used.
Also, when using aggregate, geo has to be the first filter...
Are there any hints or pointers to speed up queries like this?
If possible, I'd prefer not get "Use ElasticSearch" or "Use multiple collections " responses - I still hope there is another way to help mongodb reduce the number of documents to check BEFORE it starts doing the geoNear part.
I would like to know what is faster in terms of querying for mongodb.
Lets say I would like to search for income information based on areas
And a person can have many residencies in different states. And each polygon area would have an associated income for that individual.
I have outlined two options for querying this information, I would like to know which would be faster to search.
1) To have a single collection which has two types of documents.
Document1: has a geospatial index on it with polygons, and will have
2dsphere index on it. It will be searched with aggregation to return ids that will link to document 2. Essentially taking the place of a relation in mysql.
Document2: has other information (lets say income amount) and different indexes, but has an id
which the first document also has to reference it.
And also has an index on income amount.
The two documents are searched with an aggregation pipeline.
Stage 1 of pipeline: searching document1 geospatially for items and getting the id value .
Stage 2 of pipeline: using id found in document1 to search second document. As well searched by income type.
2) Seperating out the documents where each has its own collection and avoiding aggregation.
querying collection1 for geospatial and using the person id's found to query collection2 for income info.
3) A third option involving polyglot database, a combination of mongodb and postigs: Query postgis for the id and then use that to search mongodb collecton. I am including this option since I believe postgis to be faster for querying geospatially than mogo but I am curious if the speed of postgis will not matter due to latency of now querying two databases.
The end goal is to pull back data based on a geospatial radius. One geospatial polygon representing area where the person lives and does business for that income.
maps to 1 relational id and each relational id maps to many sets of data. Essentially I have a many to 1 to many relationship.
Many geospatials map to 1 person which maps to many data sets.
You should generally keep collections limited to a single type of document.
Solution 1 will not work. You cannot use the aggregation pipeline the way you are describing (if I'm understanding you correctly). Also, it sounds as though you are thinking in a relational way about a solution using a non-relational database.
Solution 2 will work but it will not have optimum performance. This solution sounds even more like a relational database solution where the collections are being treated like tables.
Solution 3 will probably work but as you said it will now require two databases.
All three of these solutions are progressively pulling these two types of documents further and further away from one another. I believe the best solution for a document database like MongoDB is to embed this data. It's impossible without a real example of your documents and without a clear understanding of your application to suggest an exact solution. But in general embedding data is preferred over creating relationships between documents in MongoDB. As long as no document will ever get to be over 16MB it's worth looking into whether embedding is the right solution.
I am developing an app where the user will receive geo based information depending on his position. The server should be able to handle huge number of connected clients >100k.
Now i came up with 4 Approaches on how to handle the users position.
Approach - Without geospatial index:
The app server does just hold a list of connected clients and they're location.
Whenever there is a information available the server does loop over the whole list and checks whether the client is within a given radius.
Doubts: Very expensive
Approach - Handle geospatial index in the app server:
The app server does maintain a R Tree with all connected clients and they're location.
Therefore i was looking at JSI Java Spatial Index
Doubts: It is very expensive to update the geospatial index with JSI
Approach - Let the database "mongoDb" do the geospatial index / calculation:
The app server does only hold a reference to the connected client (connection) and saves the key to that reference together with its location into mondoDb.
When a new information is available the server can query the database to get the keys off all clients nearby.
Pro: I guess mongoDb does have a much better implementation of geospatial indexes than i could ever do in the app server.
Doubts: Clients are traveling around which forces me to update the geospatial index frequently. Can i do that or am i running into a performance problem?
Approach - Own "index" using 2 dimensional array
Today i was thinking about creating a very simple index by using a two dimensional array. While the outer array is for the longitude the inner would be for the latitude.
Lets say 3 longitude / altitude degree would be enough precision.
I could receive a list of users in a given area by
ArrayList userList = data[91][35] //91.2548980712891, 35.60869979858;
// i would also need to get the users in the surrounding arrays 90;35, 92;35 ...
// if i need more precision i could use one more decimal data[912][356]
Pro: I would have fast read and write access without a query to the database
Doubts: Radius is shorter at poles. Ugly hack?
I would be very grateful if someone could point me into "the" right direction.
The index used by MongoDB for geospatial indexing is based on a geohash, which essentially converts a 2 dimensional space into a one-dimensional key, suitable for B-tree indexing. While this is somewhat less efficient than an R-tree index it will be vastly more efficient than your scenario 1. I would also argue that filtering the data at the db level with a spatial query will be more efficient and easier to maintain than creating you own spatial indexing strategy on top.
The main thing to be aware of with MongoDB is that you cannot use a geometry column as a shard key, though you can shard a collection containing a geometry field, using another key. Also, if you wish to do any aggregation queries, (which isn't clear from your question) the geometry field must be the first through the pipeline.
There is also a geohaystack index, which is based on small buckets and optimized for search based on small areas, see, http://docs.mongodb.org/manual/core/geohaystack/, which might be useful in your case.
As far as speed is concerned, insertion and search on a B-Tree index are essentially O(log n), see Wikipedia B-Tree while without an index your search will be O(n), so it will not take very long before the difference in perfomance is enormous between having and not having an index.
If you are concerned about heavy writes slowing things down, you can tune the write concern in MongoDB so that you don't have to wait for a majority of replicas to respond to every write (the default), but at the cost of potentially inconsistent data, if you should lose your master.
I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.
I am looking for a database implementing 2 geospatial indexes or allowing to simulate it efficiently.
Motivation: our application deals with vectors, rather than locations and we often need to locate all the records where the source is near something and the destination is near something else.
Mongodb does not have it. Is there a database, which does?
May be it could be simulated with the mongodb map reduce feature, where the database looks up all the records satisfying the source constraint and then passes it through the map-reduce to leave those, which satisfy the destination constraint as well. Did anyone do it?
Thanks.
It might be possible to fake this with MapReduce in Mongo, but this would only be suitable for use as a batch job and not likely to perform well as an application level query.
A workaround would be to store the source and destination points in separate Mongo collections. Then you could do a query on the source collection using $near to pull the closest points to the source point, then do another $near query against the destination collection, and compute the intersection in memory.
Another option - since you can use the geospatial index to index a field which contains an array of points, store both the source and destination points as elements in an array. Then issue two queries to that collection (one for the source point, one for the destination) and scan through the two result sets to calculate the final result (the queries won't distinguish between which match was the source and which was the destination, so you'd have to check that on the client side).