How to speed up mongodb geo queries with multiple criteria? - mongodb

Status:
As far as I understand from reading the mongodb documentation and also playing with indexes, queries and monitoring results, my understanding of the way geo location queries in mongodb work is the following:
Start at the given location
look at EVERY document from close to far
keep those matching additional query criteria
until either number_limit or distance_limit is reached
Example
To show what we are trying to do: Let's take the mongodb tutorial example as a base: https://docs.mongodb.com/manual/tutorial/geospatial-tutorial/
Let's assume we have a list of restaurants with location and much more information on top, like established_at, type (chinese, thai, italian, ...), priceOfACoke, numberOfWaiters, wheelchairAccess, ...
Problem:
Let's assume you want to query the collection of all restaurants in the US of A to return all Italian restaurants close to the city center of Pittsburgh that have been established between 2-5 years ago, with wheelChair access and more than 50 waiters, where a coke is cheaper than 1$.
This is a geo query with with restrictive additional criteria and no distance limit; and since the "waiters>50 an coke cheaper than $1" filters out most/all of the results, this query seems to be running through the whole collection and takes very long.
If run without "geoNear", assuming there is a combined index of the fields in question, this query is quite fast, even if there are only 10 results out of 1 million documents.
However, as soon as geoNear comes into play, the performance is terrible
From what I understand, there can only be ONE geo index per collection and only ONE additional property in the geo index, so there is not much to do to help mongodb finding the results with several criteria, as a traditional index seems not to be used.
Also, when using aggregate, geo has to be the first filter...
Are there any hints or pointers to speed up queries like this?
If possible, I'd prefer not get "Use ElasticSearch" or "Use multiple collections " responses - I still hope there is another way to help mongodb reduce the number of documents to check BEFORE it starts doing the geoNear part.

Related

All vs All comparisons on MongoDB

We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK

MongoDB: Optimization of Performance: Aggregation Pipeline (one collection) VS Aggregation plus Additional Query on Seperate Collection

I would like to know what is faster in terms of querying for mongodb.
Lets say I would like to search for income information based on areas
And a person can have many residencies in different states. And each polygon area would have an associated income for that individual.
I have outlined two options for querying this information, I would like to know which would be faster to search.
1) To have a single collection which has two types of documents.
Document1: has a geospatial index on it with polygons, and will have
2dsphere index on it. It will be searched with aggregation to return ids that will link to document 2. Essentially taking the place of a relation in mysql.
Document2: has other information (lets say income amount) and different indexes, but has an id
which the first document also has to reference it.
And also has an index on income amount.
The two documents are searched with an aggregation pipeline.
Stage 1 of pipeline: searching document1 geospatially for items and getting the id value .
Stage 2 of pipeline: using id found in document1 to search second document. As well searched by income type.
2) Seperating out the documents where each has its own collection and avoiding aggregation.
querying collection1 for geospatial and using the person id's found to query collection2 for income info.
3) A third option involving polyglot database, a combination of mongodb and postigs: Query postgis for the id and then use that to search mongodb collecton. I am including this option since I believe postgis to be faster for querying geospatially than mogo but I am curious if the speed of postgis will not matter due to latency of now querying two databases.
The end goal is to pull back data based on a geospatial radius. One geospatial polygon representing area where the person lives and does business for that income.
maps to 1 relational id and each relational id maps to many sets of data. Essentially I have a many to 1 to many relationship.
Many geospatials map to 1 person which maps to many data sets.
You should generally keep collections limited to a single type of document.
Solution 1 will not work. You cannot use the aggregation pipeline the way you are describing (if I'm understanding you correctly). Also, it sounds as though you are thinking in a relational way about a solution using a non-relational database.
Solution 2 will work but it will not have optimum performance. This solution sounds even more like a relational database solution where the collections are being treated like tables.
Solution 3 will probably work but as you said it will now require two databases.
All three of these solutions are progressively pulling these two types of documents further and further away from one another. I believe the best solution for a document database like MongoDB is to embed this data. It's impossible without a real example of your documents and without a clear understanding of your application to suggest an exact solution. But in general embedding data is preferred over creating relationships between documents in MongoDB. As long as no document will ever get to be over 16MB it's worth looking into whether embedding is the right solution.

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

Mongodb model for Uniqueness

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses

Cassandra Or MongoDB For Our Location Based Application

We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.
MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)
I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.
I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack
Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.
You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.