Mongo $in operator performance - mongodb

Is it slow/poor form to use the $in operator in MongoDB with a large array of possibilities?
posts.find({
author : {
$in : ['friend1','friend2','friend3'....'friend40']
}
})
App Engine, for example, won't let you use more than 30 because they translate directly to one query per item in the IN array, and so instead force you into using their method for handling fan out. While that's probably the most efficient method in Mongo too, the code for it is significantly more complex so I'd prefer to just use this generic method.
Will Mongo execute these $in queries efficiently for reasonable-sized datasets?

It can be fairly efficient with small lists (hard to say what small is, but at least into the tens/hundreds) for $in. It does not work like app-engine since mongodb has actual btree indexes and isn't a column store like bigtable.
With $in it will skip around in the index to find the matching documents, or walk through the whole collection if there isn't an index to use.

Assuming you have created index on the author field, from algorithmic point of view, the time complexity of $in operation is: $(N*log(M)), where N is the length of input array and M is the size of the collection.
The time complexity of $in operation will not change unless you change a database (Though I don't think any db can break O(N*log(M))).
However, from engineering point of view, if N goes to a big number, it is better to let your business logic server to simulate the $in operation, either by batch or one-by-one.
This is simply because: memory in database servers is way more valuable than the memory in business logic servers.

If you build an index (ensureIndex) on the list element, it should be pretty quick.
Have you tried using explain()? Its a good, built-in way to profile your queries:
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Use%7B%7Bexplain%7D%7D.

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

Querying $in with one record performance same as $eq? [duplicate]

Is it slow/poor form to use the $in operator in MongoDB with a large array of possibilities?
posts.find({
author : {
$in : ['friend1','friend2','friend3'....'friend40']
}
})
App Engine, for example, won't let you use more than 30 because they translate directly to one query per item in the IN array, and so instead force you into using their method for handling fan out. While that's probably the most efficient method in Mongo too, the code for it is significantly more complex so I'd prefer to just use this generic method.
Will Mongo execute these $in queries efficiently for reasonable-sized datasets?
It can be fairly efficient with small lists (hard to say what small is, but at least into the tens/hundreds) for $in. It does not work like app-engine since mongodb has actual btree indexes and isn't a column store like bigtable.
With $in it will skip around in the index to find the matching documents, or walk through the whole collection if there isn't an index to use.
Assuming you have created index on the author field, from algorithmic point of view, the time complexity of $in operation is: $(N*log(M)), where N is the length of input array and M is the size of the collection.
The time complexity of $in operation will not change unless you change a database (Though I don't think any db can break O(N*log(M))).
However, from engineering point of view, if N goes to a big number, it is better to let your business logic server to simulate the $in operation, either by batch or one-by-one.
This is simply because: memory in database servers is way more valuable than the memory in business logic servers.
If you build an index (ensureIndex) on the list element, it should be pretty quick.
Have you tried using explain()? Its a good, built-in way to profile your queries:
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Use%7B%7Bexplain%7D%7D.

How do I sort a MongoDB collection in MeteorJS permanently?

From the tutorials out there I know that I can sort a MongoDB collection in meteor on request like this:
// Sorted by createdAt descending
Users.find({}, {sort: {createdAt: -1}})
But I feel like this solution is not optimal in the view of performance.
Because if I understand it right, every time there is a request for Users, the raw collection is requested and then sorted over and over again.
So wouldn't it be better to sort the whole collection once and for all and then access the already sorted collection with Users.find()?
The question is: How do I sort the whole collection permanently not just the found results?
This is a known limitation of MiniMongo, Meteor's client-side implementation of (a subset of) the MongoDB functionality.
"Sorting" a MongoDB collection does not really have a coherent meaning. It does not translate into a concrete set of operations. What would you sort it by? Is there a "natural" way to sort a set of documents which structure may vary?
The mechanism that is used for making data retrieval more efficient is an index. On the server, indices are used to assist sorting, if possible:
In MongoDB, sort operations can obtain the sort order by retrieving documents based on the ordering in an index. If the query planner cannot obtain the sort order from an index, it will sort the results in memory. Sort operations that use an index often have better performance than those that do not use an index. In addition, sort operations that do not use an index will abort when they use 32 megabytes of memory.
(Source: MongoDB documentation)
As a collection does not have an inherent order to it, the entity that holds information about the order requirements in MongoDB is a Cursor. A cursor can be fetched multiple times, and in theory could be made into an efficient ordered data fetcher.
Unfortunately, this is not the case at the moment. The way it is currently implemented, MiniMongo does not have indices and does not cache the documents by order. They are re-sorted every time the cursor is fetched.
The sorting is reasonably efficient (as much as sorting can be efficient, O(n*logn) sort function invocations), but for a large data set, it could be fairly lengthy and degrade the user experience.
At the moment, if you have a special use case that requires repeated access to a large data-set that is ordered the same way, you could try to keep a cache of ordered documents if you need to, by observing the cursor and updating the cache when there are changes.

Skipping the first term of a compound index by using hint()

Suppose I have a Mongo collection with fields a and b. I've populated this collection with {a:'a', b : index } where index increases iteratively from 0 to 1000.
I know this is very, very wrong, but can't explain (no pun intended) why:
collection.find({i:{$gt:500}}).explain() confirms that the index was not used (I can see that it scanned all 1,000 documents in the collection).
Somehow forcing Mongo to use the index seems to work though:
collection.find({i:{$gt:500}}).hint({a:1,i:1}).explain()
Edit
The Mongo documentation is very clear that it will only use compound indexes if one of your query terms is the matches the first term of the compound index. In this case, using hint, it appears that Mongo used the compound index {a:1,i:1} even though the query terms do NOT include a. Is this true?
The interesting part about the way MongoDB performs queries is that it actually may run multiple queries in parallel to determine what is the best plan. It may have chosen to not use the index due to other experimenting you've done from the shell, or even when you added the data and whether it was in memory, etc/ (or a few other factors). Looking at the performance numbers, it's not reporting that using the index was actually any faster than not (although you shouldn't take much stock in those numbers generally). In this case, the data set is really small.
But, more importantly, according to the MongoDB docs, the output from the hinted run also suggests that the query wasn't covered entirely by the index (indexOnly=false).
That's because your index is a:1, i:1, yet the query is for i. Compound indexes only support searches based on any prefix of the indexed fields (meaning they must be in the order they were specified).
http://docs.mongodb.org/manual/core/read-operations/#query-optimization
FYI: Use the verbose option to see a report of all plans that were considered for the find().

Does providing a projection argument to find() limit the data that is added to Mongo's working set?

In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: http://docs.mongodb.org/manual/applications/indexes/#create-indexes-that-support-covered-queries these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.ensureIndex({a:1,b:1});
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( http://docs.mongodb.org/manual/applications/map-reduce/ ).
Note that for updates I believe only atomic operations: http://docs.mongodb.org/manual/tutorial/isolate-sequence-of-operations/ (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).