MongoDB $in operator query performance with boolean values - mongodb

Does my query performance get affected if I used
db.collection.find({field:{$in:[true,false]}})
I mean I know it's the same thing like
db.collection.find({})
But I may some cases where it could be like the first query. So, is it going to affect my query performance?
Thanks in advance!

Definitely, the simple find will be faster. Because
It doesn't need to check any index
It doesn't need to do any filter
Whereas the other query
db.collection.find({field:{$in:[true,false]}})
If you have an index on field then the performance will be better but depends on the total amount of data, cluster resources, schema, etc.
If you do not have an index, it will be slower than #1 as it has to do COLSCAN.
Again, these are all theories. You should try it out with different amounts of data in your cluster and benchmark the results as recommended in the documentation.
When do you need db.collection.find({field:{$in:[true,false]}}) in your case?
It would help if you had this where you have documents that do not fall under either true or false. Otherwise, you can use simple find.

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

In mongo if I have a bunch of IDs, is it more performant to query per ID, or by $in: IDs?

I'm wondering how $in works behind the scenes, and what optimizations are made. Does it loop through the database, looking for the required items, or know immediately where those are? Do indexes matter in those operations?
I'm trying to be efficient as possible, by making one query, and querying the documents I need in one go, but maybe when providing a single ID, which is guaranteed to be indexed, it's faster, and worth the multiple queries.
I guess there is a factor of how many documents we're talking about, in my case it's only a few. I assume with a lot of IDs it may worth it to just query them in one go, but maybe not. I'm not too experienced in mongo.
Generally, It is always better to reduce network roundtrip to the database.
In your case, using $in operator is better because if you make many requests to the database for each id, you will have so many roundtrips.
when you send your query to the database, it will try to create the most efficient execution plan for your query and if there are any indices that can help to achieve a more efficient execution plan, the database will use them.
Mongo creates an index on _id filed of the document by default.

How to efficiently add a compound index with the _id field in MongoDB

I am doing a range query on _id and need to return only one particular field ("data") from the found documents. I would like to make this query indexOnly for optimal performance.
Here is the query:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1})
This query is of course not indexOnly so I need to add another index:
db.collection.ensureIndex({_id:1,data:1})
and tell MongoDB to use that Index with:
db.collection.find({_id:{$gte:"c",$lte:"d"}},{_id:0,data:1}).hint({_id:1,data:1})
(The hint is needed because otherwise MongoDB will use the standard _id index for the query.)
This works as expected and makes the query indexOnly. However one cannot delete the standard _id index even though it is no longer needed which leads to a lot of wasted space for the doubled index. It is also annoying to be forced to always use the hint() in the query.
So I am wondering if there is a smarter way to do this.
I don't believe that there is any way to do what you want. The _id index cannot be removed, and you need to have the second index in order to perform a covered (indexOnly) query on your data.
Do you really have the need to have only a single index? I would suspect that you probably only have the requirement for either increased speed or reduced disk usage, but not both. If you do really have a requirement for both increased speed and reduced disk usage, you may need to look for a different database solution, since all of the techniques used to speed up MongoDB queries (indexes, covered queries, sharding, etc) tend to trade increased disk usage in order to gain the speed boost they provide.
EDIT:
Also, if the call to hint is bugging you, you can probably leave it off since MongoDB will eventually re-optimize it's query plan at which point it will switch over to your new index if it really is faster.

Is mongoDB efficient in doing multi-key lookups?

I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.

How complete should MongoDB indexes be?

For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.
The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.