MongoDB $nin query optimization - mongodb

I have a MongoDB collection that has about 5 million documents. The data in the documents look something like the following:
{
since: [Some Unix timestamp ie. 1660106561)
team: ["a","b","c]
}
when I run the following query, my mongoDB connection times out:
db.myCollection.find({team: {$nin: ['b']}}).sort({since: -1})
I have the following compound indices on my collection:
[since, team]
Is there anything I can do to prevent this query from timing out?

This query is just very hard to execute, $nin is a none selective query, This means it does not utilize indexes well.
For instance, the inequality operators $nin and $ne are not very selective since they often match a large portion of the index. As a result, in many cases, a $nin or $ne query with an index may perform no better than a $nin or $ne query that must scan all documents in a collection.
Specfically in this context i believe the index is even hurting the query performance, because the index is a multikey index using $nin on it still forces scanning large portions of the index tree, after that the query engine still has to fetch and filter each document to make sure it does not have the B value. performance can get even worse if this is just a toy example and the carnality of values is even greater. essentially forcing the engine to do double work.
I reckon that if you execute this query without any index usage, are alternatively if the "since" date matches the "narutal" order and you can drop the skip, you can get much better performance like this:
db.collection.find({team: {$nin: ['b']}}).sort({$natural: -1}).hint({$natural: 1});
Mind you, any query that needs to scan millions of documents will not have a lightning fast execution speed.

Related

Why does mongodb not use index scan but collection scan with find()?

I am using mongodb 3.2.4
When I execute db.mytable.find().explain() The winning plan is 'Collscan'
But when I execute db.mytable.find().hint(_id:1).explain() The winning plan is 'IXscan'
So here comes a question: since _id is the default index of a table, why mongodb does not use this index to query?
An index can be used when there is a filter criteria or a sort operation - when the fields in the index are used in the filter predicate and/or the sort. In your case, the find method doesn't have a filter criteria or a sort - so no index is used, and you can see that in the query plan as a collection scan. It is as expected. But, when you provide a hint to the find method the query optimizer tries to use the index, and in your case it did (and you see it in the query plan as an IXSCAN). In either case, with or the without the hint, the find has to scan all the documents or keys in the index.
The _id has a default unique index, yes, but unless you are using the _id field in the query filter predicate or in a sort, the query cannot use it (or, specify explicitly to use index with a hint). You can verify with the following queries, db.mytable.find( { _id: 123 } ) or db.mytable.find( { } ).sort( { _id: -1 } ) the query planner will show index scan even though you do not specify the hint.
The main purpose of the indexes is to make your queries run fast; it is about query performance. It has to be a query with filter predicate and/or a sort operation to use an index (and the fields used in the filter or sort must be indexed for performance). With the find method, in your case, without any of the two you are just accessing all the documents as they are in the collection and the index is of no use (and the query optimizer shows that in the plan).

Which query is faster to perform in mongodb: using $in range or pair of $gte and $lte?

I'm interesting in performance issue. Suppose I have a collection with field ref (represented in each document). What I want is to find all documents in specific range (for example, [1-1,000,000]. Is there any difference in the following queries in terms of db performance
db.test.find({"ref": {"$gte":1, "$lte": 1000000}}) and
db.test.find({"ref": {"$in": [1,2,3, ..., 1000000]}})
Additional question is about memory consumption. Which query is more suitable in this case if I use pymongo driver?
There is a huge difference, since range query will be transformed internally to regex query, and if there is an index on 'ref' field, covered query will be evaluated almost in no-time.
Query where you pass a list of elements is a bit heavy, so it will take more time to send it over the wire, after which MongoDb engine will have to evaluate comparison of each array element with each document property.

How to index an $or query with sort

Suppose I have a query that looks something like this:
db.things.find({
deleted: false,
type: 'thing',
$or: [{
'creator._id': someid
}, {
'parent._id': someid
}, {
'somerelation._id': someid
}]
}).sort({
'date.created': -1
})
That is, I want to find documents that meets one of those three conditions and sort it by newest. However, $or queries do not use indexes in parallel when used with a sort. Thus, how would I index this query?
http://docs.mongodb.org/manual/core/indexes/#index-behaviors-and-limitations
You can assume the following selectivity:
deleted - 99%
type - 25%
creator._id, parent._id, somerelation._id - < 1%
Now you are going to need more than one index for this query; there is no doubt about that.
The question is what indexes?
Now you have to take into consideration that none of your $ors will be able to sort their data cardinally in an optimal manner using the index due to a bug in MongoDBs query optimizer: https://jira.mongodb.org/browse/SERVER-1205 .
So you know that the $or will have some performance problems with a sort and that putting the sort field into the $or clause indexes is useless atm.
So considering this the first index you want is one that covers the base query you are making. As #Leonid said you could make this into a compound index, however, I would not do it the order he has done it. Instead, I would do:
db.col.ensureIndex({type:-1,deleted:-1,date.created:-1})
I am very unsure about the deleted field being in the index at all due to its super low selectivity; it could, in fact, create a less performant operation (this is true for most databases including SQL) being in the index rather than being taken out. This part will need testing by you; maybe the field should be last (?).
As to the order of the index, again I have just guessed. I have said DESC for all fields because your sort is DESC, but you will need to explain this yourself here.
So that should be able to handle the master clause of your query. Now to deal with those $ors.
Each $or will use an index separately, and the MongoDB query optimizer will look for indexes for them separately too as though they are separate queries altogether, so something worth noting here is a little snag about compound indexes ( http://docs.mongodb.org/manual/core/indexes/#compound-indexes ) is that they work upon prefixes ( an example note here: http://docs.mongodb.org/manual/core/indexes/#id5 ) so you can't make one single compound index to cover all three clauses, so a more optimal method of declaring indexes on the $or (considering the bug above) is:
db.col.ensureindex({creator._id:1});
db.col.ensureindex({aprent._id:1});
db.col.ensureindex({somrelation._id:1});
It should be able to get you started on making optimal indexes for your query.
I should stress however that you need to test this yourself.
Mongodb can use only one index per query, so I can't see the way to use indexes to query someid in your model.
So, the best approach is to add special field for this task:
ids = [creator._id, parent._id, somerelation._id]
In this case you'll be able to query without using $or operator:
db.things.find({
deleted: false,
type: 'thing',
ids: someid
}).sort({
'date.created': -1
})
In this case your index will look something like this:
{deleted:1, type:1, ids:1, 'date.created': -1}
If you had flexibility to adjust the schema, I would suggest adding a new field, associatedIds : [ ] which would hold creator._id, parent._id, some relation._id - you can update that field atomically when you update the main corresponding field, but now you can have a compound index on this field, type and created_date which eliminates the need for $or in your query entirely.
Considering your requirement for indexing , I would suggest you to use $orderBy operator along side your $or query. By that I mean you should be able to index on the criteria's in your $or expressions used in your $or query and then you can $orderBy to sort the result.
For example:
db.things.find({
deleted: false,
type: 'thing',
$or: [{
'creator._id': someid
}, {
'parent._id': someid
}, {
'somerelation._id': someid
}]
},{$orderBy:{'date.created': -1}})
The above query would require compound indexes on each of the fields in the $or expressions combined with the sort object specified in the orderBy criteria.
for example:
db.things.ensureIndex{'parent._id': 1,"date.created":-1}
and so on for other fields.
It is a good practice to specify "limit" for the result to prevent mongodb from performing a huge in memory sort.
Read More on $orderBy operator here

Where does compound indexes in mongodb come into play

What are the advantages we get from compound indexes. I mean suppose we have a collection, in which I have to index over 2 fields say key1 and key2. How different is it from having a compound index {key1:1, key2:1}. Whats the problem with having 2 separate indexes. Can't mongodb make use of 2 or more indexes to satisfy a query.
As at MongoDB 2.2:
Every query, including update operations, use one and only one index.
The query optimizer selects the index empirically by occasionally running alternate query plans and by selecting the plan with the best response time for each query type.
An exception to the above rule is $or queries; each clause is executed in parallel and can use a separate index.
For more information see:
Indexing Overview
Query Optimizer
Explain

how to structure a compound index in mongodb

I need some advice in creating and ordering indexes in mongo.
I have a post collection with 5 properties:
Posts
status
start date
end date
lowerCaseTitle
sortOrder
Almost all the posts will have the same status of 1 and only a handful will have a rejected status. All my queries will filter on status, start and end dates, and sort on sortOrder. I also will have one query that does a regex search on the title.
Should I set up a compound key on {status:1, start:1, end:1, sort:1}? Does it matter which order I put the fields in the compound index - should I put status first in the compound index since it's the most broad? Is it better to do a compound index rather than a single index on each property? Does mongo only use a single index on any given query?
Are there any hints for indexes on lowerCaseTitle if I'm doing a regex query on that?
sample queries are:
db.posts.find({status: {$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
db.posts.find( {lowerCaseTitle: /japan/, status:{$gte:0}, start: {$lt: today}, end: {$gt: today}}).sort({sortOrder:1})
That's a lot of questions in one post ;) Let me go through them in a practical order :
Every query can use at most one index (with the exception of top level $or clauses and such). This includes any sorting.
Because of the above you will definitely need a compound index for your problem rather than seperate per-field indexes.
Low cardinality fields (so, fields with very few unique values across your dataset) should usually not be in the index since their selectivity is very limited.
Order of the fields in your compound index matter, and so does the relative direction of each field in your compound index (e.g. "{name:1, age:-1}"). There's a lot of documentation about compound indexes and index field directions on mongodb.org so I won't repeat all of it here.
Sorts will only use the index if the sort field is in the index and is the field in the index directly after the last field that was used to select the resultset. In most cases this would be the last field of the index.
So, you should not include status in your index at all since once the index walk has eliminated the vast majority of documents based on higher cardinality fields it will at most have 2-3 documents left in most cases which is hardly optimized by a status index (especially since you mentioned those 2-3 documents are very likely to have the same status anyway).
Now, the last note that's relevant in your case is that when you use range queries (and you are) it'll not use the index for sorting anyway. You can check this by looking at the "scanAndOrder" value of your explain() once you test your query. If that value exists and is true it means it'll sort the resultset in memory (scan and order) rather than use the index directly. This cannot be avoided in your specific case.
So, your index should therefore be :
db.posts.ensureIndex({start:1, end:1})
and your query (order modified for clarity only, query optimizer will run your original query through the same execution path but I prefer putting indexed fields first and in order) :
db.posts.find({start: {$lt: today}, end: {$gt: today}, status: {$gte:0}}).sort({sortOrder:1})