MongoDB partial index not working as expected with count

MongoDB partial index not working as expected with count - mongodb

On mongoDB 3.6.3 I create this collection with two million records:
function randInt(n) { return parseInt(Math.random()*n); }
for(var j=0; j<20; j++) {
print("Building op "+j);
var bulkop=db.media.initializeOrderedBulkOp() ;
for (var i = 0; i < 100000; ++i) {
bulkop.insert(
{
id_profile: NumberLong("222"),
needle_id: NumberInt(randInt(2000000000)),
visibility: NumberInt(randInt(5)),
}
)
};
print("Executing op "+j);
bulkop.execute();
}
then I create this partial index :
db.media.createIndex(
{"id_profile": 1, "visibility": 1},
{unique: false, partialFilterExpression: { "needle_id": { $exists: true } }}
);
then I run this query that exactly matches the partial index :
db.media.count({$and:[
{id_profile:NumberInt(222)},
{visibility:NumberInt(0)},
{needle_id:{$exists:true}}]})
but it's slow :( In fact it's the same speed as if I was not using a partial index and where I need to filter all docs who don't have needle_id:
db.media.createIndex(
{"id_profile": 1, "visibility": 1},
{unique: false}
);
db.media.count({$and:[
{id_profile:NumberInt(222)},
{visibility:NumberInt(0)},
{needle_id:{$exists:true}}]})
So is it a bug with partial index ? What I can do to speed my count ?

I investigated this issue for quite some time now, it seems that running a count query on a partial index does not avoid iterating over all the documents of the index.
When creating an index on a large collection, running a count query which is based on the index, finishes within few milliseconds. it is actually counting the index's size for the values that was provided.
When looking at the execution plan of such a query, we can notice the 'IDX_SCAN' phase solely, which stands for index scan. (scanning the index on the requested bounds and counting number of documents it contains)
But when using a partial index, it seems the count functionality as implemented by mongo is NOT doing an index scan, but instead its running over the documents as if we ran a modify query.. then eventually returns the total amount of docs.
this can be proved by looking at the execution plan for such a query, and see that it shows the phase is 'FETCH', and not 'IDX_SCAN', but fetch using the index, so its quicker then if an index was not created at all.
More complete information Can be found here: Why does the "distinct" and "count" commands happen so slowly on indexed items in MongoDB?

Related

How to optimize mongo query with two parallel array?

I have a query like this:
xml_db.find(
{
'high_performer': {
'$nin': [some_value]
},
'low_performer': {
'$nin': [some_value]
},
'expiration_date': {
'$gte': datetime.now().strftime('%Y-%m-%d')
},
'source': 'some_value'
}
)
I have tried to create an index with those fields but getting error:
pymongo.errors.OperationFailure: cannot index parallel arrays [low_performer] [high_performer]
So, how to efficiently run this query?

Compound indexing ordering should follow the equality --> sort --> range rule. A good description of this can be found in this response.
This means that the first field in the index would be source, followed by the range filters (expiration_date, low_performer and high_performer).
As you noticed, one of the "performer" fields cannot be included in the index since only a single array can be indexed. You should use your knowledge of the data set to determine which filter (low_performer or high_performer) would be more selective and choose that filter to be included in the index.
Assuming that high_performer is more selective, the only remaining step would be to determine the ordering between expiration_date and high_performer. Again, you should use your knowledge of the data set to make this determination based on selectivity.
Assuming expiration_date is more selective, the index to create would then be:
{ "source" : 1, "expiration_date" : 1, "high_performer" : 1 }

Created indexes on a mongodb collection, still fails while sorting a large data set

My Query below:
db.chats.find({ bid: 'someID' }).sort({start_time: 1}).limit(10).skip(82560).pretty()
I have indexes on chats collection on the fields in this order
{
"cid" : 1,
"bid" : 1,
"start_time" : 1
}
I am trying to perform sort, but when I write a query and check the result of explain(), I still get the winningPlan as
{
"stage":"SKIP",
"skipAmount":82560,
"inputStage":{
"stage":"SORT",
"sortPattern":{
"start_time":1
},
"limitAmount":82570,
"inputStage":{
"stage":"SORT_KEY_GENERATOR",
"inputStage":{
"stage":"COLLSCAN",
"filter":{
"ID":{
"$eq":"someID"
}
},
"direction":"forward"
}
}
}
}
I was expecting not to have a sort stage in the winning plan as I have indexes created for that collection.
Having no indexes will result into the following error
MongoError: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM [duplicate]
However I managed to make the sort work by increasing the size allocation on ram from 32mb to 64mb, looking for help in adding indexes properly

The order of fields in an index matters. To sort query results by a field that is not at the start of the index key pattern the query must include equality conditions on all the prefix keys that precede the sort keys. The cid field is not in the query nor used for sorting, so you must leave it out. Then you put the bid field first in the index definition as you use it in the equality condition. The start_time goes after that to be used in sorting. Finally, the index must look like this:
{"bid" : 1, "start_time" : 1}
See the documentation for further reference.

How to efficiently page batches of results with MongoDB

I am using the below query on my MongoDB collection which is taking more than an hour to complete.
db.collection.find({language:"hi"}).sort({_id:-1}).skip(5000).limit(1)
I am trying to to get the results in a batch of 5000 to process in either ascending or descending order for documents with "hi" as a value in language field. So i am using this query in which i am skipping the processed documents every time by incrementing the "skip" value.
The document count in this collection is just above 20 million.
An index on the field "language" is already created.
MongoDB Version i am using is 2.6.7
Is there a more appropriate index for this query which can get the result faster?

When you want to sort descending, you should create a multi-field index which uses the field(s) you sort on as descending field(s). You do that by setting those field(s) to -1.
This index should greatly increase the performance of your sort:
db.collection.ensureIndex({ language: 1, _id: -1 });
When you also want to speed up the other case - retrieving sorted in ascending order - create a second index like this:
db.collection.ensureIndex({ language: 1, _id: 1 });
Keep in mind that when you do not sort your results, you receive them in natural order. Natural order is often insertion order, but there is no guarantee for that. There are various events which can cause the natural order to get messed up, so when you care about the order you should always sort explicitly. The only exception to this rule are capped collections which always maintain insertion order.

In order to efficiently "page" through results in the way that you want, it is better to use a "range query" and keep the last value you processed.
You desired "sort key" here is _id, so that makes things simple:
First you want your index in the correct order which is done with .createIndex() which is not the deprecated method:
db.collection.createIndex({ "language": 1, "_id": -1 })
Then you want to do some simple processing, from the start:
var lastId = null;
var cursor = db.collection.find({language:"hi"});
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
That's the first batch. Now when you move on to the next one:
var cursor = db.collection.find({ "language":"hi", "_id": { "$lt": lastId });
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
So that the lastId value is always considered when making the selection. You store this between each batch, and continue on from the last one.
That is much more efficient than processing with .skip(), which regardless of the index will "still" need to "skip" through all data in the collection up to the skip point.
Using the $lt operator here "filters" all the results you already processed, so you can move along much more quickly.

MongoDB MongoEngine index declaration

I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?

Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.

does using sort with $or prevent the use of indexes in mongodb?

it seems like using $or with a sort does a full table scan and avoids my indexes on title and keywords how can I get it to use my two indexes when using an $or query?
this query uses both the title and keywords index
db.tasks.find({$or: [{keywords: /^japan/}, {title:/^japan/}]})
this does a full table scan and uses my index total_-1
db.tasks.find({$or: [{keywords: /^japan/}, {title:/^japan/}]}).sort({total:-1})
while queries against keywords or title with a sort do use the indexes on keywords or title respectively.
db.tasks.find({title:/^japan/}).sort({total:-1})
db.tasks.find({keywords:/^japan/}).sort({total:-1})

Sorting and indexes in Mongo are a complex topic. Mongo also has a special error that prevents you from doing a sort without an index if you have too many items. So it's good that you're asking about indexes, because an un-indexed sort will eventually start failing.
There is a bug in JIRA that seems to cover your issue, however there are some extra details to consider.
The first thing to note are your last queries:
db.tasks.find({title:/^japan/}).sort({total:-1})
db.tasks.find({keywords:/^japan/}).sort({total:-1})
These queries will fail eventually because you are only indexing on title not on title/total. Here's a script that will demonstrate the problem.
> db.foo.ensureIndex({title:1})
> for(var i = 0; i < 100; i++) { db.foo.insert({title: 'japan', total: i}); }
> db.foo.count()
100
> db.foo.find({title: 'japan'}).sort({total:-1}).explain()
... uses BTreeCursor title_1
> // Now try with one million items
> for(var i = 0; i < 1000000; i++) { db.foo.insert({title: 'japan', total: i}); }
> db.foo.find({title: 'japan'}).sort({total:-1}).explain()
Sat Mar 31 05:57:41 uncaught exception: error: {
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128
}
So if you plan to query & sort on title and total, then you need an index on both, in that order:
> db.foo.ensureIndex({title:1,total:1})
> db.foo.find({title: 'japan'}).sort({total:-1}).explain()
{
"cursor" : "BtreeCursor title_1_total_1 reverse",
...
The JIRA bug I listed above is for something like the following:
> db.foo.find({$or: [title:/^japan/, title:/^korea/]}).sort({total:-1})
Yours is slightly different, but it will encounter the same problem. Even if you have both indexes on title/total and keyword/total MongoDB will not be able to use indexes optimally.