How to efficiently page batches of results with MongoDB

How to efficiently page batches of results with MongoDB - mongodb

I am using the below query on my MongoDB collection which is taking more than an hour to complete.
db.collection.find({language:"hi"}).sort({_id:-1}).skip(5000).limit(1)
I am trying to to get the results in a batch of 5000 to process in either ascending or descending order for documents with "hi" as a value in language field. So i am using this query in which i am skipping the processed documents every time by incrementing the "skip" value.
The document count in this collection is just above 20 million.
An index on the field "language" is already created.
MongoDB Version i am using is 2.6.7
Is there a more appropriate index for this query which can get the result faster?

When you want to sort descending, you should create a multi-field index which uses the field(s) you sort on as descending field(s). You do that by setting those field(s) to -1.
This index should greatly increase the performance of your sort:
db.collection.ensureIndex({ language: 1, _id: -1 });
When you also want to speed up the other case - retrieving sorted in ascending order - create a second index like this:
db.collection.ensureIndex({ language: 1, _id: 1 });
Keep in mind that when you do not sort your results, you receive them in natural order. Natural order is often insertion order, but there is no guarantee for that. There are various events which can cause the natural order to get messed up, so when you care about the order you should always sort explicitly. The only exception to this rule are capped collections which always maintain insertion order.

In order to efficiently "page" through results in the way that you want, it is better to use a "range query" and keep the last value you processed.
You desired "sort key" here is _id, so that makes things simple:
First you want your index in the correct order which is done with .createIndex() which is not the deprecated method:
db.collection.createIndex({ "language": 1, "_id": -1 })
Then you want to do some simple processing, from the start:
var lastId = null;
var cursor = db.collection.find({language:"hi"});
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
That's the first batch. Now when you move on to the next one:
var cursor = db.collection.find({ "language":"hi", "_id": { "$lt": lastId });
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
So that the lastId value is always considered when making the selection. You store this between each batch, and continue on from the last one.
That is much more efficient than processing with .skip(), which regardless of the index will "still" need to "skip" through all data in the collection up to the skip point.
Using the $lt operator here "filters" all the results you already processed, so you can move along much more quickly.

Related

In MongoDB how to decide for a collection which fields to be indexed for a costly query

I have a collection with 1000+ records and I need to run the query below. I have come across the issue that this query takes more than a minute even if the departmentIds array has length something like 15-20. I think if I use an index the query time will be reduced.
From what I observe the 99% of the time spent on the query is due to the $in part.
How do I decide which fields to index. Should I index only department.department_id since that's what taking most time or should I create a compound index using userId,something and department.department_id (bascially all the fields I'm using in the query here)
Here is what my query looks like
let departmentIds = [.......................... can be large]
let query = {
userId: someid,
something: something,
'department.department_id': {
$in: departmentIds
}
};
//db query
let result = db
.collection(TABLE_NAME)
.find(query)
.project({
anotherfield: 1,
department: 1
})
.toArray();

You need to check all search cases and create indexes for those that are often used and most critical for your application. For the particular case above this seems to be the index options:
userId:1
userId:1,something:1
userId:1,something:1,department.department_id:1
I bet on option 1 since userId sounds like a unique key with high selectivity very suitable for index , afcourse best is to do some testing and identify the fastest options , there is good explain option that can help alot with the testing:
db.collection.find().explain("executionStats")

Mongo - How can a narrower query be slower than a generic one?

This is executed immediately:
db.mycollection.find({ strField: 'AAA'}).count()
And this takes a lot to finish:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $exists: true }}).count()
This is how I created my index:
db.mycollection.createIndex({strField: 1, dateTimeField: 1}, { sparse: true })
But it doesn't work even using hint(indexName)
Why this happens and how to fix it?

The { $exists: true } query predicate is problematic, especially if there are documents in the collection for which that field does not exist.
When MongoDB creates an index entry for a document, it collects all of the field values according to the index spec, and concatenates them.
If a field is not present in the document, the index stores null in that field's position.
If the field is explicitly set to null, it also stores null in that field's position.
This means that these 2 documents will have identical entries in the index:
{ strField: 'AAA', dateTimeField: null}
{ strField: 'AAA'}
Note that even with the index being sparse, both documents will be indexed since at least one of the indexes fields exists in each document.
When testing {dateTimeFied:{$exists:true}}, the first document will match, while the second will not.
When processing a count query using an index, if the query can be satisfied by scanning a single range of the index, the query executor can use a count_scan stage, and get the correct result without loading a single document from disk.
Because the executor cannot definitively tell from the index whether or not the field exists, it cannot use a count_scan, and must instead use an ordinary ixscan followed by a fetch stage, and load all of the matching documents from disk in order to arrive at the correct count.
In the case of the first query, the executor would have been able to use a count_scan, while the second would have had to examine all of the documents. You should be able to see this by running explain with the executionStats option on each query.
One way to avoid this pitfall is to take advantage of the fact that MongoDB query operators are type-sensitive. This means that this query will match any document where dateTimeField is greater than epoch 0, and a timestamp:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $gte: new ISODate("1970-01-01T00:00:00Z") }}).count()
This will allow the query executor to count all of the documents that have the matching string and contain a date, but will exclude documents that contain a dateTimeField with a numeric or string value.

Choosing the type of column value for indexing in mongo

document : {
score:123
}
I have a field in the document called score(integer). I want to use a range query db.collection.find({score: {$gte: 100, $lt: 200}}). I have definite number of these ranges(approx 20).
Should i introduce a new field in the document to tell the type of range and then query on the indentifier of that range. Ex -
document: {
score: 123,
scoreType: "type1"
}
so which query is better-
1. db.collection.find({score: {$gte: 100, $lt: 200}})
2. db.collection.find({scoreType: "type1"})
In any case i will have to create an Index on either score or scoreType.
Which index would tend to perform better??

It depends entirely on your situation, if you are sure the number of documents in your database will always remain the same then use scoreType.
Keep in mind: scoreType will be a fixed value and thus will not help when you query over different ranges i.e it might work for 100 to 200 if score type was created
with this range in mind, but will not work for other ranges i.e for 100 to 500,(Do you plan on having a new scoreType2?) keeping flexibility in scope, this is a bad idea

MongoDB MongoEngine index declaration

I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?

Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.

How to delete documents by query efficiently in mongo?

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.

You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.

You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});

Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)

I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});

Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse