Iterating over MongoDB collection to duplicate all documents is painfully slow - mongodb

I have a collection of 7,000,000 documents (each of perhaps 1-2 KB BSON) in a MongoDB collection that I would like to duplicate, modifying one field. The field is a string with a numeric value, and I would like to increment the field by 1.
Following this approach From the Mongo shell, I took the following approach:
> var all = db.my_collection.find()
> all.forEach(function(it) {
... it._id = 0; // to force mongo to create a new objectId
... it.field = (parseInt(it.field) + 1).toString();
... db.my_collection.insert(it);
... })
Executing the following code is taking an extremely long time; at first I thought the code was broken somehow, but from a separate terminal I checked the status of the collection something like an hour later to find the process was still running and there was now 7,000,001 documents! I checked to find that sure enough, there was exactly 1 new document that matched the incremented field.
For context, I'm running a 2015 MBP with 4 cores and 16 GB ram. I see mongo near the top of my CPU overhead averaging about 85%.
1) Am I missing a bulk modify/update capability in Mongodb?
2) Any reason why the above operation would be working, yet working so slowly that it is updating a document at a rate of 1/hr?

Try the db.collection.mapReduce() way:
NB: A single emit can only hold half of MongoDB’s maximum BSON document size.
var mapFunction1 = function() {
emit(ObjectId(), (parseInt(this.field) + 1).toString());
};
MongoDB will not call the reduce function for a key that has only a single value.
var reduceFunction1 = function(id, field) {
return field;
};
Finally,
db.my_collection.mapReduce(
mapFunction1,
reduceFunction1.
{"out":"my_collection"} //Replaces the entire content; consider merge
)

I'm embarrassed to say that I was mistaken that this line:
... it._id = 0; // to force mongo to create a new objectId
Does indeed force mongo to create a new ObjectId. Instead I needed to be explicit:
... it._id = ObjectId();

Related

Getting previous page's results when using skip and limit while querying DocumentDB

I have a collection with ~800,000 documents, I am trying to fetch all of them, 5,000 at a time.
When running the following code:
const CHUNK_SIZE = 5000;
let skip = 0;
do {
matches = await dbClient
.collection(collectionName)
.find({})
.skip(skip)
.limit(CHUNK_SIZE)
.toArray();
// ... some processing
skip += CHUNK_SIZE;
} while (matches.length)
After about 30 iterations, I start getting documents I already received in a previous iteration.
What am I missing here?
As posted in the comments, you'll have to apply a .sort() on the query.
To do so without adding too much performance overhead it would be easiest to do this on the _id e.g.
.sort(
{
"_id" : 1.0
}
)
Neither MongoDB or the AmazonDocumentDB flavor guarantees implicit result sort ordering without it.
Amazon DocumentDB
Mongo Result Ordering

How to parseFloat in a bulk update Mongo operation

So I have a few fields that are currently strings that I am converting over to Double values in my mongo database. I was originally doing it with a find.forEach(function()).... style but that was taking way too long so I decided to try out the bulk operations Mongo has. So here is my attempt:
var bulkOp = db.transactions.initializeOrderedBulkOp();
bulkOp.find({"tran.val": {$exists: true}}).update(
{
$set: {"tran.val":parseFloat("$tran.val")}
}
);
bulkOp.execute();
When I do this my tran.val which use to be "1.35" or something along those lines now becomes NaN. I also tried without the $ on tran.val but no luck. Am I missing something?

Adding a new field to 100 million records in mongodb

What is the fastest and safest strategy for adding a new field to over 100 million mongodb documents?
Background
Using mongodb 3.0 in a 3 node replica set
We are adding a new field (post_hour) that is based on data in another field (post_time) in the current document. The post_hour field is a truncated version of post_time to the hour.
I faced a similar scenario and in which I had create a script to update around 25 Million documents and it was taking a lot of time to update all the documents. To improve performance, I one by one inserted the updated document into a new collection and renamed the new collection.This approach helped because I was inserting the documents rather than updating them ('insert' operation is faster than 'update' operation).
Here is the sample script(I have not tested it):
/*This method returns postHour*/
function convertPostTimeToPostHour(postTime){
}
var totalCount = db.person.count();
var chunkSize = 1000;
var chunkCount = totalCount / chunkSize;
offset = 0;
for(index = 0; index<chunkCount; index++){
personList = db.persons.find().skip(offset).limit(chunkSize);
personList.forEach(function (person) {
newPerson = person;
newPerson.post_hour = convertPostTimeToPostHour(person.post_time);
db.personsNew.insert(newPerson); // This will insert the record in a new collection
});
offset += chunkSize;
}
When the above written script will get executed, the new collection 'personNew' will have the updated records with value of field 'post_hour' set.
If the existing collection is having any indexes, you need to recreate them in the new collection.
Once then indexes are created, you can rename the name of collection 'person' to 'personOld' and 'personNew' to 'person'.
The snapshot will allow to prevent duplicates in query result (as we are extending size) - can be removed if any trouble happen.
Please find mongo shell script below where 'a1' is collection name:
var documentLimit = 1000;
var docCount = db.a1.find({
post_hour : {
$exists : false
}
}).count();
var chunks = docCount / documentLimit;
for (var i = 0; i <= chunks; i++) {
db.a1.find({
post_hour : {
$exists : false
}
}).snapshot()
.limit(documentLimit)
.forEach(function (doc) {
doc.post_hour = 12; // put your transformation here
// db.a1.save(doc); // uncomment this line to save data
// you can also specify write concern here
printjson(doc); // comment this line to avoid polution of shell output
// this is just for test purposes
});
}
You can play with parameters, but as bulk is executed in 1000 records blocks, that looks optimal.

Get count of documents without loading the whole collection in DerbyJS 0.6

How can I count the results of a query without loading the whole resultset into memory?
The easy way of counting documents returned by a query would be:
var q = model.query('mycollection', { date: today });
q.fetch(function() {
var length = q.get().length;
});
But this would load the whole resultset into memory and "count" an array in javascript. When you have lots of data you don't want to do this. I think.
Counting the underlying mongodb collection is rather complicated since LiveDB (I think it is LiveDB) creates many mongodb documents for one derbyjs document.
The internets point to this google groups thread from 2013, but the solution described there (putting $count: true into the query options) doesn't seem to work in DerbyJS 0.6 and current mongodb.".
query.extraRef is undefined.
It is done like described in the google groups thread. But query.extraRef is now query.refExtra.
Example:
var q = model.query('mycollection', { $count: true, date: today });
q.refExtra('_page.docsOfToday');
q.fetch();

Mongo -Select parent document with maximum child documents count, Faster way?

I'm quite new to mongo, and trying to get work following query.and is working fine too, But it's taking a little bit more time. I think I'm doing something wrong.
There are many number of documents in a collection parent, near about 6000. Each document has certain number of childs (childs is an another collection with 40000 documents in it). parents & childs are associated with each other by an attribute in the document called parent_id. Please see the following code. Following code takes approximate 1 minute to execute the queries. I don't think mongo should take that much time.
function getChildMaxDocCount(){
var maxLen = 0;
var bigSizeParent = null;
db.parents.find().forEach(function (parent){
var currentcount = db.childs.count({parent_id:parent._id});
if(currcount > maxLen){
maxLen = currcount;
bigSizeParent = parent._id;
}
});
printjson({"maxLen":maxLen, "bigSizeParent":bigSizeParent });
}
Is there any feasible/optimal way to achieve this?
If I got you right, you want to have the parent with the most childs. This is easy to accomplish using the aggregation framework. When each child only can have one parent, the aggregation query would look like this
db.childs.aggregate(
{ $group: { _id:"$parent_id", children:{$sum:1} } },
{ $sort: { "children":-1 } },
{ $limit : 1 }
);
Which should return a document like:
{ _id:"SomeParentId", children:15}
If a child can have more than one parent, it heavily depends on the data modeling how the query would look like.
Have a look at the aggregation framework documentation for details.
Edit: Some explanation
The aggregation pipeline takes every document it is told do do so through a series of steps in a way that all documents are first processed through the first step and the resulting documents are put into the next step.
Step 1: Grouping
We group all documents into new documents (virtual ones, if you want) and tell mongod to increment the field children by one for each document which has the same parent_id. Since we are referring to a field of the current document, we need to add a $ sign.
Step 2: Sorting
Now that we have a bunch of documents which hold the parent_id and the number of children this parent has, we sort it by the children field in descending (-1) order.
Step3: Limiting
Since we are only interested in the parent_id which has the most children, we only let mongod return the first document after sorting.