Our previous data model assumed that a certain field, let's call it field for lack of imagination, could contain more than 1 value, so we modeled it as an array.
Initial model:
{
field: ['val1]
}
We then realized (10 million docs later) that that wasn't the case and changed to:
{
field: 'val1;
}
I thought it would be simple to migrate to the new model but apparently it isn't.
I tried:
db.collection.update({},{$rename: {"field.0": 'newField'}})
but it complains that an array element cannot be used in the first place of $rename operator.
As I understood that in an update operation you cannot assign a field value to another one, I investigated the aggregation framework but I couldn't figure out a way.
Can I redact a document with aggregation fw and $out operator to accomplish what I want?
I also tried foreach, but is dead slow:
db.coll.find({"field":{$exists:true}}).snapshot().forEach(function(doc)
{
doc.newField = doc.field[0];
delete doc.field;
db.coll.save(doc);
});
I parallelized it using a bash script and I was able to get to about 200 updates/s, which means 10.000.000/(200*60*60)= 14h, quite some time to wait, without considering timeout errors that I handle with the bash script but that waste more time.
So now I ask, is there any chance that bulk operations or aggregation framework would speed up the process?
Would go for the bulk operations as they allow for the execution of bulk update operations which are simply abstractions on top of the server to make it easy to build bulk operations, thus streamlining your updates. You get performance gains over large collections as the bulk API sends write operations in bulk, and even better, it gives you real feedback about what succeeded and what failed. In the bulk update, you will be sending the operations to the server in batches of say 1000 which gives you a better performance as you are not sending every request to the server, just once in every 1000 requests:
var bulk = db.collection.initializeOrderedBulkOp(),
counter = 0;
db.collection.find({"field": { "$exists": true, "$type": 4 }}).forEach(function(doc) {
var updatedVal = doc.field[0];
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "field": updatedVal }
});
counter++;
if (counter % 1000 == 0) {
bulk.execute(); // Execute per 1000 operations and re-initialize every 1000 update statements
bulk = db.collection.initializeUnorderedBulkOp();
}
});
// Clean up queues
if (counter % 1000 != 0) { bulk.execute(); }
Related
I have a collection with ~800,000 documents, I am trying to fetch all of them, 5,000 at a time.
When running the following code:
const CHUNK_SIZE = 5000;
let skip = 0;
do {
matches = await dbClient
.collection(collectionName)
.find({})
.skip(skip)
.limit(CHUNK_SIZE)
.toArray();
// ... some processing
skip += CHUNK_SIZE;
} while (matches.length)
After about 30 iterations, I start getting documents I already received in a previous iteration.
What am I missing here?
As posted in the comments, you'll have to apply a .sort() on the query.
To do so without adding too much performance overhead it would be easiest to do this on the _id e.g.
.sort(
{
"_id" : 1.0
}
)
Neither MongoDB or the AmazonDocumentDB flavor guarantees implicit result sort ordering without it.
Amazon DocumentDB
Mongo Result Ordering
I want to copy price information in my document to prices[] array.
var entitiesCol = db.getCollection('entities');
entitiesCol.find({"type": "item"}).forEach(function(item){
item.prices = [ {
"value": item.price
}];
entitiesCol.save(item);
});
It takes too long time and some fields are not updated.
I am using Mongoose in server side and I can also use it.
What can I do for that?
In the mongo shell, you can use the bulkWrite() method to carry out the updates in a fast and efficient manner. Consider the following example:
var entitiesCol = db.getCollection('entities'),
counter = 0,
ops = [];
entitiesCol.find({
"type": "item",
"prices.0": { "$exists": false }
}).snapshot().forEach(function(item){
ops.push({
"updateOne": {
"filter": { "_id": item._id },
"update": {
"$push": {
"prices": { "value": item.price }
}
}
}
});
counter++;
if (counter % 500 === 0) {
entitiesCol.bulkWrite(ops);
ops = [];
}
})
if (counter % 500 !== 0)
entitiesCol.bulkWrite(ops);
The counter variable above is there to manage your bulk updates effectively if your collection is large. It allows you to batch the update operations and sends the writes to the server in batches of 500 which gives you a better performance as you are not sending every request to the server, just once in every 500 requests.
For bulk operations MongoDB imposes a default internal limit of 1000 operations per batch and so the choice of 500 documents is good in the sense that you have some control over the batch size rather than let MongoDB impose the default, i.e. for larger operations in the magnitude of > 1000 documents.
Given these three documents:
db.test.save({"_id":1, "foo":"bar1", "xKey": "xVal1"});
db.test.save({"_id":2, "foo":"bar2", "xKey": "xVal2"});
db.test.save({"_id":3, "foo":"bar3", "xKey": "xVal3"});
And a separate array of information that references those documents:
[{"_id":1, "foo":"bar1Upd"},{"_id":2, "foo":"bar2Upd"}]
Is it possible to update "foo" on the two referenced documents (1 and 2) in a single operation?
I know I can loop through the array and do them one by one but I have thousands of documents, which means too many round trips to the server.
Many thanks for your thoughts.
It's not possible to update "foo" on the two referenced documents (1 and 2) in a single atomic operation as MongoDB has no such mechanism. However, seeing that you have a large collection, one option is to take advantage of the Bulk API which allows you to send your updates in batches instead of every update request to the server.
The process involves looping all matched documents within the array and process Bulk updates which will at least allow many operations to be sent in a single request with a singular response.
This gives you much better performance since you won't be sending every request to the server but just once in every 500 requests, thus making your updates more efficient and quicker.
-EDIT-
The reason of choosing a lower value is generally a controlled choice. As noted in the documentation there, MongoDB by default will send to the server in batches of 1000 operations at a time at maximum and there is no guarantee that makes sure that these default 1000 operations requests actually fit under the 16MB BSON limit. So you would still need to be on the "safe" side and impose a lower batch size that you can only effectively manage so that it totals less than the data limit in size when sending to the server.
Let's use an example to demonstrate the approaches above:
a) If using MongoDB v3.0 or below:
var bulk = db.test.initializeOrderedBulkOp(),
largeArray = [{"_id":1, "foo":"bar1Upd"},{"_id":2, "foo":"bar2Upd"}],
counter = 0;
largeArray.forEach(doc) {
bulk.find({ "_id": doc._id }).updateOne({ "$set": { "foo": doc.foo } });
counter++;
if (counter % 500 == 0) {
bulk.execute();
bulk = db.test.initializeOrderedBulkOp();
}
}
if (counter % 500 != 0 ) bulk.execute();
b) If using MongoDB v3.2.X or above (the new MongoDB version 3.2 has since deprecated the Bulk() API and provided a newer set of apis using bulkWrite()):
var largeArray = [{"_id":1, "foo":"bar1Upd"},{"_id":2, "foo":"bar2Upd"}],
bulkUpdateOps = [];
largeArray.forEach(function(doc){
bulkUpdateOps.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$set": { "foo": doc.foo } }
}
});
if (bulkUpdateOps.length === 500) {
db.test.bulkWrite(bulkUpdateOps);
bulkUpdateOps = [];
}
});
if (bulkUpdateOps.length > 0) db.test.bulkWrite(bulkUpdateOps);
In the mongo shell, it is possible to insert an array of documents with one call. In a Meteor project, I have tried using
MyCollection = new Mongo.Collection("my_collection")
documentArray = [{"one": 1}, {"two": 2}]
MyCollection.insert(documentArray)
However, when I check my_collection from the mongo shell, it shows that only one document has been inserted, and that document contains the entire array as if it had been a map:
db.my_collection.find({})
{ "_id" : "KPsbjZt5ALZam4MTd", "0" : { "one" : 1 }, "1" : { "two" : 2} }
Is there a Meteor call that I can use to add a series of documents all at once, or must use a technique such as the one described here?
I imagine that inserting multiple documents in a single call would optimize performance on the client side, where the new documents would become available all at once.
You could use the bulk API to do the bulk insert on the server side. Manipulate the array using the forEach() method and within the loop insert the document using bulk insert operations which are simply abstractions on top of the server to make it easy to build bulk operations.
Note, for older MongoDB servers than 2.6 the API will downconvert the operations. However it's not possible to downconvert 100% so there might be some edge cases where it cannot correctly report the right numbers.
You can get raw access to the collection and database objects in the npm MongoDB driver through rawCollection and rawDatabase methods on Mongo.Collection
MyCollection = new Mongo.Collection("my_collection");
if (Meteor.isServer) {
Meteor.startup(function () {
Meteor.methods({
insertData: function() {
var bulkOp = MyCollection.rawCollection().initializeUnorderedBulkOp(),
counter = 0,
documentArray = [{"one": 1}, {"two": 2}];
documentArray.forEach(function(data) {
bulkOp.insert(data);
counter++;
// Send to server in batch of 1000 insert operations
if (counter % 1000 == 0) {
// Execute per 1000 operations and re-initialize every 1000 update statements
bulkOp.execute(function(e, rresult) {
// do something with result
});
bulkOp = MyCollection.rawCollection().initializeUnorderedBulkOp();
}
});
// Clean up queues
if (counter % 1000 != 0){
bulkOp.execute(function(e, result) {
// do something with result
});
}
}
});
});
}
I'm currently using the mikowals:batch-insert package.
Your code would work then with one small change:
MyCollection = new Mongo.Collection("my_collection");
documentArray = [{"one": 1}, {"two": 2}];
MyCollection.batchInsert(documentArray);
The one drawback of this I've noticed is that it doesn't honor simple-schema.
I have a huge collection of documents (more than two millions) ans I found my self querying very a small subset. using something like
scs = db.balance_sheets.find({"9087n":{$gte:40}, "20/58n":{ $lte:40000000}})
which gives less than 5k results. The question is, can I create a new collection with the results of this query?
I'd tried insert:
db.scs.insert(db.balance_sheets.find({"9087n":{$gte:40}, "20/58n":{ $lte:40000000}}).toArray())
But it gives me errors: Socket say send() errno:32 Broken pipe 127.0.0.1:27017
I tryied aggregate:
db.balance_sheets.aggregate([{ "9087n":{$gte:40}, "20/58n":{ $lte:40000000}} ,{$out:"pme"}])
And I get "exception: A pipeline stage specification object must contain exactly one field."
Any hints?
Thanks
The first option would be:
var cursor = db.balance_sheets.find({"9087n":{"$gte": 40}, "20/58n":{ $lte:40000000}});
while (cursor.hasNext()) {
var doc = cursor.next();
db.pme.save(doc);
};
As for the aggregation, try
db.balance_sheets.aggregate([
{
"$match": { "9087n": { "$gte": 40 }, "20/58n": { "$lte": 40000000 } }
},
{ "$out": "pme" }
]);
For improved performance especially when dealing with large collections, take advantage of using the Bulk API for bulk updates as you will be sending the operations to the server in batches of say 500 which gives you a better performance as you are not sending every request to the server, just once in every 500 requests.
The following demonstrates this approach, the first example uses the Bulk API available in MongoDB versions >= 2.6 and < 3.2 to insert all the documents matching the query from the balance_sheets collection into the pme collection:
var bulk = db.pme.initializeUnorderedBulkOp(),
counter = 0;
db.balance_sheets.find({
"9087n": {"$gte": 40},
"20/58n":{ "$lte":40000000}
}).forEach(function (doc) {
bulk.insert(doc);
counter++;
if (counter % 500 == 0) {
bulk.execute(); // Execute per 500 operations
// and re-initialize every 1000 update statements
bulk = db.pme.initializeUnorderedBulkOp();
}
})
// Clean up remaining operations in queue
if (counter % 500 != 0) { bulk.execute(); }
The next example applies to the new MongoDB version 3.2 which has since deprecated the Bulk API and provided a newer set of apis using bulkWrite():
var bulkOps = db.balance_sheets.find({
"9087n": { "$gte": 40 },
"20/58n": { "$lte": 40000000 }
}).map(function (doc) {
return { "insertOne" : { "document": doc } };
});
db.pme.bulkWrite(bulkOps);