Better way to move MongoDB Collection to another Collection - mongodb

In my web scraping project i need to move previous day scraped data from mongo_collection to mongo_his_collection
I am using this query to move data
for record in collection.find():
his_collection.insert(record)
collection.remove()
It works fine but sometimes it break when MongoDB collection contain above 10k rows
Suggest me some optimized query which will take less resources and do the same task

You could use a MapReduce job for this.
MapReduce allows you to specify a out-collection to store the results in.
When you hava a map function which emits each document with its own _id as key and a reduce function which returns the first (and in this case only because _id's are unique) entry of the values array, the MapReduce is essentially a copy operation from the source-collection to the out-collection.
Untested code:
db.runCommand(
{
mapReduce: "mongo_collection",
map: function(document) {
emit(document._id, document);
},
reduce: function(key, values) {
return values[0];
},
out: {
merge:"mongo_his_collection"
}
}
)

If both your collections are in the same database, I believe you're looking for renameCollection.
If not, you unfortunately have to do it manually, using a targeted mongodump / mongorestore command:
mongodump -d your_database -c mongo_collection
mongorestore -d your_database -c mongo_his_collection dump/your_database/mongo_collection.bson
Note that I just typed these two commands from the top of my head without actually testing them, so do make sure you check them before running them in production.
[EDIT]: sorry, I just realised that this was something you needed to do on a regular basis. In that case, mongodump / mongorestore probably isn't the best solution.
I don't see anything wrong with your solution - it would help if you edited your question to explain what you mean by "it breaks".

The query breaks because you are not limiting the find(). When you create a cursor on the server mongod will try to load the entire result set in memory. This will cause problems and/or fail if your collection is too large.
To avoid this use a skip/limit loop. Here is an example in Java:
long count = 0
while (true) {
MongoClient client = new MongoClient();
DBCursor = client.getDB("your_DB_name").getCollection("mongo_collection").find().sort(new BasicDBObject("$natural", 1)).skip(count).limit(100);
while (cursor.hasNext()) {
client.getDB("your_DB_name").getCollection("mongo_his_collection").insert(cursor.next());
count++;
}
}
This will work, but you would get better performance by batching the writes as well. To do that build an array of DBObjects from the cursor and write them all at once with one insert.
Also if the Collection is being altered while you are copying there is no guarantee that you will traverse all documents as some may end up getting moved if they increase in size.

You can try mongodump & mongorestore.

You can use renameCollection to do it directly. Or if on different mongods, use cloneCollection.
References:
MongoDB docs renameCollection: http://docs.mongodb.org/manual/reference/command/renameCollection/#dbcmd.renameCollection
MongoDB docs cloneCollection: http://docs.mongodb.org/manual/reference/command/cloneCollection/
Relevant blog post: http://blog.shlomoid.com/2011/08/how-to-move-mongodb-collection-between.html

Related

How to avoid mongo from returning duplicated documents by iterating a cursor in a constantly updated big collection?

Context
I have a big collection with millions of documents which is constantly updated with production workload. When performing a query, I have noticed that a document can be returned multiple times; My workload tries to migrate the documents to a SQL system which is set to allow unique row ids, hence it crashes.
Problem
Because the collection is so big and lots of users are updating it after the query is started, iterating over the cursor's result may give me documents with the same id (old and updated version).
What I'v tried
const cursor = db.collection.find(query, {snapshot: true});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}
Based on old documentation for the mongo driver (I'm using nodejs but this is applicable to any official mongodb driver), there is an option called snapshot which is said to avoid what is happening to me. Sadly, the driver returns an error indicating that this option does not exists (It was deprecated).
Question
Is there a way to iterate through the documents of a collection in a safe fashion that I don't get the same document twice?
I only see a viable option with aggregation pipeline, but I want to explore other options with standard queries.
Finally I got the answer from a mongo changelog page:
MongoDB 3.6.1 deprecates the snapshot query option.
For MMAPv1, use hint() on the { _id: 1} index instead to prevent a cursor from returning a document more than once if an intervening write operation results in a move of the document.
For other storage engines, use hint() with { $natural : 1 } instead.
So, from my code example:
const cursor = db.collection.find(query).hint({$natural: 1});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}

Fast query and deletion of documents of a large collection in MongoDB

I have a collection (let say CollOne) with several million documents. They have the common field "id"
{...,"id":1}
{...,"id":2}
I need to delete some documents in CollOne by id. Those ids stored in a document in another collection (CollTwo). This ids_to_delete document has the structure as follows
{"action_type":"toDelete","ids":[4,8,9,....]}
As CollOne is quite large, finding and deleting one document will take quite a long time. Is there any way to speed up the process?
Like you can't really avoid a deletion operation in the database if you want to delete anything. If you're having performance issue I would just recommend to make sure you have an index built on the id field otherwise Mongo will use a COLLSCAN to satisfy the query which means it will over iterate the entire colLOne collection which is I guess where you feel the pain.
Once you make sure an index is built there is no "more" efficient way than using deleteMany.
db.collOne.deleteMany({id: {$in: [4, 8, 9, .... ]})
In case you don't have an index and wonder how to build one, you should use createIndex like so:
(Prior to version 4.2 building an index lock the entire database, in large scale this could take up to several hours if not more, to avoid this use the background option)
db.collOne.createIndex({id: 1})
---- EDIT ----
In Mongo shell:
Mongo shell is javascript based, so you just have to to execute the same logic with js syntax, here's how I would do it:
let toDelete = db.collTwo.findOne({ ... })
db.collOne.deleteMany({id: {$in: toDelete.ids}})

MongoDB - how to get fields fill-rates as quickly as possible?

We have a very big MongoDB collection of documents with some pre-defined fields that can either have a value or not.
We need to gather fill-rates of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.
Is there a way to use db.collection.aggregate or db.collection.mapReduce to run such a script server-side?
Should it have significant performance improvements?
Will it slow down other usages of that collection (e.g. holding a major lock)?
Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.
Here's how the old script worked (in node.js):
var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
var next = function() {
cursor.nextObject(function(err, doc) {
processDoc(doc, next);
});
};
next();
and this is the new script:
collection.mapReduce(
function () {
var processDoc = function(doc) {
...
};
processDoc(this);
},
function (key, values) {
return Array.sum(values)
},
{
query : query,
out: {inline: 1}
},
function (error, results) {
// print results
}
);
processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:
emit(field_name, 1);
running old and new on a sample of 100k, old took 20 seconds, new took 8.
some notes:
map-reduce's limit option doesn't work on sharded collections, I had to query for _id : { $gte, $lte} to create the sample size needed.
map-reduce's performance boost option: jsMode : true doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.
As I understood what you want to achieve is compute something on your documents, after that you have a new "document" that can be queried. You don't need to store the "new values" computed.
If you don't need to write your "new values" inside that documents, you can use Aggregation Framework.
Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
https://docs.mongodb.com/manual/aggregation/
Since Aggregation Framework has a lot of features i can't give you more informations about how to resolve your issue.

mongodb: how can I see the execution time for the aggregate command?

I execute the follow mongodb command in mongo shell
db.coll.aggregate(...)
and i see the list of result. but is it possible to see the query
execution time? Is there any equivalent function for explain method for aggregation queries.
var before = new Date()
#aggregation query
var after = new Date()
execution_mills = after - before
You can add a time function to your .mongorc.js file (in your home directory):
function time(command) {
const t1 = new Date();
const result = command();
const t2 = new Date();
print("time: " + (t2 - t1) + "ms");
return result;
}
and then you can use it like so:
time(() => db.coll.aggregate(...))
Caution
This method doesn't give relevant results for db.collection.find()
i see that in mongodb there is a possibility to use this two command:
db.setProfilingLevel(2)
and so after the query you can use db.system.profile.find() to see the query execution time and other
Or you can install the excellent mongo-hacker, which automatically times every query, pretty()fies it, colorizes the output, sorts the keys, and more:
I will write an answer to explain this better.
Basically there is no explain() functionality for the aggregation framework yet: https://jira.mongodb.org/browse/SERVER-4504
However there is a way to measure client side but not without its downsides:
You are not measuring the database
You are measuring the application
There are too many unknowns about the in between parts to be able to get an accurate reading, i.e. you can't say that it took 0.04ms for the document result to be formulated by the MongoDB server, serialised, sent over the wire, de-serialised by the app and then stored into a hash allowing you subtract that sum from the total to get a aggregation benchmark.
However that being said, you might be able to get a slightly accurate result by doing it in MongoDB console on the same server as the mongos / mongod. This will create very little in betweens, still too many but enough to maybe get a reading you could roughly trust. As such you could use #Zagorulkin's answer in that position.

Is there a better way to export a mongodb query to a new collection?

What I want:
I have a master collection of products, I then want to filter them and put them in a separate collection.
db.masterproducts.find({category:"scuba gear"}).copyTo(db.newcollection)
Of course, I realise the 'copyTo' does not exist.
I thought I could do it with MapReduce as results are created in a new collection using the new 'out' parameter in v1.8; however this new collection is not a subset of my original collection. Or can it be if I use MapReduce correctly?
To get around it I am currently doing this:
Step 1:
/usr/local/mongodb/bin/mongodump --db database --collection masterproducts -q '{category:"scuba gear"}'
Step 2:
/usr/local/mongodb/bin/mongorestore -d database -c newcollection --drop packages.bson
My 2 step method just seems rather inefficient!
Any help greatly appreciated.
Thanks
Bob
You can iterate through your query result and save each item like this:
db.oldCollection.find(query).forEach(function(x){db.newCollection.save(x);})
You can create small server side javascript (like this one, just add filtering you want) and execute it using eval
You can use dump/restore in the way you described above
Copy collection command shoud be in mongodb soon (will be done in votes order)! See jira feature.
You should be able to create a subset with mapreduce (using 'out'). The problem is mapreduce has a special output format so your documents are going to be transformed (there is a JIRA ticket to add support for another format, but I can not find it at the moment). It is also going to be very inefficent :/
Copying a cursor to a collection makes a lot of sense, I suggest creating a ticket for this.
there is also toArray() method which can be used:
//create new collection
db.creatCollection("resultCollection")
// now query for type="foo" and insert the results into new collection
db.resultCollection.insert( (db.orginialCollection.find({type:'foo'}).toArray())