How to compare all documents in two collections with millions of doc and write the diff in a third collection in MongoDB - mongodb

I have two collections (coll_1, coll_2) with a million documents each.
These two collections are actually created by running two versions of a code from the same data source, so both two collections will have the same number of documents but the document in both collections can have one more field or sub-document missing or have a different values, but both collection's documents will have the same primary_key_id which is indexed.
I have this javascript function saved on the db to get the diff
db.system.js.save({
_id: "diffJSON", value:
function(obj1, obj2) {
var result = {};
for (key in obj1) {
if (obj2[key] != obj1[key]) result[key] = obj2[key];
if (typeof obj2[key] == 'array' && typeof obj1[key] == 'array')
result[key] = arguments.callee(obj1[key], obj2[key]);
if (typeof obj2[key] == 'object' && typeof obj1[key] == 'object')
result[key] = arguments.callee(obj1[key], obj2[key]);
}
return result;
}
});
Which runs fine like this
diffJSON(testObj1, testObj2);
Question: How to run diffJSON on coll1 and coll2, and output diffJSON result into coll3 along with primary_key_id.
I am new to MongoDB, and I understand the JOINS doesn't work as similar to RDBMS, so I wonder if I have to copy the two comparing documents in a single collection and then run the diffJSON function.
Also, most of the time (say 90%) documents in two collections will be identical, I would need to know about only 10% of docs which have any diff.
Here is a simple example document:
(but real doc is around 15k in size, just so you know the scale)
var testObj1 = { test:"1",test1: "2", tt:["td","ax"], tr:["Positive"] ,tft:{test:["a"]}};
var testObj2 = { test:"1",test1: "2", tt:["td","ax"], tr:["Negative"] };
If you know a better way to diff the documents, please feel free to suggest.

you can use a simple shell script to achieve this. First create a file named script.js and paste this code in it :
// load previously saved diffJSON() function
db.loadServerScripts();
// get all the document from collection coll1
var cursor = db.coll1.find();
if (cursor != null && cursor.hasNext()) {
// iterate over the cursor
while (cursor.hasNext()){
var doc1 = cursor.next();
// get the doc with the same _id from coll2
var id = doc1._id;
var doc2 = db.coll2.findOne({_id: id});
// compute the diff
var diff = diffJSON(doc2, doc1);
// if there is a difference between the two objects
if ( Object.keys(diff).length > 0 ) {
diff._id = id;
// insert the diff in coll3 with the same _id
db.coll3.insert(diff);
}
}
}
In this script I assume that your primary_key is the _id field.
then execute it from you shell like this:
mongo --host hostName --port portNumber databaseName < script.js
where databaseName is the came of the database containing the collections coll1 and coll2.
for this samples documents (just added an _id field to your docs):
var testObj1 = { _id: 1, test:"1",test1: "2", tt:["td","ax"], tr:["Positive"] ,tft:{test:["a"]}};
var testObj2 = { _id: 1, test:"1",test1: "2", tt:["td","ax"], tr:["Negative"] };
the script will save the following doc in coll3 :
{ "_id" : 1, "tt" : { }, "tr" : { "0" : "Positive" } }

This solution builds upon the one proposed by felix (I don't have the necessary reputation to comment on his). I made a few small changes to his script that bring important performance improvements:
// load previously saved diffJSON() function
db.loadServerScripts();
// get all the document from collection coll1 and coll2
var cursor1 = db.coll1.find().sort({'_id': 1});
var cursor2 = db.coll2.find().sort({'_id': 1});
if (cursor1 != null && cursor1.hasNext() && cursor2 != null && cursor2.hasNext()) {
// iterate over the cursor
while (cursor1.hasNext() && cursor2.hasNext()){
var doc1 = cursor1.next();
var doc2 = cursor2.next();
var pk = doc1._id
// compute the diff
var diff = diffJSON(doc2, doc1);
// if there is a difference between the two objects
if ( Object.keys(diff).length > 0 ) {
diff._id = pk;
// insert the diff in coll3 with the same _id
db.coll3.insert(diff);
}
}
}
Two cursors are used for fetching all the entries in the database sorted by the primary key. This is a very important aspect and brings most of the performance improvement. By retrieving the documents sorted by primary key, we make sure we match them correctly by the primary key. This is based on the fact that the two collections hold the same data.
This way we avoid making a call to coll2 for each document in coll1. It might seem as something insignificant, but we're talking about 1 million calls which put a lot of stress on the database.
Another important assumption is that the primary key field is _id. If it's not the case, it is crucial to have an unique index on the primary key field. Otherwise, the script might mismatch documents with the same primary key.

Related

Mongo: Can i upsert(insert or update if data already exists) a list of data on the basis of a unique key

I have a list of data-
{bookName: book1,
bookId: bookId1,
bookType: type1,
publisher: publisher1,
uniqueCombo: "bookId" + "publisher"}
My unique identifier in this case if uniqueCombo. I want to upsert these data to my mongo collection on the basis of the uniqueCombo. If the unique uniqueCombo of the latest data is not present in the collection then inset otherwise update the documents where the uniqueCombo is found.
I can use loop here and upsert the data one by one but i do not want to use loop. Or I can use unique index on the field uniqueCombo which will do the job but I need to know if any other way is there to achieve this.
I am using MongoDB shell version v4.2.3.
Unless uniqueCombo is renamed to _id, you will have to use a loop. However, you can use a bulk operator to build it into a single operation.
var bulk = db.users.initializeUnorderedBulkOp();
// foreach book
for (var i = 0; i < books.length; i++) {
var book = books[i];
bulk.find({
uniqueCombo: book.bookId + book.publisher
}).upsert().updateOne({
{
bookName: book.bookName,
bookId: book.bookId,
bookType: book.bookType,
publisher: book.publisher,
uniqueCombo: book.bookId + book.publisher
}
});
}
bulk.execute();

Adding a new field to 100 million records in mongodb

What is the fastest and safest strategy for adding a new field to over 100 million mongodb documents?
Background
Using mongodb 3.0 in a 3 node replica set
We are adding a new field (post_hour) that is based on data in another field (post_time) in the current document. The post_hour field is a truncated version of post_time to the hour.
I faced a similar scenario and in which I had create a script to update around 25 Million documents and it was taking a lot of time to update all the documents. To improve performance, I one by one inserted the updated document into a new collection and renamed the new collection.This approach helped because I was inserting the documents rather than updating them ('insert' operation is faster than 'update' operation).
Here is the sample script(I have not tested it):
/*This method returns postHour*/
function convertPostTimeToPostHour(postTime){
}
var totalCount = db.person.count();
var chunkSize = 1000;
var chunkCount = totalCount / chunkSize;
offset = 0;
for(index = 0; index<chunkCount; index++){
personList = db.persons.find().skip(offset).limit(chunkSize);
personList.forEach(function (person) {
newPerson = person;
newPerson.post_hour = convertPostTimeToPostHour(person.post_time);
db.personsNew.insert(newPerson); // This will insert the record in a new collection
});
offset += chunkSize;
}
When the above written script will get executed, the new collection 'personNew' will have the updated records with value of field 'post_hour' set.
If the existing collection is having any indexes, you need to recreate them in the new collection.
Once then indexes are created, you can rename the name of collection 'person' to 'personOld' and 'personNew' to 'person'.
The snapshot will allow to prevent duplicates in query result (as we are extending size) - can be removed if any trouble happen.
Please find mongo shell script below where 'a1' is collection name:
var documentLimit = 1000;
var docCount = db.a1.find({
post_hour : {
$exists : false
}
}).count();
var chunks = docCount / documentLimit;
for (var i = 0; i <= chunks; i++) {
db.a1.find({
post_hour : {
$exists : false
}
}).snapshot()
.limit(documentLimit)
.forEach(function (doc) {
doc.post_hour = 12; // put your transformation here
// db.a1.save(doc); // uncomment this line to save data
// you can also specify write concern here
printjson(doc); // comment this line to avoid polution of shell output
// this is just for test purposes
});
}
You can play with parameters, but as bulk is executed in 1000 records blocks, that looks optimal.

Get the count of Fields in each document through query using MongoDB java driver

Is it possible to get the number of fields in document using a Query From MongoDB java Driver
Example:
Document1 :{_id:1,"type1": 10 "type2":30,"ABC":123,"DEF":345}
Document2 :{_id:2,"type2":30,"ABC":123,"DEF":345}
Note: In second document "type1" key doesnt exist .
When i project
Is it possible to project only "type1" and "type2" and get number of fields existing in that document.
With current code i am getting all the documents and individually searching if there is the key i am looking is present in the whole cursor:
The code snipped is as follows:
MongoClient mcl=new MongoClient();
MongoDatabase mdb=mcl.getDatabase("test");
MongoCollection mcol=mdb.getCollection("testcol");
FindIterable<Document> findIterable = mcol.find();
MongoCursor<Document> cursor = findIterable.iterator();
//Here am having 120 types checking if each type is present..
while(cursor.hasNext())
{
Document doc=cursor.next();
int numberOfTypes=0;
for(int i=1;i<=120;i++)
{
if(doc.containsKey("type"+i))
{
numberOfTypes++;
}
}
System.out.println("*********************************************");
System.out.println("_id"+doc.get("_id"));
System.out.println("Number of Types in this document are "+numberOfTypes);
System.out.println("*********************************************");
}
}
This code is working if the records are less it wont be over load to the application ..Suppose there are 5000000 with each Document containing 120 types , the application is crashing as there would be more garbage collection involved as for every iteration we are creating a document.Is there any other approach through which we can achieve the above stated functionality.
From your java code I read
project only "type1" and "type2" and get number of fields existing in that document.
as
project only type[1..120] fields and number of such fields in the document
With this assumption, you can map-reduce it as following:
db.testcol.mapReduce(
function(){
value = {count:0};
for (i = 1; i <= 120; i++) {
key = "type" + i
if (this.hasOwnProperty(key)) {
value[key] = this[key];
value.count++
}
}
if (value.count > 0) {
emit(this._id, value);
}
},
function(){
//nothing to reduce
},
{
out:{inline:true}
});
out:{inline:true} works for small datasets, when result fits into 16Mb limit. For larger responses you need to output to a collection, which you can query and iterate as usual.

MongoDB: Several fields to a list

I currently have a collection that follows a format like this:
{ "_id": ObjectId(...),
"name" : "Name",
"red": 0,
"blue": 0,
"yellow": 1,
"green": 0,
...}
and so on (a bunch of colors). What I would like to do is to create a new array named colors, whose elements are those colors that have a value of 1.
For example:
{ "_id": ObjectId(...),
"name" : "Name",
"colors": ["yellow"]
}
Is this something I can do on the Mongo shell? Or should I do it in a program?
I'm pretty sure I can do it using Python, however I am having difficulties trying to do it directly in the shell. If it can be done in the shell, can anyone point me in the right direction?
Thanks.
Yes it can be easily done in the shell, or basically by following the example adapted into any language.
The key here is to look at the fields that are "colors" then contruct an update statement that both removes those fields from the document while testing them to see if they are valid for inclusion into the array, then of course adding that to the document update as well:
var bulk = db.collection.initializeOrderedBulkOp(),
count = 0;
db.collection.find().forEach(function(doc) {
doc.colors = doc.colors || [];
var update = { "$unset": {}};
Object.keys(doc).filter(function(key) {
return !/^_id|name|colors/.test(key)
}).forEach(function(key) {
update.$unset[key] = "";
if ( doc[key] == 1)
doc.colors.push(key);
});
update["$addToSet"] = { "colors": { "$each": doc.colors } };
bulk.find({ "_id": doc._id }).updateOne(update);
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp()
}
});
if ( count % 1000 != 0 )
bulk.execute();
The Bulk Operations usage means that batches of updates are sent rather than one request and response per document, so this will process a lot faster than merely issuing singular updates back and forth.
The main operators here are $unset to remove the existing fields and $addToSet to add the new evaluated array. Both are built up by cycling the keys of the document that make up the possible colors and excluding the other keys you don't want to modify using a regex filter.
Also using $addToSet and this line:
doc.colors = doc.colors || [];
with the purpose of being sure that if any document was already partially converted or otherwise touched by a code change that had already started storing the correct array, then these would not be adversely affected or overwritten by the update process.
tl;dr, spoiler
Mongodb's shell has access to some javascript-like methods on their objects. You can query your collection with db.yourCollectionName.find() which will return a cursor (cursor methods). Then iterate through to get each document, iterate through the keys, conditionally filter out keys like _id and name and then check to see if the value is 1, store that key somewhere in a collection.
Once done, you'd probably want to use db.yourCollectionName.update() or db.yourCollectionName.findAndModify() to find the record by _id and use $set to add a new field and set it's value to the collection of keys.

Query or command to find a Document, given an ObjectID but NOT a collection

So I have a document that has references to foreign ObjectIDs that may point to other documents or collections.
For example this is the pseudo-structure of the document
{
_id: ObjectID(xxxxxxxx),
....
reference: ObjectID(yyyyyyyy)
}
I can't find anything that does not involve providing the collection and given that I don't know for sure on which collection to search, I am wondering if there is a way for me to find the document in the entire database and find the collection ObjectID(yyyyyyyy) belongs to.
The only possible way to do this is by listing every collection in the database and performing a db.collection.find() on each one.
E.g. in the Mongo shell I would do something like
var result = new Array();
var collections = db.getCollectionNames();
for (var i = 0; i < collections.length; i++) {
var found = db.getCollection(collections[i]).findOne({ "_id" : ObjectId("yyyyyyyy") });
if (found) {
result.push(found);
}
}
print(result);
You need to run your query on all collections in your database.
db.getCollectionNames().forEach(function(collection){
db[collection].find({ $or : [
{ _id : ObjectId("535372b537e6210c53005ee5") },
{ reference : ObjectId("535372b537e6210c53005ee5")}]
}).forEach(printjson);
});