How to compare 2 mongodb collections? - mongodb

Im trying to 'compare' all documents between 2 collections, which will return true only and if only all documents inside 2 collections are exactly equal.
I've been searching for the methods on the collection, but couldnt find one that can do this.
I experimented something like these in the mongo shell, but not working as i expected :
db.test1 == db.test2
or
db.test1.to_json() == db.test2.to_json()
Please share your thoughts ! Thank you.

You can try using mongodb eval combined with your custom equals function, something like this.
Your methods don't work because in the first case you are comparing object references, which are not the same. In the second case, there is no guarantee that to_json will generate the same string even for the objects that are the same.
Instead, try something like this:
var compareCollections = function(){
db.test1.find().forEach(function(obj1){
db.test2.find({/*if you know some properties, you can put them here...if don't, leave this empty*/}).forEach(function(obj2){
var equals = function(o1, o2){
// here goes some compare code...modified from the SO link you have in the answer.
};
if(equals(ob1, obj2)){
// Do what you want to do
}
});
});
};
db.eval(compareCollections);
With db.eval you ensure that code will be executed on the database server side, without fetching collections to the client.

Related

call custom python function on every document in a collection Mongo DB

I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?
I noticed cursor.forEach but can't it be done just using python efficiently ?
A simple example would be to split the string in text and store the no. of words as a new attribute.
def split_count(text):
# some complex preprocessing...
return len(text.split())
# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)
But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.
I found a way to call any custom python function on a collection using parallel_scan in PyMongo.
def process_text(cursor):
for row in cursor.batch_size(200):
# Any complex preprocessing here...
split_text = row['text'].split()
db.collection.update_one({'_id': row['_id']},
{'$set': {'split_text': split_text,
'num_words': len(split_text) }},
upsert=True)
def preprocess(num_threads=4):
# Get up to max 'num_threads' cursors.
cursors = db.collection.parallel_scan(num_threads)
threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This is not really faster than cursor.forEach (but not that slow either), but it helps me execute any arbitrarily complex python code and save the results from within Python itself.
Also if I have an array of ints in one of the attributes, doing cursor.forEach converts them to floats which I don't want. So I preferred this way.
But I would be glad to know if there're any better ways than this :)
It is quite unlikely that it will ever be efficient to do this kind of thing in python. This is because the document would have to make a round trip and go through the python function on the client machine.
In your example code, you are passing the result of a function to a mongodb update query, which won't work. You can't run any python code inside mongodb queries on the db server.
As the answer to you linked question suggests, this type of action has to be performed in the mongo shell. e.g:
db.collection.find().snapshot().forEach(
function (elem) {
splitLength = elem.text.split(" ").length
db.collection.update(
{
_id: elem._id
},
{
$set: {
split: splitLength
}
}
);
}
);

Meteor - Cannot Access All Read Operations

I'm new to Meteor. I've been stuck on this problem for a while. I can successfully adds items to a collection and look at them fully in the console. However, I cannot access all of the read operations in my .js file.
That is, I can use .find() and .findOne() with empty parameters. But when I try to add .sort or an argument I get an error telling me the object is undefined.
Autopublish is turned on, so I'm not sure what the problem is. These calls are being made directly in the client.
This returns something--
Template.showcards.events({
"click .play-card": function () {
alert(Rounds.find());
}
})
And this returns nothing--
Template.showcards.events({
"click .play-card": function () {
alert(Rounds.find().sort({player1: -1}));
}
})
Sorry for the newbie question. Thanks in advance.
Meteor's collection API works a bit differently from the mongo shell's API, which is understandably confusing for new users. You'll need to do this:
Template.showcards.events({
'click .play-card': function() {
var sortedCards = Rounds.find({}, {sort: {player1: -1}}).fetch();
console.log(sortedCards);
}
});
See this for more details. Also note that logging a cursor (the result of a find) probably isn't what you want. If you want to see the contents of the documents, you need to fetch them.
Rounds.find().sort({player1: -1}) returns a cursor, so you will want to do this:
Rounds.find().sort({player1: -1}).fetch();
Note that this returns an Array of document objects. So you would do something more like this:
docs = Rounds.find().sort({player1: -1}).fetch();
alert(docs[0]);

Updating multiple complex array elements in MongoDB

I know this has been asked before, but I have yet to find a solution that works efficiently. I am working with the MongoDB C# driver, though this is more of a general question about MongoDB operations.
I have a document structure that looks something like this:
field1: value1
field2: value2
...
users: [ {...user 1 subdocument...}, {...user 2 subdocument...}, ... ]
Some facts:
Each user subdocument includes further sub-arrays & subdocuments (so they're fairly complex).
The average users array only contains about 5 elements, but in the worst case can surpass 100.
Several thousand update operations on multiple users may be conducted per day in this system, each on one document at a time. Larger arrays will receive more frequent updates due to their data size.
I am trying to figure out how to do this efficiently. From what I've heard, you cannot directly set several array elements to new values all at once, so I had to try something else.
I tried using the $pullAll / $AddToSet + $each operations to remove the old array and replace it with a modified one. I am aware that $pullall can remove only the elements that I need as well, but I would like to preserve the order of elements.
The C# code:
try
{
WriteConcernResult wcr = collection.Update(query,
Update.Combine(Update.PullAll("users"),
Update.AddToSetEach("users", newUsers.ToArray())));
}
catch (WriteConcernException wce)
{
return wce.Message;
}
In this case newUsers is aList<BsonValue>converted to an array. However I am getting the following exception message:
Cannot update 'users' and 'users' at the same time
By the looks of it, I can't have two update statements in use on the same field in the same write operation.
I also tried Update.Set("users", newUsers.ToArray()), but apparently the Set statement doesn't work with arrays, just basic values:
Argument 2: cannot convert from 'MongoDB.Bson.BsonValue[]' to 'MongoDB.Bson.BsonValue'
So then I tried converting that array to a BsonDocument:
Update.Set("users", newUsers.ToArray().ToBsonDocument());
And got this:
An Array value cannot be written to the root level of a BSON document.
I could try replacing the whole document, but that seems like overkill and definitely not very efficient.
So the only thing I can think of now is to run two separate write operations: one to remove the unwanted old users and another to replace them with their newer versions:
WriteConcernResult wcr = collection.Update(query, Update.PullAll("users"));
WriteConcernResult wcr = collection.Update(query, Update.AddToSetEach("users", newUsers.ToArray()));
Is this my best option? Or is there another, better way of doing this?
Your code should work with a minor change:
Update.Set("users", new BsonArray(newUsers));
BsonArray is a BsonValue, where as an array of documents is not and we don't implicitly convert arrays like we do other primitive values.
this extension method solve my problem:
public static class MongoExtension
{
public static BsonArray ToBsonArray(this IEnumerable list)
{
var array = new BsonArray();
foreach (var item in list)
array.Add((BsonValue) item);
return array;
}
}

Play with Meteor MongoID in Monk

I'm building an API using Express and Monk that connects to a database where wrote are mainly handled by a Meteor application.
I know that Meteor uses its own algorithm to generate IDs. So when I do something like that:
id = "aczXLTjzjjn3PchX6" // this is an ID generated by Meteor (not a valid MongoID)
Users.findOne({ _id: id }, function(err, doc) {
console.log(doc);
});
Monk outputs:
Argument passed in must be a single String of 12 bytes or a string of 24 hex characters.
This way, it seems very tricky to me to design a solid and reliable REST API. Thus, I have two questions:
How can I handle the difference in my queries between ids generated by Meteor and valid MongoID()? Is there a simple way to get JSON results from a Meteor database?
Will it be a problem to insert documents from the API which this time will have a valid MongoId()? I will end up with both type of ids in my database, seems very bad to me. :/
As I said in here, in a similar issue you can just override the id converter part of monk:
var idConverter = Users.id; // Keep a reference in case...
Users.id = function (str) { return str; };
But don't expect monk to convert Ids automatically anymore.
How can I handle the difference in my queries between ids generated by Meteor and valid MongoID()? Is there a simple way to get JSON results from a Meteor database?
Nothing much you need to do. When it is a valid ObjectId(mongo db ids) and you got a string just convert it to Object id:
id = ObjectId(id);
User.find(id, ...)
Here's the implementation for monk id method(this.col.id is reference to mongodb native ObjectId):
Collection.prototype.id =
Collection.prototype.oid = function (str) {
if (null == str) return this.col.id();
return 'string' == typeof str ? this.col.id(str) : str;
};
Will it be a problem to insert documents from the API which this time will have a valid MongoId()? I will end up with both type of ids in my database, seems very bad to me. :/
It is bad. Though it won't cause much trouble(in my experience in nodejs) if you be careful. But not all the times you are careful(programmer errors happen a lot), but it's manageable. In static languages(like Java) this is a big NO, because a field can only one type(either string or ObjectId).
My suggestion is that don't use mongodb ObjectId at all and just use strings as ids. On inserts just give it string _id, so the driver won't give it ObjectId. Though you can stop the driver from doing so by overriding pkFactory, but it doesn't seem to be easy with monk.
One more thing is that monk is not actively maintained and it's just a thin layer on top of mongodb. In my experience if you have multiple collections and large/complex code base mongoose will be much better to use.
Just to keep this question updated.
As stated in the docs, Monk is automatically casting the strings into OjbectID.
In order to disable this behaviour without using hacky solutions you'll have to disable that feature. In order to do that you just need to set castIds to false when getting the db.
So:
const Users = db.get('users', { castIds: false });
Now this will work:
Users.findOne({ _id: "aczXLTjzjjn3PchX6" }, function(err, doc) {
console.log(doc);
});

Is Rx extensions suitable for reading a file and store to database

I have a really long Excel file wich I read using EPPlus. For each line I test if it meets certain criteria and if so I add the line (an object representing the line) to a collection. When the file is read, I store those objects to the database. Would it be possible to do both things at the same time? My idea is to have a collection of objects that somehow would be consumed by thread that would save the objects to the DB. At the same time the excel reader method would populate the collection... Could this be done using Rx or is there a better method?
Thanks.
An alternate answer - based on comments to my first.
Create a function returning an IEnumberable<Records> from EPPlus/Xls - use yield return
then convert the seqence to an observable on the threadpool and you've got the Rx way of having a producer/consumer and BlockingCollection.
function IEnumberable<Records> epplusRecords()
{
while (...)
yield return nextRecord;
}
var myRecords = epplusRecords
.ToObservable(Scheduler.ThreadPool)
.Where(rec => meetsCritera(rec))
.Select(rec => newShape(rec))
.Do(newRec => writeToDb(newRec))
.ToArray();
Your case seems to be of pulling data (IEnumerable) and not pushing data (IObservable/Rx).
Hence I would suggest LINQ to objects is something that can be used to model the solution.
Something like shown in below code.
publis static IEnumerable<Records> ReadRecords(string excelFile)
{
//Read from excel file and yield values
}
//use linq operators to do filtering
var filtered = ReadRecords("fileName").Where(r => /*ur condition*/)
foreach(var r in filtered)
WriteToDb(r);
NOTE: In using IEnumerable you don't create intermediate collections in this case and the whole process looks like a pipeline.
It doesn't seem like a good fit, as there's no inherent concurrency, timing, or eventing in the use case.
That said, it may be a use case for plinq. If EEPlus supports concurrent reads. Something like
epplusRecords
.AsParallel()
.Where(rec => meetsCritera(rec))
.Select(rec => newShape(rec))
.Do(newRec => writeToDb(newRec))
.ToArray();