Mongo: Fetching documents that aren't referenced in another collection - mongodb

Lets assume that I have two collections in my Mongo database: A & B. Each A document may have reference to B, but B documents don't have references back to A.
How can I efficiently find all documents in B that are not referenced by a document in A?
Is there a more effective approach than retrieving all documents in B and manually comparing against A documents? Can this be done with map reduce?
Should I consider adding references from B to A to support the query? Since Mongo doesn't support transactions, I had avoided any two way references to avoid any chance of an inconsistent state in the event of a failure.
Also, I need to be able to effectively page through these results, if that impacts the solution.

In pseudo-code:
// Get the set of B document ids that are referenced by A documents.
var bref_ids = db.A.distinct('b_id');
// Get the set of all other B documents.
var unreferenced_b_docs = db.B.find({_id: {$nin: bref_ids}});

Related

Find oldest for every joined document

I have two collections, A and B. Each document in collection A corresponds to many documents in collection B. Documents in collection B have two important properties, "AID", the ID of the corresponding collection A document, and "date", something I want to sort by.
I now have a find query on collection A, which returns many documents in that collection. For each of the returned documents, I now want to find the oldest corresponding document in collection B.
So far, I used a for-each on my find query in collection A, and then used a find({"AID": doc._id}).sort({"date": 1}).limit(1) on collection B.
This is obviously super inefficient, since I have to go through my enormous collection B every single time for each document in collection A. Can I somehow simplify this into two aggregated queries only, perhaps using some kind of aggregate pipeline? How can I only traverse collection B once?
Thanks!

Mongodb compare two big data collection

I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.

MongoDB - how can I find all documents that aren't referenced by a document from another collection

So here is the Problem:
I have document in collection A, when it is first created it is not referenced by any other documents. At some point a document in collection Bwill be created and it will reference the ObjectId of a document in collection A.
What is the best way to find all Documents in Collection A that aren't referenced by I document in collection B?
I understand MongoDB doesn't support joins, but I wonder if there is a solution to this problem other than getting all referenced ObjectIds from Collection B and finding documents in collection A that aren't in that list, as this solution likely wouldn't scale well.
Can I just embed the document from Collection A into the document from collection B and then remove it from Collection A? Is that the best solution?
Thanks for your help and comments.
With MongoDB 3.2, the addition of the $lookup operator makes this possible:
db.a.aggregate(
[
{
$lookup: {
from: "b", <-- secondary collection name containing references to _id of 'a'
localField: "_id", <-- the _id field of the 'a' collection
foreignField: "a_id", <-- the referencing field of the 'b' collection
as: "references"
}
},
{
$match: {
references: []
}
}
]);
The above query will return all documents in collection a that have no references in collection b.
Be careful with this, though. Performance may become an issue with large collections.
Lots of options:
1) Add the id of the B document to an array in the A document (a reverse reference). Now you can look for A documents that don't have any elements in that array. Issue: array may get too large for document size if you have lots of cross references.
2) Add a collection C that tracks references between A's and B's. Behaves like a join table.
3) Have a simple flag in A 'referenced'. When you add a B mark all of the A's that it references as 'referenced'. When you remove a B, do a scan of B for all of the A's that are referenced by it and unflag any A's that no longer have a reference. Issue: could get out of sync.
4) Use map reduce on B to create a collection containing the ids of all the A's that are referenced by any B. Use that collection to mark all the A's that are referenced (after unmarking all of them first). Can use this to fix (3) periodically.
5) Put both document types in the same collection and use map reduce to emit the _id and a flag to say 'in A' or 'referenced by B'. In the reduce step look for any groups that have 'in A' but not 'referenced by B'.
...
Since there are no joins, the only options are the once you mention: either use embedded documents, or resign yourself to the use of two-part queries.
It depends on your implementation, but adding document type B to the corresponding document in A sounds like the best bet. That way you can retrieve the A's without a B by using a straightforward query ($exists operator)...
A.find( { B: { $exists: false } })

MongoDB: Nested query with arrays, and it's performance

I have 2 collections on 2 separate DBs. Both store an array field. I plan to query both at once so that:
All collection 1 documents that have elements [A,B] in their array
field and their _ids are present in collection 2's array field with a
specific document _id.
As an example:
docs (collection 1, DB 1):
[{"_id":ObjectId("doc1"), "array1":["A","B"]}, {"_id":ObjectId("doc2"), "array1":["A","C"]}]
user_docs (collection 2, DB 2):
[{"_id":ObjectId("usr1"), "array2": [ObjectId("doc1"),ObjectId("foo")]}, {"_id":ObjectId("usr2"), "array2": [ObjectId("bar"),ObjectId("baz")]}]
I need a query that given A,B and usr1, returns the 'doc1' object (because it has A,B in it's array1 field and usr1 has it in it's array2 field).
I obviously can fetch all docs having A,B in one query and all usr1's docs in another query and find the common elements at application level, but is there any better way of doing it using MongoDB?
Thanks for your help.
Ok im not sure i understand exactly what your trying to do from your description. But i dont understand why you would query data across db's this just seems very heavy handed to me why cant you store both the data sets in the same db. You can always separate later if required? Im not sure this will solve your vague problem but it would be a good place to start.
best of Luck.
You will have to query MongoDB twice, since you have no possibility of a join. You will have to do it on application level. If you can denormalize, do it. Cash the needed data in a embedded doc, so that you can do one query only.
I think #Eamonn is right, that you shouldn't have to do a query across DBs.

In MongoDB, do document _id's need to be unique across a collection or the entire DB?

I'm building a database with several collections. I have unique strings that I plan on using for all the documents in the main collection. Documents in other collections will reference documents in the main collection, which means I'll have to save said id's in the other collections. However, if _id's only need to be unique across a collection and not across an entire database, then I would just make the _id's in the other collections also use the aforementioned unique strings.
Also, I assume that in order to set my own _id's, all I have to do is have an "_id":"unique_string" property as part of the document that I insert, correct? I wouldn't need to convert the "unique_string" into another format, right?
Also, hypothetically speaking, would I be able to have a variable save the string "_id" and use that instead? Just to be clear, something as follows: var id = "_id" and then later on in the code (during an insert or a query for example) have id:"unique_string".
Best, and thanks,Sami
_ids have to be unique in a collection. You can quickly verify this by inserting two documents with the same _id in two different collections.
Your other assumptions are correct, just try them and see whether they work (they will). The proof of the pudding is in the eating.
Note: use _id directly, var id = "_id" just compilcates the code.