I have two collections which are in what we would call a "one to one relationship" if this were a relational database. I do not know why one is not nested within the other, but the fact is that for every document in collection "A", there is meant to be a document in collection "B" and vice versa.
Of course, in the absence of foreign key constraints, and in the presence of bugs, sometimes there is a document in "A" which does not have a related document in "B" (or vice versa).
I am new to MongoDB and I'm having trouble creating a query or script which will find me all the documents in "A" which do not have a related document in "B" (and vice versa). I guess I could use some sort of loop, but I don't yet know how that would work - I've only just started using simple queries on the RoboMongo command line.
Can anyone get me started with scripts for this? I have seen "Verifying reference (foreign key) integrity in MongoDB", but that doesn't help me. A bug has caused the "referential integrity" to break down, and I need the scripts in order to help me track down the bug. I also cannot redesign the database to use embedding (though I expect I'll ask why one document is not nested within the other).
I have also seen "How to find items in a collections which are not in another collection with MongoDB", but it has no answers.
Pseudo-Code for a Bad Technique
var misMatches = [];
var collectionB = db.getCollection('B');
var allOfA = db.getCollection('A').find();
while (allOfA.hasNext()) {
var nextA = allOfA.next();
if (!collectionB.find(nextA._id)) {
misMatches.push(nextA._id);
}
}
I don't know if this scales well, but ...
... given this sample date set:
> db.a.insert([{a:1},{a:2},{a:10} ])
> db.b.insert([ {b:2},{b:10},{b:20}])
// ^^^^^ ^^^^^^
// inconsistent 1-to-1 relationship
You could use map-reduce to collect the set of key in a and merge it with the set of key from b:
mapA=function() {
emit(this.a, {col: ["a"]})
}
mapB=function() {
emit(this.b, {col: ["b"]})
}
reduce=function(key, values) {
// merge both `col` arrays; sort the result
return {col: values.reduce(
function(a,b) { return a.col.concat(b.col) }
).sort()}
}
Producing:
> db.a.mapReduce(mapA, reduce, {out:{replace:"result"}})
> db.b.mapReduce(mapB, reduce, {out:{reduce:"result"}})
> db.result.find()
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 2, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 10, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] } }
Then it is quite easy to find all id that wasn't found in collection a and b. In addition you should be able to spot duplicate keys in one or the other collection:
> db.result.find({"value.col": { $ne: [ "a", "b" ]}})
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] }
Related
Can some please clarify if the following terms are as I understand them:
Mongo: embedded -> Mongoose: Sub Document
Mongo: referenced documents -> Mongoose: Population
Thank you for the help.
Embedded documents and subdocuments are the same thing:
{
"embeddedDoc" : { "a" : 1, "b" : 2 },
"embeddedDocs" : [
{ "c" : 2, "d" : "cookie" },
{ "s" : 99, "h" : "pie" },
]
}
Both terms are used when talking about MongoDB and Mongoose. I wouldn't say one is a "MongoDB term" and the other is a "Mongoose term".
A referenced document is a document for which some identifier (usually the _id) is stored in another document.
{
"referencedDoc" : "3F6A99E",
"referencedDocs" : [
"22AE5",
"95A11B"
]
}
In some other collection, or even the same collection, there'd be documents with _ids "3F6A99E", "22AE5", and "95A11B". Population is a Mongoose-specific concept. It's the process by which the references are resolved and replaced by the referenced documents, simulating a simple join. For example, after calling .populate() with for the field paths referencedDocs, you might end up with something like
{
"referencedDoc" : "3F6A99E",
"referencedDocs" : [
{ "_id" : "22AE5", "food" : "pickles" },
{ "_id" : "95A11B", "food" : "tuna" }
]
}
My MongoDB collection is made up of 2 main collections :
1) Maps
{
"_id" : ObjectId("542489232436657966204394"),
"fileName" : "importFile1.json",
"territories" : [
{
"$ref" : "territories",
"$id" : ObjectId("5424892224366579662042e9")
},
{
"$ref" : "territories",
"$id" : ObjectId("5424892224366579662042ea")
}
]
},
{
"_id" : ObjectId("542489262436657966204398"),
"fileName" : "importFile2.json",
"territories" : [
{
"$ref" : "territories",
"$id" : ObjectId("542489232436657966204395")
}
],
"uploadDate" : ISODate("2012-08-22T09:06:40.000Z")
}
2) Territories, which are referenced in "Map" objects :
{
"_id" : ObjectId("5424892224366579662042e9"),
"name" : "Afghanistan",
"area" : 653958
},
{
"_id" : ObjectId("5424892224366579662042ea"),
"name" : "Angola",
"area" : 1252651
},
{
"_id" : ObjectId("542489232436657966204395"),
"name" : "Unknown",
"area" : 0
}
My objective is to list every map with their cumulative area and number of territories. I am trying the following query :
db.maps.aggregate(
{'$unwind':'$territories'},
{'$group':{
'_id':'$fileName',
'numberOf': {'$sum': '$territories.name'},
'locatedArea':{'$sum':'$territories.area'}
}
})
However the results show 0 for each of these values :
{
"result" : [
{
"_id" : "importFile2.json",
"numberOf" : 0,
"locatedArea" : 0
},
{
"_id" : "importFile1.json",
"numberOf" : 0,
"locatedArea" : 0
}
],
"ok" : 1
}
I probably did something wrong when trying to access to the member variables of Territory (name and area), but I couldn't find an example of such a case in the Mongo doc. area is stored as an integer, and name as a string.
I probably did something wrong when trying to access to the member variables of Territory (name and area), but I couldn't find an example
of such a case in the Mongo doc. area is stored as an integer, and
name as a string.
Yes indeed, the field "territories" has an array of database references and not the actual documents. DBRefs are objects that contain information with which we can locate the actual documents.
In the above example, you can clearly see this, fire the below mongo query:
db.maps.find({"_id":ObjectId("542489232436657966204394")}).forEach(function(do
c){print(doc.territories[0]);})
it will print the DBRef object rather than the document itself:
o/p: DBRef("territories", ObjectId("5424892224366579662042e9"))
so, '$sum': '$territories.name','$sum': '$territories.area' would show you '0' since there are no fields such as name or area.
So you need to resolve this reference to a document before doing something like $territories.name
To achieve what you want, you can make use of the map() function, since aggregation nor Map-reduce support sub queries, and you already have a self-contained map document, with references to its territories.
Steps to achieve:
a) get each map
b) resolve the `DBRef`.
c) calculate the total area, and the number of territories.
d) make and return the desired structure.
Mongo shell script:
db.maps.find().map(function(doc) {
var territory_refs = doc.territories.map(function(terr_ref) {
refName = terr_ref.$ref;
return terr_ref.$id;
});
var areaSum = 0;
db.refName.find({
"_id" : {
$in : territory_refs
}
}).forEach(function(i) {
areaSum += i.area;
});
return {
"id" : doc.fileName,
"noOfTerritories" : territory_refs.length,
"areaSum" : areaSum
};
})
o/p:
[
{
"id" : "importFile1.json",
"noOfTerritories" : 2,
"areaSum" : 1906609
},
{
"id" : "importFile2.json",
"noOfTerritories" : 1,
"areaSum" : 0
}
]
Map-Reduce functions should not be and cannot be used to resolve DBRefs in the server side.
See what the documentation has to say:
The map function should not access the database for any reason.
The map function should be pure, or have no impact outside of the
function (i.e. side effects.)
The reduce function should not access the database, even to perform
read operations. The reduce function should not affect the outside
system.
Moreover, a reduce function even if used(which can never work anyway) will never be called for your problem, since a group w.r.t "fileName" or "ObjectId" would always have only one document, in your dataset.
MongoDB will not call the reduce function for a key that has only a
single value
Suppose we had something like the following document, but we wanted to return only the fields that had numeric information:
{
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"name" : "Mikey",
"list" : [ 1, 2, 3, 4, 5 ],
"people" : [ "Fred", "Barney", "Wilma", "Betty" ],
"status" : false,
"created" : ISODate("2014-02-12T00:37:40.534Z"),
"views" : 5
}
Now I know that we can query for fields that match a certain type by use of the $type operator. But I'm yet to stumble upon a way to $project this as a field value. So if we looked at the document in the "unwound" form you would see this:
{
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"name" : 2,
"list" : 16,
"people" : 2
"status" : 8,
"created" : 9,
"views" : 16
}
The final objective would be list only the fields that matched a certain type, let's say compare to get the numeric types and filter out the fields, after much document mangling, to produce a result as follows:
{
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"list" : [ 1, 2, 3, 4, 5 ],
"views" : 5
}
Does anyone have an approach to handle this.
There are a few issues that make this not practical:
Since the query is a distinctive parameter from the ability to do a projection, this isn't possible from a single query alone, as the projection cannot be influenced by the results of the query
As there's no way with the aggregation framework to iterate fields and check type, that's also not an option
That being said, there's a slightly whacky way of using a Map-Reduce that does get similar answers, albeit in a Map-Reduce style output that's not awesome:
map = function() {
function isNumber(n) {
return !isNaN(parseFloat(n)) && isFinite(n);
}
var numerics = [];
for(var fn in this) {
if (isNumber(this[fn])) {
numerics.push({f: fn, v: this[fn]});
}
if (Array.isArray(this[fn])) {
// example ... more complex logic needed
if(isNumber(this[fn][0])) {
numerics.push({f: fn, v: this[fn]});
}
}
}
emit(this._id, { n: numerics });
};
reduce = function(key, values) {
return values;
};
It's not complete, but the results are similar to what you wanted:
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"value" : {
"n" : [
{
"f" : "list",
"v" : [
1,
2,
3,
4,
5
]
},
{
"f" : "views",
"v" : 5
}
]
}
The map is just looking at each property and deciding whether it looks like a number ... and if so, adding to an array that will be stored as an object so that the map-reduce engine won't choke on array output. I've kept it simple in the example code -- you could improve the logic of numeric and array checking for sure. :)
Of course, it's not live like a find or aggregation, but as MongoDB wasn't designed with this in mind, this may have to do if you really wanted this functionality.
I have two existing collections and need to populate a third collection based on the comparison between the two existing.
The two collections that need to be compared have the following schema:
// Settings collection:
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Low":116,
"High":124,
"ImportLogId":1
}
// Data collection
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Date":"11/6/2013 12AM",
"Value":128,
"ImportLogId": 1
}
I am new to MongoDB and NoSQL in general so I am having a tough time grasping how to do this. The SQL would look something like this:
SELECT s.Identifier, r.ReadValue, r.U, r.C, r.Date
FROM Settings s
JOIN Reads r
ON s.Identifier = r.Identifier
AND s.C = r.C
AND s.U = r.U
WHERE (r.Value <= s.Low OR r.Value >= s.High)
In this case using the sample data, I would want to return a record because the value from the Data collection is greater than the high value from the setting collection. Is this possible using Mongo queries or map reduce, or is this bad collection structure (i.e. maybe all of this should be in one collection)?
A few more additional notes:
The Settings collection should really only have 1 record per "Identifier". The Data collection will have many records per "Identifier". This process could potentially be scanning hundreds of thousands of documents at one time, so resource consideration is somewhat important
There is no good way of performing operation like this using MongoDB. If you want BAD way you can use code like this:
db.settings.find().forEach(
function(doc) {
data = db.data.find({
Identifier: doc.Idendtifier,
C: doc.C,
U: doc.U,
$or: [{Value: {$lte: doc.Low}}, {Value: {$gte: doc.High}}]
}).toArray();
// Do what you need
}
)
but don't expect it will perform even remotely as good as any decent RDBMS.
You could rebuild your schema and embed documents from data collection like this:
{
"_id" : ObjectId("527a7f4b07c17a1f8ad009d2"),
"Identifier" : "ABC123",
"C" : "1",
"U" : "V",
"Low" : 116,
"High" : 124,
"ImportLogId" : 1,
"Data" : [
{
"Date" : ISODate("2013-11-06T00:00:00Z"),
"Value" : 128
},
{
"Date" : ISODate("2013-10-09T00:00:00Z"),
"Value" : 99
}
]
}
It may work if number of embedded document is low but to be honest working with arrays of documents is far from being pleasant experience. Not even mention that you can easily hit document size limit with growing size of the Data array.
If this kind of operations is typical for your application I would consider using different solution. As much as I like MongoDB it works well only with certain type of data and access patterns.
Without the concept of JOIN, you must change your approach and denormalize.
In your case, looks like you're doing a data log validation. My advice is looping settings collection and with each of them use the findAndModify operator in order to set a validation flag on data collection records who matches; after that, you could just use the find operator on the data collection, filtering by the new flag.
Starting Mongo 4.4, we can achieve this type of "join" with the new $unionWith aggregation stage coupled with a classic $group stage:
// > db.settings.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Low" : 116 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Low" : 416 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Low" : 142 }
// > db.data.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 14 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Value" : 43 }
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 45 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Value" : 8 }
db.data.aggregate([
{ $unionWith: "settings" },
{ $group: {
_id: { Identifier: "$Identifier", C: "$C", U: "$U" },
Values: { $push: "$Value" },
Low: { $mergeObjects: { v: "$Low" } }
}},
{ $match: { "Low.v": { $lt: 150 } } },
{ $out: "result-collection" }
])
// > db.result-collection.find()
// { _id: { Identifier: "ABC123", C: "1", U: "V" }, Values: [14, 45], Low: { v: 116 } }
// { _id: { Identifier: "GHI789", C: "1", U: "W" }, Values: [43], Low: { v: 142 } }
This:
Starts with a union of both collections into the pipeline via the new $unionWith stage.
Continues with a $group stage that:
Groups records based on Identifier, C and U
Accumulates Values into an array
Accumulates Lows via a $mergeObjects operation in order to get a value of Low that isn't null. Using a $first wouldn't work since this could potentially take null first (for elements from the data collection). Whereas $mergeObjects discards null values when merging an object containing a non-null value.
Then discards joined records whose Low value is bigger than let's say 150.
And finally output resulting records to a third collection via an $out stage.
A feature we've developed called Data Compare & Sync might be able to help here.
It lets you compare two MongoDB collections and see the differences (e.g. spot the same, missing, or different fields).
You can then export these comparison results to a CSV file, and use that to create your new, third collection.
Disclosure: We are the creators of the MongoDB GUI, Studio 3T.
I have two databases named: DB_A and DB_B.
Each database has one collection with same name called store.
Both collections have lots lots of documents that have exactly the same structure { key:" key1", value: "value1" }, etc.
Actually, I was supposed to only create DB_A and insert all documents into DB_A. But later when I did my second round of inserting, I made a mistake by typing the wrong name as the database name.
So now, each database has a size of 32GB, I wish to merge two databases.
One problem/constraint is that the free space available now is only 15GB, so I can't just copy all things from DB_B to DB_A.
I am wondering if I can perform some kind of "move" to merge the two databases? I prefer the most efficient way as simply reinserting 32GB into DB_A will take quite a time.
I think the easiest (and maybe the only) way is to write a script that merges the two databases document after document.
Get first document from DB_B.
Insert it into DB_A if needed.
Delete it from DB_B.
Repeat until done.
Instead of deleting documents from source db (DB_B), you may want to just read documents in batches. This should be more performant, but slightly more difficult to code (especially if you never done such a thing).
Starting Mongo 4.2, the new aggregation stage $merge can be used to merge the content of a collection in another collection in another database:
// > use db1
// > db.collection.find()
// { "_id" : 1, "key" : "a", "value" : "b" }
// { "_id" : 2, "key" : "c", "value" : "d" }
// { "_id" : 3, "key" : "a", "value" : "b" }
// > use db2
// > db.collection.find()
// { "_id" : 1, "key" : "e", "value" : "f" }
// { "_id" : 4, "key" : "a", "value" : "b" }
// > use db1
db.collection.aggregate([
{ $merge: { into: { db: "db2", coll: "coll" } } }
])
// > use db2
// > db.collection.find()
// { "_id" : 1, "key" : "a", "value" : "b" }
// { "_id" : 2, "key" : "c", "value" : "d" }
// { "_id" : 3, "key" : "a", "value" : "b" }
// { "_id" : 4, "key" : "a", "value" : "b" }
By default, when the target and the source collections contain a document with the same _id, $merge will replace the document from the target collection with the document from the source collection. In order to customise this behaviour, check $merge's whenMatched parameter.