MongoDB: How to merge two collections/databases together into one? - mongodb

I have two databases named: DB_A and DB_B.
Each database has one collection with same name called store.
Both collections have lots lots of documents that have exactly the same structure { key:" key1", value: "value1" }, etc.
Actually, I was supposed to only create DB_A and insert all documents into DB_A. But later when I did my second round of inserting, I made a mistake by typing the wrong name as the database name.
So now, each database has a size of 32GB, I wish to merge two databases.
One problem/constraint is that the free space available now is only 15GB, so I can't just copy all things from DB_B to DB_A.
I am wondering if I can perform some kind of "move" to merge the two databases? I prefer the most efficient way as simply reinserting 32GB into DB_A will take quite a time.

I think the easiest (and maybe the only) way is to write a script that merges the two databases document after document.
Get first document from DB_B.
Insert it into DB_A if needed.
Delete it from DB_B.
Repeat until done.
Instead of deleting documents from source db (DB_B), you may want to just read documents in batches. This should be more performant, but slightly more difficult to code (especially if you never done such a thing).

Starting Mongo 4.2, the new aggregation stage $merge can be used to merge the content of a collection in another collection in another database:
// > use db1
// > db.collection.find()
// { "_id" : 1, "key" : "a", "value" : "b" }
// { "_id" : 2, "key" : "c", "value" : "d" }
// { "_id" : 3, "key" : "a", "value" : "b" }
// > use db2
// > db.collection.find()
// { "_id" : 1, "key" : "e", "value" : "f" }
// { "_id" : 4, "key" : "a", "value" : "b" }
// > use db1
db.collection.aggregate([
{ $merge: { into: { db: "db2", coll: "coll" } } }
])
// > use db2
// > db.collection.find()
// { "_id" : 1, "key" : "a", "value" : "b" }
// { "_id" : 2, "key" : "c", "value" : "d" }
// { "_id" : 3, "key" : "a", "value" : "b" }
// { "_id" : 4, "key" : "a", "value" : "b" }
By default, when the target and the source collections contain a document with the same _id, $merge will replace the document from the target collection with the document from the source collection. In order to customise this behaviour, check $merge's whenMatched parameter.

Related

MongoDB $or + sort + index. How to avoid sorting in memory?

I have an issue to generate proper index for my mongo query, which would avoid SORT stage. I am not even sure if that is possible in my case. So here is my query with execution stats:
db.getCollection('test').find(
{
"$or" : [
{
"a" : { "$elemMatch" : { "_id" : { "$in" : [4577] } } },
"b" : { "$in" : [290] },
"c" : { "$in" : [35, 49, 57, 101, 161, 440] },
"d" : { "$lte" : 399 }
},
{
"e" : { "$elemMatch" : { "numbers" : { "$in" : ["1K0407151AC", "0K20N51150A"] } } },
"d" : { "$lte" : 399 }
}]
})
.sort({ "X" : 1, "d" : 1, "Y" : 1, "Z" : 1 }).explain("executionStats")
The fields 'm', 'a' and 'e' are arrays, that is why 'm' is not included in any index.
If you check the execution stats screenshot, you will see that memory usage is pretty close to maximum and unfortunately I had cases where the query failed to execute because of the 32MB limit.
Index for the first part of the $or query:
{
"a._id" : 1,
"X" : 1,
"d" : 1,
"Y" : 1,
"Z" : 1,
"b" : 1,
"c" : 1
}
Index for the second part of the $or query:
{
"e.numbers" : 1,
"X" : 1,
"d" : 1,
"Y" : 1,
"Z" : 1
}
The indexes are used by the query, but not for sorting. Instead of SORT stage I would like too see SORT_MERGE stage, but no success for now. If I run the part queries inside $or separately, they are able to use the index to avoid sorting in a memory. As a workaround it is ok, but I would need to merge and resort the results by the application.
MongoDB version is 3.4.2. I checked that and that question. My query is the result. Probably I missed something?
Edit: mongo documents look like that:
{
"_id" : "290_440_K760A03",
"Z" : "K760A03",
"c" : 440,
"Y" : "NPS",
"b" : 290,
"X" : "Schlussleuchte",
"e" : [
{
"..." : 184,
"numbers" : [
"0K20N51150A"
]
}
],
"a" : [
{
"_id" : 4577,
"..." : [
{
"..." : [
{
"..." : "R",
}
]
}
]
},
{
"_id" : 4578
}
],
"d" : 101,
"m" : [
"AT",
"BR",
"CH"
],
"moreFields":"..."
}
Edit 2: removed the filed "m" from query to decrease complexity and attached test collection dump for someone, who wants to help :)
Here is the solution-
I just added one document in my test collection as shown in your question (edit part). Then I created below four indices-
1. {"m":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1}
2. {"a._id":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1}
3. {"m":1,"X":1,"d":1,"Y":1,"Z":1}
4. {"e.numbers":1,"X":1,"d":1,"Y":1,"Z":1}
And when I executed given query for execution stats then it shows me the SORT_MERGE state as expected.
Here is the explanation-
MongoDB has a thing called equality-sort-range which tells a lot how we should create our indices. I just followed this rule and kept the index in that order. So Here the index should be {Equality fields, "X":1,"d":1,"Y":1,"Z":1, Range fields}. You can see that the query has range on field "d" only ("d" : { "$lte" : 101 }) but "d" is already covered in SORT fields of index ("X":1,"d":1,"Y":1,"Z":1) so we can skip range part (i.e. field "d") from the end of index.
If "d" had NOT been in sort/equality predicate then I would have taken it in index for range index field and my index would have looked like {Equality fields, "X":1,"Y":1,"Z":1,"d":1}.
Now my index is {Equality fields, "X":1,"d":1,"Y":1,"Z":1} and I am just concerned about equality fields. So to figure out equality fields I just checked the query find predicates and I found there are two conditions combined by OR operator.
The first condition has equality on "a._id", "b", "c", "m" ("d" has range, not equality). So I need to create an index like "a._id":1,"m":1,"b":1,"c":1,"X":1,"d":1,"Y":1,"Z":1 but this will give error because it has two array fields "a_id" and "m". And as we know Mongo doesn't allow compound index on parallel arrays so it will fail. So I created two separate index just to allow Mongo to use whatever is chosen by query planner. And hence I created first and second index.
The second condition of OR operator has "e.numbers" and "m". Both are arrays fields so I had to create two indices as done for first condition and that's how I got my third and fourth index.
Now we know that at a time a single query can use only and only one index so I need to create these indices because I don't know which branch of OR operator will be executed.
Note: If you are concerned about size of index then you can keep only one index from first two and one from last two. Or you can also keep all four and hint mongo to use proper index if you know it well before query planner.

How to Find the Equivalent of Broken "Foreign Key" Relationships?

I have two collections which are in what we would call a "one to one relationship" if this were a relational database. I do not know why one is not nested within the other, but the fact is that for every document in collection "A", there is meant to be a document in collection "B" and vice versa.
Of course, in the absence of foreign key constraints, and in the presence of bugs, sometimes there is a document in "A" which does not have a related document in "B" (or vice versa).
I am new to MongoDB and I'm having trouble creating a query or script which will find me all the documents in "A" which do not have a related document in "B" (and vice versa). I guess I could use some sort of loop, but I don't yet know how that would work - I've only just started using simple queries on the RoboMongo command line.
Can anyone get me started with scripts for this? I have seen "Verifying reference (foreign key) integrity in MongoDB", but that doesn't help me. A bug has caused the "referential integrity" to break down, and I need the scripts in order to help me track down the bug. I also cannot redesign the database to use embedding (though I expect I'll ask why one document is not nested within the other).
I have also seen "How to find items in a collections which are not in another collection with MongoDB", but it has no answers.
Pseudo-Code for a Bad Technique
var misMatches = [];
var collectionB = db.getCollection('B');
var allOfA = db.getCollection('A').find();
while (allOfA.hasNext()) {
var nextA = allOfA.next();
if (!collectionB.find(nextA._id)) {
misMatches.push(nextA._id);
}
}
I don't know if this scales well, but ...
... given this sample date set:
> db.a.insert([{a:1},{a:2},{a:10} ])
> db.b.insert([ {b:2},{b:10},{b:20}])
// ^^^^^ ^^^^^^
// inconsistent 1-to-1 relationship
You could use map-reduce to collect the set of key in a and merge it with the set of key from b:
mapA=function() {
emit(this.a, {col: ["a"]})
}
mapB=function() {
emit(this.b, {col: ["b"]})
}
reduce=function(key, values) {
// merge both `col` arrays; sort the result
return {col: values.reduce(
function(a,b) { return a.col.concat(b.col) }
).sort()}
}
Producing:
> db.a.mapReduce(mapA, reduce, {out:{replace:"result"}})
> db.b.mapReduce(mapB, reduce, {out:{reduce:"result"}})
> db.result.find()
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 2, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 10, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] } }
Then it is quite easy to find all id that wasn't found in collection a and b. In addition you should be able to spot duplicate keys in one or the other collection:
> db.result.find({"value.col": { $ne: [ "a", "b" ]}})
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] }

MongoDB - How can I use MapReduce to merge a value from one collection into another collection on multiple keys of a second collection?

I have two MongoDB collections: The first is a collection that includes frequency information for different IDs and is shown (truncated form) below:
[
{
"_id" : "A1",
"value" : 19
},
{
"_id" : "A2",
"value" : 6
},
{
"_id" : "A3",
"value" : 12
},
{
"_id" : "A4",
"value" : 8
},
{
"_id" : "A5",
"value" : 4
},
...
]
The second collection is more complex and contains information for each _id listed in the first collection (it's called frequency_collection_id in the second collection), but frequency_collection_id may be inside two lists (info.details_one, and info.details_two) for each record:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
}
}
...
]
What I'm looking to do, is merge the frequency information (from the first collection) into the second collection, in effect creating a collection that looks like:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
}
}
...
]
I know that this should be possible with MongoDB's MapReduce functions, but all the examples I've seen are either too minimal for my collection structure, or are answering different questions than I'm looking for.
Does anyone have any pointers? How can I merge my frequency information (from my first collection) into the records (inside my two lists in each record of the second collection)?
I know this is more or less a JOIN, which MongoDB does not support, but from my reading, it looks like this is a prime example of MapReduce.
I'm learning Mongo as best I can, so please forgive me if my question is too naive.
Just like all MongoDB operations, a MapReduce always operates only on a single collection and can not obtain info from another one. So you first step needs to be to dump both collections into one. Your documents have different _id's, so it should not be a problem for them to coexist in the same collection.
Then you do a MapReduce where the map function emits both kinds of documents for their common key, which is their frequency ID.
Your reduce function will then receive an array of two documents for each key: the two documents you have received. You then just have to merge these two documents into one. Keep in mind that the reduce-function can receive these two documents in any order. It can also happen that it gets called for a partial result (only one of the two documents) or for an already completed result. You need to handle these cases gracefully! A good implementation could be to create a new object and then iterate the input-documents copying all existing relevant fields with their values to the new object, so the resulting object is an amalgamation of the input documents.

MongoDB Why this error : can't append to array using string field name: comments

I have a DB structure like below:
{
"_id" : 1,
"comments" : [
{
"_id" : 2,
"content" : "xxx"
}
]
}
I update a new subdocument in the comments feild. It is OK.
db.test.update(
{"_id" : 1, "comments._id" : 2},
{$push : {"comments.$.comments" : {_id : 3, content:"xxx"}}}
)
after that the DB structure:
{
"_id" : 1,
"comments" : [
{
"_id" : 2,
"comments" : [
{
"id" : 3,
"content" : "xxx"
}
],
"content" : "xxx"
}
]
}
But when I update a new subdocument in the comment field that _id is 3, There is a error:
db.test.update(
{"_id" : 1, "comments.comments.id" : 3},
{$push : {"comments.comments.$.comments" : {id : 4, content:"xxx"}}}
)
error message:
can't append to array using string field name: comments
Well, it makes total sense if you think about it. MongoDb has the advantage and the disadvantage of solving magically certain things.
When you query the database for a specific regular field like this:
{ field : "value" }
The query {field:"value"} makes total sense, it wouldn't in case value is part of an array but Mongo solves it for you, so in case the structure is:
{ field : ["value", "anothervalue"] }
Mongo iterates through all of them and matches "value" into the field and you don't have to think about it. It works perfectly.. at only one level, because it's impossible to guess what you want to do if you have multiple levels
In your case the first query works because it's the case in this example:
db.test.update(
{"_id" : 1, "comments._id" : 2},
{$push : {"comments.$.comments" : {_id : 3, content:"xxx"}}}
)
Matches _id in the first level, and comments._id at the second level, it gets an array as a result but Mongo is able to solve it.
But in the second case, think what you need, let's isolate the where clause:
{"_id" : 1, "comments.comments.id" : 3},
"Give me from the main collection records with _id:1" (one doc)
"And comments which comments inside have and id=3" (array * array)
The first level is solved easily, comments.id, the second is not possible due comments returns an array, but one more level is an array of arrays and Mongo gets an array of arrays as a result and it's not possible to push a document into all the records of the array.
The solution is to narrow your where clause to obtain an unique document in comments (could be the first one) but it's not a good solution because you never know what is the position of the document you're looking for, using the shell I think the only option to be accurate is to do it in two steps. Check this query that works (not the solution anyway) but "solves" the multiple array part fixing it to the first record:
db.test.update(
{"_id" : 1, "comments.0.comments._id" : 3},
{$push : {"comments.0.comments.$.comments" : {id : 4, content:"xxx"}}}
)

How can I select a number of records per a specific field using mongodb?

I have a collection of documents in mongodb, each of which have a "group" field that refers to a group that owns the document. The documents look like this:
{
group: <objectID>
name: <string>
contents: <string>
date: <Date>
}
I'd like to construct a query which returns the most recent N documents for each group. For example, suppose there are 5 groups, each of which have 20 documents. I want to write a query which will return the top 3 for each group, which would return 15 documents, 3 from each group. Each group gets 3, even if another group has a 4th that's more recent.
In the SQL world, I believe this type of query is done with "partition by" and a counter. Is there such a thing in mongodb, short of doing N+1 separate queries for N groups?
You cannot do this using the aggregation framework yet - you can get the $max or top date value for each group but aggregation framework does not yet have a way to accumulate top N plus there is no way to push the entire document into the result set (only individual fields).
So you have to fall back on MapReduce. Here is something that would work, but I'm sure there are many variants (all require somehow sorting an array of objects based on a specific attribute, I borrowed my solution from one of the answers in this question.
Map function - outputs group name as a key and the entire rest of the document as the value - but it outputs it as a document containing an array because we will try to accumulate an array of results per group:
map = function () {
emit(this.name, {a:[this]});
}
The reduce function will accumulate all the documents belonging to the same group into one array (via concat). Note that if you optimize reduce to keep only the top five array elements by checking date then you won't need the finalize function, and you will use less memory during running mapreduce (it will also be faster).
reduce = function (key, values) {
result={a:[]};
values.forEach( function(v) {
result.a = v.a.concat(result.a);
} );
return result;
}
Since I'm keeping all values for each key, I need a finalize function to pull out only latest five elements per key.
final = function (key, value) {
Array.prototype.sortByProp = function(p){
return this.sort(function(a,b){
return (a[p] < b[p]) ? 1 : (a[p] > b[p]) ? -1 : 0;
});
}
value.a.sortByProp('date');
return value.a.slice(0,5);
}
Using a template document similar to one you provided, you run this by calling mapReduce command:
> db.top5.mapReduce(map, reduce, {finalize:final, out:{inline:1}})
{
"results" : [
{
"_id" : "group1",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe13"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.498Z"),
"contents" : 0.23778377776034176
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0e"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.467Z"),
"contents" : 0.4434165076818317
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe09"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.436Z"),
"contents" : 0.5935856597498059
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe04"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.405Z"),
"contents" : 0.3912118375301361
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfdff"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.372Z"),
"contents" : 0.221651989268139
}
]
},
{
"_id" : "group2",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe14"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.504Z"),
"contents" : 0.019611883210018277
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0f"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.473Z"),
"contents" : 0.5670706110540777
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0a"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.442Z"),
"contents" : 0.893193120136857
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe05"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.411Z"),
"contents" : 0.9496864483226091
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe00"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.378Z"),
"contents" : 0.013748752186074853
}
]
},
{
"_id" : "group3",
...
}
]
}
],
"timeMillis" : 15,
"counts" : {
"input" : 80,
"emit" : 80,
"reduce" : 5,
"output" : 5
},
"ok" : 1,
}
Each result has _id as group name and values as array of most recent five documents from the collection for that group name.
you need aggregation framework $group stage piped in a $limit stage...
you want also to $sort the records in some ways or else the limit will have undefined behaviour, the returned documents will be pseudo-random (the order used internally by mongo)
something like that:
db.collection.aggregate([{$group:...},{$sort:...},{$limit:...}])
here there is the documentation if you want to know more