MongoDB - How can I use MapReduce to merge a value from one collection into another collection on multiple keys of a second collection? - mongodb

I have two MongoDB collections: The first is a collection that includes frequency information for different IDs and is shown (truncated form) below:
[
{
"_id" : "A1",
"value" : 19
},
{
"_id" : "A2",
"value" : 6
},
{
"_id" : "A3",
"value" : 12
},
{
"_id" : "A4",
"value" : 8
},
{
"_id" : "A5",
"value" : 4
},
...
]
The second collection is more complex and contains information for each _id listed in the first collection (it's called frequency_collection_id in the second collection), but frequency_collection_id may be inside two lists (info.details_one, and info.details_two) for each record:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known"
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown"
}
],
}
}
...
]
What I'm looking to do, is merge the frequency information (from the first collection) into the second collection, in effect creating a collection that looks like:
[
{
"_id" : ObjectId("53cfc1d086763c43723abb07"),
"info" : {
"status" : "pass",
"details_one" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
"details_two" : [
{
"frequency_collection_id" : "A1",
"name" : "A1_object_name",
"class" : "known",
**"value" : 19**
},
{
"frequency_collection_id" : "A2",
"name" : "A2_object_name",
"class" : "unknown",
**"value" : 6**
}
],
}
}
...
]
I know that this should be possible with MongoDB's MapReduce functions, but all the examples I've seen are either too minimal for my collection structure, or are answering different questions than I'm looking for.
Does anyone have any pointers? How can I merge my frequency information (from my first collection) into the records (inside my two lists in each record of the second collection)?
I know this is more or less a JOIN, which MongoDB does not support, but from my reading, it looks like this is a prime example of MapReduce.
I'm learning Mongo as best I can, so please forgive me if my question is too naive.

Just like all MongoDB operations, a MapReduce always operates only on a single collection and can not obtain info from another one. So you first step needs to be to dump both collections into one. Your documents have different _id's, so it should not be a problem for them to coexist in the same collection.
Then you do a MapReduce where the map function emits both kinds of documents for their common key, which is their frequency ID.
Your reduce function will then receive an array of two documents for each key: the two documents you have received. You then just have to merge these two documents into one. Keep in mind that the reduce-function can receive these two documents in any order. It can also happen that it gets called for a partial result (only one of the two documents) or for an already completed result. You need to handle these cases gracefully! A good implementation could be to create a new object and then iterate the input-documents copying all existing relevant fields with their values to the new object, so the resulting object is an amalgamation of the input documents.

Related

Updating nested List in mongoDB Query working sometimes but with large data set it fails [duplicate]

This question already has answers here:
Updating a Nested Array with MongoDB
(2 answers)
Closed 5 years ago.
Following is a MongoDB document:
{
"_id" : 2,
"mem_id" : M002,
"email" : "xyz#gmail.com",
"event_type" : [
{
"name" : "MT",
"count" : 1,
"language" : [
{
"name" : "English",
"count" : 1,
"genre" : [
{
"name" : "Action",
"count" : 6
},
{
"name" : "Sci-Fi",
"count" : 3
}
],
"cast" : [
{
"name" : "Sam Wortington",
"count" : 2
},
{
"name" : "Bruce Willis",
"count" : 4
},
{
"name" : "Will Smith",
"count" : 7
},
{
"name" : "Irfan Khan",
"count" : 1
}
]
}
]
}
]
}
I'm not able to update fields that is of type array, specially event_type, language, genre and cast because of nesting. Basically, I wanted to update all the four mentioned fields along with count field for each and subdocuments. The update statement should insert a value to the tree if the value is new else should increment the count for that value.
What can be the query in mongo shell?
Thanks
You are directly hitting one of the current limitations of MongoDB.
The problem is that the engine does not support several positional operators.
See this Multiple use of the positional `$` operator to update nested arrays
There is an open ticket for this: https://jira.mongodb.org/browse/SERVER-831 (mentioned also there)
You can also read this one on how to change your data model: Updating nested arrays in mongodb
If it is feasible for you, you can do:
db.collection.update({_id:2,"event_type.name":'MT' ,"event_type.language.name":'English'},{$set:{"event_type.0.language.$.count":<number>}})
db.collection.update({_id:2,"event_type.name":'MT' ,"event_type.language.name":'English'},{$set:{"event_type.$.language.0.count":<number>}})
But you cannot do:
db.collection.update({_id:2,"event_type.name":'MT' ,"event_type.language.name":'English'},{$set:{"event_type.$.language.$.count":<number>}})
Let's take case by case:
To update the field name in event_type array:
db.testnested.update({"event_type.name" : "MT"}, {$set : {"event_type.name" : "GMT"}})
This command will update the name for an object inside the event_type list, to GMT from MT:
BEFORE:
db.testnested.find({}, {"event_type.name" : 1})
{ "_id" : 2, "event_type" : [ { "name" : "MT" } ] }
AFTER:
db.testnested.find({}, {"event_type.name" : 1})
{ "_id" : 2, "event_type" : [ { "name" : "GMT" } ] }
2.To update fields inside event_type, such as language, genre that are intern list:
There is no direct query for this. You need to read the document, update that document using the JavaScript or language of your choice, and then save() the same. I dont think there is any other way available till mongo 2.4
For further documentation, you can refer to save().
Thanks!

Get specific object in array of array in MongoDB

I need get a specific object in array of array in MongoDB.
I need get only the task object = [_id = ObjectId("543429a2cb38b1d83c3ff2c2")].
My document (projects):
{
"_id" : ObjectId("543428c2cb38b1d83c3ff2bd"),
"name" : "new project",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b"),
"members" : [
ObjectId("5424ac37eb0ea85d4c921f8b")
],
"US" : [
{
"_id" : ObjectId("5434297fcb38b1d83c3ff2c0"),
"name" : "Test Story",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b"),
"tasks" : [
{
"_id" : ObjectId("54342987cb38b1d83c3ff2c1"),
"name" : "teste3",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
},
{
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2"),
"name" : "jklasdfa_XXX",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
}
]
}
]
}
Result expected:
{
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2"),
"name" : "jklasdfa_XXX",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
}
But i not getting it.
I still testing with no success:
db.projects.find({
"US.tasks._id" : ObjectId("543429a2cb38b1d83c3ff2c2")
}, { "US.tasks.$" : 1 })
I tryed with $elemMatch too, but return nothing.
db.projects.find({
"US" : {
"tasks" : {
$elemMatch : {
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2")
}
}
}
})
Can i get ONLY my result expected using find()? If not, what and how use?
Thanks!
You will need an aggregation for that:
db.projects.aggregate([{$unwind:"$US"},
{$unwind:"$US.tasks"},
{$match:{"US.tasks._id":ObjectId("543429a2cb38b1d83c3ff2c2")}},
{$project:{_id:0,"task":"$US.tasks"}}])
should return
{ task : {
"_id" : ObjectId("543429a2cb38b1d83c3ff2c2"),
"name" : "jklasdfa_XXX",
"author" : ObjectId("5424ac37eb0ea85d4c921f8b")
}
Explanation:
$unwind creates a new (virtual) document for each array element
$match is the query part of your find
$project is similar as to project part in find i.e. it specifies the fields you want to get in the results
You might want to add a second $match before the $unwind if you know the document you are searching (look at performance metrics).
Edit: added a second $unwind since US is an array.
Don't know what you are doing (so realy can't tell and just sugesting) but you might want to examine if your schema (and mongodb) is ideal for your task because the document looks just like denormalized relational data probably a relational database would be better for you.

Ensure Unique indexes in embedded doc in mongodb

Is there a way to make a subdocument within a list have a unique field in mongodb?
document structure:
{
"_id" : "2013-08-13",
"hours" : [
{
"hour" : "23",
"file" : [
{
"date_added" : ISODate("2014-04-03T18:54:36.400Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.410Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.402Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.671Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
}
]
}
]
}
I want to make sure that the document's hours.hour value has a unique item when inserted. The issue is hours is a list. Can you ensureIndex in this way?
Indexes are not the tool for ensuring uniqueness in an embedded array, rather they are used across documents to ensure that certain fields do not repeat there.
As long as you can be certain that the content you are adding does not differ from any other value in any way then you can use the $addToSet operator with update:
db.collection.update(
{ "_id": "2013-08-13", "hour": 23 },
{ "$addToSet": {
"hours.$.file": {
"date_added" : ISODate("2014-04-03T18:54:36.671Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
}
}}
)
So that document would not be added as there is already an element matching those exact values within the target array. If the content was different (and that means any part of the content, then a new item would be added.
For anything else you would need to maintain that manually by loading up the document and inspecting the elements of the array. Say for a different "filename" with exactly the same timestamp.
Problems with your Schema
Now the question is answered I want to point out the problems with your schema design.
Dates as strings are "horrible". You may think you need them but you do not. See the aggregation framework date operators for more on this.
You have nested arrays, which generally should be avoided. The general problems are shown in the documentation for the positional $ operator. That says you only get one match on position, and that is always the "top" level array. So updating beyond adding things as shown above is going to be difficult.
A better schema pattern for you is to simply do this:
{
"date_added" : ISODate("2014-04-03T18:54:36.400Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.410Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.402Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.671Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
}
If that is in it's own collection then you can always actually use indexes to ensure uniqueness. The aggregation framework can break down the date parts and hours where needed.
Where you must have that as part of another document then try at least to avoid the nested arrays. This would be acceptable but not as flexible as separating the entries:
{
"_id" : "2013-08-13",
"hours" : {
"23": [
{
"date_added" : ISODate("2014-04-03T18:54:36.400Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.410Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.402Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
},
{
"date_added" : ISODate("2014-04-03T18:54:36.671Z"),
"name" : "1376434800_file_output_2014-03-10-09-27_44.csv"
}
]
}
}
It depends on your intended usage, the last would not allow you to do any type of aggregation comparison across hours within a day. Not in any simple way. The former does this easily and you can still break down selections by day and hour with ease.
Then again, if you are only ever appending information then your existing schema should be find. But be aware of the possible issues and alternatives.

How can I select a number of records per a specific field using mongodb?

I have a collection of documents in mongodb, each of which have a "group" field that refers to a group that owns the document. The documents look like this:
{
group: <objectID>
name: <string>
contents: <string>
date: <Date>
}
I'd like to construct a query which returns the most recent N documents for each group. For example, suppose there are 5 groups, each of which have 20 documents. I want to write a query which will return the top 3 for each group, which would return 15 documents, 3 from each group. Each group gets 3, even if another group has a 4th that's more recent.
In the SQL world, I believe this type of query is done with "partition by" and a counter. Is there such a thing in mongodb, short of doing N+1 separate queries for N groups?
You cannot do this using the aggregation framework yet - you can get the $max or top date value for each group but aggregation framework does not yet have a way to accumulate top N plus there is no way to push the entire document into the result set (only individual fields).
So you have to fall back on MapReduce. Here is something that would work, but I'm sure there are many variants (all require somehow sorting an array of objects based on a specific attribute, I borrowed my solution from one of the answers in this question.
Map function - outputs group name as a key and the entire rest of the document as the value - but it outputs it as a document containing an array because we will try to accumulate an array of results per group:
map = function () {
emit(this.name, {a:[this]});
}
The reduce function will accumulate all the documents belonging to the same group into one array (via concat). Note that if you optimize reduce to keep only the top five array elements by checking date then you won't need the finalize function, and you will use less memory during running mapreduce (it will also be faster).
reduce = function (key, values) {
result={a:[]};
values.forEach( function(v) {
result.a = v.a.concat(result.a);
} );
return result;
}
Since I'm keeping all values for each key, I need a finalize function to pull out only latest five elements per key.
final = function (key, value) {
Array.prototype.sortByProp = function(p){
return this.sort(function(a,b){
return (a[p] < b[p]) ? 1 : (a[p] > b[p]) ? -1 : 0;
});
}
value.a.sortByProp('date');
return value.a.slice(0,5);
}
Using a template document similar to one you provided, you run this by calling mapReduce command:
> db.top5.mapReduce(map, reduce, {finalize:final, out:{inline:1}})
{
"results" : [
{
"_id" : "group1",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe13"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.498Z"),
"contents" : 0.23778377776034176
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0e"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.467Z"),
"contents" : 0.4434165076818317
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe09"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.436Z"),
"contents" : 0.5935856597498059
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe04"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.405Z"),
"contents" : 0.3912118375301361
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfdff"),
"name" : "group1",
"date" : ISODate("2013-04-17T20:07:59.372Z"),
"contents" : 0.221651989268139
}
]
},
{
"_id" : "group2",
"value" : [
{
"_id" : ObjectId("516f011fbfd3e39f184cfe14"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.504Z"),
"contents" : 0.019611883210018277
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0f"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.473Z"),
"contents" : 0.5670706110540777
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe0a"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.442Z"),
"contents" : 0.893193120136857
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe05"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.411Z"),
"contents" : 0.9496864483226091
},
{
"_id" : ObjectId("516f011fbfd3e39f184cfe00"),
"name" : "group2",
"date" : ISODate("2013-04-17T20:07:59.378Z"),
"contents" : 0.013748752186074853
}
]
},
{
"_id" : "group3",
...
}
]
}
],
"timeMillis" : 15,
"counts" : {
"input" : 80,
"emit" : 80,
"reduce" : 5,
"output" : 5
},
"ok" : 1,
}
Each result has _id as group name and values as array of most recent five documents from the collection for that group name.
you need aggregation framework $group stage piped in a $limit stage...
you want also to $sort the records in some ways or else the limit will have undefined behaviour, the returned documents will be pseudo-random (the order used internally by mongo)
something like that:
db.collection.aggregate([{$group:...},{$sort:...},{$limit:...}])
here there is the documentation if you want to know more

MongoDB: Create another collection from sub-array? (many-to-many relationship)

Suppose I have 'film' objects like the one below in my collection. Films have many actors, and actors belong to many films. Many-to-many.
How do I create another collection that consists of the unique 'actor' elements? Remember, some actors will be listed in more than one film.
{
"_id" : ObjectId("4edcffa5f320646bc8bd76b4"),
"directed_by" : [
"John Gilling"
],
"forbid:genre" : null,
"genre" : [ ],
"guid" : "#9202a8c04000641f8000000000b02e5d",
"id" : "/en/pirates_of_blood_river",
"initial_release_date" : "1962",
"name" : "Pirates of Blood River",
"starring" : [
{
"actor" : {
"guid" : "#9202a8c04000641f800000000006823e",
"id" : "/en/christopher_lee",
"name" : "Christopher Lee",
"lc_name" : "christopher lee"
}
},
{
"actor" : {
"guid" : "#9202a8c04000641f80000000001de22e",
"id" : "/en/oliver_reed",
"name" : "Oliver Reed",
"lc_name" : "oliver reed"
}
},
{
"actor" : {
"guid" : "#9202a8c04000641f80000000003b41da",
"id" : "/en/glenn_corbett",
"name" : "Glenn Corbett",
"lc_name" : "glenn corbett"
}
}
]
}
This can be done in the client app, but also via aggregation in mongodb - for example, you can run a mapreduce job with an output collection. In the map step, given a film document, you can emit key value pairs of ( actor.guid, { [other actor details] } ), and in the reduce step just return a single set of details (you could also count the number of films the actor was in at this point, if you wanted).
http://www.mongodb.org/display/DOCS/MapReduce has more info on the syntax.