MapReduce on a "Parent Links" tree in MongoDB - mongodb

I have a collection of entities, which represents a tree. Each entity has a property containing an array of attributes.
For example:
{
"_id" : 1,
"parent_id" : null,
"attributes" : [ "A", "B", "C" ]
}
I would like to use MapReduce to generate another collection which is similar to the original collection, but for each item in the collection it not only contains the attributes directly associated with the entity, but also those of its ancestors, all the way up to the root of the hiearchy.
So given the following entities:
{
"_id" : 1,
"parent_id" : null,
"attributes" : [ "A", "B", "C" ]
}
{
"_id" : 2,
"parent_id" : 1,
"attributes" : [ "D", "E", "F" ]
}
{
"_id" : 3,
"parent_id" : 2,
"attributes" : [ "G", "H", "I" ]
}
The result of the MapReduce job would be the following:
{
"_id" : 1,
"attributes" : [ "A", "B", "C" ]
}
{
"_id" : 2,
"attributes" : [ "A", "B", "C", "D", "E", "F" ]
}
{
"_id" : 3,
"attributes" : [ "A", "B", "C", "D", "E", "F", "G", "H", "I" ]
}
I've managed produce MapReduce jobs which do simple things like count the attributes for each entity but can't get my head round how I might deal with a hierarchy. I am open to alternative ways of storing the data but don't want to store the whole hierarchy in a single document.
Is this kind of thin possible with MapReduce in MongoDB or am I just thinking about the problem in the wrong way?

Ok, so I don't think this will be very performant/scalable, because you have to recursively find the parent ids from the child nodes. However, it does provide the output you want.
var mapFunc = function(doc, id) {
// if this is being invoked by mapReduce, it won't pass any parameters
if(doc == null) {
doc = this;
id = this._id;
} else if (doc.parent_id != null) {
// if this is a recursive call, find the parent
doc = db.test.findOne({_id:doc.parent_id});
}
// emit the id, which is always the id of the child node (starting point), and the attributes
emit(id, {attributes: doc.attributes});
// if parent_id is not null, call mapFunc with the hidden parameters
if(doc.parent_id != null) {
// recursive mapFunc call
mapFunc(doc, id);
}
}
// since we're going to call this from within mapReduce recursively, we have to save it in the system JS
db.system.js.save({ "_id" : "mapFunc", "value" : mapFunc});
var reduceFunc = function(key, values) {
var result = {attributes:[]};
values.forEach(function(value) {
// concat the result to the new values (I don't think order is guaranteed here)
result.attributes = value.attributes.concat(result.attributes);
});
return result;
}
// this just moves the attributes up a level
var finalize = function(key, value) {return value.attributes};
// quick test...
db.test.mapReduce(mapFunc, reduceFunc, {out: {inline: 1}, finalize: finalize});
Provides:
"results" : [
{
"_id" : 1,
"value" : [
"A",
"B",
"C"
]
},
{
"_id" : 2,
"value" : [
"A",
"B",
"C",
"D",
"E",
"F"
]
},
{
"_id" : 3,
"value" : [
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I"
]
}
],
"timeMillis" : 2,
"counts" : {
"input" : 3,
"emit" : 6,
"reduce" : 2,
"output" : 3
},
"ok" : 1,
}

Related

How to aggregate 2 list if at least one element matches?

For example, I have 6 items in collection
{ _id: 1, list: ["A", "B"] }
{ _id: 2, list: ["C", "A"] }
{ _id: 3, list: ["E", "F"] }
{ _id: 4, list: ["E", "D"] }
{ _id: 5, list: ["U", "I"] }
{ _id: 6, list: ["D", "K"] }
I would do a query to merge all the items which its list have at least 1 element matches. So the result will be:
{ _id: 7, list: ["A", "B", "C"] }
{ _id: 8, list: ["E", "F", "D", "K"] }
I'm new to MongoDB so anyone help me for this query ? Thanks alot.
I found this solution which almost solves your problem.
db.lists.aggregate([
{$unwind:"$list"},
{$group:{_id:"$list", merged:{$addToSet:"$_id"}, size:{$sum:1}}},
{$match:{size: {$gt: 1}}},
{$project:{_id: 1, merged:1, size: 1, merged1: "$merged"}},
{$unwind:"$merged"},
{$unwind:"$merged1"},
{$group:{_id:"$merged", letter:{$first:"$_id"}, size:{$sum: 1}, set: {$addToSet:"$merged1"}}},
{$sort:{size:1}},
{$group:{_id: "$letter", mergedIds:{$last:"$set"}, size:{$sum:1}}},
{$match: {size:{$gt:1}}}
])
I have tested this in my mongo shell which gives the following output:
{ "_id" : "E", "matchedIds" : [ 6, 3, 4 ], "size" : 2 }
{ "_id" : "A", "matchedIds" : [ 1, 2 ], "size" : 2 }
The matchedIds represents the docs id-s which have common value in the list array.
I think in the above aggregation can be done some optimization, but initially I found this, will try to find other ways. In addition you can use $lookup aggregation at the end of aggregation pipline to match the id-s with the set values. I couldn't test this because my mongo version doesn't support $lookup. But you can manually get that values inside some for loop if you use Node.js or something else.
Edited
This algorithm will only work if the amount of intersected lists for each list is no more than 3.
For example this will work:
{ "_id" : 1, "list" : [ "A", "B" ] }
{ "_id" : 2, "list" : [ "C", "A" ] }
{ "_id" : 3, "list" : [ "E", "F" ] }
{ "_id" : 4, "list" : [ "E", "D" ] }
{ "_id" : 5, "list" : [ "U", "I" ] }
{ "_id" : 6, "list" : [ "D", "K" ] }
{ "_id" : 7, "list" : [ "A", "L" ] }
but this will not:
{ "_id" : 1, "list" : [ "A", "B" ] }
{ "_id" : 2, "list" : [ "C", "A" ] }
{ "_id" : 3, "list" : [ "E", "F" ] }
{ "_id" : 4, "list" : [ "E", "D" ] }
{ "_id" : 5, "list" : [ "U", "I" ] }
{ "_id" : 6, "list" : [ "D", "K" ] }
{ "_id" : 7, "list" : [ "L", "K" ] }
Here the lists with ids of 7, 6, 4, 3 has intersection, so the number of intersected lists is 4, in this case the provided algorithm will not work. It will work only if the amount of intersection is less than 4 for each list
Final notice
It seems you can't achieve to your desired result by doing merge computation in the mongo database layer. If you are building an application then it will be better to do computation also in the application layer.

Is there a `$slice` like comparison for MongoDB's filters?

In MongoDB there is a projection operator $slice which allows projecting a subarray.
Is there any way to filter by an array slice as well? Something like:
db.testdb.find( {arrayofstring: { $eqSlice: {$slice: [0,1], $val: [ "a" ] } } }, {...})
Edit: An example and its expected output
> db.studentsTestDataTypes.find({},{ _id: 1, int: 1, arraystring: 1})
{ "_id" : ObjectId("56977186756088b586154f9d"), "int" : 2001, "arraystring" : [ "a", "b", "c" ] }
{ "_id" : ObjectId("56977186756088b586154f9e"), "int" : 2002, "arraystring" : [ "d", "e", "f" ] }
Example of expected result: Filtering by those entries with value "a" at the first position of arraystring:
{ "_id" : ObjectId("56977186756088b586154f9d"), "int" : 2001, "arraystring" : [ "a", "b", "c" ] }
Suppose you have the following document in your collection:
{ "_id" : ObjectId("56977186756088b586154f9d"), "int" : 2001, "arraystring" : [ "a", "b", "c" ] }
{ "_id" : ObjectId("56977186756088b586154f9e"), "int" : 2002, "arraystring" : [ "d", "e", "f" ] }
{ "_id" : ObjectId("56978e21ae9bb55c0d7cdc67"), "int" : 2001, "arraystring" : [ "b", "a", "c" ] }
The easier and best way is to use dot notation
db.collection.find({ "arraystring.0": "a" } )
Which yields:
{
"_id" : ObjectId("56977186756088b586154f9d"),
"int" : 2001,
"arraystring" : [
"a",
"b",
"c"
]
}

Aggregation on the basis of the set of nested docs

Let's say I have the next 5 docs:
{ "_id" : "1", "student" : "Oscar", "courses" : [ "A", "B" ] }
{ "_id" : "2", "student" : "Alan", "courses" : [ "A", "B", "C" ] }
{ "_id" : "3", "student" : "Kate", "courses" : [ "A", "B", "D" ] }
{ "_id" : "4", "student" : "John", "courses" : [ "A", "B", "C" ] }
{ "_id" : "5", "student" : "Bema", "courses" : [ "A", "B" ] }
I want to manipulate the collection so that it will return a group of students (with their _id) by set (combination) of courses they take and calculate how many students in each set.
In the example above I have 3 set (combination) of courses and number of students as below:
1 - [ "A", "B" ] <- 2 students take this combination
2 - [ "A", "B", "C" ] <- 2 students
3 - [ "A", "B", "D" ] <- 1 student
I feel like this is more like MapReduce task rather than Aggregation...not sure...
UPDATE 1
Thanks a lot to #ExplosionPills
So the following aggregation command:
db.students.aggregate([{
$group: {
_id: "$courses",
count: {$sum: 1},
students: {$push: "$_id"}
}
}])
gives me the following output:
{ "_id" : [ "A", "B", "D" ], "count" : 1, "students" : [ "3" ] }
{ "_id" : [ "A", "B", "C" ], "count" : 2, "students" : [ "2", "4" ] }
{ "_id" : [ "A", "B" ], "count" : 2, "students" : [ "1", "5" ] }
It groups by set of courses, counts number of students belong to it and their _ids.
UPDATE 2
I found out, the aggregation above treats combination [ "C", "A", "B" ] as different from [ "A", "B", "C" ]. But I need these 2 count as same.
So let's look at the following documents:
{ "_id" : "1", "student" : "Oscar", "courses" : [ "A", "B" ] }
{ "_id" : "2", "student" : "Alan", "courses" : [ "A", "B", "C" ] }
{ "_id" : "3", "student" : "Kate", "courses" : [ "A", "B", "D" ] }
{ "_id" : "4", "student" : "John", "courses" : [ "A", "B", "C" ] }
{ "_id" : "5", "student" : "Bema", "courses" : [ "A", "B" ] }
{ "_id" : "6", "student" : "Alex", "courses" : [ "C", "A", "B" ] }
Let's see this in output:
{ "_id" : [ "C", "A", "B" ], "count" : 1, "students" : [ "6" ] }
{ "_id" : [ "A", "B", "D" ], "count" : 1, "students" : [ "3" ] }
{ "_id" : [ "A", "B", "C" ], "count" : 2, "students" : [ "2", "4" ] }
{ "_id" : [ "A", "B" ], "count" : 2, "students" : [ "1", "5" ] }
See the lines 1 and 3 - this is not what I wanted.
So, to treat [ "C", "A", "B" ] and [ "A", "B", "C" ] as same combination I changed the aggregation as follows:
db.students.aggregate([
{$unwind: "$courses" },
{$sort : {"courses": 1}},
{$group: {_id: "$_id", courses: {$push: "$courses"}}},
{$group: {_id: "$courses", count: {$sum:1}, students: {$push: "$_id"}}}
])
Output:
{ "_id" : [ "A", "B", "D" ], "count" : 1, "students" : [ "3" ] }
{ "_id" : [ "A", "B" ], "count" : 2, "students" : [ "5", "1" ] }
{ "_id" : [ "A", "B", "C" ], "count" : 3, "students" : [ "6", "4", "2" ] }
This is an aggregate operation using grouping.
db.students.aggregate([{
$group: {
// Uniquely identify the document.
// The $ syntax queries on this field
_id: "$courses",
// Add 1 for each field found (effectively a counter)
count: {$sum: 1}
}
}]);
EDIT:
If the courses can be in any order, you can $unwind, $sort, and $group again as suggested in the edited question. It's also possible to do this via mapReduce, but I'm not sure which is faster.
db.students.mapReduce(
function () {
// Use the sorted courses as the key
emit(this.courses.sort(), this._id);
},
function (key, values) {
return {"students": values, count: values.length};
},
{out: {inline: 1}}
)

Count and group by with mongo db

I m actually facing a problem with mongoDB.
I need to display some statistics :
- A treatment is an information that contain a date, the user who treated, a list of anomalies
Can you help me with the request to get :
"The numbers of anomalies by users ?"
Thanks for all :D
db.treatment.aggregate(
{
$group : {_id : "$anomalies", totalUser : { $sum : 1 }}
}
);
Note : change your collection and document key name if I put wrong.
Source : http://www.mkyong.com/mongodb/mongodb-aggregate-and-group-example/
So, if your collection had the following documents:
> db.treatments.find()
{ "_id" : 1, "date" : ISODate("2014-08-29T15:44:45.843Z"), "user" : "A", "anomalies" : [ "a", "b", "c" ] }
{ "_id" : 2, "date" : ISODate("2014-08-29T15:45:01.782Z"), "user" : "A", "anomalies" : [ "e", "f", "g" ] }
{ "_id" : 3, "date" : ISODate("2014-08-29T15:45:34.889Z"), "user" : "B", "anomalies" : [ "a", "b", "c", "e", "f", "g" ] }
{ "_id" : 4, "date" : ISODate("2014-08-29T15:48:01.860Z"), "user" : "B", "anomalies" : [ "a", "b", "c", "e", "f", "g" ] }
{ "_id" : 5, "date" : ISODate("2014-08-29T15:48:28.937Z"), "user" : "A", "anomalies" : [ "x", "y", "z" ] }
You can use $group stage to $sum the $size of the anomalies array
> db.treatments.aggregate([ { $group: { _id: "$user", allAnomalies: { $sum: { $size: "$anomalies" } } } } ] )
{ "_id" : "B", "allAnomalies" : 12 }
{ "_id" : "A", "allAnomalies" : 9 }

MongoDB with flexible fields. How to find all records with specific field name?

I've a scheme
{
"_id" : ObjectId("50ec1d93ba02ece1979ee4a5"),
"url" : "google.com"
"results" : [
{ "1357651347" : { "data1" : "a", "data2" : "b", "data3" : "c" }},
{ "1357651706" : { "data1" : "d", "data2" : "e", "data3" : "f" }},
{ "1357651772" : { "data1" : "g", "data2" : "h", "data3" : "i" }}
]
}
I'm interested in the results with id 1357651706. How do I get them (in PHP)?
You can check if something exists or you can check if something is null (or not).
So for $exists ( http://docs.mongodb.org/manual/reference/operator/exists/ ):
db.col.find({"results.1357651706": {$exists:true}})
And for checking if something is not null:
db.col.find({ "results.1357651706": {$ne: null} })
Note: It is normally better to use the null query the other way around to check if something is null and then do the process of judgement in your app. This way you can use sparse index on your query too to make it leaner.
+1 to Sammaye's answer, but consider reworking your schema to get rid of the dynamic field names which make queries like this awkward.
Something like this instead:
{
"_id" : ObjectId("50ec1d93ba02ece1979ee4a5"),
"url" : "google.com"
"results" : [
{ id: 1357651347, "data1" : "a", "data2" : "b", "data3" : "c" },
{ id: 1357651706, "data1" : "d", "data2" : "e", "data3" : "f" },
{ id: 1357651772, "data1" : "g", "data2" : "h", "data3" : "i" }
]
}
Then you can query for the doc containing the result you're looking for like this:
db.col.find({'results.id': 1357651706})