MongoDB MapReduce - How to populate an array in reduce function? - mongodb

I have a MovieRatings database with columns userId, movieId, movie-categoryId, reviewId, movieRating and reviewDate.
In my mapper I want to extract userId -> (movieId, movieRating)
And then in the reducer I want to group all movieId, movieRating pair by user.
Here is my attempt:
Map function:
var map = function() {
var values={movieId : this.movieId, movieRating : this.movieRating};
emit(this.userId, values);}
Reduce function:
var reduce = function(key,values) {
var ratings = [];
values.forEach(function(V){
var temp = {movieId : V.movieId, movieRating : V.movieRating};
Array.prototype.push.apply(ratings, temp);
});
return {userId : key, ratings : ratings };
}
Run MapReduce:
db.ratings.mapReduce(map, reduce, { out: "map_reduce_step1" })
Output: db.map_reduce_step1.find()
{ "_id" : 1, "value" : { "userId" : 1, "ratings" : [ ] } }
{ "_id" : 2, "value" : { "userId" : 2, "ratings" : [ ] } }
{ "_id" : 3, "value" : { "userId" : 3, "ratings" : [ ] } }
{ "_id" : 4, "value" : { "userId" : 4, "ratings" : [ ] } }
{ "_id" : 5, "value" : { "userId" : 5, "ratings" : [ ] } }
{ "_id" : 6, "value" : { "userId" : 6, "ratings" : [ ] } }
{ "_id" : 7, "value" : { "userId" : 7, "ratings" : [ ] } }
{ "_id" : 8, "value" : { "userId" : 8, "ratings" : [ ] } }
{ "_id" : 9, "value" : { "userId" : 9, "ratings" : [ ] } }
{ "_id" : 10, "value" : { "userId" : 10, "ratings" : [ ] } }
{ "_id" : 11, "value" : { "userId" : 11, "ratings" : [ ] } }
{ "_id" : 12, "value" : { "userId" : 12, "ratings" : [ ] } }
{ "_id" : 13, "value" : { "userId" : 13, "ratings" : [ ] } }
{ "_id" : 14, "value" : { "userId" : 14, "ratings" : [ ] } }
{ "_id" : 15, "value" : { "movieId" : 1, "movieRating" : 3 } }
{ "_id" : 16, "value" : { "userId" : 16, "ratings" : [ ] } }
I am not getting the expected output. In fact, this output makes no sense to me!
Here is the python equivalent of what I am trying to do in the reducer (just in case the purpose of reducer wasn't clear above) :
def reducer_ratings_by_user(self, user_id, itemRatings):
#Group (item, rating) pairs by userID
ratings = []
for movieID, rating in itemRatings:
ratings.append((movieID, rating))
yield user_id, ratings
Edit 1 #chridam
Here is an outline of what I really want to do here :
Movies.csv file looks like :
userId,movieId,movie-categoryId,reviewId,movieRating,reviewDate
1,1,1,1,5,7/12/2000
2,1,1,2,5,7/12/2000
3,1,1,3,5,7/12/2000
4,1,1,4,4,7/12/2000
5,1,1,5,4,7/12/2000
6,1,1,6,5,7/15/2000
1,2,1,7,4,7/25/2000
8,1,1,8,4,7/28/2000
9,1,1,9,3,8/3/2000
...
...
I import this into mongoDB :
mongoimport --db SomeName --collection ratings --type csv --headerline --file Movies.csv
Then I am trying to apply the map-reduce function as define above. After that I will export it back to a csv by doing somethig like :
mongoexport --db SomeName --collection map_reduce_step1 --csv --out movie_ratings_out.csv --fields ...
This movie_ratings_out.csv file should be like :
userId, movieId1, rating1, movieId2, rating2 ,...
1,1,5,2,4
...
...
So each row contains all the (movie,rating) pair for every user.
Edit 2
Sample :
db.ratings.find().pretty()
{
"_id" : ObjectId("57f4a0dd9cb74fc4d344a40f"),
"userId" : 4,
"movieId" : 1,
"movie-categoryId" : 1,
"reviewId" : 4,
"movieRating" : 4,
"reviewDate" : "7/12/2000"
}
{
"_id" : ObjectId("57f4a0dd9cb74fc4d344a410"),
"userId" : 5,
"movieId" : 1,
"movie-categoryId" : 1,
"reviewId" : 5,
"movieRating" : 4,
"reviewDate" : "7/12/2000"
}
{
"_id" : ObjectId("57f4a0dd9cb74fc4d344a411"),
"userId" : 4,
"movieId" : 2,
"movie-categoryId" : 1,
"reviewId" : 6,
"movieRating" : 5,
"reviewDate" : "7/15/2000"
}
{
"_id" : ObjectId("57f4a0dd9cb74fc4d344a412"),
"userId" : 4,
"movieId" : 3,
"movie-categoryId" : 1,
"reviewId" : 2,
"movieRating" : 5,
"reviewDate" : "7/12/2000"
}
...
Then after MapReduce expected output json is :
{
"_id" : ....,
"userId" : 4,
"movieList" : [ {
"movieId" : 2
"movieRating" : 5
},
{
"movieId" : 1
"movieRating" : 4
}
...
]
}
{
"_id" : ....,
"userId" : 5,
"movieList" : ...
}
...

You just need to run an aggregation pipeline which consists of a $group stage that summarize documents. This groups input documents by a specified identifier expression and applies the accumulator expression(s). The $group pipeline operator is similar to the SQL's GROUP BY clause. In SQL, you can't use GROUP BY unless you use any of the aggregation functions. The same way, you have to use an aggregation function in MongoDB as well. You can read more about the aggregation functions here.
The accumulator operator you would need to create the movieList array is $push.
Another pipeline which follows after the $group stage is the $project operator which is used to select or reshape each document in the stream, include, exclude or rename fields, inject computed fields, create sub-document fields, using mathematical expressions, dates, strings and/or logical (comparison, boolean, control) expressions - similar to what you would do with the SQL SELECT clause.
The last step is the $out pipeline which writes the resulting documents of the aggregation pipeline to a collection. It must be the last stage in the pipeline.
So as a result, you can run the following aggregate operation:
db.ratings.aggregate([
{
"$group": {
"_id": "$userId",
"movieList": {
"$push": {
"movieId": "$movieId",
"movieRating": "$movieRating",
}
}
}
},
{
"$project": {
"_id": 0, "userId": "$_id", "movieList": 1
}
},
{ "$out": "movie_ratings_out" }
])
Using the sample 5 documents above, the sample output if you query db.getCollection('movie_ratings_out').find({}) would yield:
/* 1 */
{
"_id" : ObjectId("57f52636b9c3ea346ab1d399"),
"movieList" : [
{
"movieId" : 1.0,
"movieRating" : 4.0
}
],
"userId" : 5.0
}
/* 2 */
{
"_id" : ObjectId("57f52636b9c3ea346ab1d39a"),
"movieList" : [
{
"movieId" : 1.0,
"movieRating" : 4.0
},
{
"movieId" : 2.0,
"movieRating" : 5.0
},
{
"movieId" : 3.0,
"movieRating" : 5.0
}
],
"userId" : 4.0
}

Related

Mongo aggregation - Sorting using a field value from previous pipeline as the sort field

I have produced the below output using mongodb aggregation (including $group pipeline inside levelsCount field) :
{
"_id" : "1",
"name" : "First",
"levelsCount" : [
{ "_id" : "level_One", "levelNum" : 1, "count" : 1 },
{ "_id" : "level_Three", "levelNum" : 3, "count" : 1 },
{ "_id" : "level_Four", "levelNum" : 4, "count" : 8 }
]
}
{
"_id" : "2",
"name" : "Second",
"levelsCount" : [
{ "_id" : "level_One", "levelNum" : 1, "count" : 5 },
{ "_id" : "level_Two", "levelNum" : 2, "count" : 2 },
{ "_id" : "level_Three", "levelNum" : 3, "count" : 1 },
{ "_id" : "level_Four", "levelNum" : 4, "count" : 3 }
]
}
{
"_id" : "3",
"name" : "Third",
"levelsCount" : [
{ "_id" : "level_One", "levelNum" : 1, "count" : 1 },
{ "_id" : "level_Two", "levelNum" : 2, "count" : 3 },
{ "_id" : "level_Three", "levelNum" : 3, "count" : 2 },
{ "_id" : "level_Four", "levelNum" : 4, "count" : 3 }
]
}
Now, I need to sort these documents based on the levelNum and count fields of levelsCount array elements. I.e. If two documents both had the count 5 forlevelNum: 1 (level_One), then the sort goes to compare the count of levelNum: 2 (level_Two) field and so on.
I see how $sort pipeline would work on multiple fields (Something like { $sort : { level_One : 1, level_Two: 1 } }), But the problem is how to access those values of levelNum of each array element and set that value as a field name to do sorting on that. (I couldn't handle it even after $unwinding the levelsCount array).
P.s: The initial order of levelsCount array's elements may differ on each document and is not important.
Edit:
The expected output of the above structure would be:
// Sorted result:
{
"_id" : "2",
"name" : "Second",
"levelsCount" : [
{ "_id" : "level_One", "levelNum" : 1, "count" : 5 }, // "level_One's count: 5" is greater than "level_One's count: 1" in two other documents, regardless of other level_* fields. Therefore this whole document with "name: Second" is ordered first.
{ "_id" : "level_Two", "levelNum" : 2, "count" : 2 },
{ "_id" : "level_Three", "levelNum" : 3, "count" : 1 },
{ "_id" : "level_Four", "levelNum" : 4, "count" : 3 }
]
}
{
"_id" : "3",
"name" : "Third",
"levelsCount" : [
{ "_id" : "level_One", "levelNum" : 1, "count" : 1 },
{ "_id" : "level_Two", "levelNum" : 2, "count" : 3 }, // "level_Two's count" in this document exists with value (3) while the "level_Two" doesn't exist in the below document which mean (0) value for count. So this document with "name: Third" is ordered higher than the below document.
{ "_id" : "level_Three", "levelNum" : 3, "count" : 2 },
{ "_id" : "level_Four", "levelNum" : 4, "count" : 3 }
]
}
{
"_id" : "1",
"name" : "First",
"levelsCount" : [
{ "_id" : "level_One", "levelNum" : 1, "count" : 1 },
{ "_id" : "level_Three", "levelNum" : 3, "count" : 1 },
{ "_id" : "level_Four", "levelNum" : 4, "count" : 8 }
]
}
Of course, I'd prefer to have an output document in the below format, But the first problem is to sort all docs:
{
"_id" : "1",
"name" : "First",
"levelsCount" : [
{ "level_One" : 1 },
{ "level_Three" : 1 },
{ "level_Four" : 8 }
]
}
You can sort by levelNum as descending order and count as ascending order,
db.collection.aggregate([
{
$sort: {
"levelsCount.levelNum": -1,
"levelsCount.count": 1
}
}
])
Playground
For key-value format result of levelsCount array,
$map to iterate loop of levelsCount array
prepare key-value pair array and convert to object using $arrayToObject
{
$addFields: {
levelsCount: {
$map: {
input: "$levelsCount",
in: {
$arrayToObject: [
[{ k: "$$this._id", v: "$$this.levelNum" }]
]
}
}
}
}
}
Playground

How do I calculate a field in all documents based on a value of a particular document in the same collection?

I am new to MongoDb and trying to achieve some basic calculation in it. I have collection, calc, as below
{ "_id" : 1, "value" : 10}
{ "_id" : 2, "value" : 20}
{ "_id" : 3, "value" : 20}
{ "_id" : 4, "value" : 30}
{ "_id" : 5, "value" : 30}
{ "_id" : 6, "value" : 30}
I want to add the value of "_id":1 to all value field of the documents in that collection and create a new field with the calculated result. So the final result I am looking for is as below.
{ "_id" : 1, "value" : 10, "sumup":20 }
{ "_id" : 2, "value" : 20, "sumup":30 }
{ "_id" : 3, "value" : 20, "sumup":30 }
{ "_id" : 4, "value" : 30, "sumup":40 }
{ "_id" : 5, "value" : 30, "sumup":40 }
{ "_id" : 6, "value" : 30, "sumup":40 }
You could try this in mongo shell:
db.collection.aggregate([
{
"$project": {
"value": 1,
"sumup": {
"$add": [ "$value", (db.collection.findOne({"_id": 1})).value ]
}
}
}
])

MongoDB: Sort in combination with Aggregation group

I have a collection called transaction with below documents,
/* 0 */
{
"_id" : ObjectId("5603fad216e90d53d6795131"),
"statusId" : "65c719e6727d",
"relatedWith" : "65c719e67267",
"status" : "A",
"userId" : "100",
"createdTs" : ISODate("2015-09-24T13:15:36.609Z")
}
/* 1 */
{
"_id" : ObjectId("5603fad216e90d53d6795134"),
"statusId" : "65c719e6727d",
"relatedWith" : "65c719e6726d",
"status" : "B",
"userId" : "100",
"createdTs" : ISODate("2015-09-24T13:14:31.609Z")
}
/* 2 */
{
"_id" : ObjectId("5603fad216e90d53d679512e"),
"statusId" : "65c719e6727d",
"relatedWith" : "65c719e6726d",
"status" : "C",
"userId" : "100",
"createdTs" : ISODate("2015-09-24T13:13:36.609Z")
}
/* 3 */
{
"_id" : ObjectId("5603fad216e90d53d6795132"),
"statusId" : "65c719e6727d",
"relatedWith" : "65c719e6726d",
"status" : "D",
"userId" : "100",
"createdTs" : ISODate("2015-09-24T13:16:36.609Z")
}
When I run the below Aggregation query without $group,
db.transaction.aggregate([
{
"$match": {
"userId": "100",
"statusId": "65c719e6727d"
}
},
{
"$sort": {
"createdTs": -1
}
}
])
I get the result in expected sorting order. i.e Sort createdTs in descending order (Minimal result)
/* 0 */
{
"result" : [
{
"_id" : ObjectId("5603fad216e90d53d6795132"),
"createdTs" : ISODate("2015-09-24T13:16:36.609Z")
},
{
"_id" : ObjectId("5603fad216e90d53d6795131"),
"createdTs" : ISODate("2015-09-24T13:15:36.609Z")
},
{
"_id" : ObjectId("5603fad216e90d53d6795134"),
"createdTs" : ISODate("2015-09-24T13:14:31.609Z")
},
{
"_id" : ObjectId("5603fad216e90d53d679512e"),
"createdTs" : ISODate("2015-09-24T13:13:36.609Z")
}
],
"ok" : 1
}
If I apply the below aggregation with $group, the resultant is inversely sorted(i.e Ascending sort)
db.transaction.aggregate([
{
"$match": {
"userId": "100",
"statusId": "65c719e6727d"
}
},
{
"$sort": {
"createdTs": -1
}
},
{
$group: {
"_id": {
"statusId": "$statusId",
"relatedWith": "$relatedWith",
"status": "$status"
},
"status": {$first: "$status"},
"statusId": {$first: "$statusId"},
"relatedWith": {$first: "$relatedWith"},
"createdTs": {$first: "$createdTs"}
}
}
]);
I get the result in inverse Order i.e. ** Sort createdTs in Ascending order**
/* 0 */
{
"result" : [
{
"_id" : ObjectId("5603fad216e90d53d679512e"),
"createdTs" : ISODate("2015-09-24T13:13:36.609Z")
},
{
"_id" : ObjectId("5603fad216e90d53d6795134"),
"createdTs" : ISODate("2015-09-24T13:14:31.609Z")
},
{
"_id" : ObjectId("5603fad216e90d53d6795131"),
"createdTs" : ISODate("2015-09-24T13:15:36.609Z")
},
{
"_id" : ObjectId("5603fad216e90d53d6795132"),
"createdTs" : ISODate("2015-09-24T13:16:36.609Z")
}
],
"ok" : 1
}
Where am I wrong ?
The $group stage doesn't insure the ordering of the results. See here the first paragraph.
If you want the results to be sorted after a $group, you need to add a $sort after the $group stage.
In your case, you should move the $sort after the $group and before you ask the question : No, the $sort won't be able to use an index after the $group like it does before the $group :-).
The internal algorithm of $group seems to keep some sort of ordering (reversed apparently), but I would not count on that and add a $sort.
You are not doing anything wrong here, Its a $group behavior in Mongodb
Lets have a look in this example
Suppose you have following doc in collection
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
{ "_id" : 6, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-15T12:05:10Z") }
{ "_id" : 7, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T14:12:12Z") }
Now if you run this
db.collection.aggregate([{ $sort: { item: 1,date:1}} ] )
the output will be in ascending order of item and date.
Now if you add group stage in aggregation pipeline it will reverse the order.
db.collection.aggregate([{ $sort: { item: 1,date:1}},{$group:{_id:"$item"}} ] )
Output will be
{ "_id" : "xyz" }
{ "_id" : "jkl" }
{ "_id" : "abc" }
Now the solution for your problem
change "createdTs": -1 to "createdTs": 1 for group

How to sort nested object

db.sort.drop();
db.sort.insert({stats: [{userId: 1, date: '01012013'},{userId: 2, date: '31122012'}]});
db.sort.insert({stats: [{userId: 1, date: '31122013'},{userId: 2, date: '01012012'}]});
> db.sort.find({'stats.userId': 1}).sort({'stats.date': 1}).pretty()
{
"_id" : ObjectId("52af1ce974be7dbd071e8563"),
"stats" : [
{
"userId" : 1,
"date" : "31122013"
},
{
"userId" : 2,
"date" : "01012012"
}
]
}
{
"_id" : ObjectId("52af1ce974be7dbd071e8562"),
"stats" : [
{
"userId" : 1,
"date" : "01012013"
},
{
"userId" : 2,
"date" : "31122012"
}
]
}
How to get the documents sorted by date userId: 1?
I expect to see:
{
"_id" : ObjectId("52af1ce974be7dbd071e8562"),
"stats" : [
{
"userId" : 1,
"date" : "01012013"
},
{
"userId" : 2,
"date" : "31122012"
}
]
}
{
"_id" : ObjectId("52af1ce974be7dbd071e8563"),
"stats" : [
{
"userId" : 1,
"date" : "31122013"
},
{
"userId" : 2,
"date" : "01012012"
}
]
}
Something along the lines of:
db.COLLECTION.find({stats.userID: 1}).sort({'stats.date':1})
Given your data above, when I execute that query, I get this back:
"stats" : [
{
"userId" : 1,
"date" : 1012013
}
You could use the aggregation framework to achieve
db.sort.aggregate(
[
{$unwind:"$stats"},
{$match:{"stats.userId":1}},
{$sort:{"stats.date":1}}
]);
If you could change your schema to store the user+id combination as the field name with the date as its value then you accomplish this easily
> db.so2.find().pretty()
{
"_id" : 1,
"stats" : [
{
"userId1" : "01012013"
},
{
"userId2" : "31122012"
}
]
}
{
"_id" : 2,
"stats" : [
{
"userId1" : "31122013"
},
{
"userId2" : "01012012"
}
]
}
You then change the query predicate from userId : 1 to check for existence of the userId field
> db.so2.find({"stats.userId1":{$exists:1}}).sort({"stats.userId1":1}).pretty()
{
"_id" : 1,
"stats" : [
{
"userId1" : "01012013"
},
{
"userId2" : "31122012"
}
]
}
{
"_id" : 2,
"stats" : [
{
"userId1" : "31122013"
},
{
"userId2" : "01012012"
}
]
}

Multiple group operations using Mongo aggregation framework

Given a set of questions that have linked survey and category id:
> db.questions.find().toArray();
[
{
"_id" : ObjectId("4fda05bc322b1c95b531ac25"),
"id" : 1,
"name" : "Question 1",
"category_id" : 1,
"survey_id" : 1,
"score" : 5
},
{
"_id" : ObjectId("4fda05cb322b1c95b531ac26"),
"id" : 2,
"name" : "Question 2",
"category_id" : 1,
"survey_id" : 1,
"score" : 3
},
{
"_id" : ObjectId("4fda05d9322b1c95b531ac27"),
"id" : 3,
"name" : "Question 3",
"category_id" : 2,
"survey_id" : 1,
"score" : 4
},
{
"_id" : ObjectId("4fda4287322b1c95b531ac28"),
"id" : 4,
"name" : "Question 4",
"category_id" : 2,
"survey_id" : 1,
"score" : 7
}
]
I can find the category average with:
db.questions.aggregate(
{ $group : {
_id : "$category_id",
avg_score : { $avg : "$score" }
}
}
);
{
"result" : [
{
"_id" : 1,
"avg_score" : 4
},
{
"_id" : 2,
"avg_score" : 5.5
}
],
"ok" : 1
}
How can I get the average of category averages (note this is different than simply averaging all questions)? I would assume I would do multiple group operations but this fails:
> db.questions.aggregate(
... { $group : {
... _id : "$category_id",
... avg_score : { $avg : "$score" },
... }},
... { $group : {
... _id : "$survey_id",
... avg_score : { $avg : "$score" },
... }}
... );
{
"errmsg" : "exception: the _id field for a group must not be undefined",
"code" : 15956,
"ok" : 0
}
>
It's important to understand that the operations in the argument to aggregate() form a pipeline. This meant that the input to any element of the pipeline is the stream of documents produced by the previous element in the pipeline.
In your example, your first query creates a pipeline of documents that look like this:
{
"_id" : 2,
"avg_score" : 5.5
},
{
"_id" : 1,
"avg_score" : 4
}
This means that the second element of the pipline is seeing a series of documents where the only keys are "_id" and "avg_score". The keys "category_id" and "score" no longer exist in this document stream.
If you want to further aggregate on this stream, you'll have to aggregate using the keys that are seen at this stage in the pipeline. Since you want to average the averages, you need to put in a single constant value for the _id field, so that all of the input documents get grouped into a single result.
The following code produces the correct result:
db.questions.aggregate(
{ $group : {
_id : "$category_id",
avg_score : { $avg : "$score" },
}
},
{ $group : {
_id : "all",
avg_score : { $avg : "$avg_score" },
}
}
);
When run, it produces the following output:
{
"result" : [
{
"_id" : "all",
"avg_score" : 4.75
}
],
"ok" : 1
}