get total of sub documents in a collection - mongodb

How do I get the total comments in the collection if my collection looks like this. (not the total comments per post but total for the collection.)
{
_id: 1,
post: 'content',
comments: [
{
name: '',
comment: ''
}
]
}
If I have post A with 3 comments and post B with 5 comments. The result should be 8.

You could use the aggregation framework:
> db.prabir.aggregate(
{ $unwind : "$comments" },
{ $group: {
_id: '',
count: { $sum: 1 }
}
})
{ "result" : [ { "_id" : "", "count" : 8 } ], "ok" : 1 }
In a nutshell this (temporarily) creates a separate document for each comment and then increments count for each document.
For a large number of posts and comments it might be more efficient to keep track of the number of comments. When ever a comment is added you also increment a counter. Example:
// Insert a comment
> comment = { name: 'JohnDoe', comment: 'FooBar' }
> db.prabir.update(
{ post: "A" },
{
$push: { comments: comment },
$inc: { numComments: 1 }
}
)
Using the aggregation framework again:
> db.prabir.aggregate(
{ $project : { _id: 0, numComments: 1 }},
{ $group: {
_id: '',
count: { $sum: "$numComments" }
}
})
{ "result" : [ { "_id" : "", "count" : 8 } ], "ok" : 1 }

You can use the aggregate method of the aggregation framework for that:
db.test.aggregate(
// Only include docs with at least one comment.
{$match: {'comments.0': {$exists: true}}},
// Duplicate the documents, 1 per comments array entry
{$unwind: '$comments'},
// Group all docs together and count the number of unwound docs,
// which will be the same as the number of comments.
{$group: {_id: null, count: {$sum: 1}}}
);
UPDATE
As of MongoDB 2.6, there's a more efficient way to do this by using the $size aggregation operator to directly get the number of comments in each doc:
db.test.aggregate(
{$group: {_id: null, count: {$sum: {$size: '$comments'}}}}
);

Related

MongoDB grouping results based on greater than expression

I have a ton of records in a collection that look like this:
{
"_id" : ObjectId("5a95cf7790bd8fbf1c6a39da"),
"dmb_reviewerID" : "AB9S9279OZ3QO",
"dmb_asin" : "0078764343",
"dmb_reviewerName" : "Alan",
"dmb_helpful" : [
1,
1
],
"dmb_reviewText" : "I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.",
"dmb_overall" : 5.0,
"dmb_summary" : "Good game and Beta access!!",
"dmb_unixReviewTime" : 1373155200,
"dmb_reviewTime" : "07 7, 2013"
}
I need to find all of the product IDs (dmb_asin) which have 200 reviews or more.
So far, I've managed to count them and return a sum using an aggregate, but I can't figure out how to only show those that are greater than 200.
My code:
aggregate({
$group: {
_id: "$dmb_asin",
reviews: {
$addToSet: "$dmb_asin"
},
count: {
$sum: 1,},
}
});
Try this code (if I correctly understand you)
aggregate([
{
$group: {
_id: '$dmb_asin',
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$gte: 200
}
}
}
])
Try this query:
db.collection.aggregate([
{$group: {
_id: "$dmb_asin",
reviews: {
$addToSet: "$dmb_asin"
},
count: {
$sum: 1,}
}},
{$match:{"reviews":{$gte:200}}}
])

Get the number of documents liked per document in MongoDB

I'm working on a project by using MongoDB as a database and I'm encountering a problem: I can't find the right query to make a simple count of the likes of a document. The collection that I use is this :
{ "username" : "example1",
"like" : [ { "document_id" : "doc1" },
"document_id" : "doc2 },
...]
}
So what I need is to compute is the number of likes of each document so at the end I will have
{ "document_id" : "docA" , nbLikes : 30 }, {"document_id" : "docB", nbLikes : 1}
Can anyone help me on this because I failed.
You can do this by unwinding the like array of each doc and then grouping by document_id to get a count for each value:
db.test.aggregate([
// Duplicate each doc, once per 'like' array element
{$unwind: '$like'},
// Group them by document_id and assemble a count
{$group: {_id: '$like.document_id', nbLikes: {$sum: 1}}},
// Reshape the docs to match the desired output
{$project: {_id: 0, document_id: '$_id', nbLikes: 1}}
])
Add "likeCount" field and increase count for per $push operation and read "likeCount" field
db.test.update(
{ _id: "..." },
{
$inc: { likeCount: 1 },
$push: { like: { "document_id" : "doc1" } }
}
)

How can i get duplicate elements identified by 2 keys in mongo collection?

I have a collection with a few million documents for which i need to find at least duplicate document. The duplication criteria is based on 2 keys, not one. So i need to find at least 2 documents which both have { property1 : value1, property2 : value2,}.
For this i am trying to use the aggregate framewotk as in the following example:
db.listings.aggregate({
$group:
{
_id : { property1 : "$property1", property2 : "$property2" },
count: { $sum: 1 }
},},{
$match : {
count: {
$gt : 1
}
}},{
$limit: 1})
I think this should be working, BUT
Mongo returns the following error:
{
"code" : 16390,
"ok" : 0,
"errmsg" : "exception: sharded pipeline failed on shard shard1: { errmsg: \"exception: aggregation result exceeds maximum document size (16MB)\", code: 16389, ok: 0.0}"
I have also tried
db.collection.aggregate( { $group: { _id:
{ $concat: [ "$property1",
": ",
"$property2"
]
},
count: { $sum: 1 }
}
}
)
Got the same result
Does anyone have a better idea how to do this? I am not really a mongo expert, but i have to do this one way or the other.
Thanks in advance
Your idea to shrink the doc as much as possible with $concat is a good one, but $concat is a $project operator, not a $group operator. So try something like this:
db.collection.aggregate(
{ $project: { _id: { $concat: ["$property1", ":", "$property2"] }}},
{ $group: { _id: '$_id', c: { $sum: 1 }}},
{ $match: { c: { $gt: 1 }}})
It still may use too much memory, but it's worth a shot.
Using map-reduce is an alternative. Here you can find examples :
http://docs.mongodb.org/manual/tutorial/map-reduce-examples/

mongodb aggregation framework group + project

I have the following issue:
this query return 1 result which is what I want:
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } } }])
{
"result" : [
{
"_id" : "b91e51e9-6317-4030-a9a6-e7f71d0f2161",
"version" : 1.2000000000000002
}
],
"ok" : 1
}
this query ( I just added projection so I can later query for the entire document) return multiple results. What am I doing wrong?
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } }, $project: { _id : 1 } }])
{
"result" : [
{
"_id" : ObjectId("5139310a3899d457ee000003")
},
{
"_id" : ObjectId("513931053899d457ee000002")
},
{
"_id" : ObjectId("513930fd3899d457ee000001")
}
],
"ok" : 1
}
found the answer
1. first I need to get all the _ids
db.items.aggregate( [
{ '$match': { 'owner.id': '9e748c81-0f71-4eda-a710-576314ef3fa' } },
{ '$group': { _id: '$item.id', dbid: { $max: "$_id" } } }
]);
2. then i need to query the documents
db.items.find({ _id: { '$in': "IDs returned from aggregate" } });
which will look like this:
db.items.find({ _id: { '$in': [ '1', '2', '3' ] } });
( I know its late but still answering it so that other people don't have to go search for the right answer somewhere else )
See to the answer of Deka, this will do your job.
Not all accumulators are available in $project stage. We need to consider what we can do in project with respect to accumulators and what we can do in group. Let's take a look at this:
db.companies.aggregate([{
$match: {
funding_rounds: {
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
funding: {
$push: {
amount: "$funding_rounds.raised_amount",
year: "$funding_rounds.funded_year"
}
}
}
}, ]).pretty()
Where we're checking if any of the funding_rounds is not empty. Then it's unwind-ed to $sort and to later stages. We'll see one document for each element of the funding_rounds array for every company. So, the first thing we're going to do here is to $sort based on:
funding_rounds.funded_year
funding_rounds.funded_month
funding_rounds.funded_day
In the group stage by company name, the array is getting built using $push. $push is supposed to be part of a document specified as the value for a field we name in a group stage. We can push on any valid expression. In this case, we're pushing on documents to this array and for every document that we push it's being added to the end of the array that we're accumulating. In this case, we're pushing on documents that are built from the raised_amount and funded_year. So, the $group stage is a stream of documents that have an _id where we're specifying the company name.
Notice that $push is available in $group stages but not in $project stage. This is because $group stages are designed to take a sequence of documents and accumulate values based on that stream of documents.
$project on the other hand, works with one document at a time. So, we can calculate an average on an array within an individual document inside a project stage. But doing something like this where one at a time, we're seeing documents and for every document, it passes through the group stage pushing on a new value, well that's something that the $project stage is just not designed to do. For that type of operation we want to use $group.
Let's take a look at another example:
db.companies.aggregate([{
$match: {
funding_rounds: {
$exists: true,
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
first_round: {
$first: "$funding_rounds"
},
last_round: {
$last: "$funding_rounds"
},
num_rounds: {
$sum: 1
},
total_raised: {
$sum: "$funding_rounds.raised_amount"
}
}
}, {
$project: {
_id: 0,
company: "$_id.company",
first_round: {
amount: "$first_round.raised_amount",
article: "$first_round.source_url",
year: "$first_round.funded_year"
},
last_round: {
amount: "$last_round.raised_amount",
article: "$last_round.source_url",
year: "$last_round.funded_year"
},
num_rounds: 1,
total_raised: 1,
}
}, {
$sort: {
total_raised: -1
}
}]).pretty()
In the $group stage, we're using $first and $last accumulators. Right, again we can see that as with $push - we can't use $first and $last in project stages. Because again, project stages are not designed to accumulate values based on multiple documents. Rather they're designed to reshape documents one at a time. Total number of rounds is calculated using the $sum operator. The value 1 simply counts the number of documents passed through that group together with each document that matches or is grouped under a given _id value. The project may seem complex, but it's just making the output pretty. It's just that it's including num_rounds and total_raised from the previous document.

How do I use aggregation operators in a $match in MongoDB (for example $year or $dayOfMonth)?

I have a collection full of documents with a created_date attribute. I'd like to send these documents through an aggregation pipeline to do some work on them. Ideally I would like to filter them using a $match before I do any other work on them so that I can take advantage of indexes however I can't figure out how to use the new $year/$month/$dayOfMonth operators in my $match expression.
There are a few examples floating around of how to use the operators in a $project operation but I'm concerned that by placing a $project as the first step in my pipeline then I've lost access to my indexes (MongoDB documentation indicates that the first expression must be a $match to take advantage of indexes).
Sample data:
{
post_body: 'This is the body of test post 1',
created_date: ISODate('2012-09-29T05:23:41Z')
comments: 48
}
{
post_body: 'This is the body of test post 2',
created_date: ISODate('2012-09-24T12:34:13Z')
comments: 10
}
{
post_body: 'This is the body of test post 3',
created_date: ISODate('2012-08-16T12:34:13Z')
comments: 10
}
I'd like to run this through an aggregation pipeline to get the total comments on all posts made in September
{
aggregate: 'posts',
pipeline: [
{$match:
/*Can I use the $year/$month operators here to match Sept 2012?
$year:created_date : 2012,
$month:created_date : 9
*/
/*or does this have to be
created_date :
{$gte:{$date:'2012-09-01T04:00:00Z'},
$lt: {$date:'2012-10-01T04:00:00Z'} }
*/
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
This works but the match loses access to any indexes for more complicated queries:
{
aggregate: 'posts',
pipeline: [
{$project:
{
month : {$month:'$created_date'},
year : {$year:'$created_date'}
}
},
{$match:
{
month:9,
year: 2012
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
As you already found, you cannot $match on fields that are not in the document (it works exactly the same way that find works) and if you use $project first then you will lose the ability to use indexes.
What you can do instead is combine your efforts as follows:
{
aggregate: 'posts',
pipeline: [
{$match: {
created_date :
{$gte:{$date:'2012-09-01T04:00:00Z'},
$lt: {date:'2012-10-01T04:00:00Z'}
}}
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
The above only gives you aggregation for September, if you wanted to aggregate for multiple months, you can for example:
{
aggregate: 'posts',
pipeline: [
{$match: {
created_date :
{ $gte:'2012-07-01T04:00:00Z',
$lt: '2012-10-01T04:00:00Z'
}
},
{$project: {
comments: 1,
new_created: {
"yr" : {"$year" : "$created_date"},
"mo" : {"$month" : "$created_date"}
}
}
},
{$group:
{_id: "$new_created",
totalComments:{$sum:'$comments'}
}
}
]
}
and you'll get back something like:
{
"result" : [
{
"_id" : {
"yr" : 2012,
"mo" : 7
},
"totalComments" : 5
},
{
"_id" : {
"yr" : 2012,
"mo" : 8
},
"totalComments" : 19
},
{
"_id" : {
"yr" : 2012,
"mo" : 9
},
"totalComments" : 21
}
],
"ok" : 1
}
Let's look at building some pipelines that involve operations that are already familiar to us. So, we're going to look at the following stages:
match - this is filtering stage, similar to find.
project
sort
skip
limit
We might ask ourself why these stages are necessary, given that this functionality is already provided in the MongoDB query language, and the reason is because we need these stages to support the more complex analytics-oriented functionality that's included with the aggregation framework. The below query is simply equal to a find:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, ])
Let's introduce a project stage in this aggregation pipeline:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$project: {
_id: 0,
name: 1,
founded_year: 1
}
}])
We use aggregate method for implementing aggregation framework. The aggregation pipelines are merely an array of documents. Each of the document should stipulate a particular stage operator. So, in the above case we've an aggregation pipeline with two stages. The $match stage is passing the documents one at a time to $project stage.
Let's extend to limit stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}])
This gets the matching documents and limits to five before projecting out the fields. So, projection is working only on 5 documents. Assume, if we were to do something like this:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$project: {
_id: 0,
name: 1
}
}, {
$limit: 5
}])
This gets the matching documents and projects those large number of documents and finally limits to five. So, projection is working on large number of documents and finally limiting to 5. This gives us a lesson that we should limit the documents to those which are absolutely necessary to be passed to the next stage. Now, let's look at sort stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$sort: {
name: 1
}
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}])
This will sort all documents by name and give only 5 out of them. Assume, if we were to do something like this:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$limit: 5
}, {
$sort: {
name: 1
}
}, {
$project: {
_id: 0,
name: 1
}
}])
This will take first 5 documents and sort them. Let's add the skip stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$sort: {
name: 1
}
}, {
$skip: 10
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}, ])
This will sort all the documents and skip the initial 10 documents and return to us. We should try to include $match stages as early as possible in the pipeline. To filter documents using a $match stage, we use the same syntax for constructing query documents (filters) as we do for find().
Try this;
db.createCollection("so");
db.so.remove();
db.so.insert([
{
post_body: 'This is the body of test post 1',
created_date: ISODate('2012-09-29T05:23:41Z'),
comments: 48
},
{
post_body: 'This is the body of test post 2',
created_date: ISODate('2012-09-24T12:34:13Z'),
comments: 10
},
{
post_body: 'This is the body of test post 3',
created_date: ISODate('2012-08-16T12:34:13Z'),
comments: 10
}
]);
//db.so.find();
db.so.ensureIndex({"created_date":1});
db.runCommand({
aggregate:"so",
pipeline:[
{
$match: { // filter only those posts in september
created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') }
}
},
{
$group: {
_id: null, // no shared key
comments: { $sum: "$comments" } // total comments for all the posts in the pipeline
}
},
]
//,explain:true
});
Result is;
{ "result" : [ { "_id" : null, "comments" : 58 } ], "ok" : 1 }
So you could also modify your previous example to do this, although I'm not sure why you'd want to, unless you plan on doing something else with month and year in the pipeline;
{
aggregate: 'posts',
pipeline: [
{$match: { created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') } } },
{$project:
{
month : {$month:'$created_date'},
year : {$year:'$created_date'}
}
},
{$match:
{
month:9,
year: 2012
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}