MongoDB grouping results based on greater than expression - mongodb

I have a ton of records in a collection that look like this:
{
"_id" : ObjectId("5a95cf7790bd8fbf1c6a39da"),
"dmb_reviewerID" : "AB9S9279OZ3QO",
"dmb_asin" : "0078764343",
"dmb_reviewerName" : "Alan",
"dmb_helpful" : [
1,
1
],
"dmb_reviewText" : "I haven't gotten around to playing the campaign but the multiplayer is solid and pretty fun. Includes Zero Dark Thirty pack, an Online Pass, and the all powerful Battlefield 4 Beta access.",
"dmb_overall" : 5.0,
"dmb_summary" : "Good game and Beta access!!",
"dmb_unixReviewTime" : 1373155200,
"dmb_reviewTime" : "07 7, 2013"
}
I need to find all of the product IDs (dmb_asin) which have 200 reviews or more.
So far, I've managed to count them and return a sum using an aggregate, but I can't figure out how to only show those that are greater than 200.
My code:
aggregate({
$group: {
_id: "$dmb_asin",
reviews: {
$addToSet: "$dmb_asin"
},
count: {
$sum: 1,},
}
});

Try this code (if I correctly understand you)
aggregate([
{
$group: {
_id: '$dmb_asin',
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$gte: 200
}
}
}
])

Try this query:
db.collection.aggregate([
{$group: {
_id: "$dmb_asin",
reviews: {
$addToSet: "$dmb_asin"
},
count: {
$sum: 1,}
}},
{$match:{"reviews":{$gte:200}}}
])

Related

Sum unique properties in different collection elements

I am quite new to MongoDB. Hopefully I am using the correct terminology to express my problem.
I have the following collection:
Data collection
{
"name":"ABC",
"resourceId":"i-1234",
"volumeId":"v-1234",
"data":"11/6/2013 12AM",
"cost": 0.5
},
{
"name":"ABC",
"resourceId":"v-1234",
"volumeId":"",
"data":"11/6/2013 2AM",
"cost": 1.5
}
I want to query the collection such that if a volumeId matches with another entries resourceId, then sum up the corresponding resourceId's cost together.
As a result, the cost would be 2.0 in this case.
Basically I want to match the volumeId of one entry to the resourceId of another entry and sum the costs if matched.
I hope I have explained my problem properly. Any help is appreciated. Thanks
Try this aggregation query:
db.col.aggregate([
{
$project: {
resourceId: 1,
volumeId: 1,
cost: 1,
match: {
$cond: [
{$eq: ["$volumeId", ""]},
"$resourceId",
"$volumeId"
]
}
}
},
{
$group: {
_id: '$match',
cost: {$sum: '$cost'},
resId: {
$addToSet: {
$cond: [
{$eq: ['$match', '$resourceId']},
null,
'$resourceId'
]
}
}
}
},
{$unwind: '$resId'},
{$match: {
resId: {
$ne: null
}
}
},
{
$project: {
resourseId: '$resId',
cost: 1,
_id: 0
}
}
])
And you will get the following:
{ "cost" : 2, "resourseId" : "i-1234" }
This is assuming the statement I wrote in the comment is true.

How to handle partial week data grouping in mongodb

I have some docs (daily open price for a stock) like the followings:
/* 0 */
{
"_id" : ObjectId("54d65597daf0910dfa8169b0"),
"D" : ISODate("2014-12-29T00:00:00.000Z"),
"O" : 104.98
}
/* 1 */
{
"_id" : ObjectId("54d65597daf0910dfa8169af"),
"D" : ISODate("2014-12-30T00:00:00.000Z"),
"O" : 104.73
}
/* 2 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ae"),
"D" : ISODate("2014-12-31T00:00:00.000Z"),
"O" : 104.51
}
/* 3 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ad"),
"D" : ISODate("2015-01-02T00:00:00.000Z"),
"O" : 103.75
}
/* 4 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ac"),
"D" : ISODate("2015-01-05T00:00:00.000Z"),
"O" : 102.5
}
and I want to aggregate the records by week so I can get the weekly average open price. My first attempt is to use:
db.ohlc.aggregate({
$match: {
D: {
$gte: new ISODate('2014-12-28')
}
}
}, {
$project: {
year: {
$year: '$D'
},
week: {
$week: '$D'
},
O: 1
}
}, {
$group: {
_id: {
year: '$year',
week: '$week'
},
O: {
$avg: '$O'
}
}
}, {
$sort: {
_id: 1
}
})
Bu I soon realized the result is incorrect as both the last week of 2014 (week number 52) and the first week of 2015 (week number 0) are partial weeks. With this aggregation I would have an average price for 12/29-12/31/2014 and another one for 01/02/2015 (which is the only trading date in the first week of 2015) but in my application I would need to group the data from 12/29/2015 through 01/02/2015. Any advice?
To answer my own question, the trick is to calculate the number of weeks based on a reference date (1970-01-04) and group by that number. You can check out my new post at http://midnightcodr.github.io/2015/02/07/OHLC-data-grouping-with-mongodb/ for details.
I use this for candelization; with allowDiskUsage, out and some date filters it works great. Maybe you can adopt the grouping?
db.getCollection('market').aggregate(
[
{ $match: { date: { $exists: true } } },
{ $sort: { date: 1 } },
{ $project: { _id: 0, date: 1, rate: 1, amount: 1, tm15: { $mod: [ "$date", 900 ] } } },
{ $project: { _id: 0, date: 1, rate: 1, amount: 1, candleDate: { $subtract: [ "$date", "$tm15" ] } } },
{ $group: { _id: "$candleDate", open: { $first: '$rate' }, low: { $min: '$rate' }, high: { $max: '$rate' }, close: { $last: '$rate' }, volume: { $sum: '$amount' }, trades: { $sum: 1 } } }
])
From my experience, this is not a really good approach to tackle the problem. Why? This will definitely not scale, the amount of computation needed is quite exhausting, specially to do the grouping.
What I would do in your situation is to move part of the application logic to the documents in the DB.
My first approach would be to add a "week" field that will state the previous (or next) Sunday of the date the sample belongs to. This is quite easy to do at the moment of insertion. Then you can simply run the aggregation method grouping by that field. If you want more performance, add an index for { symbol : 1, week : 1 } and do a sort in the aggregate.
My second approach, which would be if you plan on making a lot this type of aggregations, is basically having documents that group the samples in a weekly manner. Like this:
{
week : <Day Representing Week>,
prices: [
{ Day Sample }, ...
]
}
Then you can simply work on those documents directly. This will help you reduce your indexes in a significant manner, thus speeding things up.

mongodb aggregation framework group + project

I have the following issue:
this query return 1 result which is what I want:
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } } }])
{
"result" : [
{
"_id" : "b91e51e9-6317-4030-a9a6-e7f71d0f2161",
"version" : 1.2000000000000002
}
],
"ok" : 1
}
this query ( I just added projection so I can later query for the entire document) return multiple results. What am I doing wrong?
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } }, $project: { _id : 1 } }])
{
"result" : [
{
"_id" : ObjectId("5139310a3899d457ee000003")
},
{
"_id" : ObjectId("513931053899d457ee000002")
},
{
"_id" : ObjectId("513930fd3899d457ee000001")
}
],
"ok" : 1
}
found the answer
1. first I need to get all the _ids
db.items.aggregate( [
{ '$match': { 'owner.id': '9e748c81-0f71-4eda-a710-576314ef3fa' } },
{ '$group': { _id: '$item.id', dbid: { $max: "$_id" } } }
]);
2. then i need to query the documents
db.items.find({ _id: { '$in': "IDs returned from aggregate" } });
which will look like this:
db.items.find({ _id: { '$in': [ '1', '2', '3' ] } });
( I know its late but still answering it so that other people don't have to go search for the right answer somewhere else )
See to the answer of Deka, this will do your job.
Not all accumulators are available in $project stage. We need to consider what we can do in project with respect to accumulators and what we can do in group. Let's take a look at this:
db.companies.aggregate([{
$match: {
funding_rounds: {
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
funding: {
$push: {
amount: "$funding_rounds.raised_amount",
year: "$funding_rounds.funded_year"
}
}
}
}, ]).pretty()
Where we're checking if any of the funding_rounds is not empty. Then it's unwind-ed to $sort and to later stages. We'll see one document for each element of the funding_rounds array for every company. So, the first thing we're going to do here is to $sort based on:
funding_rounds.funded_year
funding_rounds.funded_month
funding_rounds.funded_day
In the group stage by company name, the array is getting built using $push. $push is supposed to be part of a document specified as the value for a field we name in a group stage. We can push on any valid expression. In this case, we're pushing on documents to this array and for every document that we push it's being added to the end of the array that we're accumulating. In this case, we're pushing on documents that are built from the raised_amount and funded_year. So, the $group stage is a stream of documents that have an _id where we're specifying the company name.
Notice that $push is available in $group stages but not in $project stage. This is because $group stages are designed to take a sequence of documents and accumulate values based on that stream of documents.
$project on the other hand, works with one document at a time. So, we can calculate an average on an array within an individual document inside a project stage. But doing something like this where one at a time, we're seeing documents and for every document, it passes through the group stage pushing on a new value, well that's something that the $project stage is just not designed to do. For that type of operation we want to use $group.
Let's take a look at another example:
db.companies.aggregate([{
$match: {
funding_rounds: {
$exists: true,
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
first_round: {
$first: "$funding_rounds"
},
last_round: {
$last: "$funding_rounds"
},
num_rounds: {
$sum: 1
},
total_raised: {
$sum: "$funding_rounds.raised_amount"
}
}
}, {
$project: {
_id: 0,
company: "$_id.company",
first_round: {
amount: "$first_round.raised_amount",
article: "$first_round.source_url",
year: "$first_round.funded_year"
},
last_round: {
amount: "$last_round.raised_amount",
article: "$last_round.source_url",
year: "$last_round.funded_year"
},
num_rounds: 1,
total_raised: 1,
}
}, {
$sort: {
total_raised: -1
}
}]).pretty()
In the $group stage, we're using $first and $last accumulators. Right, again we can see that as with $push - we can't use $first and $last in project stages. Because again, project stages are not designed to accumulate values based on multiple documents. Rather they're designed to reshape documents one at a time. Total number of rounds is calculated using the $sum operator. The value 1 simply counts the number of documents passed through that group together with each document that matches or is grouped under a given _id value. The project may seem complex, but it's just making the output pretty. It's just that it's including num_rounds and total_raised from the previous document.

get total of sub documents in a collection

How do I get the total comments in the collection if my collection looks like this. (not the total comments per post but total for the collection.)
{
_id: 1,
post: 'content',
comments: [
{
name: '',
comment: ''
}
]
}
If I have post A with 3 comments and post B with 5 comments. The result should be 8.
You could use the aggregation framework:
> db.prabir.aggregate(
{ $unwind : "$comments" },
{ $group: {
_id: '',
count: { $sum: 1 }
}
})
{ "result" : [ { "_id" : "", "count" : 8 } ], "ok" : 1 }
In a nutshell this (temporarily) creates a separate document for each comment and then increments count for each document.
For a large number of posts and comments it might be more efficient to keep track of the number of comments. When ever a comment is added you also increment a counter. Example:
// Insert a comment
> comment = { name: 'JohnDoe', comment: 'FooBar' }
> db.prabir.update(
{ post: "A" },
{
$push: { comments: comment },
$inc: { numComments: 1 }
}
)
Using the aggregation framework again:
> db.prabir.aggregate(
{ $project : { _id: 0, numComments: 1 }},
{ $group: {
_id: '',
count: { $sum: "$numComments" }
}
})
{ "result" : [ { "_id" : "", "count" : 8 } ], "ok" : 1 }
You can use the aggregate method of the aggregation framework for that:
db.test.aggregate(
// Only include docs with at least one comment.
{$match: {'comments.0': {$exists: true}}},
// Duplicate the documents, 1 per comments array entry
{$unwind: '$comments'},
// Group all docs together and count the number of unwound docs,
// which will be the same as the number of comments.
{$group: {_id: null, count: {$sum: 1}}}
);
UPDATE
As of MongoDB 2.6, there's a more efficient way to do this by using the $size aggregation operator to directly get the number of comments in each doc:
db.test.aggregate(
{$group: {_id: null, count: {$sum: {$size: '$comments'}}}}
);

How do I use aggregation operators in a $match in MongoDB (for example $year or $dayOfMonth)?

I have a collection full of documents with a created_date attribute. I'd like to send these documents through an aggregation pipeline to do some work on them. Ideally I would like to filter them using a $match before I do any other work on them so that I can take advantage of indexes however I can't figure out how to use the new $year/$month/$dayOfMonth operators in my $match expression.
There are a few examples floating around of how to use the operators in a $project operation but I'm concerned that by placing a $project as the first step in my pipeline then I've lost access to my indexes (MongoDB documentation indicates that the first expression must be a $match to take advantage of indexes).
Sample data:
{
post_body: 'This is the body of test post 1',
created_date: ISODate('2012-09-29T05:23:41Z')
comments: 48
}
{
post_body: 'This is the body of test post 2',
created_date: ISODate('2012-09-24T12:34:13Z')
comments: 10
}
{
post_body: 'This is the body of test post 3',
created_date: ISODate('2012-08-16T12:34:13Z')
comments: 10
}
I'd like to run this through an aggregation pipeline to get the total comments on all posts made in September
{
aggregate: 'posts',
pipeline: [
{$match:
/*Can I use the $year/$month operators here to match Sept 2012?
$year:created_date : 2012,
$month:created_date : 9
*/
/*or does this have to be
created_date :
{$gte:{$date:'2012-09-01T04:00:00Z'},
$lt: {$date:'2012-10-01T04:00:00Z'} }
*/
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
This works but the match loses access to any indexes for more complicated queries:
{
aggregate: 'posts',
pipeline: [
{$project:
{
month : {$month:'$created_date'},
year : {$year:'$created_date'}
}
},
{$match:
{
month:9,
year: 2012
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
As you already found, you cannot $match on fields that are not in the document (it works exactly the same way that find works) and if you use $project first then you will lose the ability to use indexes.
What you can do instead is combine your efforts as follows:
{
aggregate: 'posts',
pipeline: [
{$match: {
created_date :
{$gte:{$date:'2012-09-01T04:00:00Z'},
$lt: {date:'2012-10-01T04:00:00Z'}
}}
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}
The above only gives you aggregation for September, if you wanted to aggregate for multiple months, you can for example:
{
aggregate: 'posts',
pipeline: [
{$match: {
created_date :
{ $gte:'2012-07-01T04:00:00Z',
$lt: '2012-10-01T04:00:00Z'
}
},
{$project: {
comments: 1,
new_created: {
"yr" : {"$year" : "$created_date"},
"mo" : {"$month" : "$created_date"}
}
}
},
{$group:
{_id: "$new_created",
totalComments:{$sum:'$comments'}
}
}
]
}
and you'll get back something like:
{
"result" : [
{
"_id" : {
"yr" : 2012,
"mo" : 7
},
"totalComments" : 5
},
{
"_id" : {
"yr" : 2012,
"mo" : 8
},
"totalComments" : 19
},
{
"_id" : {
"yr" : 2012,
"mo" : 9
},
"totalComments" : 21
}
],
"ok" : 1
}
Let's look at building some pipelines that involve operations that are already familiar to us. So, we're going to look at the following stages:
match - this is filtering stage, similar to find.
project
sort
skip
limit
We might ask ourself why these stages are necessary, given that this functionality is already provided in the MongoDB query language, and the reason is because we need these stages to support the more complex analytics-oriented functionality that's included with the aggregation framework. The below query is simply equal to a find:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, ])
Let's introduce a project stage in this aggregation pipeline:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$project: {
_id: 0,
name: 1,
founded_year: 1
}
}])
We use aggregate method for implementing aggregation framework. The aggregation pipelines are merely an array of documents. Each of the document should stipulate a particular stage operator. So, in the above case we've an aggregation pipeline with two stages. The $match stage is passing the documents one at a time to $project stage.
Let's extend to limit stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}])
This gets the matching documents and limits to five before projecting out the fields. So, projection is working only on 5 documents. Assume, if we were to do something like this:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$project: {
_id: 0,
name: 1
}
}, {
$limit: 5
}])
This gets the matching documents and projects those large number of documents and finally limits to five. So, projection is working on large number of documents and finally limiting to 5. This gives us a lesson that we should limit the documents to those which are absolutely necessary to be passed to the next stage. Now, let's look at sort stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$sort: {
name: 1
}
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}])
This will sort all documents by name and give only 5 out of them. Assume, if we were to do something like this:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$limit: 5
}, {
$sort: {
name: 1
}
}, {
$project: {
_id: 0,
name: 1
}
}])
This will take first 5 documents and sort them. Let's add the skip stage:
db.companies.aggregate([{
$match: {
founded_year: 2004
}
}, {
$sort: {
name: 1
}
}, {
$skip: 10
}, {
$limit: 5
}, {
$project: {
_id: 0,
name: 1
}
}, ])
This will sort all the documents and skip the initial 10 documents and return to us. We should try to include $match stages as early as possible in the pipeline. To filter documents using a $match stage, we use the same syntax for constructing query documents (filters) as we do for find().
Try this;
db.createCollection("so");
db.so.remove();
db.so.insert([
{
post_body: 'This is the body of test post 1',
created_date: ISODate('2012-09-29T05:23:41Z'),
comments: 48
},
{
post_body: 'This is the body of test post 2',
created_date: ISODate('2012-09-24T12:34:13Z'),
comments: 10
},
{
post_body: 'This is the body of test post 3',
created_date: ISODate('2012-08-16T12:34:13Z'),
comments: 10
}
]);
//db.so.find();
db.so.ensureIndex({"created_date":1});
db.runCommand({
aggregate:"so",
pipeline:[
{
$match: { // filter only those posts in september
created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') }
}
},
{
$group: {
_id: null, // no shared key
comments: { $sum: "$comments" } // total comments for all the posts in the pipeline
}
},
]
//,explain:true
});
Result is;
{ "result" : [ { "_id" : null, "comments" : 58 } ], "ok" : 1 }
So you could also modify your previous example to do this, although I'm not sure why you'd want to, unless you plan on doing something else with month and year in the pipeline;
{
aggregate: 'posts',
pipeline: [
{$match: { created_date: { $gte: ISODate('2012-09-01'), $lt: ISODate('2012-10-01') } } },
{$project:
{
month : {$month:'$created_date'},
year : {$year:'$created_date'}
}
},
{$match:
{
month:9,
year: 2012
}
},
{$group:
{_id: '0',
totalComments:{$sum:'$comments'}
}
}
]
}