Group and count using aggregation framework - mongodb

I'm trying to group and count the following structure:
[{
"_id" : ObjectId("5479c4793815a1f417f537a0"),
"status" : "canceled",
"date" : ISODate("2014-11-29T00:00:00.000Z"),
"offset" : 30,
"devices" : [
{
"name" : "Mouse",
"cost" : 150,
},
{
"name" : "Keyboard",
"cost" : 200,
}
],
},
{
"_id" : ObjectId("5479c4793815a1f417d557a0"),
"status" : "done",
"date" : ISODate("2014-10-20T00:00:00.000Z"),
"offset" : 30,
"devices" : [
{
"name" : "LCD",
"cost" : 150,
},
{
"name" : "Keyboard",
"cost" : 200,
}
],
}
,
{
"_id" : ObjectId("5479c4793815a1f417f117a0"),
"status" : "done",
"date" : ISODate("2014-12-29T00:00:00.000Z"),
"offset" : 30,
"devices" : [
{
"name" : "Headphones",
"cost" : 150,
},
{
"name" : "LCD",
"cost" : 200,
}
],
}]
I need group and count something like that:
"result" : [
{
"_id" : {
"status" : "canceled"
},
"count" : 1
},
{
"_id" : {
"status" : "done"
},
"count" : 2
},
totaldevicecost: 730,
],
"ok" : 1
}
My problem in calculating cost sum in subarray "devices". How to do that?

It seems like you got a start on this but you got lost on some of the other concepts. There are some basic truths when working with arrays in documents, but's let's start where you left off:
db.sample.aggregate([
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 }
}}
])
So that is just going to use the $group pipeline to gather up your documents on the different values of the "status" field and then also produce another field for "count" which of course "counts" the occurrences of the grouping key by passing a value of 1 to the $sum operator for each document found. This puts you at a point much like you describe:
{ "_id" : "done", "count" : 2 }
{ "_id" : "canceled", "count" : 1 }
That's the first stage of this and easy enough to understand, but now you need to know how to get values out of an array. You might then be tempted once you understand the "dot notation" concept properly to do something like this:
db.sample.aggregate([
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$devices.cost" }
}}
])
But what you will find is that the "total" will in fact be 0 for each of those results:
{ "_id" : "done", "count" : 2, "total" : 0 }
{ "_id" : "canceled", "count" : 1, "total" : 0 }
Why? Well MongoDB aggregation operations like this do not actually traverse array elements when grouping. In order to do that, the aggregation framework has a concept called $unwind. The name is relatively self-explanatory. An embedded array in MongoDB is much like having a "one-to-many" association between linked data sources. So what $unwind does is exactly that sort of "join" result, where the resulting "documents" are based on the content of the array and duplicated information for each parent.
So in order to act on array elements you need to use $unwind first. This should logically lead you to code like this:
db.sample.aggregate([
{ "$unwind": "$devices" },
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$devices.cost" }
}}
])
And then the result:
{ "_id" : "done", "count" : 4, "total" : 700 }
{ "_id" : "canceled", "count" : 2, "total" : 350 }
But that isn't quite right is it? Remember what you just learned from $unwind and how it does a de-normalized join with the parent information? So now that is duplicated for every document since both had two array member. So while the "total" field is correct, the "count" is twice as much as it should be in each case.
A bit more care needs to be taken, so instead of doing this in a single $group stage, it is done in two:
db.sample.aggregate([
{ "$unwind": "$devices" },
{ "$group": {
"_id": "$_id",
"status": { "$first": "$status" },
"total": { "$sum": "$devices.cost" }
}},
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$total" }
}}
])
Which now gets the result with correct totals in it:
{ "_id" : "canceled", "count" : 1, "total" : 350 }
{ "_id" : "done", "count" : 2, "total" : 700 }
Now the numbers are right, but it is still not exactly what you are asking for. I would think you should stop there as the sort of result you are expecting is really not suited to just a single result from aggregation alone. You are looking for the total to be "inside" the result. It really doesn't belong there, but on small data it is okay:
db.sample.aggregate([
{ "$unwind": "$devices" },
{ "$group": {
"_id": "$_id",
"status": { "$first": "$status" },
"total": { "$sum": "$devices.cost" }
}},
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$total" }
}},
{ "$group": {
"_id": null,
"data": { "$push": { "count": "$count", "total": "$total" } },
"totalCost": { "$sum": "$total" }
}}
])
And a final result form:
{
"_id" : null,
"data" : [
{
"count" : 1,
"total" : 350
},
{
"count" : 2,
"total" : 700
}
],
"totalCost" : 1050
}
But, "Do Not Do That". MongoDB has a document limit on response of 16MB, which is a limitation of the BSON spec. On small results you can do this kind of convenience wrapping, but in the larger scheme of things you want the results in the earlier form and either a separate query or live with iterating the whole results in order to get the total from all documents.
You do appear to be using a MongoDB version less than 2.6, or copying output from a RoboMongo shell which does not support the latest version features. From MongoDB 2.6 though the results of aggregation can be a "cursor" rather than a single BSON array. So the overall response can be much larger than 16MB, but only when you are not compacting to a single document as results, shown for the last example.
This would be especially true in cases where you were "paging" the results, with 100's to 1000's of result lines but you just wanted a "total" to return in an API response when you are only returning a "page" of 25 results at a time.
Anyhow, that should give you a reasonable guide on how to get the type of results you are expecting from your common document form. Remember $unwind in order to process arrays, and generally $group multiple times in order to get totals at different grouping levels from your document and collection groupings.

Related

How to group documents on index of array elements?

I'm looking for a way to take data such as this
{ "_id" : 5, "count" : 1, "arr" : [ "aga", "dd", "a" ] },
{ "_id" : 6, "count" : 4, "arr" : [ "aga", "ysdf" ] },
{ "_id" : 7, "count" : 4, "arr" : [ "sad", "aga" ] }
I would like to sum the count based on the 1st item(index) of arr. In another aggregation I would like to do the same with the 1st and the 2nd item in the arr array.
I've tried using unwind, but that breaks up the data and the hierarchy is then lost.
I've also tried using
$group: {
_id: {
arr_0:'$arr.0'
},
total:{
$sum: '$count'
}
}
but the result is blank arrays
Actually you can't use the dot notation to group your documents by element at a specified index. To two that you have two options:
First the optimal way using the $arrayElemAt operator new in MongoDB 3.2. which return the element at a specified index in the array.
db.collection.aggregate([
{ "$group": {
"_id": { "$arrayElemAt": [ "$arr", 0 ] },
"count": { "$sum": 1 }
}}
])
From MongoDB version 3.0 backward you will need to de-normalise your array then in the first time $group by _id and use the $first operator to return the first item in the array. From there you will need to regroup your document using that value and use the $sum to get the sum. But this will only work for the first and last index because MongoDB also provides the $last operator.
db.collection.aggregate([
{ "$unwind": "$arr" },
{ "$group": {
"_id": "$_id",
"arr": { "$first": "$arr" }
}},
{ "$group": {
"_id": "$arr",
"count": { "$sum": 1 }
}}
])
which yields something like this:
{ "_id" : "sad", "count" : 1 }
{ "_id" : "aga", "count" : 2 }
To group using element at position p in your array you will get a better chance using the mapReduce function.
var mapFunction = function(){ emit(this.arr[0], 1); };
var reduceFunction = function(key, value) { return Array.sum(value); };
db.collection.mapReduce(mapFunction, reduceFunction, { "out": { "inline": 1 } } )
Which returns:
{
"results" : [
{
"_id" : "aga",
"value" : 2
},
{
"_id" : "sad",
"value" : 1
}
],
"timeMillis" : 27,
"counts" : {
"input" : 3,
"emit" : 3,
"reduce" : 1,
"output" : 2
},
"ok" : 1
}

Match on key from two queries in a single query

I have time series data in mongodb as follows:
{
"_id" : ObjectId("558912b845cea070a982d894"),
"code" : "ZL0KOP",
"time" : NumberLong("1420128024000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d895"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128025000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d896"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128003000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d897"),
"code" : "ZL0KOP",
"time" : NumberLong("1420041724000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d89e"),
"code" : "YBUHCW",
"time" : NumberLong("1420041732000"),
"direction" : "10",
"siteId" : "0002"
}
{
"_id" : ObjectId("558912b845cea070a982d8a1"),
"code" : "U48AIW",
"time" : NumberLong("1420041729000"),
"direction" : "10",
"siteId" : "0002"
}
{
"_id" : ObjectId("558912b845cea070a982d8a0"),
"code" : "OJ3A06",
"time" : NumberLong("1420300927000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d89d"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420300885000"),
"direction" : "10",
"siteId" : "0003"
}
{
"_id" : ObjectId("558912b845cea070a982d8a2"),
"code" : "ZLV05H",
"time" : NumberLong("1420300922000"),
"direction" : "10",
"siteId" : "0001"
}
{
"_id" : ObjectId("558912b845cea070a982d8a3"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420300928000"),
"direction" : "10",
"siteId" : "0000"
}
The codes that match two or more conditions need to be filtered out.
For example:
condition1: 1420128000000 < time < 1420128030000,siteId == 0000
condition2: 1420300880000 < time < 1420300890000,siteId == 0003
results for the first condition:
{
"_id" : ObjectId("558912b845cea070a982d894"),
"code" : "ZL0KOP",
"time" : NumberLong("1420128024000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d895"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128025000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d896"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128003000"),
"direction" : "10",
"siteId" : "0000"
}
results for the second condition:
{
"_id" : ObjectId("558912b845cea070a982d89d"),
"code" : "AQ0ZSQ", "time" : NumberLong("1420300885000"),
"direction" : "10",
"siteId" : "0003"
}
The only code that matchs all the conditions above should be:
{"code" : "AQ0ZSQ", "count":2}
"count" means, the code "AQ0ZSQ" appeared in both conditions
The only solution I can think of is using two querys. For example, using python
result1 = list(db.codes.objects({'time': {'$gt': 1420128000000,'$lt': 1420128030000}, 'siteId': "0000"}).only("code"))
result2 = list(db.codes.objects({'time': {'$gt': 1420300880000,'$lt': 1420300890000}},{'siteId':'0003'}).only("code"))
and then found the shared code in both results.
The Problem is that there are millions of documents in the collection, and both query can easily exceed the 16mb limitation.
So is it possible to do that in one query? or should I change the document structure?
What you are asking for here requires the usage of the aggregation framework in order to calculate that there was an intersection between results on the server.
The first part of the logic is you need an $or query for the two conditions, then there will be some additional projection and filtering on those results:
db.collection.aggregate([
// Fetch all possible documents for consideration
{ "$match": {
"$or": [
{
"time": { "$gt": 1420128000000, "$lt": 1420128030000 },
"siteId": "0000"
},
{
"time": { "$gt": 1420300880000, "$lt": 1420300890000 },
"siteId": "0003"
}
]
}},
// Locigically compare the conditions agaist results and add a score
{ "$project": {
"code": "$code",
"score": { "$add": [
{ "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420128000000 ] },
{ "$lt": [ "$time", 1420128030000 ] },
{ "$eq": [ "$siteId", "0000" ] }
]},
1,
0
]},
{ "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420300880000 ] },
{ "$lt": [ "$time", 1420300890000 ] },
{ "$eq": [ "$siteId", "0003" ] }
]},
1,
0
]}
]}
}},
// Now Group the results by "code"
{ "$group": {
"_id": "$code",
"score": { "$sum": "$score" }
}},
// Now filter to keep only results with score 2
{ "$match": { "score": 2 } }
])
So break that down and see how it works.
First you want a query with $match to get all the possible documents for "all" of your conditions of "intersection". That is what the $or expression allows here by considering that matched documents must meet either set. You need all of them to work out the "intersection" here.
In the second $project pipeline stage a boolean test of your conditions is performed with each set. Notice the usage of $and here as well as other boolean operators of the aggregation framework is slightly different to that of the query usage form.
In the aggregation framework form ( outside of $match which uses normal query operators ) these operators take an array of arguments, to typically represent "two" values for comparison rather than the operation being assigned to the "right" of the field name.
Since these conditions are logical or "boolean" we want to return the result as "numeric" rather than a true/false value. This is what $cond does here. So where the condition is true for the document inspected a score of 1 is emitted otherwise it is 0 when false.
Finally in this $project expression both of your conditions are wrapped with $add to form the "score" result. So if none of the conditions ( not possible after the $match ) were not true the score would be 0, if "one" is true then 1, or where "both" are true then 2.
Noting here that the specific conditions asked for here will never score above 1 for a single document since no document can have the overlapping range or "two" "siteId" values as is present here.
Now the important part is to $group by the "code" value and $sum the score value to get a total per "code".
This leaves the final $match filter stage of the pipeline to only keep those documents with a "score" value that is equal to the number of conditions you asked for. In this case 2.
There is a failing there however in that where there is more than one value of "code" in the matches for either condition ( as there is ) then the "score" here would be incorrect.
So after the introduction to the principles of using logical operators in aggregation, you can fix that fault by essentially "tagging" each result logically as to which condition "set" it applies to. Then you can basically consider which "code" appeared in "both" sets in this case:
db.collection.aggregate([
{ "$match": {
"$or": [
{
"time": { "$gt": 1420128000000, "$lt": 1420128030000 },
"siteId": "0000"
},
{
"time": { "$gt": 1420300880000, "$lt": 1420300890000 },
"siteId": "0003"
}
]
}},
// If it's the first logical condition it's "A" otherwise it can
// only be the other, therefore "B". Extend for more sets as needed.
{ "$group": {
"_id": {
"code": "$code",
"type": { "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420128000000 ] },
{ "$lt": [ "$time", 1420128030000 ] },
{ "$eq": [ "$siteId", "0000" ] }
]},
"A",
"B"
]}
}
}},
// Simply add up the results for each "type"
{ "$group": {
"_id": "$_id.code",
"score": { "$sum": 1 }
}}
// Now filter to keep only results with score 2
{ "$match": { "score": 2 } }
])
It might be a bit to take in if this is your first time using the aggregation framework. Please take the time to look at the operators used as defined with the links here and also look at Aggregation Pipeline Operators in general.
Beyond simple data selection, this is the tool you should be reaching to most often when using MongoDB, so you would do well to understand all the operations that are possible.

Aggregation framework flatten subdocument data with parent document

I am building a dashboard that rotates between different webpages. I am wanting to pull all slides that are part of the "Test" deck and order them appropriately. After the query my result would ideally look like.
[
{ "url" : "http://10.0.1.187", "position": 1, "duartion": 10 },
{ "url" : "http://10.0.1.189", "position": 2, "duartion": 3 }
]
I currently have a dataset that looks like the following
{
"_id" : ObjectId("53a612043c24d08167b26f82"),
"url" : "http://10.0.1.189",
"decks" : [
{
"title" : "Test",
"position" : 2,
"duration" : 3
}
]
}
{
"_id" : ObjectId("53a6103e3c24d08167b26f81"),
"decks" : [
{
"title" : "Test",
"position" : 1,
"duration" : 2
},
{
"title" : "Other Deck",
"position" : 1,
"duration" : 10
}
],
"url" : "http://10.0.1.187"
}
My attempted query looks like:
db.slides.aggregate([
{
"$match": {
"decks.title": "Test"
}
},
{
"$sort": {
"decks.position": 1
}
},
{
"$project": {
"_id": 0,
"position": "$decks.position",
"duration": "$decks.duration",
"url": 1
}
}
]);
But it does not yield my desired results. How can I query my dataset and get my expected results in a optimal way?
Well to truly "flatten" the document as your title suggests then $unwind is always going to be employed as there really is not other way to do that. There are however some different approaches if you can live with the array being filtered down to the matching element.
Basically speaking, if you really only have one thing to match in the array then your fastest approach is to simply use .find() matching the required element and projecting:
db.slides.find(
{ "decks.title": "Test" },
{ "decks.$": 1 }
).sort({ "decks.position": 1 }).pretty()
That is still an array but as long as you have only one element that matches then this does work. Also the items are sorted as expected, though of course the "title" field is not dropped from the matched documents, as that is beyond the possibilities for simple projection.
{
"_id" : ObjectId("53a6103e3c24d08167b26f81"),
"decks" : [
{
"title" : "Test",
"position" : 1,
"duration" : 2
}
]
}
{
"_id" : ObjectId("53a612043c24d08167b26f82"),
"decks" : [
{
"title" : "Test",
"position" : 2,
"duration" : 3
}
]
}
Another approach, as long as you have MongoDB 2.6 or greater available, is using the $map operator and some others in order to both "filter" and re-shape the array "in-place" without actually applying $unwind:
db.slides.aggregate([
{ "$project": {
"url": 1,
"decks": {
"$setDifference": [
{
"$map": {
"input": "$decks",
"as": "el",
"in": {
"$cond": [
{ "$eq": [ "$$el.title", "Test" ] },
{
"position": "$$el.position",
"duration": "$$el.duration"
},
false
]
}
}
},
[false]
]
}
}},
{ "$sort": { "decks.position": 1 }}
])
The advantage there is that you can make the changes without "unwinding", which can reduce processing time with large arrays as you are not essentially creating new documents for every array member and then running a separate $match stage to "filter" or another $project to reshape.
{
"_id" : ObjectId("53a6103e3c24d08167b26f81"),
"decks" : [
{
"position" : 1,
"duration" : 2
}
],
"url" : "http://10.0.1.187"
}
{
"_id" : ObjectId("53a612043c24d08167b26f82"),
"url" : "http://10.0.1.189",
"decks" : [
{
"position" : 2,
"duration" : 3
}
]
}
You can again either live with the "filtered" array or if you want you can again "flatten" this truly by adding in an additional $unwind where you do not need to filter with $match as the result already contains only the matched items.
But generally speaking if you can live with it then just use .find() as it will be the fastest way. Otherwise what you are doing is fine for small data, or there is the other option for consideration.
Well as soon as I posted I realized I should be using an $unwind. Is this query the optimal way to do it, or can it be done differently?
db.slides.aggregate([
{
"$unwind": "$decks"
},
{
"$match": {
"decks.title": "Test"
}
},
{
"$sort": {
"decks.position": 1
}
},
{
"$project": {
"_id": 0,
"position": "$decks.position",
"duration": "$decks.duration",
"url": 1
}
}
]);

MongoDB nested grouping

I have the following MongoDB data model:
{
"_id" : ObjectId("53725814740fd6d2ee0ca2bb"),
"date" : "2014-01-01",
"establishmentId" : 1,
"products" : [
{
"productId" : 1,
"price" : 7.03,
"someOtherInfo" : 325,
"somethingElse" : 6878
},
{
"productId" : 2,
"price" : 4.6,
"someOtherInfo" : 243,
"somethingElse" : 1757
},
{
"productId" : 3,
"price" : 2.14,
"someOtherInfo" : 610,
"somethingElse" : 5435
},
{
"productId" : 4,
"price" : 1.45,
"someOtherInfo" : 627,
"somethingElse" : 5762
},
{
"productId" : 5,
"price" : 3.9,
"someOtherInfo" : 989,
"somethingElse" : 3752
}
}
What is the fastest way to get the average price across all establishments? Is there a better data model to achieve this?
An aggregation operation should handle this well. I'd suggest looking into the $unwind operation.
Something along these lines should work (just as an example):
db.collection.aggregate(
{$match: {<query parameters>}},
{$unwind: "$products"},
{
$group: {
_id: "<blank or field(s) to group by before averaging>",
$avg: "$price"
}
}
);
An aggregation built in this style should produce a JSON object that has the data you want.
Due to the gross syntax errors in anything else provided the more direct answer is:
db.collection.aggregate([
{ "$unwind": "$products" },
{ "$group": {
"_id": null,
"avgprice": { "$avg": "$products.price" }
}}
])
The usage of the aggregation framework here is to first $unwind the array, which is a way to "de-normalize" the content in the array into separate documents.
Then in the $group stage you pass in a value of null to the _id which means "group everything" and pass your $products.price ( note the dot notation ) in to the $avg operator to return the total average value across all of the sub-document entries in all of your documents in the collection.
See the full operator reference for more information.
The best solution I found was:
db.collection.aggregate([
{$match:{date:{$gte:"2014-01-01",$lte:"2014-01-31"},establishmentId:{$in:[1,2,3,4,5,6]}}
{ "$unwind": "$products" },
{ "$group": {
"_id": {date:"$date",product:"$products.productId"},
"avgprice": { "$avg": "$products.price" }
}}
])
And something I found out also was that it is much better to first use match and then unwind so there are fewer items to unwind. This results in a faster overall process.

Obtaining $group result with group count

Assuming I have a collection called "posts" (in reality it is a more complex collection, posts is too simple) with the following structure:
> db.posts.find()
{ "_id" : ObjectId("50ad8d451d41c8fc58000003"), "title" : "Lorem ipsum", "author" :
"John Doe", "content" : "This is the content", "tags" : [ "SOME", "RANDOM", "TAGS" ] }
I expect this collection to span hundreds of thousands, perhaps millions, that I need to query for posts by tags and group the results by tag and display the results paginated. This is where the aggregation framework comes in. I plan to use the aggregate() method to query the collection:
db.posts.aggregate([
{ "$unwind" : "$tags" },
{ "$group" : {
_id: { tag: "$tags" },
count: { $sum: 1 }
} }
]);
The catch is that to create the paginator I would need to know the length of the output array. I know that to do that you can do:
db.posts.aggregate([
{ "$unwind" : "$tags" },
{ "$group" : {
_id: { tag: "$tags" },
count: { $sum: 1 }
} }
{ "$group" : {
_id: null,
total: { $sum: 1 }
} }
]);
But that would discard the output from previous pipeline (the first group). Is there a way that the two operations be combined while preserving each pipeline's output? I know that the output of the whole aggregate operation can be cast to an array in some language and have the contents counted but there may be a possibility that the pipeline output may exceed the 16Mb limit. Also, performing the same query just to obtain the count seems like a waste.
So is obtaining the document result and count at the same time possible? Any help is appreciated.
Use $project to save tag and count into tmp
Use $push or addToSet to store tmp into your data list.
Code:
db.test.aggregate(
{$unwind: '$tags'},
{$group:{_id: '$tags', count:{$sum:1}}},
{$project:{tmp:{tag:'$_id', count:'$count'}}},
{$group:{_id:null, total:{$sum:1}, data:{$addToSet:'$tmp'}}}
)
Output:
{
"result" : [
{
"_id" : null,
"total" : 5,
"data" : [
{
"tag" : "SOME",
"count" : 1
},
{
"tag" : "RANDOM",
"count" : 2
},
{
"tag" : "TAGS1",
"count" : 1
},
{
"tag" : "TAGS",
"count" : 1
},
{
"tag" : "SOME1",
"count" : 1
}
]
}
],
"ok" : 1
}
I'm not sure you need the aggregation framework for this other than counting all the tags eg:
db.posts.aggregate(
{ "unwind" : "$tags" },
{ "group" : {
_id: { tag: "$tags" },
count: { $sum: 1 }
} }
);
For paginating through per tag you can just use the normal query syntax - like so:
db.posts.find({tags: "RANDOM"}).skip(10).limit(10)