Find Total based on group by of two mongo field - mongodb

i have collection data like this -
{
"user_id" : "1",
"branch_id" : "1",
"total" : 100,
},
{
"user_id" : "1",
"branch_id" : "1",
"total" : 200
},
{
"user_id" : "1",
"branch_id" : "3",
"total" : 1400
},
{
"user_id" : "2",
"branch_id" : "1",
"total" : 100
},
{
"user_id" : "2",
"branch_id" : "1",
"total" : 100
},
I am looking to get output in the below format -
[
{
"user_id":"1",
"branch_id":"1",
"grand_total":"300"
},
{
"user_id":"1",
"branch_id":"3",
"grand_total":"1400"
},
{
"user_id":"2",
"branch_id":"1",
"grand_total":"200"
}
]
I have tried a mongo aggregate query, but the query gives output as undefined.
Basically I need to get per user wise per branch wise the total points he has earned.
Here is what I have tried but not working -
Collection.aggregate(
{
"$group": {
"_id": "$user_id",
"nameCount": { "$sum": 1 },
"branch_id": {
"$sum": {
"$cond": [ {"$branch_id":{"$ne":null}} ]
}
}
}
},
{
"$project": {
"_id": 0,
"name": "$_id",
"nameCount": 1,
"branch_id":1
}
}
);
Please help.

Your aggregation pipeline has to look like this:
{
"$group": {
"_id": {
user_id: "$user_id",
branch_id: "$branch_id"
},
"grand_total": {
"$sum": "$total"
},
}
}, {
"$project": {
"_id": 0,
"user_id": "$_id.user_id",
"branch_id": "$_id.branch_id",
"total": "$grand_total"
}
}
Inside your _id field in your "$group" pipeline you add the fields that you want to group your documents by. If you only want to group by one field you can write it as follows:
{"$group": {
"_id": "$user_id"
}
}
If you have multiple fields you want to group by (like it seems in your case) then you write it as follows:
{"$group": {
"_id": {
user_id: "$user_id",
branch_id: "$branch_id"
}
}
}
Every aggregation pipeline changes your document. So, in your $group if you call the sum of all totals like that "grand_total"
"grand_total": {
"$sum": "$total"
}
then in your $project pipeline that field total doesn't exist anymore. But instead we created a new field (grand_total) that is the sum.

Related

In MongoDb how to get the max\min\avg\count of a single matched element occurred inside an embedded array?

I have a MongoDB collection as below. And I want to get the min\max\avg\count of the xxx field inside all documents which $match: { "Parsed.FileId": "421462559", "Parsed.MessageId": "123" }
Note that Fields of each document only contains one single field with SchemaName = xxx
Is it possible with MongoDB aggregation (or other feature) and how?
{
"_id" : NumberLong(409),
"Parsed" : {
"FileId" : "421462559",
"MessageId": "123",
"Fields" : [
{
"SchemaName" : "xxx",
"Type" : 0,
"Value" : 6
},
{
"SchemaName" : "yyy",
"Type" : 0,
"Value" : 5
}
]
}
},
{
"_id" : NumberLong(510),
"Parsed" : {
"FileId" : "421462559",
"MessageId": "123",
"Fields" : [
{
"SchemaName" : "xxx",
"Type" : 0,
"Value" : 10
},
{
"SchemaName" : "yyy",
"Type" : 0,
"Value" : 20
}
]
}
}
For example collection above, I expect to get the result for field xxx as:
{
count: 2,
min: 6,
max: 10,
avg: 8
}
You can use below aggregation
Basically you need to first $unwind the nested array and then have to use $group with the corresponding accumulator i.e. min
max
$sum
$avg
db.collection.aggregate([
{ "$match": { "Parsed.Fields.SchemaName": "xxx" }},
{ "$unwind": "$Parsed.Fields" },
{ "$match": { "Parsed.Fields.SchemaName": "xxx" }},
{ "$group": {
"_id": null,
"count": { "$sum": 1 },
"max": { "$max": "$Parsed.Fields.Value" },
"min": { "$min": "$Parsed.Fields.Value" },
"avg": { "$avg": "$Parsed.Fields.Value" }
}}
])

Is it recommended to use unwind in working with large amount of data with nested documents on MongoDB [duplicate]

I am new in mongodb and trying to work with nested documents.I have a query as below
db.EndpointData.aggregate([
{ "$group" : { "_id" : "$EndpointId", "RequestCount" : { "$sum" : 1 }, "FirstActivity" : { "$min" : "$DateTime" }, "LastActivity" : { "$max" : "$DateTime" }, "Tags" : { "$push" : "$Tags" } } },
{ "$unwind" : "$Tags" },
{ "$unwind" : "$Tags" },
{ "$group" : { "_id" : "$_id", "RequestCount" : { "$first" : "$RequestCount" }, "Tags" : { "$push" : "$Tags" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } },
{ "$unwind" : "$Tags" },
{ "$unwind" : "$Tags.Sensors" },
{ "$group" : { "_id" : { "EndpointId" : "$_id", "Uid" : "$Tags.Uid", "Type" : "$Tags.Sensors.Type" }, "RequestCount" : { "$first" : "$RequestCount" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } },
{ "$group" : { "_id" : { "EndpointId" : "$_id.EndpointId", "Uid" : "$_id.Uid" }, "count" : { "$sum" : 1 }, "RequestCount" : { "$first" : "$RequestCount" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } },
{ "$group" : { "_id" : "$_id.EndpointId", "TagCount" : { "$sum" : 1 }, "SensorCount" : { "$sum" : "$count" }, "RequestCount" : { "$first" : "$RequestCount" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } }])
and my data structure is as below
{
"_id": "6aef51dfaf42ea1b70d0c4db",
"EndpointId": "98799bcc-e86f-4c8a-b340-8b5ed53caf83",
"DateTime": "2018-05-06T19:05:02.666Z",
"Url": "test",
"Tags": [
{
"Uid": "C1:3D:CA:D4:45:11",
"Type": 1,
"DateTime": "2018-05-06T19:05:02.666Z",
"Sensors": [
{
"Type": 1,
"Value": { "$numberDecimal": "-95" }
},
{
"Type": 2,
"Value": { "$numberDecimal": "-59" }
},
{
"Type": 3,
"Value": { "$numberDecimal": "11.029802536740132" }
}
]
},
{
"Uid": "C1:3D:CA:D4:45:11",
"Type": 1,
"DateTime": "2018-05-06T19:05:02.666Z",
"Sensors": [
{
"Type": 1,
"Value": { "$numberDecimal": "-92" }
},
{
"Type": 2,
"Value": { "$numberDecimal": "-59" }
}
]
}
]
}
This query works fine and correct. I count Tags, Sensors and repeat times of each EdpointID. But the problem is when I work with large size of data (about 10,000,000 documents) I get memory problem. It seems having 4 levels of unwind make problem in this query. How can I reduce unwinds in this query?
As long as your data has unique sensor and tag readings per document, which to date what you have presented appears to, then you simply don't need $unwind at all.
In fact, all you really need is a single $group:
db.endpoints.aggregate([
// In reality you would $match to limit the selection of documents
{ "$match": {
"DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
}},
{ "$group": {
"_id": "$EndpointId",
"FirstActivity" : { "$min" : "$DateTime" },
"LastActivity" : { "$max" : "$DateTime" },
"RequestCount": { "$sum": 1 },
"TagCount": {
"$sum": {
"$size": { "$setUnion": ["$Tags.Uid",[]] }
}
},
"SensorCount": {
"$sum": {
"$sum": {
"$map": {
"input": { "$setUnion": ["$Tags.Uid",[]] },
"as": "tag",
"in": {
"$size": {
"$reduce": {
"input": {
"$filter": {
"input": {
"$map": {
"input": "$Tags",
"in": {
"Uid": "$$this.Uid",
"Type": "$$this.Sensors.Type"
}
}
},
"cond": { "$eq": [ "$$this.Uid", "$$tag" ] }
}
},
"initialValue": [],
"in": { "$setUnion": [ "$$value", "$$this.Type" ] }
}
}
}
}
}
}
}
}}
])
Or if you actually do need to accumulate those "unique" values of "Sensors" and "Tags" from across different documents, then you still need initial $unwind statements to get the right grouping, but nowhere near as much as you presently have:
db.endpoints.aggregate([
// In reality you would $match to limit the selection of documents
{ "$match": {
"DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
}},
{ "$unwind": "$Tags" },
{ "$unwind": "$Tags.Sensors" },
{ "$group": {
"_id": {
"EndpointId": "$EndpointId",
"Uid": "$Tags.Uid",
"Type": "$Tags.Sensors.Type"
},
"FirstActivity": { "$min": "$DateTime" },
"LastActivity": { "$max": "$DateTime" },
"RequestCount": { "$addToSet": "$_id" }
}},
{ "$group": {
"_id": {
"EndpointId": "$_id.EndpointId",
"Uid": "$_id.Uid",
},
"FirstActivity": { "$min": "$FirstActivity" },
"LastActivity": { "$max": "$LastActivity" },
"count": { "$sum": 1 },
"RequestCount": { "$addToSet": "$RequestCount" }
}},
{ "$group": {
"_id": "$_id.EndpointId",
"FirstActivity": { "$min": "$FirstActivity" },
"LastActivity": { "$max": "$LastActivity" },
"TagCount": { "$sum": 1 },
"SensorCount": { "$sum": "$count" },
"RequestCount": { "$addToSet": "$RequestCount" }
}},
{ "$addFields": {
"RequestCount": {
"$size": {
"$reduce": {
"input": {
"$reduce": {
"input": "$RequestCount",
"initialValue": [],
"in": { "$setUnion": [ "$$value", "$$this" ] }
}
},
"initialValue": [],
"in": { "$setUnion": [ "$$value", "$$this" ] }
}
}
}
}}
],{ "allowDiskUse": true })
And from MongoDB 4.0 you can use $toString on the ObjectId within _id and simply merge the unique keys for those in order to keep the RequestCount using $mergeObjects. This is cleaner and a bit more scalable than pushing nested array content and flattening it
db.endpoints.aggregate([
// In reality you would $match to limit the selection of documents
{ "$match": {
"DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
}},
{ "$unwind": "$Tags" },
{ "$unwind": "$Tags.Sensors" },
{ "$group": {
"_id": {
"EndpointId": "$EndpointId",
"Uid": "$Tags.Uid",
"Type": "$Tags.Sensors.Type"
},
"FirstActivity": { "$min": "$DateTime" },
"LastActivity": { "$max": "$DateTime" },
"RequestCount": {
"$mergeObjects": {
"$arrayToObject": [[{ "k": { "$toString": "$_id" }, "v": 1 }]]
}
}
}},
{ "$group": {
"_id": {
"EndpointId": "$_id.EndpointId",
"Uid": "$_id.Uid",
},
"FirstActivity": { "$min": "$FirstActivity" },
"LastActivity": { "$max": "$LastActivity" },
"count": { "$sum": 1 },
"RequestCount": { "$mergeObjects": "$RequestCount" }
}},
{ "$group": {
"_id": "$_id.EndpointId",
"FirstActivity": { "$min": "$FirstActivity" },
"LastActivity": { "$max": "$LastActivity" },
"TagCount": { "$sum": 1 },
"SensorCount": { "$sum": "$count" },
"RequestCount": { "$mergeObjects": "$RequestCount" }
}},
{ "$addFields": {
"RequestCount": {
"$size": {
"$objectToArray": "$RequestCount"
}
}
}}
],{ "allowDiskUse": true })
Either form returns the same data, though the order of keys in the result may vary:
{
"_id" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",
"FirstActivity" : ISODate("2018-05-06T19:05:02.666Z"),
"LastActivity" : ISODate("2018-05-06T19:05:02.666Z"),
"RequestCount" : 2,
"TagCount" : 4,
"SensorCount" : 16
}
The result is obtained from these sample documents which you originally gave as a sample source in the original question on the topic:
{
"_id" : ObjectId("5aef51dfaf42ea1b70d0c4db"),
"EndpointId" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",
"DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
"Url" : "test",
"Tags" : [
{
"Uid" : "C1:3D:CA:D4:45:11",
"Type" : 1,
"DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
"Sensors" : [
{
"Type" : 1,
"Value" : NumberDecimal("-95")
},
{
"Type" : 2,
"Value" : NumberDecimal("-59")
},
{
"Type" : 3,
"Value" : NumberDecimal("11.029802536740132")
},
{
"Type" : 4,
"Value" : NumberDecimal("27.25")
},
{
"Type" : 6,
"Value" : NumberDecimal("2924")
}
]
},
{
"Uid" : "C1:3D:CA:D4:45:11",
"Type" : 1,
"DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
"Sensors" : [
{
"Type" : 1,
"Value" : NumberDecimal("-95")
},
{
"Type" : 2,
"Value" : NumberDecimal("-59")
},
{
"Type" : 3,
"Value" : NumberDecimal("11.413037961112279")
},
{
"Type" : 4,
"Value" : NumberDecimal("27.25")
},
{
"Type" : 6,
"Value" : NumberDecimal("2924")
}
]
},
{
"Uid" : "E5:FA:2A:35:AF:DD",
"Type" : 1,
"DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
"Sensors" : [
{
"Type" : 1,
"Value" : NumberDecimal("-97")
},
{
"Type" : 2,
"Value" : NumberDecimal("-58")
},
{
"Type" : 3,
"Value" : NumberDecimal("10.171658037099185")
}
]
}
]
}
/* 2 */
{
"_id" : ObjectId("5aef51e0af42ea1b70d0c4dc"),
"EndpointId" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",
"Url" : "test",
"Tags" : [
{
"Uid" : "E2:02:00:18:DA:40",
"Type" : 1,
"DateTime" : ISODate("2018-05-06T19:05:04.574Z"),
"Sensors" : [
{
"Type" : 1,
"Value" : NumberDecimal("-98")
},
{
"Type" : 2,
"Value" : NumberDecimal("-65")
},
{
"Type" : 3,
"Value" : NumberDecimal("7.845424441900629")
},
{
"Type" : 4,
"Value" : NumberDecimal("0.0")
},
{
"Type" : 6,
"Value" : NumberDecimal("3012")
}
]
},
{
"Uid" : "12:3B:6A:1A:B7:F9",
"Type" : 1,
"DateTime" : ISODate("2018-05-06T19:05:04.574Z"),
"Sensors" : [
{
"Type" : 1,
"Value" : NumberDecimal("-95")
},
{
"Type" : 2,
"Value" : NumberDecimal("-59")
},
{
"Type" : 3,
"Value" : NumberDecimal("12.939770381907275")
}
]
}
]
}
Bottom line is that you can either use the first given form here which will accumulate "within each document" and then "accumulate per endpoint" within a single stage and is the most optimal, or you actually require to identify things like the "Uid" on the tags or the "Type" on the sensor where those values occur more than once over any combination of documents grouping by the endpoint.
Your sample data supplied to date only shows that these values are "unique within each document", therefore the first given form would be most optimal if this is the case for all remaining data.
In the event that it is not, then "unwinding" the two nested arrays in order to "aggregate the detail across documents" is the only way to approach this. You can limit the date range or other criteria as most "queries" typically have some bounds and do not actually work on the "whole" collection data, but the main fact remains that arrays would be "unwound" creating essentially a document copy for every array member.
The point on optimization means that you only need to do this "twice" as there are only two arrays. Doing successive $group to $unwind to $group is always a sure sign you a doing something really wrong. Once you "take something apart" you should only ever need to "put it back together" once. In a series of graded steps as demonstrated here is the once approach which optimizes.
Outside of the scope of your question still remains:
Add other realistic constraints to the query to reduce the documents processed, maybe even do so in "batches" and combine results
Add the allowDiskUse option to the pipeline to let temporary storage be used. ( actually demonstrated on the commands )
Consider that "nested arrays" are probably not the best storage method for the analysis you want to do. It's always more efficient when you know you need to $unwind to simply write the data in that "unwound" form directly into a collection.
If you're dealing with data on the order of 10,000,000 documents, you're going to run into aggregation pipeline size limits easily. Specifically, according to the MongoDB documentation, there is a pipeline RAM use limit of 100MB. If each document has at least 10 bytes of data, then that's enough to hit that limit, and your documents would absolutely exceed that amount.
There are a few options available to you to resolve this problem:
1) You can use the allowDiskUse option as noted in the documentation.
2) You can project your documents further between unwind stages to limit document size (very unlikely to be enough on its own).
3) You can periodically generate summary documents on subsets of your data, and then perform your aggregations on those summary documents. If, for example, you run summary documents on subsets of size 1,000, you can reduce the number of documents in your pipelines from 10,000,000 to just 10,000.
4) You can look into sharding your collection and running these aggregate operations on a cluster to reduce the load on any single server.
Options 1 and 2 are both very short-term solutions. They're easy to implement, but won't help much in the long run. Options 3 and 4, however, are far more involved and trickier to implement, but will provide the greatest amount of scalability and are more likely to continue meeting your needs long-term.
Do be warned, however, that if you plan to approach option 4, you need to be very prepared. A sharded collection cannot be unsharded, and messing up can cause potentially irreparable data loss. Having a dedicated DBA with experience with MongoDB clusters is recommended.

MongoDB combining group aggregation and strLenBytes

I've a Mongo collection with documents like this:
{
"_id" : ObjectId("5a9d0d44c3a1ce5f14c6940a"),
"topic_id" : "5a7af30613b79405643e7da1",
"value" : "VMware Virtual Platform",
"timestamp" : "2018-03-05 09:26:25.136546",
"insert_ts" : "2018-03-05 09:26:25.136682",
"inserted_by" : 1
},
{
"_id" : ObjectId("5a9d0d44c3a1ce5f14c69409"),
"topic_id" : "5a7af30713b79479f82b4b84",
"value" : "VMware, Inc.",
"timestamp" : "2018-03-05 09:26:25.118931",
"insert_ts" : "2018-03-05 09:26:25.119081",
"inserted_by" : 1
},
{
"_id" : ObjectId("5a9d0d44c3a1ce5f14c69408"),
"topic_id" : "5a7af30713b7946d6d0a8772",
"value" : "Phoenix Technologies LTD 6.00 09/21/2015",
"timestamp" : "2018-03-05 09:26:25.101624",
"insert_ts" : "2018-03-05 09:26:25.101972",
"inserted_by" : 1
}
I would like to fetch some aggregated data from this collection. I want to know the oldest timestamp, the documents count and the total strlen of all values, but grouped by topic_id, where the document-id is greater than x.
In mysql, i would build a sql like this:
SELECT
MAX(_id) as max_id,
COUNT(*) as message_count,
MIN(timestamp) as min_timestamp,
LENGTH(GROUP_CONCAT(value)) as size
FROM `dev_topic_data_numeric`
WHERE _id > 22000
GROUP BY topic_id
How do i achieve this in MongoDB? I already tried to build it looking like this:
db.getCollection('topic_data_text').aggregate(
[
{
"$match":
{
"_id": {"$gte": ObjectId("5a9d0aefc3a1ce5f14c68c81") }
}
},
{
"$group":
{
"_id": "$topic_id",
"max_id": {"$max":"$_id"},
"min_timestamp": {"$min": "$timestamp"},
"message_count": {"$sum": 1},
/*"size": {"$strLenBytes": "$value" }*/
}
}
]
);
Then i uncomment $strLenBytes it crashes saying, that strLenBytes is not a group operator. The API of MongoDB does not help me here. How do i have to write it to get the strlen?
My expected result should look like this:
{
"_id" : "5a7af30613b79405643e7da1",
"max_id" : ObjectId("5a9d0d44c3a1ce5f14c6940a"),
"min_timestamp" : "2018-03-05 09:26:25.136546",
"message_count" : 1,
"size" : 23,
}
My MongoDB version is 3.4.4.
This is because $strLenBytes is not an accumulator, unlike $sum or $max. The $group stage accumulates values, so any operator that is valid in the $group stage are typically accumulators.
$strLenBytes converts one value to another in a 1-1 fashion. This is typically an operator for the $project stage.
Adding a $project stage in your aggregation should give you the result you require. Note that you would also need to modify the $group stage slightly to pass on the required values:
> db.test.aggregate([
{
"$match":
{
"_id": {"$gte": ObjectId("5a9d0aefc3a1ce5f14c68c81") }
}
},
{
"$group":
{
"_id": {"topic_id": "$topic_id", value: "$value"},
"max_id": {"$max":"$_id"},
"min_timestamp": {"$min": "$timestamp"},
"message_count": {"$sum": 1}
}
},
{
"$project":
{
"_id": "$_id.topic_id",
"max_id": "$max_id",
"min_timestamp": "$min_timestamp",
"message_count": "$message_count",
size: {"$strLenBytes": "$_id.value" }
}
}
])
Output using your example documents:
{
"_id": "5a7af30613b79405643e7da1",
"max_id": ObjectId("5a9d0d44c3a1ce5f14c6940a"),
"min_timestamp": "2018-03-05 09:26:25.136546",
"message_count": 1,
"size": 23
}
{
"_id": "5a7af30713b79479f82b4b84",
"max_id": ObjectId("5a9d0d44c3a1ce5f14c69409"),
"min_timestamp": "2018-03-05 09:26:25.118931",
"message_count": 1,
"size": 12
}
{
"_id": "5a7af30713b7946d6d0a8772",
"max_id": ObjectId("5a9d0d44c3a1ce5f14c69408"),
"min_timestamp": "2018-03-05 09:26:25.101624",
"message_count": 1,
"size": 40
}
After testing #kevin-adistambha's answer and some further experimenting, I found another way to achieve my wanted result - and maybe it has a better performance - but that needs more testing to be sure about this.
db.getCollection('topic_data_text').aggregate(
[
{
"$match":
{
"_id": {"$gt": ObjectId("5a9f9d8bd5de3ac75f8cc269") }
}
},
{
"$group":
{
"_id": "$topic_id",
"max_id": {"$max":"$_id"},
"min_timestamp": {"$min": "$timestamp"},
"message_count": {"$sum": 1},
"size": {"$sum": {"$strLenBytes": "$value"}}
}
}
]
);

Mongodb grouby by and sum and get media

I have this collection in my database:
{ "IdUser" : "1", "IdItem" : "1" },
{ "IdUser" : "1", "IdItem" : "2" },
{ "IdUser" : "1", "IdItem" : "3" },
{ "IdUser" : "2", "IdItem" : "4" },
{ "IdUser" : "2", "IdItem" : "5" },
{ "IdUser" : "4", "IdItem" : "6" },
{ "IdUser" : "5", "IdItem" : "7" }
How can I obtain this result:
Users with one item: 2
Users with two items: 1
Users with three items: 1
You need to first $group your documents by IdUser then count the number of time each IdUser appear in your collection using the $sum accumulator operator. This allows you in the next stage to group your documents by "count" and return the count for "user" with same number of "items".
db.items.aggregate([
{ "$group": {
"_id": "$IdUser",
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$count",
"nUsers": { "$sum": 1 }
}}
])
Using the aggregate() method will give you the desired result though the output documents would be different in that instead of having a key/value pair, you have two different fields with values that show the number of users and their respective
item count. The following aggregation pipeline explains this:
db.collection.aggregate([
{
"$group": {
"_id": "$IdUser",
"count": {
"$sum": { "$cond": [{ "$gt": [ "$IdItem", null ] }, 1, 0 ] }
}
}
},
{
"$group": {
"_id": "$count",
"users": { "$push": "$_id" }
}
},
{
"$project": {
"_id": 0,
"number_of_items": "$_id",
"number_of_users": { "$size": "$users" }
}
}
])
In the above, you $group all the documents by the user key to get their counts, taking into consideration documents that may not have the item field, which would be discounted in the aggregate. A further $group operation is necessary to then get the users per count, in the form of an array.
The last step in the pipeline $project serves to reshape the final output so that you get the following output (with the sample documents supplied with the question):
{ "number_of_items" : 1, "number_of_users" : 2 }
{ "number_of_items" : 3, "number_of_users" : 1 }
{ "number_of_items" : 2, "number_of_users" : 1 }

How to group by different fields

I want to find all users named 'Hans' and aggregate their 'age' and number of 'childs' by grouping them.
Assuming I have following in my database 'users'.
{
"_id" : "01",
"user" : "Hans",
"age" : "50"
"childs" : "2"
}
{
"_id" : "02",
"user" : "Hans",
"age" : "40"
"childs" : "2"
}
{
"_id" : "03",
"user" : "Fritz",
"age" : "40"
"childs" : "2"
}
{
"_id" : "04",
"user" : "Hans",
"age" : "40"
"childs" : "1"
}
The result should be something like this:
"result" :
[
{
"age" :
[
{
"value" : "50",
"count" : "1"
},
{
"value" : "40",
"count" : "2"
}
]
},
{
"childs" :
[
{
"value" : "2",
"count" : "2"
},
{
"value" : "1",
"count" : "1"
}
]
}
]
How can I achieve this?
This should almost be a MongoDB FAQ, mostly because it is a real example concept of how you should be altering your thinking from SQL processing and embracing what engines like MongoDB do.
The basic principle here is "MongoDB does not do joins". Any way of "envisioning" how you would construct SQL to do this essentially requires a "join" operation. The typical form is "UNION" which is in fact a "join".
So how to do it under a different paradigm? Well first, let's approach how not to do it and understand the reasons why. Even if of course it will work for your very small sample:
The Hard Way
db.docs.aggregate([
{ "$group": {
"_id": null,
"age": { "$push": "$age" },
"childs": { "$push": "$childs" }
}},
{ "$unwind": "$age" },
{ "$group": {
"_id": "$age",
"count": { "$sum": 1 },
"childs": { "$first": "$childs" }
}},
{ "$sort": { "_id": -1 } },
{ "$group": {
"_id": null,
"age": { "$push": {
"value": "$_id",
"count": "$count"
}},
"childs": { "$first": "$childs" }
}},
{ "$unwind": "$childs" },
{ "$group": {
"_id": "$childs",
"count": { "$sum": 1 },
"age": { "$first": "$age" }
}},
{ "$sort": { "_id": -1 } },
{ "$group": {
"_id": null,
"age": { "$first": "$age" },
"childs": { "$push": {
"value": "$_id",
"count": "$count"
}}
}}
])
That will give you a result like this:
{
"_id" : null,
"age" : [
{
"value" : "50",
"count" : 1
},
{
"value" : "40",
"count" : 3
}
],
"childs" : [
{
"value" : "2",
"count" : 3
},
{
"value" : "1",
"count" : 1
}
]
}
So why is this bad? The main problem should be apparent in the very first pipeline stage:
{ "$group": {
"_id": null,
"age": { "$push": "$age" },
"childs": { "$push": "$childs" }
}},
What we asked to do here is group up everything in the collection for the values we want and $push those results into an array. When things are small then this works, but real world collections would result in this "single document" in the pipeline that exceeds the 16MB BSON limit that is allowed. That is what is bad.
The rest of the logic follows the natural course by working with each array. But of course real world scenarios would almost always make this untenable.
You could avoid this somewhat, by doing things like "duplicating" the documents to be of "type" "age or "child" and grouping documents individually by type. But it's all a bit to "over complex" and not a solid way of doing things.
The natural response is "what about a UNION?", but since MongoDB does not do the "join" then how to approach that?
A Better Way ( aka A New Hope )
Your best approach here both architecturally and performance wise is to simply submit "both" queries ( yes two ) in "parallel" to the server via your client API. As the results are received you then "combine" them into a single response you can then send back as a source of data to your eventual "client" application.
Different languages have different approaches to this, but the general case is to look for an "asynchronous processing" API that allows you to do this in tandem.
My example purpose here uses node.js as the "asynchronous" side is basically "built in" and reasonably intuitive to follow. The "combination" side of things can be any type of "hash/map/dict" table implementation, just doing it the simple way for example only:
var async = require('async'),
MongoClient = require('mongodb');
MongoClient.connect('mongodb://localhost/test',function(err,db) {
var collection = db.collection('docs');
async.parallel(
[
function(callback) {
collection.aggregate(
[
{ "$group": {
"_id": "$age",
"type": { "$first": { "$literal": "age" } },
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": -1 } }
],
callback
);
},
function(callback) {
collection.aggregate(
[
{ "$group": {
"_id": "$childs",
"type": { "$first": { "$literal": "childs" } },
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": -1 } }
],
callback
);
}
],
function(err,results) {
if (err) throw err;
var response = {};
results.forEach(function(res) {
res.forEach(function(doc) {
if ( !response.hasOwnProperty(doc.type) )
response[doc.type] = [];
response[doc.type].push({
"value": doc._id,
"count": doc.count
});
});
});
console.log( JSON.stringify( response, null, 2 ) );
}
);
});
Which gives the cute result:
{
"age": [
{
"value": "50",
"count": 1
},
{
"value": "40",
"count": 3
}
],
"childs": [
{
"value": "2",
"count": 3
},
{
"value": "1",
"count": 1
}
]
}
So the key thing to note here is that the "separate" aggregation statements themselves are actually quite simple. The only thing you face is combining those in your final result. There are many approaches to "combining", particularly to deal with large results from each of the queries, but this is the basic example of the execution model.
Key points here.
Shuffling data in the aggregation pipeline is possible but not performant for large data sets.
Use a language implementation and API that support "parallel" and "asynchronous" execution so you can "load up" all or "most" of your operations at once.
The API should support some method of "combination" or otherwise allow a separate "stream" write to process each result set received into one.
Forget about the SQL way. The NoSQL way delegates the processing of such things as "joins" to your "data logic layer", which is what contains the code as shown here. It does it this way because it is scalable to very large datasets. It is rather the job of your "data logic" handling nodes in large applications to deliver this to the end API.
This is fast compared to any other form of "wrangling" I could possibly describe. Part of "NoSQL" thinking is to "Unlearn what you have learned" and look at things a different way. And if that way doesn't perform better, then stick with the SQL approach for storage and query.
That's why alternatives exist.
That was a tough one!
First, the bare solution:
db.test.aggregate([
{ "$match": { "user": "Hans" } },
// duplicate each document: one for "age", the other for "childs"
{ $project: { age: "$age", childs: "$childs",
data: {$literal: ["age", "childs"]}}},
{ $unwind: "$data" },
// pivot data to something like { data: "age", value: "40" }
{ $project: { data: "$data",
value: {$cond: [{$eq: ["$data", "age"]},
"$age",
"$childs"]} }},
// Group by data type, and count
{ $group: { _id: {data: "$data", value: "$value" },
count: { $sum: 1 },
value: {$first: "$value"} }},
// aggregate values in an array for each independant (type,value) pair
{ $group: { _id: "$_id.data", values: { $push: { count: "$count", value: "$value" }} }} ,
// project value to the correctly name field
{ $project: { result: {$cond: [{$eq: ["$_id", "age"]},
{age: "$values" },
{childs: "$values"}]} }},
// group all data in the result array, and remove unneeded `_id` field
{ $group: { _id: null, result: { $push: "$result" }}},
{ $project: { _id: 0, result: 1}}
])
Producing:
{
"result" : [
{
"age" : [
{
"count" : 3,
"value" : "40"
},
{
"count" : 1,
"value" : "50"
}
]
},
{
"childs" : [
{
"count" : 1,
"value" : "1"
},
{
"count" : 3,
"value" : "2"
}
]
}
]
}
And now, for some explanations:
One of the major issues here is that each incoming document has to be part of two different sums. I solved that by adding a literal array ["age", "childs"] to your documents, and then unwinding them by that array. That way, each document will be presented twice in the later stage.
Once that done, to ease processing, I change the data representation to something much more manageable like { data: "age", value: "40" }
The following steps will perform the data aggregation per-se. Up to the third $project step that will map the value fields to the corresponding age or childs field.
The final two steps will simply wrap the two documents in one, removing the unneeded _id field.
Pfff!