How to group by different fields - mongodb

I want to find all users named 'Hans' and aggregate their 'age' and number of 'childs' by grouping them.
Assuming I have following in my database 'users'.
{
"_id" : "01",
"user" : "Hans",
"age" : "50"
"childs" : "2"
}
{
"_id" : "02",
"user" : "Hans",
"age" : "40"
"childs" : "2"
}
{
"_id" : "03",
"user" : "Fritz",
"age" : "40"
"childs" : "2"
}
{
"_id" : "04",
"user" : "Hans",
"age" : "40"
"childs" : "1"
}
The result should be something like this:
"result" :
[
{
"age" :
[
{
"value" : "50",
"count" : "1"
},
{
"value" : "40",
"count" : "2"
}
]
},
{
"childs" :
[
{
"value" : "2",
"count" : "2"
},
{
"value" : "1",
"count" : "1"
}
]
}
]
How can I achieve this?

This should almost be a MongoDB FAQ, mostly because it is a real example concept of how you should be altering your thinking from SQL processing and embracing what engines like MongoDB do.
The basic principle here is "MongoDB does not do joins". Any way of "envisioning" how you would construct SQL to do this essentially requires a "join" operation. The typical form is "UNION" which is in fact a "join".
So how to do it under a different paradigm? Well first, let's approach how not to do it and understand the reasons why. Even if of course it will work for your very small sample:
The Hard Way
db.docs.aggregate([
{ "$group": {
"_id": null,
"age": { "$push": "$age" },
"childs": { "$push": "$childs" }
}},
{ "$unwind": "$age" },
{ "$group": {
"_id": "$age",
"count": { "$sum": 1 },
"childs": { "$first": "$childs" }
}},
{ "$sort": { "_id": -1 } },
{ "$group": {
"_id": null,
"age": { "$push": {
"value": "$_id",
"count": "$count"
}},
"childs": { "$first": "$childs" }
}},
{ "$unwind": "$childs" },
{ "$group": {
"_id": "$childs",
"count": { "$sum": 1 },
"age": { "$first": "$age" }
}},
{ "$sort": { "_id": -1 } },
{ "$group": {
"_id": null,
"age": { "$first": "$age" },
"childs": { "$push": {
"value": "$_id",
"count": "$count"
}}
}}
])
That will give you a result like this:
{
"_id" : null,
"age" : [
{
"value" : "50",
"count" : 1
},
{
"value" : "40",
"count" : 3
}
],
"childs" : [
{
"value" : "2",
"count" : 3
},
{
"value" : "1",
"count" : 1
}
]
}
So why is this bad? The main problem should be apparent in the very first pipeline stage:
{ "$group": {
"_id": null,
"age": { "$push": "$age" },
"childs": { "$push": "$childs" }
}},
What we asked to do here is group up everything in the collection for the values we want and $push those results into an array. When things are small then this works, but real world collections would result in this "single document" in the pipeline that exceeds the 16MB BSON limit that is allowed. That is what is bad.
The rest of the logic follows the natural course by working with each array. But of course real world scenarios would almost always make this untenable.
You could avoid this somewhat, by doing things like "duplicating" the documents to be of "type" "age or "child" and grouping documents individually by type. But it's all a bit to "over complex" and not a solid way of doing things.
The natural response is "what about a UNION?", but since MongoDB does not do the "join" then how to approach that?
A Better Way ( aka A New Hope )
Your best approach here both architecturally and performance wise is to simply submit "both" queries ( yes two ) in "parallel" to the server via your client API. As the results are received you then "combine" them into a single response you can then send back as a source of data to your eventual "client" application.
Different languages have different approaches to this, but the general case is to look for an "asynchronous processing" API that allows you to do this in tandem.
My example purpose here uses node.js as the "asynchronous" side is basically "built in" and reasonably intuitive to follow. The "combination" side of things can be any type of "hash/map/dict" table implementation, just doing it the simple way for example only:
var async = require('async'),
MongoClient = require('mongodb');
MongoClient.connect('mongodb://localhost/test',function(err,db) {
var collection = db.collection('docs');
async.parallel(
[
function(callback) {
collection.aggregate(
[
{ "$group": {
"_id": "$age",
"type": { "$first": { "$literal": "age" } },
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": -1 } }
],
callback
);
},
function(callback) {
collection.aggregate(
[
{ "$group": {
"_id": "$childs",
"type": { "$first": { "$literal": "childs" } },
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": -1 } }
],
callback
);
}
],
function(err,results) {
if (err) throw err;
var response = {};
results.forEach(function(res) {
res.forEach(function(doc) {
if ( !response.hasOwnProperty(doc.type) )
response[doc.type] = [];
response[doc.type].push({
"value": doc._id,
"count": doc.count
});
});
});
console.log( JSON.stringify( response, null, 2 ) );
}
);
});
Which gives the cute result:
{
"age": [
{
"value": "50",
"count": 1
},
{
"value": "40",
"count": 3
}
],
"childs": [
{
"value": "2",
"count": 3
},
{
"value": "1",
"count": 1
}
]
}
So the key thing to note here is that the "separate" aggregation statements themselves are actually quite simple. The only thing you face is combining those in your final result. There are many approaches to "combining", particularly to deal with large results from each of the queries, but this is the basic example of the execution model.
Key points here.
Shuffling data in the aggregation pipeline is possible but not performant for large data sets.
Use a language implementation and API that support "parallel" and "asynchronous" execution so you can "load up" all or "most" of your operations at once.
The API should support some method of "combination" or otherwise allow a separate "stream" write to process each result set received into one.
Forget about the SQL way. The NoSQL way delegates the processing of such things as "joins" to your "data logic layer", which is what contains the code as shown here. It does it this way because it is scalable to very large datasets. It is rather the job of your "data logic" handling nodes in large applications to deliver this to the end API.
This is fast compared to any other form of "wrangling" I could possibly describe. Part of "NoSQL" thinking is to "Unlearn what you have learned" and look at things a different way. And if that way doesn't perform better, then stick with the SQL approach for storage and query.
That's why alternatives exist.

That was a tough one!
First, the bare solution:
db.test.aggregate([
{ "$match": { "user": "Hans" } },
// duplicate each document: one for "age", the other for "childs"
{ $project: { age: "$age", childs: "$childs",
data: {$literal: ["age", "childs"]}}},
{ $unwind: "$data" },
// pivot data to something like { data: "age", value: "40" }
{ $project: { data: "$data",
value: {$cond: [{$eq: ["$data", "age"]},
"$age",
"$childs"]} }},
// Group by data type, and count
{ $group: { _id: {data: "$data", value: "$value" },
count: { $sum: 1 },
value: {$first: "$value"} }},
// aggregate values in an array for each independant (type,value) pair
{ $group: { _id: "$_id.data", values: { $push: { count: "$count", value: "$value" }} }} ,
// project value to the correctly name field
{ $project: { result: {$cond: [{$eq: ["$_id", "age"]},
{age: "$values" },
{childs: "$values"}]} }},
// group all data in the result array, and remove unneeded `_id` field
{ $group: { _id: null, result: { $push: "$result" }}},
{ $project: { _id: 0, result: 1}}
])
Producing:
{
"result" : [
{
"age" : [
{
"count" : 3,
"value" : "40"
},
{
"count" : 1,
"value" : "50"
}
]
},
{
"childs" : [
{
"count" : 1,
"value" : "1"
},
{
"count" : 3,
"value" : "2"
}
]
}
]
}
And now, for some explanations:
One of the major issues here is that each incoming document has to be part of two different sums. I solved that by adding a literal array ["age", "childs"] to your documents, and then unwinding them by that array. That way, each document will be presented twice in the later stage.
Once that done, to ease processing, I change the data representation to something much more manageable like { data: "age", value: "40" }
The following steps will perform the data aggregation per-se. Up to the third $project step that will map the value fields to the corresponding age or childs field.
The final two steps will simply wrap the two documents in one, removing the unneeded _id field.
Pfff!

Related

How to aggregate into a map [duplicate]

I have a data in profile collection
[
{
name: "Harish",
gender: "Male",
caste: "Vokkaliga",
education: "B.E"
},
{
name: "Reshma",
gender: "Female",
caste: "Vokkaliga",
education: "B.E"
},
{
name: "Rangnath",
gender: "Male",
caste: "Lingayath",
education: "M.C.A"
},
{
name: "Lakshman",
gender: "Male",
caste: "Lingayath",
education: "B.Com"
},
{
name: "Reshma",
gender: "Female",
caste: "Lingayath",
education: "B.E"
}
]
here I need to calculate total Number of different gender, total number of different caste and total number of different education.
Expected o/p
{
gender: [{
name: "Male",
total: "3"
},
{
name: "Female",
total: "2"
}],
caste: [{
name: "Vokkaliga",
total: "2"
},
{
name: "Lingayath",
total: "3"
}],
education: [{
name: "B.E",
total: "3"
},
{
name: "M.C.A",
total: "1"
},
{
name: "B.Com",
total: "1"
}]
}
using mongodb aggregation how can I get the expected result.
There are different approaches depending on the version available, but they all essentially break down to transforming your document fields into separate documents in an "array", then "unwinding" that array with $unwind and doing successive $group stages in order to accumulate the output totals and arrays.
MongoDB 3.4.4 and above
Latest releases have special operators like $arrayToObject and $objectToArray which can make transfer to the initial "array" from the source document more dynamic than in earlier releases:
db.profile.aggregate([
{ "$project": {
"_id": 0,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$in": [ "$$this.k", ["gender","caste","education"] ] }
}
}
}},
{ "$unwind": "$data" },
{ "$group": {
"_id": "$data",
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$v" } }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
])
So using $objectToArray you make the initial document into an array of it's keys and values as "k" and "v" keys in the resulting array of objects. We apply $filter here in order to select by "key". Here using $in with a list of keys we want, but this could be more dynamically used as a list of keys to "exclude" where that was shorter. It's just using logical operators to evaluate the condition.
The end stage here uses $replaceRoot and since all our manipulation and "grouping" in between still keeps that "k" and "v" form, we then use $arrayToObject here to promote our "array of objects" in result to the "keys" of the top level document in output.
MongoDB 3.6 $mergeObjects
As an extra wrinkle here, MongoDB 3.6 includes $mergeObjects which can be used as an "accumulator" in a $group pipeline stage as well, thus replacing the $push and making the final $replaceRoot simply shifting the "data" key to the "root" of the returned document instead:
db.profile.aggregate([
{ "$project": {
"_id": 0,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$in": [ "$$this.k", ["gender","caste","education"] ] }
}
}
}},
{ "$unwind": "$data" },
{ "$group": { "_id": "$data", "total": { "$sum": 1 } }},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": {
"$mergeObjects": {
"$arrayToObject": [
[{ "k": "$_id", "v": "$v" }]
]
}
}
}},
{ "$replaceRoot": { "newRoot": "$data" } }
])
This is not really that different to what is being demonstrated overall, but simply demonstrates how $mergeObjects can be used in this way and may be useful in cases where the grouping key was something different and we did not want that final "merge" to the root space of the object.
Note that the $arrayToObject is still needed to transform the "value" back into the name of the "key", but we just do it during the accumulation rather than after the grouping, since the new accumulation allows the "merge" of keys.
MongoDB 3.2
Taking it back a version or even if you have a MongoDB 3.4.x that is less than the 3.4.4 release, we can still use much of this but instead we deal with the creation of the array in a more static fashion, as well as handling the final "transform" on output differently due to the aggregation operators we don't have:
db.profile.aggregate([
{ "$project": {
"data": [
{ "k": "gender", "v": "$gender" },
{ "k": "caste", "v": "$caste" },
{ "k": "education", "v": "$education" }
]
}},
{ "$unwind": "$data" },
{ "$group": {
"_id": "$data",
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$v" } }
}},
/*
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
*/
]).map( d =>
d.data.map( e => ({ [e.k]: e.v }) )
.reduce((acc,curr) => Object.assign(acc,curr),{})
)
This is exactly the same thing, except instead of having a dynamic transform of the document into the array, we actually "explicitly" assign each array member with the same "k" and "v" notation. Really just keeping those key names for convention at this point since none of the aggregation operators here depend on that at all.
Also instead of using $replaceRoot, we just do exactly the same thing as what the previous pipeline stage implementation was doing there but in client code instead. All MongoDB drivers have some implementation of cursor.map() to enable "cursor transforms". Here with the shell we use the basic JavaScript functions of Array.map() and Array.reduce() to take that output and again promote the array content to being the keys of the top level document returned.
MongoDB 2.6
And falling back to MongoDB 2.6 to cover the versions in between, the only thing that changes here is the usage of $map and a $literal for input with the array declaration:
db.profile.aggregate([
{ "$project": {
"data": {
"$map": {
"input": { "$literal": ["gender","caste", "education"] },
"as": "k",
"in": {
"k": "$$k",
"v": {
"$cond": {
"if": { "$eq": [ "$$k", "gender" ] },
"then": "$gender",
"else": {
"$cond": {
"if": { "$eq": [ "$$k", "caste" ] },
"then": "$caste",
"else": "$education"
}
}
}
}
}
}
}
}},
{ "$unwind": "$data" },
{ "$group": {
"_id": "$data",
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$v" } }
}},
/*
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
*/
])
.map( d =>
d.data.map( e => ({ [e.k]: e.v }) )
.reduce((acc,curr) => Object.assign(acc,curr),{})
)
Since the basic idea here is to "iterate" a provided array of the field names, the actual assignment of values comes by "nesting" the $cond statements. For three possible outcomes this means only a single nesting in order to "branch" for each outcome.
Modern MongoDB from 3.4 have $switch which makes this branching simpler, yet this demonstrates the logic was always possible and the $cond operator has been around since the aggregation framework was introduced in MongoDB 2.2.
Again, the same transformation on the cursor result applies as there is nothing new there and most programming languages have the ability to do this for years, if not from inception.
Of course the basic process can even be done way back to MongoDB 2.2, but just applying the array creation and $unwind in a different way. But no-one should be running any MongoDB under 2.8 at this point in time, and official support even from 3.0 is even fast running out.
Output
For visualization, the output of all demonstrated pipelines here has the following form before the last "transform" is done:
/* 1 */
{
"_id" : null,
"data" : [
{
"k" : "gender",
"v" : [
{
"name" : "Male",
"total" : 3.0
},
{
"name" : "Female",
"total" : 2.0
}
]
},
{
"k" : "education",
"v" : [
{
"name" : "M.C.A",
"total" : 1.0
},
{
"name" : "B.E",
"total" : 3.0
},
{
"name" : "B.Com",
"total" : 1.0
}
]
},
{
"k" : "caste",
"v" : [
{
"name" : "Lingayath",
"total" : 3.0
},
{
"name" : "Vokkaliga",
"total" : 2.0
}
]
}
]
}
And then either by the $replaceRoot or the cursor transform as demonstrated the result becomes:
/* 1 */
{
"gender" : [
{
"name" : "Male",
"total" : 3.0
},
{
"name" : "Female",
"total" : 2.0
}
],
"education" : [
{
"name" : "M.C.A",
"total" : 1.0
},
{
"name" : "B.E",
"total" : 3.0
},
{
"name" : "B.Com",
"total" : 1.0
}
],
"caste" : [
{
"name" : "Lingayath",
"total" : 3.0
},
{
"name" : "Vokkaliga",
"total" : 2.0
}
]
}
So whilst we can put some new and fancy operators into the aggregation pipeline where we have those available, the most common use case is in these "end of pipeline transforms" in which case we may as well simply do the same transformation on each document in the cursor results returned instead.

Remove duplicate documents based on field

I've seen a number of solutions on this, however they are all for Mongo v2 and are not suitable for V3.
My document looks like this:
{
"_id" : ObjectId("582c98667d81e1d0270cb3e9"),
"asin" : "B01MTKPJT1",
"url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1",
"image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg",
"salesRank" : NumberInt(442137),
"title" : "Trump Wins 45th President Presidential Victory T-Shirt",
"brand" : "\"Getting Political On Me\"",
"favourite" : false,
"createdAt" : ISODate("2016-11-16T17:33:26.763+0000"),
"updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}
and my collection contains around 500k documents. I want to remove all duplicate documents (except for 1) where the ASIN is the same
How can I achieve this?
This is something we can actually do using the aggregation framework and without client side processing.
MongoDB 3.4
db.collection.aggregate(
[
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": "$asin",
"doc": { "$first": "$$ROOT" }
}},
{ "$replaceRoot": { "newRoot": "$doc" } },
{ "$out": "collection" }
]
)
MongoDB version <= 3.2:
db.collection.aggregate(
[
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": "$asin",
"doc": { "$first": "$$ROOT" }
}},
{ "$project": {
"asin": "$doc.asin",
"url": "$doc.url",
"image": "$doc.image",
"salesRank": "$doc.salesRank",
"title": "$doc.salesRank",
"brand": "$doc.brand",
"favourite": "$doc.favourite",
"createdAt": "$doc.createdAt",
"updatedAt": "$doc.updatedAt"
}},
{ "$out": "collection" }
]
)
Use a for loop, it will take time but will do the work
db.amazon_sales.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){
db.amazon_sales.remove({_id:{$gt:doc._id}, asin:doc.asin});
})
Then and this index
db.amazon_sales.createIndex( { "asin": 1 }, { unique: true } )

Aggregate Query in Mongodb returns specific field

Document Sample:
{
"_id" : ObjectId("53329dfgg43771e49538b4567"),
"u" : {
"_id" : ObjectId("532a435gs4c771edb168c1bd7"),
"n" : "Salman khan",
"e" : "salman#gmail.com"
},
"ps" : 0,
"os" : 1,
"rs" : 0,
"cd" : 1395685800,
"ud" : 0
}
Query:
db.collectiontmp.aggregate([
{$match: {os:1}},
{$project : { name:{$toUpper:"$u.e"} , _id:0 } },
{$group: { _id: "$u._id",total: {$sum:1} }},
{$sort: {total: -1}}, { $limit: 10 }
]);
I need following things from the above query:
Group by u._id
Returns total number of records and email from the record, as shown below:
{
"result":
[
{
"email": "",
"total": ""
},
{
"email": "",
"total": ""
}
],
"ok":
1
}
The first thing you are doing wrong here is not understanding how $project is intended to work. Pipeline stages such as $project and $group will only output the fields that are "explicitly" identified. So only the fields you say to output will be available to the following pipeline stages.
Specifically here you "project" only part of the "u" field in your document and you therefore removed the other data from being available. The only present field here now is "name", which is the one you "projected".
Perhaps it was really your intention to do something like this:
db.collectiontmp.aggregate([
{ "$group": {
"_id": {
"_id": "$u._id",
"email": { "$toUpper": "$u.e" }
},
"total": { "$sum": 1 },
}},
{ "$project": {
"_id": 0,
"email": "$_id.email",
"total": 1
}},
{ "$sort": { "total": -1 } },
{ "$limit": 10 }
])
Or even:
db.collectiontmp.aggregate([
{ "$group": {
"_id": "$u._id",
"email": { "$first": { "$toUpper": "$u.e" } }
"total": { "$sum": 1 },
}},
{ "$project": {
"_id": 0,
"email": 1,
"total": 1
}},
{ "$sort": { "total": -1 } },
{ "$limit": 10 }
])
That gives you the sort of output you are looking for.
Remember that as this is a "pipeline", then only the "output" from a prior stage is available to the "next" stage. There is no "global" concept of the document as this is not a declarative statement such as in SQL, but a "pipeline".
So think Unix pipe "|" command, or otherwise look that up. Then your thinking will fall into place.

mongodb get filtered count

I have no extended knowledge on how to create mongodb queries, but I wanted to ask how could I query collection get something like this:
{
Total: 1000,
Filtered: 459,
DocumentArray: []
}
Of course doing that in one query, so I do not need to do something like this:
db.collection.find();
db.collection.find().count();
db.colection.count();
Well you could do something along these lines:
Considering documents like this:
{ "_id" : ObjectId("531251829df82824bdb53578"), "name" : "a", "type" : "b" }
{ "_id" : ObjectId("531251899df82824bdb53579"), "name" : "a", "type" : "c" }
{ "_id" : ObjectId("5312518e9df82824bdb5357a"), "type" : "c", "name" : "b" }
And an aggregate pipeline like this:
db.collection.aggregate([
{ "$group": {
"_id": null,
"totalCount": { "$sum": 1 },
"docs": { "$push": {
"name": "$name",
"type": "$type"
}},
}},
{ "$unwind": "$docs" },
{ "$match": { "docs.name": "a" } },
{ "$group": {
"_id": null,
"totalCount": { "$first": "$totalCount" },
"filteredCount": { "$sum": 1 },
"docs": { "$push": "$docs" }
}}
])
But I would not recommend it. It will certainly blow up on any "real" collection due to exceeding the maximum BSON document size. And I would doubt it would be performing very well. But that is how it can be done, even if the utility is purely academic.
Just do what you are doing if you need the information. That is the "right way" to do it.

Selecting Distinct values from Array in MongoDB

I have a collection name Alpha_Num, It has following structure. I am trying to find out which Alphabet-Numerals pair will appear maximum number of times ?
If we just go with the data below, pair abcd-123 appears twice so as pair efgh-10001, but the second one is not a valid case for me as it appears in same document.
{
"_id" : 12345,
"Alphabet" : "abcd",
"Numerals" : [
"123",
"456",
"2345"
]
}
{
"_id" : 123456,
"Alphabet" : "efgh",
"Numerals" : [
"10001",
"10001",
"1002"
]
}
{
"_id" : 123456567,
"Alphabet" : "abcd",
"Numerals" : [
"123"
]
}
I tried to use aggregation frame work, something like below
db.Alpha_Num.aggregate([
{"$unwind":"$Numerals"},
{"$group":
{"_id":{"Alpha":"$Alphabet","Num":"$Numerals"},
"count":{$sum:1}}
},
{"$sort":{"count":-1}}
])
Problem in this query is it gives pair efgh-10001 twice.
Question : How to select distinct values from array "Numerals" in the above condition ?
Problem solved.
db.Alpha_Num.aggregate([{
"$unwind": "$Numerals"
}, {
"$group": {
_id: {
"_id": "$_id",
"Alpha": "$Alphabet"
},
Num: {
$addToSet: "$Numerals"
}
}
}, {
"$unwind": "$Num"
}, {
"$group": {
_id: {
"Alplha": "$_id.Alpha",
"Num": "$Num"
},
count: {
"$sum": 1
}
}
}])
Grouping using $addToSet and unwinding again did the trick. Got the answer from one of 10gen online course.