Select Group by count and distinct count in same mongodb query - mongodb

I am trying to do something like
select campaign_id,campaign_name,count(subscriber_id),count(distinct subscriber_id)
group by campaign_id,campaign_name from campaigns;
This query giving results except count(distinct subscriber_id)
db.campaigns.aggregate([
{$match: {subscriber_id: {$ne: null}}},
{$group: {
_id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name"},
count: {$sum: 1}
}}
])
This following query giving results except count(subscriber_id)
db.campaigns_logs.aggregate([
{$match : {subscriber_id: {$ne: null}}},
{$group : { _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name",subscriber_id: "$subscriber_id"}}},
{$group : { _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name"},
count: {$sum: 1}
}}
])
but I want count(subscriber_id),count(distinct subscriber_id) in the same result

You are beginning to think along the right lines here as you were headed in the right direction. Changing your SQL mindset, "distinct" is really just another way of writing a $group operation in either language. That means you have two group operations happening here and, in aggregation pipeline terms, two pipeline stages.
Just with simplified documents to visualize:
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "123"
},
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "123"
},
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "456"
}
It stands to reason that for the given "campaign" combination the total count and "distinct" count are "3" and "2" respectively. So the logical thing to do is "group" up all of those "subscriber_id" values first and keep the count of occurrences for each, then while thinking "pipeline", "total" those counts per "campaign" and then just count the "distinct" occurrences as a separate number:
db.campaigns.aggregate([
{ "$match": { "subscriber_id": { "$ne": null }}},
// Count all occurrences
{ "$group": {
"_id": {
"campaign_id": "$campaign_id",
"campaign_name": "$campaign_name",
"subscriber_id": "$subscriber_id"
},
"count": { "$sum": 1 }
}},
// Sum all occurrences and count distinct
{ "$group": {
"_id": {
"campaign_id": "$_id.campaign_id",
"campaign_name": "$_id.campaign_name"
},
"totalCount": { "$sum": "$count" },
"distinctCount": { "$sum": 1 }
}}
])
After the first "group" the output documents can be visualized like this:
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A",
"subscriber_id" : "456"
},
"count" : 1
}
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A",
"subscriber_id" : "123"
},
"count" : 2
}
So from the "three" documents in the sample, "2" belong to one distinct value and "1" to another. This can still be totaled with $sum in order to get the total matching documents which you do in the following stage, with the final result:
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A"
},
"totalCount" : 3,
"distinctCount" : 2
}
A really good analogy for the aggregation pipeline is the unix pipe "|" operator, which allows "chaining" of operations so you can pass the output of one command through to the input of the next, and so on. Starting to think of your processing requirements in that way will help you understand operations with the aggregation pipeline better.

SQL Query: (group by & count of distinct)
select city,count(distinct(emailId)) from TransactionDetails group by city;
The equivalent mongo query would look like this:
db.TransactionDetails.aggregate([
{$group:{_id:{"CITY" : "$cityName"},uniqueCount: {$addToSet: "$emailId"}}},
{$project:{"CITY":1,uniqueCustomerCount:{$size:"$uniqueCount"}} }
]);

Related

Mongodb - group by same value in different fields in different documents

I have documents with common values in different fields that I want to group by that value. Simplified records are:
{ _id:1,
"Home" : "A",
"Away" : "B" }
{ _id:2,
"Home" : "B",
"Away" : "C" }
{ _id:3,
"Home" : "C",
"Away" : "A" }
{ _id:4,
"Home" : "C",
"Away" : "B" }
{ _id:5,
"Home" : "A",
"Away" : "C" }
I am trying to get an aggregate group result that includes, for example, the value "A" whether it appears in a document in the field "Home", or the field "Away". The result I want is:
{"_id": "A", "count": 3},
{"_id": "B", "count": 3},
{"_id": "C", "count": 4}
Grouping by either "Home" or "Away" is no problem but that wouldn't give me all the records, as shown below, I wouldn't get a count of records where "A" or "B" or "C" was in the "Home" field:
{$group:
{_id: "$Away"} etc... }
I have tried using $cond from other posts here as follows:
$group : {
_id : {
$cond : [{
$gt : [ "$Away", null]
}, "$Home"]
}
}
Also tried an $or which is pretty obviously wrong since it will only find the same value for Away and Home fields within each document (which is never the case):
$group : {
_id : {
$or : [ "$Away", "$Home"]
}
}
I'm stuck and not sure if this is even possible; to group on a value that may be in different fields in different documents.
You can create an object to use $objectToArray and $unwind and then group like this:
Create object using $set and the same values ($Home and $Away)
Use project to not pass these values to the next stage. There are no neccesary, you have the object.
Then $objectToArray to do $unwind and get every value
And last $group by property v generated by $objectToArray.
db.collection.aggregate([
{
"$set": {
"obj": {
"Home": "$Home",
"Away": "$Away"
}
}
},
{
"$project": {"Away": 0,"Home": 0}
},
{
"$set": {"obj": {"$objectToArray": "$obj"}}
},
{
"$unwind": "$obj"
},
{
"$group": {
"_id": "$obj.v",
"count": {"$sum": 1}
}
}
])
Example here

MongoDB aggregation, find number of distinct values in documents' arrays

Reading the docs, I see you can get the number of elements in document arrays. For example given the following documents:
{ "_id" : 1, "item" : "ABC1", "description" : "product 1", colors: [ "blue", "black", "red" ] }
{ "_id" : 2, "item" : "ABC2", "description" : "product 2", colors: [ "purple" ] }
{ "_id" : 3, "item" : "XYZ1", "description" : "product 3", colors: [ ] }
and the following query:
db.inventory.aggregate([{$project: {item: 1, numberOfColors: { $size: "$colors" }}}])
We would get the number of elements in each document's colors array:
{ "_id" : 1, "item" : "ABC1", "numberOfColors" : 3 }
{ "_id" : 2, "item" : "ABC2", "numberOfColors" : 1 }
{ "_id" : 3, "item" : "XYZ1", "numberOfColors" : 0 }
I've not been able to figure out if and how you could sum up all the colors in all the documents directly from a query, ie:
{ "totalColors": 4 }
You can use the following query to get the count of all colors in all docs:
db.inventory.aggregate([
{ $unwind: '$colors' } , // expands nested array so we have one doc per each array value
{ $group: {_id: null, allColors: {$addToSet: "$colors"} } } , // find all colors
{ $project: { totalColors: {$size: "$allColors"}}} // find count of all colors
])
Infinitely better is is to simply $sum the $size:
db.inventory.aggregate([
{ "$group": { "_id": null, "totalColors": { "$sum": { "$size": "$colors" } } }
])
If you wanted "distinct in each document" then you would instead:
db.inventory.aggregate([
{ "$group": {
"_id": null,
"totalColors": {
"$sum": {
"$size": { "$setUnion": [ [], "$colors" ] }
}
}
}}
])
Where $setUnion takes values likes ["purple","blue","purple"] and makes it into ["purple","blue"] as a "set" with "distinct items".
And if you really want "distinct across documents" then don't accumulate the "distinct" into a single document. That causes performance issues and simply does not scale to large data sets, and can possibly break the 16MB BSON Limit. Instead accumulate naturally via the key:
db.inventory.aggregate([
{ "$unwind": "$colors" },
{ "$group": { "_id": "$colors" } },
{ "$group": { "_id": null, "totalColors": { "$sum": 1 } } }
])
Where you only use $unwind because you want "distinct" values from the array as combined with other documents. Generally $unwind should be avoided unless the value contained in the array is being accessed in the "grouping key" _id of $group. Where it is not, it is better to treat arrays using other operators instead, since $unwind creates a "copy" of the whole document per array element.
And of course there was also nothing wrong with simply using .distinct() here, which will return the "distinct" values "as an array", for which you can just test the Array.length() on in code:
var totalSize = db.inventory.distinct("colors").length;
Which for the simple operation you are asking, would be the overall fastest approach for a simple "count of distinct elements". Of course the limitation remains that the result cannot exceed the 16MB BSON limit as a payload. Which is where you defer to .aggregate() instead.

MongoDB aggregate nested array correctly

OK I am very new to Mongo, and I am already stuck.
Db has the following structure (much simplified for sure):
{
{
"_id" : ObjectId("57fdfbc12dc30a46507044ec"),
"keyterms" : [
{
"score" : "2",
"value" : "AA",
},
{
"score" : "2",
"value" : "AA",
},
{
"score" : "4",
"value" : "BB",
},
{
"score" : "3",
"value" : "CC",
}
]
},
{
"_id" : ObjectId("57fdfbc12dc30a46507044ef"),
"keyterms" : [
...
There are some Objects. Each Object have an array "keywords". Each of this Arrays Entries, which have score and value. There are some duplicates though (not really, since in the real db the keywords entries have much more fields, but concerning value and score they are duplicates).
Now I need a query, which
selects one object by id
groups its keyterms in by value
and counts the dublicates
sorts them by score
So I want to have something like that as result
// for Object 57fdfbc12dc30a46507044ec
"keyterms"; [
{
"score" : "4",
"value" : "BB",
"count" : 1
},
{
"score" : "3",
"value" : "CC",
"count" : 1
}
{
"score" : "2",
"value" : "AA",
"count" : 2
}
]
In SQL I would have written something like this
select
score, value, count(*) as count
from
all_keywords_table_or_some_join
group by
value
order by
score
But, sadly enough, it's not SQL.
In Mongo I managed to write this:
db.getCollection('tests').aggregate([
{$match: {'_id': ObjectId('57fdfbc12dc30a46507044ec')}},
{$unwind: "$keyterms"},
{$sort: {"keyterms.score": -1}},
{$group: {
'_id': "$_id",
'keyterms': {$push: "$keyterms"}
}},
{$project: {
'keyterms.score': 1,
'keyterms.value': 1
}}
])
But there is something missing: the grouping of the the keywords by their value. I can not get rid of the feeling, that this is the wrong approach at all. How can I select the keywords array and continue with that, and use an aggregate function inly on this - that would be easy.
BTW I read this
(Mongo aggregate nested array)
but I can't figure it out for my example unfortunately...
You'd want an aggregation pipeline where after you $unwind the array, you group the flattened documents by the array's value and score keys, aggregate the counts using the $sum accumulator operator and retain the main document's _id with the $first operator.
The preceding pipeline should then group the documents from the previous pipeline by the _id key so as to preserve the original schema and recreate the keyterms array using the $push operator.
The following demonstration attempts to explain the above aggregation operation:
db.tests.aggregate([
{ "$match": { "_id": ObjectId("57fdfbc12dc30a46507044ec") } },
{ "$unwind": "$keyterms" },
{
"$group": {
"_id": {
"value": "$keyterms.value",
"score": "$keyterms.score"
},
"doc_id": { "$first": "$_id" },
"count": { "$sum": 1 }
}
},
{ "$sort": {"_id.score": -1 } },
{
"$group": {
"_id": "$doc_id",
"keyterms": {
"$push": {
"value": "$_id.value",
"score": "$_id.score",
"count": "$count"
}
}
}
}
])
Sample Output
{
"_id" : ObjectId("57fdfbc12dc30a46507044ec"),
"keyterms" : [
{
"value" : "BB",
"score" : "4",
"count" : 1
},
{
"value" : "CC",
"score" : "3",
"count" : 1
},
{
"value" : "AA",
"score" : "2",
"count" : 2
}
]
}
Demo
Meanwhile, I solved it myself:
aggregate([
{$match: {'_id': ObjectId('57fdfbc12dc30a46507044ec')}},
{$unwind: "$keyterms"},
{$sort: {"keyterms.score": -1}},
{$group: {
'_id': "$keyterms.value",
'keyterms': {$push: "$keyterms"},
'escore': {$first: "$keyterms.score"},
'evalue': {$first: "$keyterms.value"}
}},
{$limit: 15},
{$project: {
"score": "$escore",
"value": "$evalue",
"count": {$size: "$keyterms"}
}}
])

How to return distinct $or in mongodb?

So I have this query
db.collection.find($or:[{data_id:123},{data_id:345},{data_id:443}]);
How do I tweak it to return only one of each part of the $or.
I.E something analogous to the SQL:
SELECT DISTINCT data_id, [...] WHERE data_id='123' OR data_id='345'...
Your question needs to be considered with consideration to the documents you have as "distinct" can mean a few different things here. Consider the following sample:
{
"tripId": 123,
"thisField": "this",
"thatField": "that"
},
{
"tripId": 123,
"thisField": "other",
"thatField": "then"
},
{
"tripId": 345,
"thisField": "other",
"thatField": "then"
},
{
"tripId": 345,
"thisField": "this",
"thatField": "that"
},
{
"tripId": 123,
"thisField": "this",
"thatField": "that"
},
{
"tripId": 789,
"thisField": "this",
"thatField": "that"
}
MongoDB has the .distinct() method which which would return distinct values for a single field, but only one field as well as the items are returned simply as an array of those field values.
For anything else you want the .aggregate() method. This is the aggregation pipeline which does a number of different functions and can handle some very complex operations due to the "pipeline" nature of it's processing.
Particularly here you would want to use a $group pipeline stage in order to "group" together values based on a key. That "key" is expressed in the form of an _id key in the $group statement. Much like "SELECT" in SQL with a "GROUP BY" or a "DISTINCT" modifier ( which are much the same in function ) you need to specify all of the fields you intend in the results.
Moreover, anything that would not be specified in a "GROUP BY" portion of a statement would have to be subject to some sort of "grouping operation" in order to select which field values to present. For this there are various "Group Accumulator Operators" to act on these values:
One example here using the $first operator in this case:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$group": {
"_id": "$tripId",
"thisField": { "$first": "$thisField" },
"thatField": { "$first": "$thatField" },
"total": { "$sum": 1 }
}}
])
Gives this result:
{ "_id" : 345, "thisField" : "other", "thatField" : "then", "total" : 2 }
{ "_id" : 123, "thisField" : "this", "thatField" : "that", "total" : 3 }
So with the addition of a $sum operator to count the occurrences of the same distinct values this picks up the "first" occurrences of the values in the specified fields that were mentioned in the accumulator expressions outside of the grouping key.
In versions of MongoDB since 2.6 you can "shortcut" naming all of the fields you want individually using the $$ROOT expression variable. This is a reference to "all" of the fields present in the document as of the state in the current stage where it is used. It's a little shorter to write, but the output is a little different due to the syntax:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$group": {
"_id": "$tripId",
"doc": { "$first": "$$ROOT" },
"total": { "$sum": 1 }
}}
])
Outputs as:
{
"_id" : 345,
"doc" : {
"_id" : ObjectId("54feaf3839c29b9cd470bcbe"),
"tripId" : 345,
"thisField" : "other",
"thatField" : "then"
},
"total" : 2
}
{
"_id" : 123,
"doc" : {
"_id" : ObjectId("54feaf3839c29b9cd470bcbc"),
"tripId" : 123,
"thisField" : "this",
"thatField" : "that"
},
"total" : 3
}
That is a general case with most $group aggregation operations where you specify a "key" and subject other fields present to a "grouping operator"/"accumulator" of some sort.
The other case that if you were looking for the "distinct" occurrences of "all" fields, then you would express these as part of the "key" for the group expression like this:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$group": {
"_id": {
"tripId": "$tripId",
"thisField": "$thisField",
"thatField": "$thatField"
},
"total": { "$sum": 1 }
}}
])
That gives us this output:
{
"_id" : {
"tripId" : 345,
"thisField" : "this",
"thatField" : "that"
},
"total" : 1
}
{
"_id" : {
"tripId" : 345,
"thisField" : "other",
"thatField" : "then"
},
"total" : 1
}
{
"_id" : {
"tripId" : 123,
"thisField" : "other",
"thatField" : "then"
},
"total" : 1
}
{
"_id" : {
"tripId" : 123,
"thisField" : "this",
"thatField" : "that"
},
"total" : 2
}
The total result being 4 documents that considers the "distinct" values on each of the fields mentioned as part of the "composite key". It correctly accounts that most of those combinations occurred 1 time, with the exception of the one example that actually occurs twice with all the same values.
Naturally the $$ROOT variable would not apply here as the "whole document" contains the "unique" _id field for each document. You can always add a $project stage beforehand to filter that field out, but the same conditions to specifying the fields required applies:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$project": {
"_id": 0,
"tripId": 1,
"thisField": 1,
"thatField": 1
}},
{ "$group": {
"_id": "$$ROOT",
"total": { "$sum": 1 }
}}
])
So that serves as an introduction with examples of what you can do in the form of "distinct" queries with MongoDB and specifically the aggregation framework. There are various other common SQL to Aggregation mapping examples given in the documentation.
The other general case was your usage of $or in your question. As you see in the samples here, when you want the same "or" condition over values of the same field, then the more efficient way to write this in your query is with the $in operator. Rather than an array of "query documents" this takes and array of "possible values" to the common field it is examining in the expression. It is basically a $or condition, but expressed in a shorter form for this case.
Instead of $or use $in query that will serve your purpose.
db.collection.find({data_id:{$in: [123, 345, 443]}});

Mongodb Aggregation grouping with leave the field

After applying the aggregation
db.grades.aggregate([
{$match: {'type': 'homework'}},
{$sort: {'student_id':1, 'score':1}}
])
got the result:
{
"result" : [
{
"_id" : ObjectId("50906d7fa3c412bb040eb579"),
"student_id" : 0,
"type" : "homework",
"score" : 14.8504576811645
},
{
"_id" : ObjectId("50906d7fa3c412bb040eb57a"),
"student_id" : 0,
"type" : "homework",
"score" : 63.98402553675503
},
...
How to modify the request to leave documents with a minimum value score and get a result which kept the field id. For example, in such a way:
{
"_id" : ObjectId("50906d7fa3c412bb040eb579"),
"score" : 14.8504576811645
}
Thanks.
Is this a homework question from the education site? I can't remember, but this is fairly trivial.
db.grades.aggregate([
{ "$match": { type: 'homework' } },
{ "$sort": {student_id: 1, score: 1} },
{ "$group": {
"_id": "$student_id",
"doc": { "$first": "$_id"},
"score": { "$first": "$score"}
}},
{ "$sort: { "_id": 1 } },
{ "$project": {
"_id": "$doc",
"score": 1
}}
])
All this does is use $first to get the first result when grouping by student_id. By first it means exactly that, so this is only useful after sorting and is different from $min which would take the smallest value from the grouped results.
So if you got part of the way there, not only do you keep the first score, but you also do the same operation on the _id value as well.
The additional sort is only there so the results don't trip you up, because they are likely to appear in the reverse order of student_id. Finally there is just a small use of $project to get the document form that you want.