Mongodb - group by same value in different fields in different documents - mongodb

I have documents with common values in different fields that I want to group by that value. Simplified records are:
{ _id:1,
"Home" : "A",
"Away" : "B" }
{ _id:2,
"Home" : "B",
"Away" : "C" }
{ _id:3,
"Home" : "C",
"Away" : "A" }
{ _id:4,
"Home" : "C",
"Away" : "B" }
{ _id:5,
"Home" : "A",
"Away" : "C" }
I am trying to get an aggregate group result that includes, for example, the value "A" whether it appears in a document in the field "Home", or the field "Away". The result I want is:
{"_id": "A", "count": 3},
{"_id": "B", "count": 3},
{"_id": "C", "count": 4}
Grouping by either "Home" or "Away" is no problem but that wouldn't give me all the records, as shown below, I wouldn't get a count of records where "A" or "B" or "C" was in the "Home" field:
{$group:
{_id: "$Away"} etc... }
I have tried using $cond from other posts here as follows:
$group : {
_id : {
$cond : [{
$gt : [ "$Away", null]
}, "$Home"]
}
}
Also tried an $or which is pretty obviously wrong since it will only find the same value for Away and Home fields within each document (which is never the case):
$group : {
_id : {
$or : [ "$Away", "$Home"]
}
}
I'm stuck and not sure if this is even possible; to group on a value that may be in different fields in different documents.

You can create an object to use $objectToArray and $unwind and then group like this:
Create object using $set and the same values ($Home and $Away)
Use project to not pass these values to the next stage. There are no neccesary, you have the object.
Then $objectToArray to do $unwind and get every value
And last $group by property v generated by $objectToArray.
db.collection.aggregate([
{
"$set": {
"obj": {
"Home": "$Home",
"Away": "$Away"
}
}
},
{
"$project": {"Away": 0,"Home": 0}
},
{
"$set": {"obj": {"$objectToArray": "$obj"}}
},
{
"$unwind": "$obj"
},
{
"$group": {
"_id": "$obj.v",
"count": {"$sum": 1}
}
}
])
Example here

Related

MongoDB aggregation, find number of distinct values in documents' arrays

Reading the docs, I see you can get the number of elements in document arrays. For example given the following documents:
{ "_id" : 1, "item" : "ABC1", "description" : "product 1", colors: [ "blue", "black", "red" ] }
{ "_id" : 2, "item" : "ABC2", "description" : "product 2", colors: [ "purple" ] }
{ "_id" : 3, "item" : "XYZ1", "description" : "product 3", colors: [ ] }
and the following query:
db.inventory.aggregate([{$project: {item: 1, numberOfColors: { $size: "$colors" }}}])
We would get the number of elements in each document's colors array:
{ "_id" : 1, "item" : "ABC1", "numberOfColors" : 3 }
{ "_id" : 2, "item" : "ABC2", "numberOfColors" : 1 }
{ "_id" : 3, "item" : "XYZ1", "numberOfColors" : 0 }
I've not been able to figure out if and how you could sum up all the colors in all the documents directly from a query, ie:
{ "totalColors": 4 }
You can use the following query to get the count of all colors in all docs:
db.inventory.aggregate([
{ $unwind: '$colors' } , // expands nested array so we have one doc per each array value
{ $group: {_id: null, allColors: {$addToSet: "$colors"} } } , // find all colors
{ $project: { totalColors: {$size: "$allColors"}}} // find count of all colors
])
Infinitely better is is to simply $sum the $size:
db.inventory.aggregate([
{ "$group": { "_id": null, "totalColors": { "$sum": { "$size": "$colors" } } }
])
If you wanted "distinct in each document" then you would instead:
db.inventory.aggregate([
{ "$group": {
"_id": null,
"totalColors": {
"$sum": {
"$size": { "$setUnion": [ [], "$colors" ] }
}
}
}}
])
Where $setUnion takes values likes ["purple","blue","purple"] and makes it into ["purple","blue"] as a "set" with "distinct items".
And if you really want "distinct across documents" then don't accumulate the "distinct" into a single document. That causes performance issues and simply does not scale to large data sets, and can possibly break the 16MB BSON Limit. Instead accumulate naturally via the key:
db.inventory.aggregate([
{ "$unwind": "$colors" },
{ "$group": { "_id": "$colors" } },
{ "$group": { "_id": null, "totalColors": { "$sum": 1 } } }
])
Where you only use $unwind because you want "distinct" values from the array as combined with other documents. Generally $unwind should be avoided unless the value contained in the array is being accessed in the "grouping key" _id of $group. Where it is not, it is better to treat arrays using other operators instead, since $unwind creates a "copy" of the whole document per array element.
And of course there was also nothing wrong with simply using .distinct() here, which will return the "distinct" values "as an array", for which you can just test the Array.length() on in code:
var totalSize = db.inventory.distinct("colors").length;
Which for the simple operation you are asking, would be the overall fastest approach for a simple "count of distinct elements". Of course the limitation remains that the result cannot exceed the 16MB BSON limit as a payload. Which is where you defer to .aggregate() instead.

MongoDB aggregate nested array correctly

OK I am very new to Mongo, and I am already stuck.
Db has the following structure (much simplified for sure):
{
{
"_id" : ObjectId("57fdfbc12dc30a46507044ec"),
"keyterms" : [
{
"score" : "2",
"value" : "AA",
},
{
"score" : "2",
"value" : "AA",
},
{
"score" : "4",
"value" : "BB",
},
{
"score" : "3",
"value" : "CC",
}
]
},
{
"_id" : ObjectId("57fdfbc12dc30a46507044ef"),
"keyterms" : [
...
There are some Objects. Each Object have an array "keywords". Each of this Arrays Entries, which have score and value. There are some duplicates though (not really, since in the real db the keywords entries have much more fields, but concerning value and score they are duplicates).
Now I need a query, which
selects one object by id
groups its keyterms in by value
and counts the dublicates
sorts them by score
So I want to have something like that as result
// for Object 57fdfbc12dc30a46507044ec
"keyterms"; [
{
"score" : "4",
"value" : "BB",
"count" : 1
},
{
"score" : "3",
"value" : "CC",
"count" : 1
}
{
"score" : "2",
"value" : "AA",
"count" : 2
}
]
In SQL I would have written something like this
select
score, value, count(*) as count
from
all_keywords_table_or_some_join
group by
value
order by
score
But, sadly enough, it's not SQL.
In Mongo I managed to write this:
db.getCollection('tests').aggregate([
{$match: {'_id': ObjectId('57fdfbc12dc30a46507044ec')}},
{$unwind: "$keyterms"},
{$sort: {"keyterms.score": -1}},
{$group: {
'_id': "$_id",
'keyterms': {$push: "$keyterms"}
}},
{$project: {
'keyterms.score': 1,
'keyterms.value': 1
}}
])
But there is something missing: the grouping of the the keywords by their value. I can not get rid of the feeling, that this is the wrong approach at all. How can I select the keywords array and continue with that, and use an aggregate function inly on this - that would be easy.
BTW I read this
(Mongo aggregate nested array)
but I can't figure it out for my example unfortunately...
You'd want an aggregation pipeline where after you $unwind the array, you group the flattened documents by the array's value and score keys, aggregate the counts using the $sum accumulator operator and retain the main document's _id with the $first operator.
The preceding pipeline should then group the documents from the previous pipeline by the _id key so as to preserve the original schema and recreate the keyterms array using the $push operator.
The following demonstration attempts to explain the above aggregation operation:
db.tests.aggregate([
{ "$match": { "_id": ObjectId("57fdfbc12dc30a46507044ec") } },
{ "$unwind": "$keyterms" },
{
"$group": {
"_id": {
"value": "$keyterms.value",
"score": "$keyterms.score"
},
"doc_id": { "$first": "$_id" },
"count": { "$sum": 1 }
}
},
{ "$sort": {"_id.score": -1 } },
{
"$group": {
"_id": "$doc_id",
"keyterms": {
"$push": {
"value": "$_id.value",
"score": "$_id.score",
"count": "$count"
}
}
}
}
])
Sample Output
{
"_id" : ObjectId("57fdfbc12dc30a46507044ec"),
"keyterms" : [
{
"value" : "BB",
"score" : "4",
"count" : 1
},
{
"value" : "CC",
"score" : "3",
"count" : 1
},
{
"value" : "AA",
"score" : "2",
"count" : 2
}
]
}
Demo
Meanwhile, I solved it myself:
aggregate([
{$match: {'_id': ObjectId('57fdfbc12dc30a46507044ec')}},
{$unwind: "$keyterms"},
{$sort: {"keyterms.score": -1}},
{$group: {
'_id': "$keyterms.value",
'keyterms': {$push: "$keyterms"},
'escore': {$first: "$keyterms.score"},
'evalue': {$first: "$keyterms.value"}
}},
{$limit: 15},
{$project: {
"score": "$escore",
"value": "$evalue",
"count": {$size: "$keyterms"}
}}
])

Select Group by count and distinct count in same mongodb query

I am trying to do something like
select campaign_id,campaign_name,count(subscriber_id),count(distinct subscriber_id)
group by campaign_id,campaign_name from campaigns;
This query giving results except count(distinct subscriber_id)
db.campaigns.aggregate([
{$match: {subscriber_id: {$ne: null}}},
{$group: {
_id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name"},
count: {$sum: 1}
}}
])
This following query giving results except count(subscriber_id)
db.campaigns_logs.aggregate([
{$match : {subscriber_id: {$ne: null}}},
{$group : { _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name",subscriber_id: "$subscriber_id"}}},
{$group : { _id: {campaign_id: "$campaign_id",campaign_name: "$campaign_name"},
count: {$sum: 1}
}}
])
but I want count(subscriber_id),count(distinct subscriber_id) in the same result
You are beginning to think along the right lines here as you were headed in the right direction. Changing your SQL mindset, "distinct" is really just another way of writing a $group operation in either language. That means you have two group operations happening here and, in aggregation pipeline terms, two pipeline stages.
Just with simplified documents to visualize:
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "123"
},
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "123"
},
{
"campaign_id": "A",
"campaign_name": "A",
"subscriber_id": "456"
}
It stands to reason that for the given "campaign" combination the total count and "distinct" count are "3" and "2" respectively. So the logical thing to do is "group" up all of those "subscriber_id" values first and keep the count of occurrences for each, then while thinking "pipeline", "total" those counts per "campaign" and then just count the "distinct" occurrences as a separate number:
db.campaigns.aggregate([
{ "$match": { "subscriber_id": { "$ne": null }}},
// Count all occurrences
{ "$group": {
"_id": {
"campaign_id": "$campaign_id",
"campaign_name": "$campaign_name",
"subscriber_id": "$subscriber_id"
},
"count": { "$sum": 1 }
}},
// Sum all occurrences and count distinct
{ "$group": {
"_id": {
"campaign_id": "$_id.campaign_id",
"campaign_name": "$_id.campaign_name"
},
"totalCount": { "$sum": "$count" },
"distinctCount": { "$sum": 1 }
}}
])
After the first "group" the output documents can be visualized like this:
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A",
"subscriber_id" : "456"
},
"count" : 1
}
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A",
"subscriber_id" : "123"
},
"count" : 2
}
So from the "three" documents in the sample, "2" belong to one distinct value and "1" to another. This can still be totaled with $sum in order to get the total matching documents which you do in the following stage, with the final result:
{
"_id" : {
"campaign_id" : "A",
"campaign_name" : "A"
},
"totalCount" : 3,
"distinctCount" : 2
}
A really good analogy for the aggregation pipeline is the unix pipe "|" operator, which allows "chaining" of operations so you can pass the output of one command through to the input of the next, and so on. Starting to think of your processing requirements in that way will help you understand operations with the aggregation pipeline better.
SQL Query: (group by & count of distinct)
select city,count(distinct(emailId)) from TransactionDetails group by city;
The equivalent mongo query would look like this:
db.TransactionDetails.aggregate([
{$group:{_id:{"CITY" : "$cityName"},uniqueCount: {$addToSet: "$emailId"}}},
{$project:{"CITY":1,uniqueCustomerCount:{$size:"$uniqueCount"}} }
]);

Documents in MongoDB where last n sub-array elements contain a value

Consider this set of data in MongoDB...
{
_id: 1,
name: "Johnny",
properties: [
{
type: "A",
value: 257,
date: "4/1/2014"
},
{
type: "A",
value: 200,
date: "4/2/2014"
},
{
type: "B",
value: 301,
date: "4/3/2014"
},
...]
}
What is the proper way to query the the documents in which the one (or more of) last two "properties" elements have a value > x, or one (or more of) the last two "properties" elements of type "A" have a value > x?
If you can stomach modifying your insertion method try as follows;
Change your updates to push the following:
doc = { type : "A", "value" : 123, "date" : new Date() }
db.foo.update( {_id:1}, { "$push" : { "properties" : { "$each" : [ doc ], "$sort" : { date : -1} } } } )
This will give you an array of documents sorted in descending order by time, making the "most recent" document first.
You can now use the standard MongoDB dot notation to query against the 0, 1, etc elements of your properties array, which represent the most recent additions logically.
As per the comments, the aggregation framework is for a lot more than simply "aggregating" values, so you can take advantage of the various pipeline operators to do very advanced things that cannot be achieved simply using .find()
db.collection.aggregate([
// Match documents that "could" meet the conditions to narrow down
{ "$match": {
"properties": { "$elemMatch": {
"type": "A", "value": { "$gt": 200 }
}}
}},
// Keep a copy of the document for later with an array copy
{ "$project": {
"_id": {
"_id": "$_id",
"name": "$name",
"properties": "$properties"
},
"properties": 1
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$properties" },
// Get the "last" element of the array and copy the existing one
{ "$group": {
"_id": "$_id",
"properties": { "$last": "$_id.properties" },
"last": { "$last": "$properties" },
"count": { "$sum": 1 }
}},
// Unwind the copy again
{ "$unwind": "$properties" },
// Project to mark the element you already have
{ "$project": {
"properties": 1,
"last": 1,
"count": 1,
"seen": { "$eq": [ "$properties", "$last" ] }
}},
// Match again, being careful to keep any array with one element only
// This gets rid of the element you already kept
{ "$match": {
"$or": [
{ "seen": false },
{ "seen": true, "count": 1 }
]
}},
// Group to get the second last element as "next"
{ "$group": {
"_id": "$_id",
"last": { "$last": "$last" },
"next": { "$last": "$properties" }
}},
// Then match to see if either of those elements fits
{ "$match": {
"$or": [
{ "last.type": "A", "last.value": { "$gt": 200 } },
{ "next.type": "A", "next.value": { "$gt": 200 } }
]
}},
// Finally restore your matching documents
{ "$project": {
"_id": "$_id._id",
"name": "$_id.name",
"properties": "$_id.properties"
}}
])
Running through that in a bit more detail:
The first $match usage is to make sure you are only working on documents that can "possibly" match your extended conditions. Always a good idea to optimize like this.
The next stage is to $project since you likely want to keep the original document detail and you are at least going to need the array again in order to get the second last element.
The next stages make use of $unwind in order to break the array into individual documents which is then followed by $group which is used to find the last item on the document _id boundary. This is actually the last item in the array. Plus you keep a count of the array elements.
So then after using $unwind again on the original array content, the usage of $project again adds a "seen" field to the document indicating via the use of the $eq operator whether or not the document from the original is actually the one that was previously keep as the "last" element.
After that stage you again issue a $match in order to filter that last document from the result, but also making sure in the condition that you are not removing anything that originally matched where the array length is actually 1.
From here you want to $group again in order to get the "second last" element from the array (or indeed the same "last" element where there was only one.
The final steps are simply to $match where either of those last two elements meets the conditions, and then finally $project the document in it's original form.
So while that is fairly involved and of course increases in complexity by the number of items you want to test at the end of the array it can be done, and shows how aggregate is very suited to the problem.
Where possible it is the best approach as invoking the JavaScript interpreter will convey an overhead compared to the native code used by aggregate.
Using mapReduce would remove the code complexity for taking the last two possible elements (or more) but it will invoke the JavaScript interpreter by nature and will therefore run much more slowly.
For the record, since the sample in the question would not be a match, here is some data that will match the last two documents, one of which only has one element in the array:
{
"_id" : 1,
"name" : "Johnny",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 2,
"name" : "Ace",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "B",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 3,
"name" : "Bo",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
}
]
}
{
"_id" : 4,
"name" : "Sue",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 240,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
Have you considered using a $where clause? Not the most efficient but I think it should get you what you want. For instance, if you wanted every document that had either the last two properties elements value field greater than 200 you could try:
db.collection.find({properties:{$exists:true},
$where: "(this.properties[this.properties.length-1].value > 200)||
(this.properties[this.properties.length-2].value > 200)"});
This needs some work for edge cases (array < 2 members for example) and more complex queries (by the "type" field too) but should get you started.

sort array in query and project all fields

I would like to sort a nested array at query time while also projecting all fields in the document.
Example document:
{ "_id" : 0, "unknown_field" : "foo", "array_to_sort" : [ { "a" : 3, "b" : 4 }, { "a" : 3, "b" : 3 }, { "a" : 1, "b" : 0 } ] }
I can perform the sorting with an aggregation but I cannot preserve all the fields I need. The application does not know at query time what other fields may appear in each document, so I am not able to explicitly project them. If I had a wildcard to project all fields then this would work:
db.c.aggregate([
{$unwind: "$array_to_sort"},
{$sort: {"array_to_sort.b":1, "array_to_sort:a": 1}},
{$group: {_id:"$_id", array_to_sort: {$push:"$array_to_sort"}}}
]);
...but unfortunately, it produces a result that does not contain the "unknown_field":
{
"_id" : 0,
"array_to_sort" : [
{
"a" : 1,
"b" : 0
},
{
"a" : 3,
"b" : 3
},
{
"a" : 3,
"b" : 4
}
]
}
Here is the insert command incase you would like to experiment:
db.c.insert({"unknown_field": "foo", "array_to_sort": [{"a": 3, "b": 4}, {"a": 3, "b":3}, {"a": 1, "b":0}]})
I cannot pre-sort the array because the sort criteria is dynamic. I may be sorting by any combination of a and/or b ascending/descending at query time. I realize I may need to do this in my client application, but it would be sweet if I could do it in mongo because then I could also $slice/skip/limit the results for paging instead of retrieving the entire array every time.
Since you are grouping on the document _id you can simply place the fields you wish to keep within the grouping _id. Then you can re-form using $project
db.c.aggregate([
{ "$unwind": "$array_to_sort"},
{ "$sort": {"array_to_sort.b":1, "array_to_sort:a": 1}},
{ "$group": {
"_id": {
"_id": "$_id",
"unknown_field": "$unknown_field"
},
"Oarray_to_sort": { "$push":"$array_to_sort"}
}},
{ "$project": {
"_id": "$_id._id",
"unknown_field": "$_id.unknown_field",
"array_to_sort": "$Oarray_to_sort"
}}
]);
The other "trick" in there is using a temporary name for the array in the grouping stage. This is so when you $project and change the name, you get the fields in the order specified in the projection statement. If you did not, then the "array_to_sort" field would not be the last field in the order, as it is copied from the prior stage.
That is an intended optimization in $project, but if you want the order then you can do it as above.
For completely unknown structures there is the mapReduce way of doing things:
db.c.mapReduce(
function () {
this["array_to_sort"].sort(function(a,b) {
return a.a - b.a || a.b - b.b;
});
emit( this._id, this );
},
function(){},
{ "out": { "inline": 1 } }
)
Of course that has an output format that is specific to mapReduce and therefore not exactly the document you had, but all the fields are contained under "values":
{
"results" : [
{
"_id" : 0,
"value" : {
"_id" : 0,
"some_field" : "a",
"array_to_sort" : [
{
"a" : 1,
"b" : 0
},
{
"a" : 3,
"b" : 3
},
{
"a" : 3,
"b" : 4
}
]
}
}
],
}
Future releases ( as of writing ) allow you to use a $$ROOT variable in aggregate to represent the document:
db.c.aggregate([
{ "$project": {
"_id": "$$ROOT",
"array_to_sort": "$array_to_sort"
}},
{ "$unwind": "$array_to_sort"},
{ "$sort": {"array_to_sort.b":1, "array_to_sort:a": 1}},
{ "$group": {
"_id": "$_id",
"array_to_sort": { "$push":"$array_to_sort"}
}}
]);
So there is no point there using the final "project" stage as you do not actually know the other fields in the document. But they will all be contained (including the original array and order ) within the _id field of the result document.