MongoDB aggregation, find number of distinct values in documents' arrays - mongodb

Reading the docs, I see you can get the number of elements in document arrays. For example given the following documents:
{ "_id" : 1, "item" : "ABC1", "description" : "product 1", colors: [ "blue", "black", "red" ] }
{ "_id" : 2, "item" : "ABC2", "description" : "product 2", colors: [ "purple" ] }
{ "_id" : 3, "item" : "XYZ1", "description" : "product 3", colors: [ ] }
and the following query:
db.inventory.aggregate([{$project: {item: 1, numberOfColors: { $size: "$colors" }}}])
We would get the number of elements in each document's colors array:
{ "_id" : 1, "item" : "ABC1", "numberOfColors" : 3 }
{ "_id" : 2, "item" : "ABC2", "numberOfColors" : 1 }
{ "_id" : 3, "item" : "XYZ1", "numberOfColors" : 0 }
I've not been able to figure out if and how you could sum up all the colors in all the documents directly from a query, ie:
{ "totalColors": 4 }

You can use the following query to get the count of all colors in all docs:
db.inventory.aggregate([
{ $unwind: '$colors' } , // expands nested array so we have one doc per each array value
{ $group: {_id: null, allColors: {$addToSet: "$colors"} } } , // find all colors
{ $project: { totalColors: {$size: "$allColors"}}} // find count of all colors
])

Infinitely better is is to simply $sum the $size:
db.inventory.aggregate([
{ "$group": { "_id": null, "totalColors": { "$sum": { "$size": "$colors" } } }
])
If you wanted "distinct in each document" then you would instead:
db.inventory.aggregate([
{ "$group": {
"_id": null,
"totalColors": {
"$sum": {
"$size": { "$setUnion": [ [], "$colors" ] }
}
}
}}
])
Where $setUnion takes values likes ["purple","blue","purple"] and makes it into ["purple","blue"] as a "set" with "distinct items".
And if you really want "distinct across documents" then don't accumulate the "distinct" into a single document. That causes performance issues and simply does not scale to large data sets, and can possibly break the 16MB BSON Limit. Instead accumulate naturally via the key:
db.inventory.aggregate([
{ "$unwind": "$colors" },
{ "$group": { "_id": "$colors" } },
{ "$group": { "_id": null, "totalColors": { "$sum": 1 } } }
])
Where you only use $unwind because you want "distinct" values from the array as combined with other documents. Generally $unwind should be avoided unless the value contained in the array is being accessed in the "grouping key" _id of $group. Where it is not, it is better to treat arrays using other operators instead, since $unwind creates a "copy" of the whole document per array element.
And of course there was also nothing wrong with simply using .distinct() here, which will return the "distinct" values "as an array", for which you can just test the Array.length() on in code:
var totalSize = db.inventory.distinct("colors").length;
Which for the simple operation you are asking, would be the overall fastest approach for a simple "count of distinct elements". Of course the limitation remains that the result cannot exceed the 16MB BSON limit as a payload. Which is where you defer to .aggregate() instead.

Related

Mongodb - group by same value in different fields in different documents

I have documents with common values in different fields that I want to group by that value. Simplified records are:
{ _id:1,
"Home" : "A",
"Away" : "B" }
{ _id:2,
"Home" : "B",
"Away" : "C" }
{ _id:3,
"Home" : "C",
"Away" : "A" }
{ _id:4,
"Home" : "C",
"Away" : "B" }
{ _id:5,
"Home" : "A",
"Away" : "C" }
I am trying to get an aggregate group result that includes, for example, the value "A" whether it appears in a document in the field "Home", or the field "Away". The result I want is:
{"_id": "A", "count": 3},
{"_id": "B", "count": 3},
{"_id": "C", "count": 4}
Grouping by either "Home" or "Away" is no problem but that wouldn't give me all the records, as shown below, I wouldn't get a count of records where "A" or "B" or "C" was in the "Home" field:
{$group:
{_id: "$Away"} etc... }
I have tried using $cond from other posts here as follows:
$group : {
_id : {
$cond : [{
$gt : [ "$Away", null]
}, "$Home"]
}
}
Also tried an $or which is pretty obviously wrong since it will only find the same value for Away and Home fields within each document (which is never the case):
$group : {
_id : {
$or : [ "$Away", "$Home"]
}
}
I'm stuck and not sure if this is even possible; to group on a value that may be in different fields in different documents.
You can create an object to use $objectToArray and $unwind and then group like this:
Create object using $set and the same values ($Home and $Away)
Use project to not pass these values to the next stage. There are no neccesary, you have the object.
Then $objectToArray to do $unwind and get every value
And last $group by property v generated by $objectToArray.
db.collection.aggregate([
{
"$set": {
"obj": {
"Home": "$Home",
"Away": "$Away"
}
}
},
{
"$project": {"Away": 0,"Home": 0}
},
{
"$set": {"obj": {"$objectToArray": "$obj"}}
},
{
"$unwind": "$obj"
},
{
"$group": {
"_id": "$obj.v",
"count": {"$sum": 1}
}
}
])
Example here

Database sort on numeric values that are actually strings

I have below query. In that field_a is String property and field_b is an array of type Number. I want an array having property field_a and field_b with unique combination. Here field_a contains numeric value but in string format. So I want to apply natural sort in aggregation pipeline. $natural can be used only with such query db.collection.find().sort( { $natural: 1 } )
So how can I use natural sort in MongoDB or I need to depend upon JS functions or on lodash/underscore.js ?
db.collection.aggregate([
{"$group": { "_id": { field_a: "$field_a", field_b: "$field_b" } } },
{ $project: { a: "$_id" } },
{"$group": {"_id": 'a', "res": {"$addToSet": "$_id" }}},
{"$unwind": "$res"},
{"$sort": { "res": 1}},
{"$group": { "_id": null, "res": {"$push": "$res" }}}
])
I short, this is what you want to do here:
db.collection.aggregate([
{ "$group": {
"_id": {
"field_a": {
"$concat": [
{ "$substrCP": [
"0000000000",
0,
{ "$subtract": [ 10, { "$strLenCP": "$field_a" } ] }
]},
"$field_a"
]
},
"field_b": "$field_b"
}
}},
{ "$sort": { "_id": 1 } }
])
Explanation
As a basic concept, the problem you have is that "strings" sort in in a way that does not translate to how numeric sorts work.
As a brief example, these documents use a string value:
{ "_id" : ObjectId("5928276f84c3559bc2fd458b"), "a" : "5" }
{ "_id" : ObjectId("5928277484c3559bc2fd458c"), "a" : "50" }
{ "_id" : ObjectId("5928277e84c3559bc2fd458d"), "a" : "60" }
{ "_id" : ObjectId("5928278284c3559bc2fd458e"), "a" : "6" }
If you try to sort these, then the lexical order applies:
> db.list.find().sort({ "a": 1 })
{ "_id" : ObjectId("5928276f84c3559bc2fd458b"), "a" : "5" }
{ "_id" : ObjectId("5928277484c3559bc2fd458c"), "a" : "50" }
{ "_id" : ObjectId("5928278284c3559bc2fd458e"), "a" : "6" }
{ "_id" : ObjectId("5928277e84c3559bc2fd458d"), "a" : "60" }
As strings, this makes sense. Since a "50" begins with "5" and is therefore less than the "6".
Since the aggregation framework cannot "cast" these as numeric values, then your only option is to present the "strings" in a way in which they will order lexically in the same way as they would with numeric values.
In brief terms we "zero pad" them, which is essentially making the values fixed length strings which are prefixed or "padded" with 0 which makes the "strings" appear to be ordered like they would be numerically:
db.list.aggregate([
{ "$project": {
"a": {
"$concat": [
{ "$substrCP": [
"0000000000",
0,
{ "$subtract": [ 10, { "$strLenCP": "$a" } ] }
]},
"$a"
]
}
}},
{ "$sort": { "a": 1 } }
])
And this will produce a list in order like:
{ "_id" : ObjectId("5928276f84c3559bc2fd458b"), "a" : "0000000005" }
{ "_id" : ObjectId("5928278284c3559bc2fd458e"), "a" : "0000000006" }
{ "_id" : ObjectId("5928277484c3559bc2fd458c"), "a" : "0000000050" }
{ "_id" : ObjectId("5928277e84c3559bc2fd458d"), "a" : "0000000060" }
The basic premise here is that you take a "template" string, which is in this example a string of 0 which is 10 characters long. Then we look at the length of the field data to transform using $strLenCP and $subtract that length from 10 which is the length of the template string used here.
The difference in length is fed to the $substrCP operator as the number of characters to take from the template. This output is then fed to $concat in order to make a "string" which like the template is 10 characters long, but starts with zeros and ends in the initial numeric string.
In your actual usage, this will end up as part of a composite key. Yet since the transformed key is first in the key order, then simply sorting by _id considers this and primarily sorts by values in the first key and then the second.

MongoDB: find documents with a given array of subdocuments

I want to find documents which contain given subdocuments, let's say I have the following documents in my commits collection:
// Document 1
{
"commit": 1,
"authors" : [
{"name" : "Joe", "lastname" : "Doe"},
{"name" : "Joe", "lastname" : "Doe"}
]
}
// Document 2
{
"commit": 2,
"authors" : [
{"name" : "Joe", "lastname" : "Doe"},
{"name" : "John", "lastname" : "Smith"}
]
}
// Document 3
{
"commit": 3,
"authors" : [
{"name" : "Joe", "lastname" : "Doe"}
]
}
All I want from the above collection is 1st document, since I know I'm looking for a commit with 2 authors were both have same name and lastname. So I came up with the query:
db.commits.find({
$and: [{'authors': {$elemMatch: {'name': 'Joe,
'lastname': 'Doe'}},
{'authors': {$elemMatch: {'name': 'Joe,
'lastname': 'Doe'}}],
'authors': { $size: 2 }
})
$size is used to filter out 3rd document, but the query still returns 2nd document since both $elemMatch return True.
I can't use index on subdocuments, since the order of authors used for search is random. Is there a way to remove 2nd document from results without using Mongo's aggregate function?
What you are asking for here is a little different from a standard query. In fact you are asking for where the "name" and "lastname" is found in that combination in your array two times or more to identify that document.
Standard query arguments do not match "how many times" an array element is matched within a result. But of course you can ask the server to "count" that for you using the aggregation framework:
db.collection.aggregate([
// Match possible documents to reduce the pipeline
{ "$match": {
"authors": { "$elemMatch": { "name": "Joe", "lastname": "Doe" } }
}},
// Unwind the array elements for processing
{ "$unwind": "$authors" },
// Group back and "count" the matching elements
{ "$group": {
"_id": "$_id",
"commit": { "$first": "$commit" },
"authors": { "$push": "$authors" },
"count": { "$sum": {
"$cond": [
{ "$and": [
{ "$eq": [ "$authors.name", "Joe" ] },
{ "$eq": [ "$authors.lastname", "Doe" ] }
]},
1,
0
]
}}
}},
// Filter out anything that didn't match at least twice
{ "$match": { "count": { "$gte": 2 } } }
])
So essentially you but your conditions to match inside the $cond operator which returns 1 where matched and 0 where not, and this is passed to $sum to get a total for the document.
Then filter out any documents that did not match 2 or more times

Documents in MongoDB where last n sub-array elements contain a value

Consider this set of data in MongoDB...
{
_id: 1,
name: "Johnny",
properties: [
{
type: "A",
value: 257,
date: "4/1/2014"
},
{
type: "A",
value: 200,
date: "4/2/2014"
},
{
type: "B",
value: 301,
date: "4/3/2014"
},
...]
}
What is the proper way to query the the documents in which the one (or more of) last two "properties" elements have a value > x, or one (or more of) the last two "properties" elements of type "A" have a value > x?
If you can stomach modifying your insertion method try as follows;
Change your updates to push the following:
doc = { type : "A", "value" : 123, "date" : new Date() }
db.foo.update( {_id:1}, { "$push" : { "properties" : { "$each" : [ doc ], "$sort" : { date : -1} } } } )
This will give you an array of documents sorted in descending order by time, making the "most recent" document first.
You can now use the standard MongoDB dot notation to query against the 0, 1, etc elements of your properties array, which represent the most recent additions logically.
As per the comments, the aggregation framework is for a lot more than simply "aggregating" values, so you can take advantage of the various pipeline operators to do very advanced things that cannot be achieved simply using .find()
db.collection.aggregate([
// Match documents that "could" meet the conditions to narrow down
{ "$match": {
"properties": { "$elemMatch": {
"type": "A", "value": { "$gt": 200 }
}}
}},
// Keep a copy of the document for later with an array copy
{ "$project": {
"_id": {
"_id": "$_id",
"name": "$name",
"properties": "$properties"
},
"properties": 1
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$properties" },
// Get the "last" element of the array and copy the existing one
{ "$group": {
"_id": "$_id",
"properties": { "$last": "$_id.properties" },
"last": { "$last": "$properties" },
"count": { "$sum": 1 }
}},
// Unwind the copy again
{ "$unwind": "$properties" },
// Project to mark the element you already have
{ "$project": {
"properties": 1,
"last": 1,
"count": 1,
"seen": { "$eq": [ "$properties", "$last" ] }
}},
// Match again, being careful to keep any array with one element only
// This gets rid of the element you already kept
{ "$match": {
"$or": [
{ "seen": false },
{ "seen": true, "count": 1 }
]
}},
// Group to get the second last element as "next"
{ "$group": {
"_id": "$_id",
"last": { "$last": "$last" },
"next": { "$last": "$properties" }
}},
// Then match to see if either of those elements fits
{ "$match": {
"$or": [
{ "last.type": "A", "last.value": { "$gt": 200 } },
{ "next.type": "A", "next.value": { "$gt": 200 } }
]
}},
// Finally restore your matching documents
{ "$project": {
"_id": "$_id._id",
"name": "$_id.name",
"properties": "$_id.properties"
}}
])
Running through that in a bit more detail:
The first $match usage is to make sure you are only working on documents that can "possibly" match your extended conditions. Always a good idea to optimize like this.
The next stage is to $project since you likely want to keep the original document detail and you are at least going to need the array again in order to get the second last element.
The next stages make use of $unwind in order to break the array into individual documents which is then followed by $group which is used to find the last item on the document _id boundary. This is actually the last item in the array. Plus you keep a count of the array elements.
So then after using $unwind again on the original array content, the usage of $project again adds a "seen" field to the document indicating via the use of the $eq operator whether or not the document from the original is actually the one that was previously keep as the "last" element.
After that stage you again issue a $match in order to filter that last document from the result, but also making sure in the condition that you are not removing anything that originally matched where the array length is actually 1.
From here you want to $group again in order to get the "second last" element from the array (or indeed the same "last" element where there was only one.
The final steps are simply to $match where either of those last two elements meets the conditions, and then finally $project the document in it's original form.
So while that is fairly involved and of course increases in complexity by the number of items you want to test at the end of the array it can be done, and shows how aggregate is very suited to the problem.
Where possible it is the best approach as invoking the JavaScript interpreter will convey an overhead compared to the native code used by aggregate.
Using mapReduce would remove the code complexity for taking the last two possible elements (or more) but it will invoke the JavaScript interpreter by nature and will therefore run much more slowly.
For the record, since the sample in the question would not be a match, here is some data that will match the last two documents, one of which only has one element in the array:
{
"_id" : 1,
"name" : "Johnny",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 2,
"name" : "Ace",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "B",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 3,
"name" : "Bo",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
}
]
}
{
"_id" : 4,
"name" : "Sue",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 240,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
Have you considered using a $where clause? Not the most efficient but I think it should get you what you want. For instance, if you wanted every document that had either the last two properties elements value field greater than 200 you could try:
db.collection.find({properties:{$exists:true},
$where: "(this.properties[this.properties.length-1].value > 200)||
(this.properties[this.properties.length-2].value > 200)"});
This needs some work for edge cases (array < 2 members for example) and more complex queries (by the "type" field too) but should get you started.

sort array in query and project all fields

I would like to sort a nested array at query time while also projecting all fields in the document.
Example document:
{ "_id" : 0, "unknown_field" : "foo", "array_to_sort" : [ { "a" : 3, "b" : 4 }, { "a" : 3, "b" : 3 }, { "a" : 1, "b" : 0 } ] }
I can perform the sorting with an aggregation but I cannot preserve all the fields I need. The application does not know at query time what other fields may appear in each document, so I am not able to explicitly project them. If I had a wildcard to project all fields then this would work:
db.c.aggregate([
{$unwind: "$array_to_sort"},
{$sort: {"array_to_sort.b":1, "array_to_sort:a": 1}},
{$group: {_id:"$_id", array_to_sort: {$push:"$array_to_sort"}}}
]);
...but unfortunately, it produces a result that does not contain the "unknown_field":
{
"_id" : 0,
"array_to_sort" : [
{
"a" : 1,
"b" : 0
},
{
"a" : 3,
"b" : 3
},
{
"a" : 3,
"b" : 4
}
]
}
Here is the insert command incase you would like to experiment:
db.c.insert({"unknown_field": "foo", "array_to_sort": [{"a": 3, "b": 4}, {"a": 3, "b":3}, {"a": 1, "b":0}]})
I cannot pre-sort the array because the sort criteria is dynamic. I may be sorting by any combination of a and/or b ascending/descending at query time. I realize I may need to do this in my client application, but it would be sweet if I could do it in mongo because then I could also $slice/skip/limit the results for paging instead of retrieving the entire array every time.
Since you are grouping on the document _id you can simply place the fields you wish to keep within the grouping _id. Then you can re-form using $project
db.c.aggregate([
{ "$unwind": "$array_to_sort"},
{ "$sort": {"array_to_sort.b":1, "array_to_sort:a": 1}},
{ "$group": {
"_id": {
"_id": "$_id",
"unknown_field": "$unknown_field"
},
"Oarray_to_sort": { "$push":"$array_to_sort"}
}},
{ "$project": {
"_id": "$_id._id",
"unknown_field": "$_id.unknown_field",
"array_to_sort": "$Oarray_to_sort"
}}
]);
The other "trick" in there is using a temporary name for the array in the grouping stage. This is so when you $project and change the name, you get the fields in the order specified in the projection statement. If you did not, then the "array_to_sort" field would not be the last field in the order, as it is copied from the prior stage.
That is an intended optimization in $project, but if you want the order then you can do it as above.
For completely unknown structures there is the mapReduce way of doing things:
db.c.mapReduce(
function () {
this["array_to_sort"].sort(function(a,b) {
return a.a - b.a || a.b - b.b;
});
emit( this._id, this );
},
function(){},
{ "out": { "inline": 1 } }
)
Of course that has an output format that is specific to mapReduce and therefore not exactly the document you had, but all the fields are contained under "values":
{
"results" : [
{
"_id" : 0,
"value" : {
"_id" : 0,
"some_field" : "a",
"array_to_sort" : [
{
"a" : 1,
"b" : 0
},
{
"a" : 3,
"b" : 3
},
{
"a" : 3,
"b" : 4
}
]
}
}
],
}
Future releases ( as of writing ) allow you to use a $$ROOT variable in aggregate to represent the document:
db.c.aggregate([
{ "$project": {
"_id": "$$ROOT",
"array_to_sort": "$array_to_sort"
}},
{ "$unwind": "$array_to_sort"},
{ "$sort": {"array_to_sort.b":1, "array_to_sort:a": 1}},
{ "$group": {
"_id": "$_id",
"array_to_sort": { "$push":"$array_to_sort"}
}}
]);
So there is no point there using the final "project" stage as you do not actually know the other fields in the document. But they will all be contained (including the original array and order ) within the _id field of the result document.