Documents in MongoDB where last n sub-array elements contain a value - mongodb

Consider this set of data in MongoDB...
{
_id: 1,
name: "Johnny",
properties: [
{
type: "A",
value: 257,
date: "4/1/2014"
},
{
type: "A",
value: 200,
date: "4/2/2014"
},
{
type: "B",
value: 301,
date: "4/3/2014"
},
...]
}
What is the proper way to query the the documents in which the one (or more of) last two "properties" elements have a value > x, or one (or more of) the last two "properties" elements of type "A" have a value > x?

If you can stomach modifying your insertion method try as follows;
Change your updates to push the following:
doc = { type : "A", "value" : 123, "date" : new Date() }
db.foo.update( {_id:1}, { "$push" : { "properties" : { "$each" : [ doc ], "$sort" : { date : -1} } } } )
This will give you an array of documents sorted in descending order by time, making the "most recent" document first.
You can now use the standard MongoDB dot notation to query against the 0, 1, etc elements of your properties array, which represent the most recent additions logically.

As per the comments, the aggregation framework is for a lot more than simply "aggregating" values, so you can take advantage of the various pipeline operators to do very advanced things that cannot be achieved simply using .find()
db.collection.aggregate([
// Match documents that "could" meet the conditions to narrow down
{ "$match": {
"properties": { "$elemMatch": {
"type": "A", "value": { "$gt": 200 }
}}
}},
// Keep a copy of the document for later with an array copy
{ "$project": {
"_id": {
"_id": "$_id",
"name": "$name",
"properties": "$properties"
},
"properties": 1
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$properties" },
// Get the "last" element of the array and copy the existing one
{ "$group": {
"_id": "$_id",
"properties": { "$last": "$_id.properties" },
"last": { "$last": "$properties" },
"count": { "$sum": 1 }
}},
// Unwind the copy again
{ "$unwind": "$properties" },
// Project to mark the element you already have
{ "$project": {
"properties": 1,
"last": 1,
"count": 1,
"seen": { "$eq": [ "$properties", "$last" ] }
}},
// Match again, being careful to keep any array with one element only
// This gets rid of the element you already kept
{ "$match": {
"$or": [
{ "seen": false },
{ "seen": true, "count": 1 }
]
}},
// Group to get the second last element as "next"
{ "$group": {
"_id": "$_id",
"last": { "$last": "$last" },
"next": { "$last": "$properties" }
}},
// Then match to see if either of those elements fits
{ "$match": {
"$or": [
{ "last.type": "A", "last.value": { "$gt": 200 } },
{ "next.type": "A", "next.value": { "$gt": 200 } }
]
}},
// Finally restore your matching documents
{ "$project": {
"_id": "$_id._id",
"name": "$_id.name",
"properties": "$_id.properties"
}}
])
Running through that in a bit more detail:
The first $match usage is to make sure you are only working on documents that can "possibly" match your extended conditions. Always a good idea to optimize like this.
The next stage is to $project since you likely want to keep the original document detail and you are at least going to need the array again in order to get the second last element.
The next stages make use of $unwind in order to break the array into individual documents which is then followed by $group which is used to find the last item on the document _id boundary. This is actually the last item in the array. Plus you keep a count of the array elements.
So then after using $unwind again on the original array content, the usage of $project again adds a "seen" field to the document indicating via the use of the $eq operator whether or not the document from the original is actually the one that was previously keep as the "last" element.
After that stage you again issue a $match in order to filter that last document from the result, but also making sure in the condition that you are not removing anything that originally matched where the array length is actually 1.
From here you want to $group again in order to get the "second last" element from the array (or indeed the same "last" element where there was only one.
The final steps are simply to $match where either of those last two elements meets the conditions, and then finally $project the document in it's original form.
So while that is fairly involved and of course increases in complexity by the number of items you want to test at the end of the array it can be done, and shows how aggregate is very suited to the problem.
Where possible it is the best approach as invoking the JavaScript interpreter will convey an overhead compared to the native code used by aggregate.
Using mapReduce would remove the code complexity for taking the last two possible elements (or more) but it will invoke the JavaScript interpreter by nature and will therefore run much more slowly.
For the record, since the sample in the question would not be a match, here is some data that will match the last two documents, one of which only has one element in the array:
{
"_id" : 1,
"name" : "Johnny",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 2,
"name" : "Ace",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "B",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 3,
"name" : "Bo",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
}
]
}
{
"_id" : 4,
"name" : "Sue",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 240,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}

Have you considered using a $where clause? Not the most efficient but I think it should get you what you want. For instance, if you wanted every document that had either the last two properties elements value field greater than 200 you could try:
db.collection.find({properties:{$exists:true},
$where: "(this.properties[this.properties.length-1].value > 200)||
(this.properties[this.properties.length-2].value > 200)"});
This needs some work for edge cases (array < 2 members for example) and more complex queries (by the "type" field too) but should get you started.

Related

MongoDB aggregation, find number of distinct values in documents' arrays

Reading the docs, I see you can get the number of elements in document arrays. For example given the following documents:
{ "_id" : 1, "item" : "ABC1", "description" : "product 1", colors: [ "blue", "black", "red" ] }
{ "_id" : 2, "item" : "ABC2", "description" : "product 2", colors: [ "purple" ] }
{ "_id" : 3, "item" : "XYZ1", "description" : "product 3", colors: [ ] }
and the following query:
db.inventory.aggregate([{$project: {item: 1, numberOfColors: { $size: "$colors" }}}])
We would get the number of elements in each document's colors array:
{ "_id" : 1, "item" : "ABC1", "numberOfColors" : 3 }
{ "_id" : 2, "item" : "ABC2", "numberOfColors" : 1 }
{ "_id" : 3, "item" : "XYZ1", "numberOfColors" : 0 }
I've not been able to figure out if and how you could sum up all the colors in all the documents directly from a query, ie:
{ "totalColors": 4 }
You can use the following query to get the count of all colors in all docs:
db.inventory.aggregate([
{ $unwind: '$colors' } , // expands nested array so we have one doc per each array value
{ $group: {_id: null, allColors: {$addToSet: "$colors"} } } , // find all colors
{ $project: { totalColors: {$size: "$allColors"}}} // find count of all colors
])
Infinitely better is is to simply $sum the $size:
db.inventory.aggregate([
{ "$group": { "_id": null, "totalColors": { "$sum": { "$size": "$colors" } } }
])
If you wanted "distinct in each document" then you would instead:
db.inventory.aggregate([
{ "$group": {
"_id": null,
"totalColors": {
"$sum": {
"$size": { "$setUnion": [ [], "$colors" ] }
}
}
}}
])
Where $setUnion takes values likes ["purple","blue","purple"] and makes it into ["purple","blue"] as a "set" with "distinct items".
And if you really want "distinct across documents" then don't accumulate the "distinct" into a single document. That causes performance issues and simply does not scale to large data sets, and can possibly break the 16MB BSON Limit. Instead accumulate naturally via the key:
db.inventory.aggregate([
{ "$unwind": "$colors" },
{ "$group": { "_id": "$colors" } },
{ "$group": { "_id": null, "totalColors": { "$sum": 1 } } }
])
Where you only use $unwind because you want "distinct" values from the array as combined with other documents. Generally $unwind should be avoided unless the value contained in the array is being accessed in the "grouping key" _id of $group. Where it is not, it is better to treat arrays using other operators instead, since $unwind creates a "copy" of the whole document per array element.
And of course there was also nothing wrong with simply using .distinct() here, which will return the "distinct" values "as an array", for which you can just test the Array.length() on in code:
var totalSize = db.inventory.distinct("colors").length;
Which for the simple operation you are asking, would be the overall fastest approach for a simple "count of distinct elements". Of course the limitation remains that the result cannot exceed the 16MB BSON limit as a payload. Which is where you defer to .aggregate() instead.

How to return distinct $or in mongodb?

So I have this query
db.collection.find($or:[{data_id:123},{data_id:345},{data_id:443}]);
How do I tweak it to return only one of each part of the $or.
I.E something analogous to the SQL:
SELECT DISTINCT data_id, [...] WHERE data_id='123' OR data_id='345'...
Your question needs to be considered with consideration to the documents you have as "distinct" can mean a few different things here. Consider the following sample:
{
"tripId": 123,
"thisField": "this",
"thatField": "that"
},
{
"tripId": 123,
"thisField": "other",
"thatField": "then"
},
{
"tripId": 345,
"thisField": "other",
"thatField": "then"
},
{
"tripId": 345,
"thisField": "this",
"thatField": "that"
},
{
"tripId": 123,
"thisField": "this",
"thatField": "that"
},
{
"tripId": 789,
"thisField": "this",
"thatField": "that"
}
MongoDB has the .distinct() method which which would return distinct values for a single field, but only one field as well as the items are returned simply as an array of those field values.
For anything else you want the .aggregate() method. This is the aggregation pipeline which does a number of different functions and can handle some very complex operations due to the "pipeline" nature of it's processing.
Particularly here you would want to use a $group pipeline stage in order to "group" together values based on a key. That "key" is expressed in the form of an _id key in the $group statement. Much like "SELECT" in SQL with a "GROUP BY" or a "DISTINCT" modifier ( which are much the same in function ) you need to specify all of the fields you intend in the results.
Moreover, anything that would not be specified in a "GROUP BY" portion of a statement would have to be subject to some sort of "grouping operation" in order to select which field values to present. For this there are various "Group Accumulator Operators" to act on these values:
One example here using the $first operator in this case:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$group": {
"_id": "$tripId",
"thisField": { "$first": "$thisField" },
"thatField": { "$first": "$thatField" },
"total": { "$sum": 1 }
}}
])
Gives this result:
{ "_id" : 345, "thisField" : "other", "thatField" : "then", "total" : 2 }
{ "_id" : 123, "thisField" : "this", "thatField" : "that", "total" : 3 }
So with the addition of a $sum operator to count the occurrences of the same distinct values this picks up the "first" occurrences of the values in the specified fields that were mentioned in the accumulator expressions outside of the grouping key.
In versions of MongoDB since 2.6 you can "shortcut" naming all of the fields you want individually using the $$ROOT expression variable. This is a reference to "all" of the fields present in the document as of the state in the current stage where it is used. It's a little shorter to write, but the output is a little different due to the syntax:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$group": {
"_id": "$tripId",
"doc": { "$first": "$$ROOT" },
"total": { "$sum": 1 }
}}
])
Outputs as:
{
"_id" : 345,
"doc" : {
"_id" : ObjectId("54feaf3839c29b9cd470bcbe"),
"tripId" : 345,
"thisField" : "other",
"thatField" : "then"
},
"total" : 2
}
{
"_id" : 123,
"doc" : {
"_id" : ObjectId("54feaf3839c29b9cd470bcbc"),
"tripId" : 123,
"thisField" : "this",
"thatField" : "that"
},
"total" : 3
}
That is a general case with most $group aggregation operations where you specify a "key" and subject other fields present to a "grouping operator"/"accumulator" of some sort.
The other case that if you were looking for the "distinct" occurrences of "all" fields, then you would express these as part of the "key" for the group expression like this:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$group": {
"_id": {
"tripId": "$tripId",
"thisField": "$thisField",
"thatField": "$thatField"
},
"total": { "$sum": 1 }
}}
])
That gives us this output:
{
"_id" : {
"tripId" : 345,
"thisField" : "this",
"thatField" : "that"
},
"total" : 1
}
{
"_id" : {
"tripId" : 345,
"thisField" : "other",
"thatField" : "then"
},
"total" : 1
}
{
"_id" : {
"tripId" : 123,
"thisField" : "other",
"thatField" : "then"
},
"total" : 1
}
{
"_id" : {
"tripId" : 123,
"thisField" : "this",
"thatField" : "that"
},
"total" : 2
}
The total result being 4 documents that considers the "distinct" values on each of the fields mentioned as part of the "composite key". It correctly accounts that most of those combinations occurred 1 time, with the exception of the one example that actually occurs twice with all the same values.
Naturally the $$ROOT variable would not apply here as the "whole document" contains the "unique" _id field for each document. You can always add a $project stage beforehand to filter that field out, but the same conditions to specifying the fields required applies:
db.collection.aggregate([
{ "$match": {
"tripId": { "$in": [ 123,345 ] }
}},
{ "$project": {
"_id": 0,
"tripId": 1,
"thisField": 1,
"thatField": 1
}},
{ "$group": {
"_id": "$$ROOT",
"total": { "$sum": 1 }
}}
])
So that serves as an introduction with examples of what you can do in the form of "distinct" queries with MongoDB and specifically the aggregation framework. There are various other common SQL to Aggregation mapping examples given in the documentation.
The other general case was your usage of $or in your question. As you see in the samples here, when you want the same "or" condition over values of the same field, then the more efficient way to write this in your query is with the $in operator. Rather than an array of "query documents" this takes and array of "possible values" to the common field it is examining in the expression. It is basically a $or condition, but expressed in a shorter form for this case.
Instead of $or use $in query that will serve your purpose.
db.collection.find({data_id:{$in: [123, 345, 443]}});

Complex MongoDB Aggregation

I have a situation where I need to perform a group by operation based on an array value that sums up occurrences of a field value. The counts are then filtered on and the results are prepared so that they can be displayed according to the condition. Essentially, the documents are transformed back to how they would be presented if you simply used the find function. I am running into an issue of the temporary documents being too large due to the number of items collected in the matchedDocuments array. Any suggestions on how to improve this would be helpful.
db.collection1.aggregate([
{
'$unwind': '$arrayOfValues'
}, {
'$group': {
'_id': '$arrayOfValues',
'x_count': {
$sum: {
$cond: [{
$eq: ['$field.value', 'x']
},
1, 0
]
}
},
'y_count': {
$sum: {
$cond: [{
$eq: ['$field.value', 'y']
},
1, 0
]
}
},
'matchedDocuments': {
'$push': '$$CURRENT'
}
}
},
{'$match': {'$or': [{'x_count': {'$gte': 2}}, {'y_count': { '$gte': 1}}]}},
{'$unwind': '$matchedDocuments'},
{
'$group': {
'_id': '$matchedDocuments.key',
'document': {
'$last': '$$CURRENT.matchedDocuments'
}
}
}
], {
allowDiskUse: true
})
Below are some sample documents and the expected result based on the criteria above:
// Sample documents
{ "_id" : ObjectId("5407c76b7b1c276c74f90524"), "field" : "x", "arrayOfValues" : [ "a", "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90525"), "field" : "x", "arrayOfValues" : [ "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90526"), "field" : "z", "arrayOfValues" : [ "a" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90527"), "field" : "x", "arrayOfValues" : [ "a", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90528"), "field" : "z", "arrayOfValues" : [ "b" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90529"), "field" : "y", "arrayOfValues" : [ "k" ] }
// Expected Result
[
{ "_id" : ObjectId("5407c76b7b1c276c74f90524"), "field" : "x", "arrayOfValues" : [ "a", "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90525"), "field" : "x", "arrayOfValues" : [ "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90527"), "field" : "x", "arrayOfValues" : [ "a", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90529"), "field" : "y", "arrayOfValues" : [ "k" ] }
]
I think ultimately you are asking a little too much from a single query, as clearly the biggest problem here is trying to store all of the original documents from whence the array element came whilst trying to aggregate a total.
For me, I would just try to identify which conditions on the document would result in a match and then issue a separate query to get the actual documents back. You could adapt the aggregation below to try to return the document, but I think it's very likely to fail when doing so as it would be the reverse of what you should be using arrays for.
The process is also generally much more efficient in the way it goes about the matching alowing you to firstly "Select the elements you are interested in with a match condition" and secondly, "Use the natural grouping conditions rather than rely on conditional sums".
var cursor = db.collection.aggregate([
{ "$match": { "field": { "$in": ["x", "y"] } } },
{ "$unwind": "$arrayOfValues" },
{ "$group": {
"_id": {
"elem": "$arrayOfValues",
"field": "$field"
},
"count": { "$sum": 1 }
}},
{ "$match": {
"$or": [
{ "_id.field": "x", "count": { "$gte": 2 } },
{ "_id.field": "y", "count": { "$gte": 1 } }
]
}},
{ "$group": {
"_id": "$_id.field",
"values": { "$push": "$_id.elem" }
}}
])
var query = { "$or": [] };
cursor.forEach(function(doc) {
query["$or"].push({
"field": doc._id,
"arrayOfValues": { "$in": doc.values }
});
});
db.collection.find(query)
For the record the query should come out like this, given the supplied data:
{
"$or" : [
{
"field" : "x",
"arrayOfValues" : {
"$in" : [
"c",
"b",
"a"
]
}
},
{
"field" : "y",
"arrayOfValues" : {
"$in" : [
"k"
]
}
}
]
}
The basic logic is met by just looking for the values of "field" that you are interested in, so at least eliminating all others from the possible results. Then you basically want to tally up the counts for each array element under each of those "field" values and test where the required occurrences were met.
This may or may not work best the other way around, but the sample here shows the greatest variation by the "arrayOfValues" so that makes sense as the second level of grouping.
As stated earlier, I think it is too much to ask to basically "stuff" all of the parent document information into an array for each "arrayOfValues" element as this works beyond the basic principles of a sensible schema, where that sort of relation would naturally be stored as separate documents. So the end principle here is just find the "conditions" that match those documents which is what the end result comes out as.
The transformed query is then issued against the collection, where that will return all documents that meet the conditions determined from the previous analysis. At the end of the day, moving the responsibility of "fetching" the matching documents off to another query rather than trying to store the documents that match in arrays.
This seems the most logical and scale-able approach, but if you mostly tend to use your data in this type of result you should be looking at re-designing your schema to suit this better. But there really is not enough specific information here to comment on that further.

sort array in query and project all fields

I would like to sort a nested array at query time while also projecting all fields in the document.
Example document:
{ "_id" : 0, "unknown_field" : "foo", "array_to_sort" : [ { "a" : 3, "b" : 4 }, { "a" : 3, "b" : 3 }, { "a" : 1, "b" : 0 } ] }
I can perform the sorting with an aggregation but I cannot preserve all the fields I need. The application does not know at query time what other fields may appear in each document, so I am not able to explicitly project them. If I had a wildcard to project all fields then this would work:
db.c.aggregate([
{$unwind: "$array_to_sort"},
{$sort: {"array_to_sort.b":1, "array_to_sort:a": 1}},
{$group: {_id:"$_id", array_to_sort: {$push:"$array_to_sort"}}}
]);
...but unfortunately, it produces a result that does not contain the "unknown_field":
{
"_id" : 0,
"array_to_sort" : [
{
"a" : 1,
"b" : 0
},
{
"a" : 3,
"b" : 3
},
{
"a" : 3,
"b" : 4
}
]
}
Here is the insert command incase you would like to experiment:
db.c.insert({"unknown_field": "foo", "array_to_sort": [{"a": 3, "b": 4}, {"a": 3, "b":3}, {"a": 1, "b":0}]})
I cannot pre-sort the array because the sort criteria is dynamic. I may be sorting by any combination of a and/or b ascending/descending at query time. I realize I may need to do this in my client application, but it would be sweet if I could do it in mongo because then I could also $slice/skip/limit the results for paging instead of retrieving the entire array every time.
Since you are grouping on the document _id you can simply place the fields you wish to keep within the grouping _id. Then you can re-form using $project
db.c.aggregate([
{ "$unwind": "$array_to_sort"},
{ "$sort": {"array_to_sort.b":1, "array_to_sort:a": 1}},
{ "$group": {
"_id": {
"_id": "$_id",
"unknown_field": "$unknown_field"
},
"Oarray_to_sort": { "$push":"$array_to_sort"}
}},
{ "$project": {
"_id": "$_id._id",
"unknown_field": "$_id.unknown_field",
"array_to_sort": "$Oarray_to_sort"
}}
]);
The other "trick" in there is using a temporary name for the array in the grouping stage. This is so when you $project and change the name, you get the fields in the order specified in the projection statement. If you did not, then the "array_to_sort" field would not be the last field in the order, as it is copied from the prior stage.
That is an intended optimization in $project, but if you want the order then you can do it as above.
For completely unknown structures there is the mapReduce way of doing things:
db.c.mapReduce(
function () {
this["array_to_sort"].sort(function(a,b) {
return a.a - b.a || a.b - b.b;
});
emit( this._id, this );
},
function(){},
{ "out": { "inline": 1 } }
)
Of course that has an output format that is specific to mapReduce and therefore not exactly the document you had, but all the fields are contained under "values":
{
"results" : [
{
"_id" : 0,
"value" : {
"_id" : 0,
"some_field" : "a",
"array_to_sort" : [
{
"a" : 1,
"b" : 0
},
{
"a" : 3,
"b" : 3
},
{
"a" : 3,
"b" : 4
}
]
}
}
],
}
Future releases ( as of writing ) allow you to use a $$ROOT variable in aggregate to represent the document:
db.c.aggregate([
{ "$project": {
"_id": "$$ROOT",
"array_to_sort": "$array_to_sort"
}},
{ "$unwind": "$array_to_sort"},
{ "$sort": {"array_to_sort.b":1, "array_to_sort:a": 1}},
{ "$group": {
"_id": "$_id",
"array_to_sort": { "$push":"$array_to_sort"}
}}
]);
So there is no point there using the final "project" stage as you do not actually know the other fields in the document. But they will all be contained (including the original array and order ) within the _id field of the result document.

Mongodb Aggregation grouping with leave the field

After applying the aggregation
db.grades.aggregate([
{$match: {'type': 'homework'}},
{$sort: {'student_id':1, 'score':1}}
])
got the result:
{
"result" : [
{
"_id" : ObjectId("50906d7fa3c412bb040eb579"),
"student_id" : 0,
"type" : "homework",
"score" : 14.8504576811645
},
{
"_id" : ObjectId("50906d7fa3c412bb040eb57a"),
"student_id" : 0,
"type" : "homework",
"score" : 63.98402553675503
},
...
How to modify the request to leave documents with a minimum value score and get a result which kept the field id. For example, in such a way:
{
"_id" : ObjectId("50906d7fa3c412bb040eb579"),
"score" : 14.8504576811645
}
Thanks.
Is this a homework question from the education site? I can't remember, but this is fairly trivial.
db.grades.aggregate([
{ "$match": { type: 'homework' } },
{ "$sort": {student_id: 1, score: 1} },
{ "$group": {
"_id": "$student_id",
"doc": { "$first": "$_id"},
"score": { "$first": "$score"}
}},
{ "$sort: { "_id": 1 } },
{ "$project": {
"_id": "$doc",
"score": 1
}}
])
All this does is use $first to get the first result when grouping by student_id. By first it means exactly that, so this is only useful after sorting and is different from $min which would take the smallest value from the grouped results.
So if you got part of the way there, not only do you keep the first score, but you also do the same operation on the _id value as well.
The additional sort is only there so the results don't trip you up, because they are likely to appear in the reverse order of student_id. Finally there is just a small use of $project to get the document form that you want.