MongoDB Find key with max value within nested document - mongodb

I have a document like this:
{
timestamp: ISODate("2013-10-10T23:00:00.000Z"),
values: {
0: 25,
1: 2,
3: 16,
4: 12,
5: 10
}
}
Two questions:
How can I get the "argmax" 0 from the nested document in values?
If I have multiple documents like this, can I query for all documents with an "argmax" of 2, for instance?

You really need to change the way you are structuring your documents as what you have right now is not good. Nested objects like you have cannot be "traversed" in normal query operations, so their is no way of efficiently searching "across keys".
The only way to do this is using JavaScript evaluation of $where, and this means "no index" can be used to optimise searching. It is also basically a "brute force" match against every document in the collection:
db.collection.find(function() {
var values = this.values;
return Object.keys(values).some(function(key) {
return values[key] == 2;
});
})
That is just to find a value of "2" in the nested key, in order to find if the "maximum" value was "2" then you would do:
db.collection.find(function() {
var values = this.values;
return Math.max.apply(null, Object.keys(values).map(function(key) {
return values[key];
})) == 2;
})
So brute force is not good. Better to structure your document with "values" as an "array". Then all native queries work fine:
{
"timestamp": ISODate("2013-10-10T23:00:00.000Z"),
"values": [25, 2, 16, 12, 10]
}
Now you can do:
db.collection.aggregate([
{ "$match": { "values": 2 } },
{ "$unwind": "$values" },
{ "$group": {
"_id": "$_id",
"timestamp": { "$first": "$timestamp" },
"values": { "$push": "$values" },
"maxVal": { "$max": "$values" }
}},
{ "$match": { "maxVal": 2 } }
])
Which might at first glance seem more cumbersome, but the construction of this using native operators as opposed to JavaScript translation does make this much more efficient. It is also notably more efficient in that it is now possible to actually search whether the "values" array actually even contains "2" as a value using an index even, without needing to test all content in looped code.
The main work is done within testing the "max" value for the array, so even if this was not the most shining example, you can see even see the clear difference in how a normal query operation can be combined with JavaScript evaluation now, to make that process faster:
db.collection.find({
"values": 2,
"$where": function() {
return Math.max.apply(null,this.values) == 2;
}
})
So the initial "values": 2 will filter the documents immediately for those that contain "2", and the subsequent expression merely filters down further those documents where the "max" value of the array is "2" as well.
Moreover, if it was your intention to query for "maximum" values like this on a regular basis, then you would be better off storing this value as a discrete field in the document itself, like so:
{
"timestamp": ISODate("2013-10-10T23:00:00.000Z"),
"values": [25, 2, 16, 12, 10],
"minValue": 2,
"maxValue": 25
}
Then finding documents with the "maximum value" of 2 is as simple as:
db.collection.find({ "maxValue": 2 })
Or the largest "max" within all documents:
db.collection.find().sort({ "maxValue": -1 }).limit(1)
Or even both "min" and "max" from all documents at the same time:
db.collection.aggregate([
{ "$group": {
"_id": null,
"minValue": { "$min": "$minValue" },
"maxValue": { "$max": "$maxValue" }
}}
])
Maintaining this data when adding new "values" is a simple matter of employing the $min and $max update operators as you update the document. So to add "26" to the values:
db.collection.update(
{ "timestamp": ISODate("2013-10-10T23:00:00.000Z") },
{
"$push": { "values": 26 },
"$min": { "minValue": 26 },
"$max": { "maxValue": 26 }
}
)
Which results in only ajusting values where either $min or $max respectively was less than or greater than the current value.
{
"timestamp": ISODate("2013-10-10T23:00:00.000Z"),
"values": [25, 2, 16, 12, 10, 26],
"minValue": 2,
"maxValue": 26
}
Therefore it should be clear to see why the structure is important, and that nested objects should be avoided in preference to an array where it is your intention to traverse the data, in either analysing the document itself, or indeed across multiple documents in a collection.

Related

Finding documents based on the minimum value in an array

my document structure is something like :
{
_id: ...,
key1: ....
key2: ....
....
min_value: //should be the minimum of all the values in options
options: [
{
source: 'a',
value: 12,
},
{
source: 'b',
value: 10,
},
...
]
},
{
_id: ...,
key1: ....
key2: ....
....
min_value: //should be the minimum of all the values in options
options: [
{
source: 'a',
value: 24,
},
{
source: 'b',
value: 36,
},
...
]
}
the value of various sources in options will keep getting updated on a frequent basis(evey few mins or hours),
assume the size of options array doesnt change, i.e. no extra elements are added to the list
my queries are of the following type:
-find all documents where the min_value of all the options falls between some limit.
I could first do an unwind on options(and then take min) and then run comparison queries, but I am new to mongo and not sure how performance
is affected by unwind operation. The number of documents of this type would be about a few million.
Or does anyone has any suggestions around changing the document structure which could help me simplify this query? ( apart from creating separate documents per source - it would involves lot of data duplication )
Thanks!
Using $unwind is indeed quite expensive, most notably so with larger arrays, but there is a cost in all cases of usage. There are a couple of way to approach not needing $unwind here without real structural changes.
Pure Aggregation
In the basic case, as of MongoDB 3.2.x release series the $min operator can work directly on an array of values in a "projection" sense in addition to it's standard grouping accumulator role. This means that with the help of the related $map operator for processing elements of an array, you can then get the minimal value without using $unwind:
db.collection.aggregate([
// Still makes sense to use an index to select only possible documents
{ "$match": {
"options": {
"$elemMatch": {
"value": { "$gte": minValue, "$lt": maxValue }
}
}
}},
// Provides a logical filter to remove non-matching documents
{ "$redact": {
"$cond": {
"if": {
"$let": {
"vars": {
"min_value": {
"$min": {
"$map": {
"input": "$options",
"as": "option",
"in": "$$option.value"
}
}
}
},
"in": { "$and": [
{ "$gte": [ "$$min_value", minValue ] },
{ "$lt": [ "$$min_value", maxValue ] }
]}
}
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
// Optionally return the min_value as a field
{ "$project": {
"min_value": {
"$min": {
"$map": {
"input": "$options",
"as": "option",
"in": "$$option.value"
}
}
}
}}
])
The basic case is to get the "minimum" value from the array ( done inside of $let since we want to use the result "twice" in logical conditions. Helps us not repeat ourselves ) is to first extract the "value" data from the "options" array. This is done using $map.
The output of $map is an array with just those values, so this is supplied as the argument to $min, which then returns the minimum value for that array.
Using $redact is sort of like a $match pipeline stage with the difference that rather than needing a field to be "present" in the document being examined, you instead just form a logical condition with calculations.
In this case the condition is $and where "both" the logical forms of $gte and $lt return true against the calculated value ( from $let as "$$min_value" ).
The $redact stage then has the special arguments to apply to $$KEEP the document when the condition is true or $$PRUNE the document from results when it is false.
It's all very much like doing $project and then $match to actually project the value into the document before filtering in another stage, but all done in one stage. Of course you might actually want to $project the resulting field in what you return, but it generally cuts the workload if you remove non-matched documents "first" using $redact instead.
Updating Documents
Of course I think the best option is to actually keep the "min_value" field in the document rather than work it out at run-time. So this is a very simple thing to do when adding to or altering array items during update.
For this there is the $min "update" operator. Use it when appending with $push:
db.collection.update({
{ "_id": id },
{
"$push": { "options": { "source": "a", "value": 9 } },
"$min": { "min_value": 9 }
}
})
Or when updating a value of an element:
db.collection.update({
{ "_id": id, "options.source": "a" },
{
"$set": { "options.$.value": 9 },
"$min": { "min_value": 9 }
}
})
If the current "min_value" in the document is greater than the argument in $min or the key does not yet exist then the value given will be written. If it is greater than, the existing value stays in place since it is already the smaller value.
You can even set all your existing data with a simple "bulk" operations update:
var ops = [];
db.collection.find({ "min_value": { "$exists": false } }).forEach(function(doc) {
// Queue operations
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$min": {
"min_value": Math.min.apply(
null,
doc.options.map(function(option) {
return option.value
})
)
}
}
}
});
// Write once in 1000 documents
if ( ops.length == 1000 ) {
db.collection.bulkWrite(ops);
ops = [];
}
});
// Clear any remaining operations
if ( ops.length > 0 )
db.collection.bulkWrite(ops);
Then with a field in place, it is just a simple range selection:
db.collection.find({
"min_value": {
"$gte": minValue, "$lt": maxValue
}
})
So it really should be in your best interests to keep a field ( or fields if you regularly need different conditions ) in the document since that provides the most efficient query.
Of course, the new functions of aggregation $min along with $map also make this viable to use without a field, if you prefer more dynamic conditions.

MongoDB sum arrays from multiple documents on a per-element basis

I have the following document structure (simplified for this example)
{
_id : ObjectId("sdfsdf"),
result : [1, 3, 5, 7, 9]
},
{
_id : ObjectId("asdref"),
result : [2, 4, 6, 8, 10]
}
I want to get the sum of those result arrays, but not a total sum, instead a new array corresponding to the sum of the original arrays on an element basis, i.e.
result : [3, 7, 11, 15, 19]
I have searched through the myriad questions here and a few come close (e.g. this one, this one, and this one), but I can't quite get there.
I can get the sum of each array fine
aggregate(
[
{
"$unwind" : "$result"
},
{
"$group": {
"_id": "$_id",
"results" : { "$sum" : "$result"}
}
}
]
)
which gives me
[ { _id: sdfsdf, results: 25 },
{ _id: asdref, results: 30 } ]
but I can't figure out how to get the sum of each element
You can use includeArrayIndex if you have 3.2 or newer MongoDb.
Then you should change $unwind.
Your code should be like this:
.aggregate(
[
{
"$unwind" : { path: "$result", includeArrayIndex: "arrayIndex" }
},
{
"$group": {
"_id": "$arrayIndex",
"results" : { "$sum" : "$result"}
}
},
{
$sort: { "_id": 1}
},
{
"$group":{
"_id": null,
"results":{"$push":"$results"}
}
},
{
"$project": {"_id":0,"results":1}
}
]
)
There is an alternate approach to this, but mileage may vary on how practical it is considering that a different approach would involve using $push to create an "array of arrays" and then applying $reduce as introduced in MongoDB 3.4 to $sum those array elements into a single array result:
db.collection.aggregate([
{ "$group": {
"_id": null,
"result": { "$push": "$result" }
}},
{ "$addFields": {
"result": {
"$reduce": {
"input": "$result",
"initialValue": [],
"in": {
"$map": {
"input": {
"$zip": {
"inputs": [ "$$this", "$$value" ],
"useLongestLength": true
}
},
"as": "el",
"in": { "$sum": "$$el" }
}
}
}
}
}}
])
The real trick there is in the "input" to $map we use the $zip operation which creates a transposed list of arrays "pairwise" for the two array inputs.
In a first iteration this takes the empty array as supplied to $reduce and would return the "zipped" output with consideration to the first object found as in:
[ [0,1], [0,3], [0,5], [0,7], [0,9] ]
So the useLongestLength would substitute the empty array with 0 values out to the the length of the current array and "zip" them together as above.
Processing with $map, each element is subject to $sum which "reduces" the returned results as:
[ 1, 3, 5, 7, 9 ]
On the second iteration, the next entry in the "array of arrays" would be picked up and processed by $zip along with the previous "reduced" content as:
[ [1,2], [3,4], [5,6], [7,8], [9,10] ]
Which is then subject to the $map for each element using $sum again to produce:
[ 3, 7, 11, 15, 19 ]
And since there were only two arrays pushed into the "array of arrays" that is the end of the operation, and the final result. But otherwise the $reduce would keep iterating until all array elements of the input were processed.
So in some cases this would be the more performant option and what you should be using. But it is noted that particularly when using a null for $group you are asking "every" document to $push content into an array for the result.
This could be a cause of breaking the BSON Limit in extreme cases, and therefore when aggregating positional array content over large results, it is probably best to use $unwind with the includeArrayIndex option instead.
Or indeed actually take a good look at the process, where in particular if the "positional array" in question is actually the result of some other "aggregation operation", then you should rather be looking at the previous pipeline stages that were used to create the "positional array". And then consider that if you wanted those positions "aggregated further" to new totals, then you should in fact do that "before" the positional result was obtained.

MongoDB lists - get every Nth item

I have a Mongodb schema that looks roughly like:
[
{
"name" : "name1",
"instances" : [
{
"value" : 1,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 2,
"date" : ISODate("2015-04-01T00:00:00.000Z")
},
{
"value" : 2.5,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
...
]
},
{
"name" : "name2",
"instances" : [
...
]
}
]
where the number of instances for each element can be quite big.
I sometimes want to get only a sample of the data, that is, get every 3rd instance, or every 10th instance... you get the picture.
I can achieve this goal by getting all instances and filtering them in my server code, but I was wondering if there's a way to do it by using some aggregation query.
Any ideas?
Updated
Assuming the data structure was flat as #SylvainLeroux suggested below, that is:
[
{"name": "name1", "value": 1, "date": ISODate("2015-03-04T00:00:00.000Z")},
{"name": "name2", "value": 5, "date": ISODate("2015-04-04T00:00:00.000Z")},
{"name": "name1", "value": 2, "date": ISODate("2015-04-01T00:00:00.000Z")},
{"name": "name1", "value": 2.5, "date": ISODate("2015-03-05T00:00:00.000Z")},
...
]
will the task of getting every Nth item (of specific name) be easier?
It seems that your question clearly asked "get every nth instance" which does seem like a pretty clear question.
Query operations like .find() can really only return the document "as is" with the exception of general field "selection" in projection and operators such as the positional $ match operator or $elemMatch that allow a singular matched array element.
Of course there is $slice, but that just allows a "range selection" on the array, so again does not apply.
The "only" things that can modify a result on the server are .aggregate() and .mapReduce(). The former does not "play very well" with "slicing" arrays in any way, at least not by "n" elements. However since the "function()" arguments of mapReduce are JavaScript based logic, then you have a little more room to play with.
For analytical processes, and for analytical purposes "only" then just filter the array contents via mapReduce using .filter():
db.collection.mapReduce(
function() {
var id = this._id;
delete this._id;
// filter the content of "instances" to every 3rd item only
this.instances = this.instances.filter(function(el,idx) {
return ((idx+1) % 3) == 0;
});
emit(id,this);
},
function() {},
{ "out": { "inline": 1 } } // or output to collection as required
)
It's really just a "JavaScript runner" at this point, but if this is just for anaylsis/testing then there is nothing generally wrong with the concept. Of course the output is not "exactly" how your document is structured, but it's as near a facsimile as mapReduce can get.
The other suggestion I see here requires creating a new collection with all the items "denormalized" and inserting the "index" from the array as part of the unqique _id key. That may produce something you can query directly, bu for the "every nth item" you would still have to do:
db.resultCollection.find({
"_id.index": { "$in": [2,5,8,11,14] } // and so on ....
})
So work out and provide the index value of "every nth item" in order to get "every nth item". So that doesn't really seem to solve the problem that was asked.
If the output form seemed more desirable for your "testing" purposes, then a better subsequent query on those results would be using the aggregation pipeline, with $redact
db.newCollection([
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$mod": [ { "$add": [ "$_id.index", 1] }, 3 ] },
0 ]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
That at least uses a "logical condition" much the same as what was applied with .filter() before to just select the "nth index" items without listing all possible index values as a query argument.
No $unwind is needed here. You can use $push with $arrayElemAt to project the array value at requested index inside $group aggregation.
Something like
db.colname.aggregate(
[
{"$group":{
"_id":null,
"valuesatNthindex":{"$push":{"$arrayElemAt":["$instances",N]}
}}
},
{"$project":{"valuesatNthindex":1}}
])
You might like this approach using the $lookup aggregation. And probably the most convenient and fastest way without any aggregation trick.
Create a collection Names with the following schema
[
{ "_id": 1, "name": "name1" },
{ "_id": 2, "name": "name2" }
]
and then Instances collection having the parent id as "nameId"
[
{ "nameId": 1, "value" : 1, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 1, "value" : 2, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 1, "value" : 3, "date" : ISODate("2015-03-05T00:00:00.000Z") },
{ "nameId": 2, "value" : 7, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 2, "value" : 8, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 2, "value" : 4, "date" : ISODate("2015-03-05T00:00:00.000Z") }
]
Now with $lookup aggregation 3.6 syntax you can use $sample inside the $lookup pipeline to get the every Nth element randomly.
db.Names.aggregate([
{ "$lookup": {
"from": Instances.collection.name,
"let": { "nameId": "$_id" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$nameId", "$$nameId"] }}},
{ "$sample": { "size": N }}
],
"as": "instances"
}}
])
You can test it here
Unfortunately, with the aggregation framework it's not possible as this would require an option with $unwind to emit an array index/position, of which currently aggregation can't handle. There is an open JIRA ticket for this here SERVER-4588.
However, a workaround would be to use MapReduce but this comes at a huge performance cost since the actual calculations of getting the array index are performed using the embedded JavaScript engine (which is slow), and there still is a single global JavaScript lock, which only allows a single JavaScript thread to run at a single time.
With mapReduce, you could try something like this:
Mapping function:
var map = function(){
for(var i=0; i < this.instances.length; i++){
emit(
{ "_id": this._id, "index": i },
{ "index": i, "value": this.instances[i] }
);
}
};
Reduce function:
var reduce = function(){}
You can then run the following mapReduce function on your collection:
db.collection.mapReduce( map, reduce, { out : "resultCollection" } );
And then you can query the result collection to geta list/array of every Nth item of the instance array by using the map() cursor method :
var thirdInstances = db.resultCollection.find({"_id.index": N})
.map(function(doc){return doc.value.value})
You can use below aggregation:
db.col.aggregate([
{
$project: {
instances: {
$map: {
input: { $range: [ 0, { $size: "$instances" }, N ] },
as: "index",
in: { $arrayElemAt: [ "$instances", "$$index" ] }
}
}
}
}
])
$range generates a list of indexes. Third parameter represents non-zero step. For N = 2 it will be [0,2,4,6...], for N = 3 it will return [0,3,6,9...] and so on. Then you can use $map to get correspinding items from instances array.
Or with just a find block:
db.Collection.find({}).then(function(data) {
var ret = [];
for (var i = 0, len = data.length; i < len; i++) {
if (i % 3 === 0 ) {
ret.push(data[i]);
}
}
return ret;
});
Returns a promise whose then() you can invoke to fetch the Nth modulo'ed data.

MongoDB Nested Array Intersection Query

and thank you in advance for your help.
I have a mongoDB database structured like this:
{
'_id' : objectID(...),
'userID' : id,
'movies' : [{
'movieID' : movieID,
'rating' : rating
}]
}
My question is:
I want to search for a specific user that has 'userID' : 3, for example, get all is movies, then i want to get all the other users that have at least, 15 or more movies with the same 'movieID', then with that group i wanna select only the users that have those 15 movies in similarity and have one extra 'movieID' that i choose.
I already tried aggregation, but failed, and if i do single queries like getting all the users movies from a user, the cycling every user movie and comparing it takes a bunch of time.
Any ideias?
Thank you
There are a couple of ways to do this using the aggregation framework
Just a simple set of data for example:
{
"_id" : ObjectId("538181738d6bd23253654690"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 2, "rating": 6 },
{ "_id": 3, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654691"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 4, "rating": 6 },
{ "_id": 2, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654692"),
"movies": [
{ "_id": 2, "rating": 5 },
{ "_id": 5, "rating": 6 },
{ "_id": 6, "rating": 7 }
]
}
Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.
For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document if you want to keep more than `_id`
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
}},
// Unwind the array
{ "$unwind": "$movies" },
// Build the array back with just `_id` values
{ "$group": {
"_id": "$_id",
"movies": { "$push": "$movies._id" }
}},
// Find the "set intersection" of the two arrays
{ "$project": {
"movies": {
"$size": {
"$setIntersection": [
[ 1, 2, 3 ],
"$movies"
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document along with the "set" to match
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
"set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
}},
// Unwind both those arrays
{ "$unwind": "$movies" },
{ "$unwind": "$set" },
// Group back the count where both `_id` values are equal
{ "$group": {
"_id": "$_id",
"movies": {
"$sum": {
"$cond":[
{ "$eq": [ "$movies._id", "$set" ] },
1,
0
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
In Detail
That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.
$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".
The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:
"$and": [
{ "movies": { "$not": { "$size": 1 } } },
{ "movies": { "$not": { "$size": 2 } } },
{ "movies": { "$not": { "$size": 3 } } }
]
So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.
$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.
What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.
$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.
$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.
Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".
$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.
$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.
Final
That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.
All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".

Mongo. Narroving down results of nested array

If I have a document like this:
{
"name" : "Foo",
"words" :
[
"lorem",
"ipsum",
"dolor",
"sit",
"amet",
...
]
}
Let's say this words array is pretty big. Now I need a query that would fetch that document:
db.docs.find({'name':'Foo'}) - that will get whole document
but what I want, instead of fetching the entire words array (cause it's too big) I would like to retrieve only elements that meet some criteria. Let's say I want to see only words that start with "a" or have a length of at least 3 characters.
You know maybe something like this:
// this won't work!
db.docs.find({
"$where":"(this.words.map(function(e){ if (e.length >=3) { return e } }))"
})
EDIT
You cannot filter array contents using find, You can only match that the array contains the condition. So in order to filter the contents of the array you need to make use of aggregate:
db.docs.aggregate([
// Still makes sense to match the documents that meet the condition
{ "$match": {
"name": "Foo",
"words": { "$regex": "^[A-Za-z0-9_]{4,}" }
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$words" },
// Actually "filter" the array elements
{ "$match": { "words": { "$regex": "^[A-Za-z0-9_]{4,}" } } },
// Group back the document with the "filtered" array
{ "$group": {
"_id": "$_id",
"name": { "$first": "$name" },
"words": { "$push": "$words" }
}}
])
That makes use a regular expression condition that will match at least 4 characters from the start of the string. The ^ anchor is quite important here as it allows an index to be used which is much more optimal than whatever else you can do.
The result returned will look like this:
{
"result" : [
{
"_id" : ObjectId("5341f0476cbcc02b995092ac"),
"name" : "Foo",
"words" : [
"lorem",
"ipsum",
"dolor"
]
}
],
"ok" : 1
}
You can also throw a lot of arbitrary JavaScript at mapReduce and test the length of elements in the array, but that will take considerably longer to execute.
--
The terms are quite simple, you simply add the additional operator to the query document as so:
db.docs.find({ "name": "Foo", "$where": "(this.words.length > 3)" })
You really should not be using the $where operator unless absolutely necessary, and even then you really should think about what you are doing. Heed the warnings that are given in that document.
As stated in the manual page for $size, probably the best way to deal with detecting array length for a given range (rather than exact) is to create a "counter" field in your document that is updated as elements are added/removed from the array. This makes a very simple and efficient query:
db.docs.find({ "name": "Foo", "counter": { "$gt": 3 } })
Of course from MongoDB versions 2.6 and upwards you can also do this:
db.docs.aggregate([
{ "$project": {
"name": 1,
"words": 1,
"count": { "$size": "$words" }
}},
{ "$match": {
"count": { "$gt": 3 }
}}
])
Either of those forms is going to perform a lot better than using something that is going to remove the use of an index and then invoke the JavaScript interpreter over each resulting document. Or even just use the $size operator for an exact size of the array.