MongoDB sum arrays from multiple documents on a per-element basis - mongodb

I have the following document structure (simplified for this example)
{
_id : ObjectId("sdfsdf"),
result : [1, 3, 5, 7, 9]
},
{
_id : ObjectId("asdref"),
result : [2, 4, 6, 8, 10]
}
I want to get the sum of those result arrays, but not a total sum, instead a new array corresponding to the sum of the original arrays on an element basis, i.e.
result : [3, 7, 11, 15, 19]
I have searched through the myriad questions here and a few come close (e.g. this one, this one, and this one), but I can't quite get there.
I can get the sum of each array fine
aggregate(
[
{
"$unwind" : "$result"
},
{
"$group": {
"_id": "$_id",
"results" : { "$sum" : "$result"}
}
}
]
)
which gives me
[ { _id: sdfsdf, results: 25 },
{ _id: asdref, results: 30 } ]
but I can't figure out how to get the sum of each element

You can use includeArrayIndex if you have 3.2 or newer MongoDb.
Then you should change $unwind.
Your code should be like this:
.aggregate(
[
{
"$unwind" : { path: "$result", includeArrayIndex: "arrayIndex" }
},
{
"$group": {
"_id": "$arrayIndex",
"results" : { "$sum" : "$result"}
}
},
{
$sort: { "_id": 1}
},
{
"$group":{
"_id": null,
"results":{"$push":"$results"}
}
},
{
"$project": {"_id":0,"results":1}
}
]
)

There is an alternate approach to this, but mileage may vary on how practical it is considering that a different approach would involve using $push to create an "array of arrays" and then applying $reduce as introduced in MongoDB 3.4 to $sum those array elements into a single array result:
db.collection.aggregate([
{ "$group": {
"_id": null,
"result": { "$push": "$result" }
}},
{ "$addFields": {
"result": {
"$reduce": {
"input": "$result",
"initialValue": [],
"in": {
"$map": {
"input": {
"$zip": {
"inputs": [ "$$this", "$$value" ],
"useLongestLength": true
}
},
"as": "el",
"in": { "$sum": "$$el" }
}
}
}
}
}}
])
The real trick there is in the "input" to $map we use the $zip operation which creates a transposed list of arrays "pairwise" for the two array inputs.
In a first iteration this takes the empty array as supplied to $reduce and would return the "zipped" output with consideration to the first object found as in:
[ [0,1], [0,3], [0,5], [0,7], [0,9] ]
So the useLongestLength would substitute the empty array with 0 values out to the the length of the current array and "zip" them together as above.
Processing with $map, each element is subject to $sum which "reduces" the returned results as:
[ 1, 3, 5, 7, 9 ]
On the second iteration, the next entry in the "array of arrays" would be picked up and processed by $zip along with the previous "reduced" content as:
[ [1,2], [3,4], [5,6], [7,8], [9,10] ]
Which is then subject to the $map for each element using $sum again to produce:
[ 3, 7, 11, 15, 19 ]
And since there were only two arrays pushed into the "array of arrays" that is the end of the operation, and the final result. But otherwise the $reduce would keep iterating until all array elements of the input were processed.
So in some cases this would be the more performant option and what you should be using. But it is noted that particularly when using a null for $group you are asking "every" document to $push content into an array for the result.
This could be a cause of breaking the BSON Limit in extreme cases, and therefore when aggregating positional array content over large results, it is probably best to use $unwind with the includeArrayIndex option instead.
Or indeed actually take a good look at the process, where in particular if the "positional array" in question is actually the result of some other "aggregation operation", then you should rather be looking at the previous pipeline stages that were used to create the "positional array". And then consider that if you wanted those positions "aggregated further" to new totals, then you should in fact do that "before" the positional result was obtained.

Related

Find Index of first Matching Element $gte with $indexOfArray

MongoDB has $indexOfArray to let you find the element's array index, for example:
$indexOfArray: ["$article.date", ISODate("2019-03-29")]
Is it possible to use comparison operators with $indexOfArray together, like:
$indexOfArray: ["$article.date", {$gte: ISODate("2019-03-29")}]
Not it's not possible with $indexOfArray as that will only look for an equality match to an expression as the second argument.
Instead you can make a construct like this:
db.data.insertOne({
"_id" : ObjectId("5ca01e301a97dd8b468b3f55"),
"array" : [
ISODate("2018-03-01T00:00:00Z"),
ISODate("2018-03-02T00:00:00Z"),
ISODate("2018-03-03T00:00:00Z")
]
})
db.data.aggregate([
{ "$addFields": {
"matchedIndex": {
"$let": {
"vars": {
"matched": {
"$arrayElemAt": [
{ "$filter": {
"input": {
"$zip": {
"inputs": [ "$array", { "$range": [ 0, { "$size": "$array" } ] }]
}
},
"cond": { "$gte": [ { "$arrayElemAt": ["$$this", 0] }, new Date("2018-03-02") ] }
}},
0
]
}
},
"in": {
"$arrayElemAt": [{ "$ifNull": [ "$$matched", [0,-1] ] },1]
}
}
}
}}
])
Which would return for the $gte of Date("2018-03-02"):
{
"_id" : ObjectId("5ca01e301a97dd8b468b3f55"),
"array" : [
ISODate("2018-03-01T00:00:00Z"),
ISODate("2018-03-02T00:00:00Z"),
ISODate("2018-03-03T00:00:00Z")
],
"matchedIndex" : 1
}
Or -1 where the condition was not met in order to be consistent with $indexOfArray.
The basic premise is using $zip in order to "pair" with the array index positions which get generated from $range and $size of the array. This can be fed to a $filter condition which will return ALL matching elements to the supplied condition. Here it is the first element of the "pair" ( being the original array content ) via $arrayElemAt matching the specified condition using $gte
{ "$gte": [ { "$arrayElemAt": ["$$this", 0] }, new Date("2018-03-02") ] }
The $filter will return either ALL elements after ( in the case of $gte ) or an empty array where nothing was found. Consistent with $indexOfArray you only want the first match, which is done with another wrapping $arrayElemAt on the output for the 0 position.
Since the result could be an omitted value ( which is what happens by $arrayElemAt: [[], 0] ) then you use [$ifNull][8] to test the result ans pass a two element array back with a -1 as the second element in the case where the output was not defined. In either case that "paired" array has the second element ( index 1 ) extracted again via $arrayElemAt in order to get the first matched index of the condition.
Of course since you want to refer to that whole expression, it just reads a little cleaner in the end within a $let, but that is optional as you can "inline" with the $ifNull if wanted.
So it is possible, it's just a little more involved than placing a range expression inside of $indexOfArray.
Note that any expression which actually returns a single value for equality match is just fine. But since operators like $gte return a boolean, then that would not be equal to any value in the array, and thus the sort of processing with $filter and then extraction is what you require.

Mongodb Aggregate a $slice to get an element in exact position from a nested array

I would like to retrieve a value from a nested array where it exists at an exact position within the array.
I want to create name value pairs by doing $slice[0,1] for the name and then $slice[1,1] for the value.
Before I attempt to use aggregate, I want to attempt a find within a nested array. I can do what I want on a single depth array in a document as shown below:
{
"_id" : ObjectId("565cc5261506995581569439"),
"a" : [
4,
2,
8,
71,
21
]
}
I apply the following: db.getCollection('anothertest').find({},{_id:0, a: {$slice:[0,1]}})
and I get:
{
"a" : [
4
]
}
This is fantastic. However, what if the array I want to $slice [0,1] is located within the document at objectRawOriginData.Reports.Rows.Rows.Cells?
If I can first of all FIND then I want to apply the same as an AGGREGATE.
Your best bet here and especially if your application is not yet ready for release is to hold off until MongoDB 3.2 for deployment, or at least start working with a release candidate in the interim. The main reason being is that the "projection" $slice does not work with the aggregation framework, as do not other forms of array matching projection as well. But this has been addressed for the upcoming release.
This is going to give you a couple of new operators, being $slice and even $arrayElemAt which can be used to address array elements by position in the aggregation pipeline.
Either:
db.getCollection('anothertest').aggregate([
{ "$project": {
"_id": 0,
"a": { "$slice": ["$a",0,1] }
}}
])
Which returns the familiar:
{ "a" : [ 4 ] }
Or:
db.getCollection('anothertest').aggregate([
{ "$project": {
"_id": 0,
"a": { "$arrayElemAt": ["$a", 0] }
}}
])
Which is just the element and not an array:
{ "a" : 4 }
Until that release becomes available other than in release candidate form, the currently available operators make it quite easy for the "first" element of the array:
db.getCollection('anothertest').aggregate([
{ "$unwind": "$a" },
{ "$group": {
"_id": "$_id",
"a": { "$first": "$a" }
}}
])
Through use of the $first operator after $unwind. But getting another indexed position becomes horribly iterative:
db.getCollection('anothertest').aggregate([
{ "$unwind": "$a" },
// Keeps the first element
{ "$group": {
"_id": "$_id",
"first": { "$first": "$a" },
"a": { "$push": "$a" }
}},
{ "$unwind": "$a" },
// Removes the first element
{ "$redact": {
"$cond": {
"if": { "$ne": [ "$first", "$a" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
// Top is now the second element
{ "$group": {
"_id": "$_id",
"second": { "$first": "$a" }
}}
])
And so on, and also a lot of handling to alter that to deal with arrays that might be shorter than the "nth" element you are looking for. So "possible", but ugly and not performant.
Also noting that is "not really" working with "indexed positions", and is purely matching on values. So duplicate values would easily be removed, unless there was another unique identifier per array element to work with. Future $unwind also has the ability to project an array index, which is handy for other purposes, but the other operators are more useful for this specific case than that feature.
So for my money I would wait till you had the feature available to be able to integrate this in an aggregation pipeline, or at least re-consider why you believe you need it and possibly design around it.

MongoDB Find key with max value within nested document

I have a document like this:
{
timestamp: ISODate("2013-10-10T23:00:00.000Z"),
values: {
0: 25,
1: 2,
3: 16,
4: 12,
5: 10
}
}
Two questions:
How can I get the "argmax" 0 from the nested document in values?
If I have multiple documents like this, can I query for all documents with an "argmax" of 2, for instance?
You really need to change the way you are structuring your documents as what you have right now is not good. Nested objects like you have cannot be "traversed" in normal query operations, so their is no way of efficiently searching "across keys".
The only way to do this is using JavaScript evaluation of $where, and this means "no index" can be used to optimise searching. It is also basically a "brute force" match against every document in the collection:
db.collection.find(function() {
var values = this.values;
return Object.keys(values).some(function(key) {
return values[key] == 2;
});
})
That is just to find a value of "2" in the nested key, in order to find if the "maximum" value was "2" then you would do:
db.collection.find(function() {
var values = this.values;
return Math.max.apply(null, Object.keys(values).map(function(key) {
return values[key];
})) == 2;
})
So brute force is not good. Better to structure your document with "values" as an "array". Then all native queries work fine:
{
"timestamp": ISODate("2013-10-10T23:00:00.000Z"),
"values": [25, 2, 16, 12, 10]
}
Now you can do:
db.collection.aggregate([
{ "$match": { "values": 2 } },
{ "$unwind": "$values" },
{ "$group": {
"_id": "$_id",
"timestamp": { "$first": "$timestamp" },
"values": { "$push": "$values" },
"maxVal": { "$max": "$values" }
}},
{ "$match": { "maxVal": 2 } }
])
Which might at first glance seem more cumbersome, but the construction of this using native operators as opposed to JavaScript translation does make this much more efficient. It is also notably more efficient in that it is now possible to actually search whether the "values" array actually even contains "2" as a value using an index even, without needing to test all content in looped code.
The main work is done within testing the "max" value for the array, so even if this was not the most shining example, you can see even see the clear difference in how a normal query operation can be combined with JavaScript evaluation now, to make that process faster:
db.collection.find({
"values": 2,
"$where": function() {
return Math.max.apply(null,this.values) == 2;
}
})
So the initial "values": 2 will filter the documents immediately for those that contain "2", and the subsequent expression merely filters down further those documents where the "max" value of the array is "2" as well.
Moreover, if it was your intention to query for "maximum" values like this on a regular basis, then you would be better off storing this value as a discrete field in the document itself, like so:
{
"timestamp": ISODate("2013-10-10T23:00:00.000Z"),
"values": [25, 2, 16, 12, 10],
"minValue": 2,
"maxValue": 25
}
Then finding documents with the "maximum value" of 2 is as simple as:
db.collection.find({ "maxValue": 2 })
Or the largest "max" within all documents:
db.collection.find().sort({ "maxValue": -1 }).limit(1)
Or even both "min" and "max" from all documents at the same time:
db.collection.aggregate([
{ "$group": {
"_id": null,
"minValue": { "$min": "$minValue" },
"maxValue": { "$max": "$maxValue" }
}}
])
Maintaining this data when adding new "values" is a simple matter of employing the $min and $max update operators as you update the document. So to add "26" to the values:
db.collection.update(
{ "timestamp": ISODate("2013-10-10T23:00:00.000Z") },
{
"$push": { "values": 26 },
"$min": { "minValue": 26 },
"$max": { "maxValue": 26 }
}
)
Which results in only ajusting values where either $min or $max respectively was less than or greater than the current value.
{
"timestamp": ISODate("2013-10-10T23:00:00.000Z"),
"values": [25, 2, 16, 12, 10, 26],
"minValue": 2,
"maxValue": 26
}
Therefore it should be clear to see why the structure is important, and that nested objects should be avoided in preference to an array where it is your intention to traverse the data, in either analysing the document itself, or indeed across multiple documents in a collection.

MongoDB lists - get every Nth item

I have a Mongodb schema that looks roughly like:
[
{
"name" : "name1",
"instances" : [
{
"value" : 1,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 2,
"date" : ISODate("2015-04-01T00:00:00.000Z")
},
{
"value" : 2.5,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
...
]
},
{
"name" : "name2",
"instances" : [
...
]
}
]
where the number of instances for each element can be quite big.
I sometimes want to get only a sample of the data, that is, get every 3rd instance, or every 10th instance... you get the picture.
I can achieve this goal by getting all instances and filtering them in my server code, but I was wondering if there's a way to do it by using some aggregation query.
Any ideas?
Updated
Assuming the data structure was flat as #SylvainLeroux suggested below, that is:
[
{"name": "name1", "value": 1, "date": ISODate("2015-03-04T00:00:00.000Z")},
{"name": "name2", "value": 5, "date": ISODate("2015-04-04T00:00:00.000Z")},
{"name": "name1", "value": 2, "date": ISODate("2015-04-01T00:00:00.000Z")},
{"name": "name1", "value": 2.5, "date": ISODate("2015-03-05T00:00:00.000Z")},
...
]
will the task of getting every Nth item (of specific name) be easier?
It seems that your question clearly asked "get every nth instance" which does seem like a pretty clear question.
Query operations like .find() can really only return the document "as is" with the exception of general field "selection" in projection and operators such as the positional $ match operator or $elemMatch that allow a singular matched array element.
Of course there is $slice, but that just allows a "range selection" on the array, so again does not apply.
The "only" things that can modify a result on the server are .aggregate() and .mapReduce(). The former does not "play very well" with "slicing" arrays in any way, at least not by "n" elements. However since the "function()" arguments of mapReduce are JavaScript based logic, then you have a little more room to play with.
For analytical processes, and for analytical purposes "only" then just filter the array contents via mapReduce using .filter():
db.collection.mapReduce(
function() {
var id = this._id;
delete this._id;
// filter the content of "instances" to every 3rd item only
this.instances = this.instances.filter(function(el,idx) {
return ((idx+1) % 3) == 0;
});
emit(id,this);
},
function() {},
{ "out": { "inline": 1 } } // or output to collection as required
)
It's really just a "JavaScript runner" at this point, but if this is just for anaylsis/testing then there is nothing generally wrong with the concept. Of course the output is not "exactly" how your document is structured, but it's as near a facsimile as mapReduce can get.
The other suggestion I see here requires creating a new collection with all the items "denormalized" and inserting the "index" from the array as part of the unqique _id key. That may produce something you can query directly, bu for the "every nth item" you would still have to do:
db.resultCollection.find({
"_id.index": { "$in": [2,5,8,11,14] } // and so on ....
})
So work out and provide the index value of "every nth item" in order to get "every nth item". So that doesn't really seem to solve the problem that was asked.
If the output form seemed more desirable for your "testing" purposes, then a better subsequent query on those results would be using the aggregation pipeline, with $redact
db.newCollection([
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$mod": [ { "$add": [ "$_id.index", 1] }, 3 ] },
0 ]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
That at least uses a "logical condition" much the same as what was applied with .filter() before to just select the "nth index" items without listing all possible index values as a query argument.
No $unwind is needed here. You can use $push with $arrayElemAt to project the array value at requested index inside $group aggregation.
Something like
db.colname.aggregate(
[
{"$group":{
"_id":null,
"valuesatNthindex":{"$push":{"$arrayElemAt":["$instances",N]}
}}
},
{"$project":{"valuesatNthindex":1}}
])
You might like this approach using the $lookup aggregation. And probably the most convenient and fastest way without any aggregation trick.
Create a collection Names with the following schema
[
{ "_id": 1, "name": "name1" },
{ "_id": 2, "name": "name2" }
]
and then Instances collection having the parent id as "nameId"
[
{ "nameId": 1, "value" : 1, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 1, "value" : 2, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 1, "value" : 3, "date" : ISODate("2015-03-05T00:00:00.000Z") },
{ "nameId": 2, "value" : 7, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 2, "value" : 8, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 2, "value" : 4, "date" : ISODate("2015-03-05T00:00:00.000Z") }
]
Now with $lookup aggregation 3.6 syntax you can use $sample inside the $lookup pipeline to get the every Nth element randomly.
db.Names.aggregate([
{ "$lookup": {
"from": Instances.collection.name,
"let": { "nameId": "$_id" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$nameId", "$$nameId"] }}},
{ "$sample": { "size": N }}
],
"as": "instances"
}}
])
You can test it here
Unfortunately, with the aggregation framework it's not possible as this would require an option with $unwind to emit an array index/position, of which currently aggregation can't handle. There is an open JIRA ticket for this here SERVER-4588.
However, a workaround would be to use MapReduce but this comes at a huge performance cost since the actual calculations of getting the array index are performed using the embedded JavaScript engine (which is slow), and there still is a single global JavaScript lock, which only allows a single JavaScript thread to run at a single time.
With mapReduce, you could try something like this:
Mapping function:
var map = function(){
for(var i=0; i < this.instances.length; i++){
emit(
{ "_id": this._id, "index": i },
{ "index": i, "value": this.instances[i] }
);
}
};
Reduce function:
var reduce = function(){}
You can then run the following mapReduce function on your collection:
db.collection.mapReduce( map, reduce, { out : "resultCollection" } );
And then you can query the result collection to geta list/array of every Nth item of the instance array by using the map() cursor method :
var thirdInstances = db.resultCollection.find({"_id.index": N})
.map(function(doc){return doc.value.value})
You can use below aggregation:
db.col.aggregate([
{
$project: {
instances: {
$map: {
input: { $range: [ 0, { $size: "$instances" }, N ] },
as: "index",
in: { $arrayElemAt: [ "$instances", "$$index" ] }
}
}
}
}
])
$range generates a list of indexes. Third parameter represents non-zero step. For N = 2 it will be [0,2,4,6...], for N = 3 it will return [0,3,6,9...] and so on. Then you can use $map to get correspinding items from instances array.
Or with just a find block:
db.Collection.find({}).then(function(data) {
var ret = [];
for (var i = 0, len = data.length; i < len; i++) {
if (i % 3 === 0 ) {
ret.push(data[i]);
}
}
return ret;
});
Returns a promise whose then() you can invoke to fetch the Nth modulo'ed data.

MongoDB Nested Array Intersection Query

and thank you in advance for your help.
I have a mongoDB database structured like this:
{
'_id' : objectID(...),
'userID' : id,
'movies' : [{
'movieID' : movieID,
'rating' : rating
}]
}
My question is:
I want to search for a specific user that has 'userID' : 3, for example, get all is movies, then i want to get all the other users that have at least, 15 or more movies with the same 'movieID', then with that group i wanna select only the users that have those 15 movies in similarity and have one extra 'movieID' that i choose.
I already tried aggregation, but failed, and if i do single queries like getting all the users movies from a user, the cycling every user movie and comparing it takes a bunch of time.
Any ideias?
Thank you
There are a couple of ways to do this using the aggregation framework
Just a simple set of data for example:
{
"_id" : ObjectId("538181738d6bd23253654690"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 2, "rating": 6 },
{ "_id": 3, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654691"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 4, "rating": 6 },
{ "_id": 2, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654692"),
"movies": [
{ "_id": 2, "rating": 5 },
{ "_id": 5, "rating": 6 },
{ "_id": 6, "rating": 7 }
]
}
Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.
For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document if you want to keep more than `_id`
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
}},
// Unwind the array
{ "$unwind": "$movies" },
// Build the array back with just `_id` values
{ "$group": {
"_id": "$_id",
"movies": { "$push": "$movies._id" }
}},
// Find the "set intersection" of the two arrays
{ "$project": {
"movies": {
"$size": {
"$setIntersection": [
[ 1, 2, 3 ],
"$movies"
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document along with the "set" to match
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
"set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
}},
// Unwind both those arrays
{ "$unwind": "$movies" },
{ "$unwind": "$set" },
// Group back the count where both `_id` values are equal
{ "$group": {
"_id": "$_id",
"movies": {
"$sum": {
"$cond":[
{ "$eq": [ "$movies._id", "$set" ] },
1,
0
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
In Detail
That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.
$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".
The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:
"$and": [
{ "movies": { "$not": { "$size": 1 } } },
{ "movies": { "$not": { "$size": 2 } } },
{ "movies": { "$not": { "$size": 3 } } }
]
So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.
$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.
What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.
$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.
$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.
Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".
$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.
$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.
Final
That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.
All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".