MongoDB Nested Array Intersection Query - mongodb

and thank you in advance for your help.
I have a mongoDB database structured like this:
{
'_id' : objectID(...),
'userID' : id,
'movies' : [{
'movieID' : movieID,
'rating' : rating
}]
}
My question is:
I want to search for a specific user that has 'userID' : 3, for example, get all is movies, then i want to get all the other users that have at least, 15 or more movies with the same 'movieID', then with that group i wanna select only the users that have those 15 movies in similarity and have one extra 'movieID' that i choose.
I already tried aggregation, but failed, and if i do single queries like getting all the users movies from a user, the cycling every user movie and comparing it takes a bunch of time.
Any ideias?
Thank you

There are a couple of ways to do this using the aggregation framework
Just a simple set of data for example:
{
"_id" : ObjectId("538181738d6bd23253654690"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 2, "rating": 6 },
{ "_id": 3, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654691"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 4, "rating": 6 },
{ "_id": 2, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654692"),
"movies": [
{ "_id": 2, "rating": 5 },
{ "_id": 5, "rating": 6 },
{ "_id": 6, "rating": 7 }
]
}
Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.
For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document if you want to keep more than `_id`
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
}},
// Unwind the array
{ "$unwind": "$movies" },
// Build the array back with just `_id` values
{ "$group": {
"_id": "$_id",
"movies": { "$push": "$movies._id" }
}},
// Find the "set intersection" of the two arrays
{ "$project": {
"movies": {
"$size": {
"$setIntersection": [
[ 1, 2, 3 ],
"$movies"
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document along with the "set" to match
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
"set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
}},
// Unwind both those arrays
{ "$unwind": "$movies" },
{ "$unwind": "$set" },
// Group back the count where both `_id` values are equal
{ "$group": {
"_id": "$_id",
"movies": {
"$sum": {
"$cond":[
{ "$eq": [ "$movies._id", "$set" ] },
1,
0
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
In Detail
That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.
$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".
The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:
"$and": [
{ "movies": { "$not": { "$size": 1 } } },
{ "movies": { "$not": { "$size": 2 } } },
{ "movies": { "$not": { "$size": 3 } } }
]
So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.
$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.
What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.
$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.
$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.
Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".
$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.
$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.
Final
That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.
All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".

Related

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

Mongo Sort by Count of Matches in Array

Lets say my test data is
db.multiArr.insert({"ID" : "fruit1","Keys" : ["apple", "orange", "banana"]})
db.multiArr.insert({"ID" : "fruit2","Keys" : ["apple", "carrot", "banana"]})
to get individual fruit like carrot i do
db.multiArr.find({'Keys':{$in:['carrot']}})
when i do an or query for orange and banana, i see both the records fruit1 and then fruit2
db.multiArr.find({ $or: [{'Keys':{$in:['carrot']}}, {'Keys':{$in:['banana']}}]})
Result of the output should be fruit2 and then fruit1, because fruit2 has both carrot and banana
To actually answer this first, you need to "calculate" the number of matches to the given condition in order to "sort" the results to return with the preference to the most matches on top.
For this you need the aggregation framework, which is what you use for "calculation" and "manipulation" of data in MongoDB:
db.multiArr.aggregate([
{ "$match": { "Keys": { "$in": [ "carrot", "banana" ] } } },
{ "$project": {
"ID": 1,
"Keys": 1,
"order": {
"$size": {
"$setIntersection": [ ["carrot", "banana"], "$Keys" ]
}
}
}},
{ "$sort": { "order": -1 } }
])
On an MongoDB older than version 3, then you can do the longer form:
db.multiArr.aggregate([
{ "$match": { "Keys": { "$in": [ "carrot", "banana" ] } } },
{ "$unwind": "$Keys" },
{ "$group": {
"_id": "$_id",
"ID": { "$first": "$ID" },
"Keys": { "$push": "$Keys" },
"order": {
"$sum": {
{ "$cond": [
{ "$or": [
{ "$eq": [ "$Keys", "carrot" ] },
{ "$eq": [ "$Keys", "banana" ] }
]},
1,
0
]}
}
}
}},
{ "$sort": { "order": -1 } }
])
In either case the function here is to first match the possible documents to the conditions by providing a "list" of arguments with $in. Once the results are obtained you want to "count" the number of matching elements in the array to the "list" of possible values provided.
In the modern form the $setIntersection operator compares the two "lists" returning a new array that only contains the "unique" matching members. Since we want to know how many matches that was, we simply return the $size of that list.
In older versions, you pull apart the document array with $unwind in order to perform operations on it since older versions lacked the newer operators that worked with arrays without alteration. The process then looks at each value individually and if either expression in $or matches the possible values then the $cond ternary returns a value of 1 to the $sum accumulator, otherwise 0. The net result is the same "count of matches" as shown for the modern version.
The final thing is simply to $sort the results based on the "count of matches" that was returned so the most matches is on "top". This is is "descending order" and therefore you supply the -1 to indicate that.
Addendum concerning $in and arrays
You are misunderstanding a couple of things about MongoDB queries for starters. The $in operator is actually intended for a "list" of arguments like this:
{ "Keys": { "$in": [ "carrot", "banana" ] } }
Which is essentially the shorthand way of saying "Match either 'carrot' or 'banana' in the property 'Keys'". And could even be written in long form like this:
{ "$or": [{ "Keys": "carrot" }, { "Keys": "banana" }] }
Which really should lead you to if it were a "singular" match condition, then you simply supply the value to match to the property:
{ "Keys": "carrot" }
So that should cover the misconception that you use $in to match a property that is an array within a document. Rather the "reverse" case is the intended usage where instead you supply a "list of arguments" to match a given property, be that property an array or just a single value.
The MongoDB query engine makes no distinction between a single value or an array of values in an equality or similar operation.

MongoDB sum arrays from multiple documents on a per-element basis

I have the following document structure (simplified for this example)
{
_id : ObjectId("sdfsdf"),
result : [1, 3, 5, 7, 9]
},
{
_id : ObjectId("asdref"),
result : [2, 4, 6, 8, 10]
}
I want to get the sum of those result arrays, but not a total sum, instead a new array corresponding to the sum of the original arrays on an element basis, i.e.
result : [3, 7, 11, 15, 19]
I have searched through the myriad questions here and a few come close (e.g. this one, this one, and this one), but I can't quite get there.
I can get the sum of each array fine
aggregate(
[
{
"$unwind" : "$result"
},
{
"$group": {
"_id": "$_id",
"results" : { "$sum" : "$result"}
}
}
]
)
which gives me
[ { _id: sdfsdf, results: 25 },
{ _id: asdref, results: 30 } ]
but I can't figure out how to get the sum of each element
You can use includeArrayIndex if you have 3.2 or newer MongoDb.
Then you should change $unwind.
Your code should be like this:
.aggregate(
[
{
"$unwind" : { path: "$result", includeArrayIndex: "arrayIndex" }
},
{
"$group": {
"_id": "$arrayIndex",
"results" : { "$sum" : "$result"}
}
},
{
$sort: { "_id": 1}
},
{
"$group":{
"_id": null,
"results":{"$push":"$results"}
}
},
{
"$project": {"_id":0,"results":1}
}
]
)
There is an alternate approach to this, but mileage may vary on how practical it is considering that a different approach would involve using $push to create an "array of arrays" and then applying $reduce as introduced in MongoDB 3.4 to $sum those array elements into a single array result:
db.collection.aggregate([
{ "$group": {
"_id": null,
"result": { "$push": "$result" }
}},
{ "$addFields": {
"result": {
"$reduce": {
"input": "$result",
"initialValue": [],
"in": {
"$map": {
"input": {
"$zip": {
"inputs": [ "$$this", "$$value" ],
"useLongestLength": true
}
},
"as": "el",
"in": { "$sum": "$$el" }
}
}
}
}
}}
])
The real trick there is in the "input" to $map we use the $zip operation which creates a transposed list of arrays "pairwise" for the two array inputs.
In a first iteration this takes the empty array as supplied to $reduce and would return the "zipped" output with consideration to the first object found as in:
[ [0,1], [0,3], [0,5], [0,7], [0,9] ]
So the useLongestLength would substitute the empty array with 0 values out to the the length of the current array and "zip" them together as above.
Processing with $map, each element is subject to $sum which "reduces" the returned results as:
[ 1, 3, 5, 7, 9 ]
On the second iteration, the next entry in the "array of arrays" would be picked up and processed by $zip along with the previous "reduced" content as:
[ [1,2], [3,4], [5,6], [7,8], [9,10] ]
Which is then subject to the $map for each element using $sum again to produce:
[ 3, 7, 11, 15, 19 ]
And since there were only two arrays pushed into the "array of arrays" that is the end of the operation, and the final result. But otherwise the $reduce would keep iterating until all array elements of the input were processed.
So in some cases this would be the more performant option and what you should be using. But it is noted that particularly when using a null for $group you are asking "every" document to $push content into an array for the result.
This could be a cause of breaking the BSON Limit in extreme cases, and therefore when aggregating positional array content over large results, it is probably best to use $unwind with the includeArrayIndex option instead.
Or indeed actually take a good look at the process, where in particular if the "positional array" in question is actually the result of some other "aggregation operation", then you should rather be looking at the previous pipeline stages that were used to create the "positional array". And then consider that if you wanted those positions "aggregated further" to new totals, then you should in fact do that "before" the positional result was obtained.

Mongodb Aggregate a $slice to get an element in exact position from a nested array

I would like to retrieve a value from a nested array where it exists at an exact position within the array.
I want to create name value pairs by doing $slice[0,1] for the name and then $slice[1,1] for the value.
Before I attempt to use aggregate, I want to attempt a find within a nested array. I can do what I want on a single depth array in a document as shown below:
{
"_id" : ObjectId("565cc5261506995581569439"),
"a" : [
4,
2,
8,
71,
21
]
}
I apply the following: db.getCollection('anothertest').find({},{_id:0, a: {$slice:[0,1]}})
and I get:
{
"a" : [
4
]
}
This is fantastic. However, what if the array I want to $slice [0,1] is located within the document at objectRawOriginData.Reports.Rows.Rows.Cells?
If I can first of all FIND then I want to apply the same as an AGGREGATE.
Your best bet here and especially if your application is not yet ready for release is to hold off until MongoDB 3.2 for deployment, or at least start working with a release candidate in the interim. The main reason being is that the "projection" $slice does not work with the aggregation framework, as do not other forms of array matching projection as well. But this has been addressed for the upcoming release.
This is going to give you a couple of new operators, being $slice and even $arrayElemAt which can be used to address array elements by position in the aggregation pipeline.
Either:
db.getCollection('anothertest').aggregate([
{ "$project": {
"_id": 0,
"a": { "$slice": ["$a",0,1] }
}}
])
Which returns the familiar:
{ "a" : [ 4 ] }
Or:
db.getCollection('anothertest').aggregate([
{ "$project": {
"_id": 0,
"a": { "$arrayElemAt": ["$a", 0] }
}}
])
Which is just the element and not an array:
{ "a" : 4 }
Until that release becomes available other than in release candidate form, the currently available operators make it quite easy for the "first" element of the array:
db.getCollection('anothertest').aggregate([
{ "$unwind": "$a" },
{ "$group": {
"_id": "$_id",
"a": { "$first": "$a" }
}}
])
Through use of the $first operator after $unwind. But getting another indexed position becomes horribly iterative:
db.getCollection('anothertest').aggregate([
{ "$unwind": "$a" },
// Keeps the first element
{ "$group": {
"_id": "$_id",
"first": { "$first": "$a" },
"a": { "$push": "$a" }
}},
{ "$unwind": "$a" },
// Removes the first element
{ "$redact": {
"$cond": {
"if": { "$ne": [ "$first", "$a" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
// Top is now the second element
{ "$group": {
"_id": "$_id",
"second": { "$first": "$a" }
}}
])
And so on, and also a lot of handling to alter that to deal with arrays that might be shorter than the "nth" element you are looking for. So "possible", but ugly and not performant.
Also noting that is "not really" working with "indexed positions", and is purely matching on values. So duplicate values would easily be removed, unless there was another unique identifier per array element to work with. Future $unwind also has the ability to project an array index, which is handy for other purposes, but the other operators are more useful for this specific case than that feature.
So for my money I would wait till you had the feature available to be able to integrate this in an aggregation pipeline, or at least re-consider why you believe you need it and possibly design around it.

How to sort by 'value' of a specific key within a property stored as an array with k-v pairs in mongodb

I have a mongodb collection, let's call it rows containing documents with the following general structure:
{
"setid" : 154421,
"date" : ISODate("2014-02-22T14:06:48.229Z"),
"version" : 2,
"data" : [
{
"k" : "name",
"v" : "ryan"
},
{
"k" : "points",
"v" : "375"
},
{
"k" : "email",
"v" : "ryan#123.com"
}
],
}
There is no guarantee what values of k and v might populate the "data" property for any particular document (eg. other documents might have 5 k-v pairs with different key names in it). The only rule is that documents with the same setid have the same k-v pairs. (i.e. the rows collection might hold 100 other documents with setid = 154421, that have the same set of 3 keys in the data property: "name", "points", "email", with their own respective values.
How would one, with this setup, construct a query to retrieve all rows with a particular setid sorted by points? I need, in effect, some way of saying 'sort by the the field data.v where the value of k==points or something like that...?
Something like this:
db.rows.find({setid:154421},{$sort:{'data.v',-1}, {$where: k:'points'}}})
I know this is the incorrect syntax, but I'm just taking a stab at it to illustrate my point.
Is it possible?
Assuming that what you want would be all the documents that have the "points" value as a "key" in the array, and then sort on the "value" for that "key", then this is a little out of scope for the .find() method.
Reason being if you did something like this
db.collection.find({
"setid": 154421, "data.k": "point" }
).sort({ "data.v" : -1 })
The problem is that even though the matched elements do have the matching key of "point", there is no way of telling which data.v you are referring to for the sort. Also, a sort within .find() results will not do something like this:
db.collection.find({
"setid": 154421, "data.k": "point" }
).sort({ "data.$.v" : -1 })
Which would be trying to use a positional operator within a sort, essentially telling which element to use the value of v on. But this is not supported and not likely to be, and for the most likely explaination, that "index" value would be likely different in every document.
But this kind of selective sorting can be done with the use of .aggregate().
db.collection.aggregate([
// Actually shouldn't need the setid
{ "$match": { "data": {"$elemMatch": { "k": "points" } } } },
// Saving the original document before you filter
{ "$project": {
"doc": {
"_id": "$_id",
"setid": "$setid",
"date": "$date",
"version": "$version",
"data": "$data"
},
"data": "$data"
}}
// Unwind the array
{ "$unwind": "$data" },
// Match the "points" entries, so filtering to only these
{ "$match": { "data.k": "points" } },
// Sort on the value, presuming you want the highest
{ "$sort": { "data.v": -1 } },
// Restore the document
{ "$project": {
"setid": "$doc.setid",
"date": "$doc.date",
"version": "$doc.version",
"data": "$doc.data"
}}
])
Of course that presumes the data array only has the one element that has the key points. If there were more than one, you would need to $group before the sort like this:
// Group to remove the duplicates and get highest
{ "$group": {
"_id": "$doc",
"value": { "$max": "$data.v" }
}},
// Sort on the value
{ "$sort": { "value": -1 } },
// Restore the document
{ "$project": {
"_id": "$_id._id",
"setid": "$_id.setid",
"date": "$_id.date",
"version": "$_id.version",
"data": "$_id.data"
}}
So there is one usage of .aggregate() in order to do some complex sorting on documents and still return the original document result in full.
Do some more reading on aggregation operators and the general framework. It's a useful tool to learn that takes you beyond .find().