Mongo Sort by Count of Matches in Array - mongodb

Lets say my test data is
db.multiArr.insert({"ID" : "fruit1","Keys" : ["apple", "orange", "banana"]})
db.multiArr.insert({"ID" : "fruit2","Keys" : ["apple", "carrot", "banana"]})
to get individual fruit like carrot i do
db.multiArr.find({'Keys':{$in:['carrot']}})
when i do an or query for orange and banana, i see both the records fruit1 and then fruit2
db.multiArr.find({ $or: [{'Keys':{$in:['carrot']}}, {'Keys':{$in:['banana']}}]})
Result of the output should be fruit2 and then fruit1, because fruit2 has both carrot and banana

To actually answer this first, you need to "calculate" the number of matches to the given condition in order to "sort" the results to return with the preference to the most matches on top.
For this you need the aggregation framework, which is what you use for "calculation" and "manipulation" of data in MongoDB:
db.multiArr.aggregate([
{ "$match": { "Keys": { "$in": [ "carrot", "banana" ] } } },
{ "$project": {
"ID": 1,
"Keys": 1,
"order": {
"$size": {
"$setIntersection": [ ["carrot", "banana"], "$Keys" ]
}
}
}},
{ "$sort": { "order": -1 } }
])
On an MongoDB older than version 3, then you can do the longer form:
db.multiArr.aggregate([
{ "$match": { "Keys": { "$in": [ "carrot", "banana" ] } } },
{ "$unwind": "$Keys" },
{ "$group": {
"_id": "$_id",
"ID": { "$first": "$ID" },
"Keys": { "$push": "$Keys" },
"order": {
"$sum": {
{ "$cond": [
{ "$or": [
{ "$eq": [ "$Keys", "carrot" ] },
{ "$eq": [ "$Keys", "banana" ] }
]},
1,
0
]}
}
}
}},
{ "$sort": { "order": -1 } }
])
In either case the function here is to first match the possible documents to the conditions by providing a "list" of arguments with $in. Once the results are obtained you want to "count" the number of matching elements in the array to the "list" of possible values provided.
In the modern form the $setIntersection operator compares the two "lists" returning a new array that only contains the "unique" matching members. Since we want to know how many matches that was, we simply return the $size of that list.
In older versions, you pull apart the document array with $unwind in order to perform operations on it since older versions lacked the newer operators that worked with arrays without alteration. The process then looks at each value individually and if either expression in $or matches the possible values then the $cond ternary returns a value of 1 to the $sum accumulator, otherwise 0. The net result is the same "count of matches" as shown for the modern version.
The final thing is simply to $sort the results based on the "count of matches" that was returned so the most matches is on "top". This is is "descending order" and therefore you supply the -1 to indicate that.
Addendum concerning $in and arrays
You are misunderstanding a couple of things about MongoDB queries for starters. The $in operator is actually intended for a "list" of arguments like this:
{ "Keys": { "$in": [ "carrot", "banana" ] } }
Which is essentially the shorthand way of saying "Match either 'carrot' or 'banana' in the property 'Keys'". And could even be written in long form like this:
{ "$or": [{ "Keys": "carrot" }, { "Keys": "banana" }] }
Which really should lead you to if it were a "singular" match condition, then you simply supply the value to match to the property:
{ "Keys": "carrot" }
So that should cover the misconception that you use $in to match a property that is an array within a document. Rather the "reverse" case is the intended usage where instead you supply a "list of arguments" to match a given property, be that property an array or just a single value.
The MongoDB query engine makes no distinction between a single value or an array of values in an equality or similar operation.

Related

Query by field value, not value in field array

The following snippet shows three queries:
find all the documents
find the documents containing a field a containing either the string "x" or an array containing the string "x"
find the documents containing a field a containing an array containing the string "x"
I was not able to find the documents containing a field a containing the string "x", not inside an array.
> db.stuff.find({},{_id:0})
{ "a" : "x" }
{ "a" : [ "x" ] }
> db.stuff.find({a:"x"},{_id:0})
{ "a" : "x" }
{ "a" : [ "x" ] }
> db.stuff.find({a:{$elemMatch:{$eq:"x"}}},{_id:0})
{ "a" : [ "x" ] }
>
MongoDB basically does not care if the data at a "given path" is actually in an array or not. If you want to make the distinction, then you need to "tell it that":
db.stuff.find({ "a": "x", "$where": "return !Array.isArray(this.a)" })
This is what $where adds to the bargain, where you can supply a condition that explicitly asks "is this an array" via Array.isArray() in JavaScript evaluation. And the JavaScript NOT ! assertion reverses the logic.
An alternate approach is to add the $exists check:
db.stuff.find({ "a": "x", "a.0": { "$exists": false } })
Which also essentially asks "is this an array" by looking for the first element index. So the "reverse" false case means "this is not an array".
Or even as you note you can use $elemMatch to select only the array, but "negate" that using $not:
db.stuff.find({ "a": { "$not": { "$elemMatch": { "$eq": "x" } } } })
Though probably "not" the best of options since that also "negates index usage", which the other examples all strive to avoid by at least including "one" positive condition for a match. So it's for the best to include the "implicit AND" by combining arguments:
db.stuff.find({
"a": { "$eq": "x", "$not": { "$elemMatch": { "$eq": "x" } } }
})
Or for "aggregation" which does not support $where, you can test using the $isArray aggregation operator should your MongoDB version ( 3.2 or greater ) support it:
db.stuff.aggregate([
{ "$match": { "a": "x" } },
{ "$redact": {
"$cond": {
"if": { "$not": { "$isArray": "$a" } },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
Noting that it is good practice to supply "regular" query conditions as well where possible, and in all cases.
Also noting that querying the BSON $type does not typically work in this case, since the "contents" of the array itself are in fact a "string", which is what the $type operator is going to consider, and thus not report that such an array is in fact an array.

MongoDB: Why $literal required ? And where it can be used?

I have gone through MongoDB $literal in Aggregation framework, but I don't understand where it could be used ? more importantly, why it is required ?
Example from official MongoDB documentation,
db.records.aggregate( [
{ $project: { costsOneDollar: { $eq: [ "$price", { $literal: "$1" } ] } } }
])
Instead of the above example using $literal, why can't I use as below ?
db.records.aggregate( [
{ $project: { costsOneDollar: { $eq: [ "$price", "$1" ] } } }
] )
Also provide some other example which shows the best(or effective) usage of $literal.
For your basic case I think the documentation is fairly self explanatory:
In expression, the dollar sign $ evaluates to a field path; i.e. provides access to the field. For example, the $eq expression $eq: [ "$price", "$1" ] performs an equality check between the value in the field named price and the value in the field named 1 in the document.
So since $ is reserved for evaluation of field path values within the document, then this would be considered to acutally be looking for a "field" named 1 within the document. So the actual comparsion would likely be between the field named "price" and since there is no field named "1" then this would be treated as null and therefore false for every document.
On the other hand where the field "price" actually has a value equal to "$1", then the usage of $literal allows that "value" ( and not the field path reference ) to be considered. Hence "literal".
The operator has actually been around for some time ( since MongoDB 2.2 actually ) but under the guise of $const, which though not doucmented is still the basic operator, and $literal is really just an "alias" for that.
The usage mainly is and always has been to use where an expression is required to have some "specific value" as instructed within the pipeline. Take this simple statement:
{ "$project": { "myField": "one" } }
So for any number of reasons you might want to do that, and basically return a "literal" value in such a statement. But if you tried, it would result in a error as it essentially does not resolve to either a "field path" or a boolean condition for field selection, as is required here. So if you instead use:
{ "$project": { "myField": { "$literal": "one" } } }
Then you have "myField" with a value of "one" just like you asked for.
Other usages are more historic, such as:
{ "$project": { "array": { "$literal": ["A","B","C" ] } } },
{ "$unwind": "$array" },
{ "$group": {
"_id": "$_id",
"trans": { "$push": {
"$cond": [
{ "$eq": [ "$array", "A" ] },
"$fieldA",
{ "$cond": [
{ "$eq": [ "$array", "B" ] },
"$fieldB",
"$fieldC"
]}
]
}}
}}
Which might more modernly be replaced with something like:
{ "$project": {
"trans": {
"$map": {
"input": ["A","B","C"],
"as": "el",
"in": {
"$cond": [
{ "$eq": [ "$$el", "A" ] },
"$fieldA",
{ "$cond": [
{ "$eq": [ "$$el", "B" ] },
"$fieldB",
"$fieldC"
]}
]
}
}
}
}}
As a construct to move selected fields into an array based on position, with the difference being that as "array" and a field assignment the $literal is necessary, but as the "input" argument the plain array notation is just fine.
So the general cases are:
Where something reserved such as $ is needed as the value to match
Where there is a specific value to inject as a field assignment, and not as an argument to another operator expression.
The $1 example you give would try and compare the price field with the 1 field. By specifying the $literal operator, you're telling MongoDB that it is the exact string "$1". The same might be true if you wanted to use a MongoDB function name as a field name in your code, or even using a query snippet as a field value.

How to concatenate all values and find specific substring in Mongodb?

I have json document like this:
{
"A": [
{
"C": "abc",
"D": "de"
},
{
"C": "fg",
"D": "hi"
}
]
}
I would check whether "A" contains string ef or not.
first Concatenate all values abcdefghi then search for ef
In XML, XPATH it would be something like:
//A[contains(., 'ef')]
Is there any similar query in Mongodb?
All options are pretty horrible for this type of search, but there are a few approaches you can take. Please note though that the end case here is likely the best solution, but I present the options in order to illustrate the problem.
If your keys in the array "A" are consistently defined and always contained an array, you would be searching like this:
db.collection.aggregate([
// Filter the documents containing your parts
{ "$match": {
"$and": [
{ "$or": [
{ "A.C": /e/ },
{ "A.D": /e/ }
]},
{"$or": [
{ "A.C": /f/ },
{ "A.D": /f/ }
]}
]
}},
// Keep the original form and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"A": "$A"
},
"A": 1
}},
// Unwind the array
{ "$unwind": "$A" },
// Join the two fields and push to a single array
{ "$group": {
"_id": "$_id",
"joined": { "$push": {
"$concat": [ "$A.C", "$A.D" ]
}}
}},
// Copy the array
{ "$project": {
"C": "$joined",
"D": "$joined"
}},
// Unwind both arrays
{ "$unwind": "$C" },
{ "$unwind": "$D" },
// Join the copies and test if they are the same
{ "$project": {
"joined": { "$concat": [ "$C", "$D" ] },
"same": { "$eq": [ "$C", "$D" ] },
}},
// Discard the "same" elements and search for the required string
{ "$match": {
"same": false,
"joined": { "$regex": "ef" }
}},
// Project the origial form of the matching documents
{ "$project": {
"_id": "$_id._id",
"A": "$_id.A"
}}
])
So apart from the horrible $regex matching there are a few hoops to go through in order to get the fields "joined" in order to again search for the string in sequence. Also note the reverse joining that is possible here that could possibly produce a false positive. Currently there would be no simple way to avoid that reverse join or otherwise filter it, so there is that to consider.
Another approach is to basically run everything through arbitrary JavaScript. The mapReduce method can be your vehicle for this. Here you can be a bit looser with the types of data that can be contained in "A" and try to tie in some more conditional matching to attempt to reduce the set of documents you are working on:
db.collection.mapReduce(
function () {
var joined = "";
if ( Object.prototype.toString.call( this.A ) === '[object Array]' ) {
this.A.forEach(function(doc) {
for ( var k in doc ) {
joined += doc[k];
}
});
} else {
joined = this.A; // presuming this is just a string
}
var id = this._id;
delete this["_id"];
if ( joined.match(/ef/) )
emit( id, this );
},
function(){}, // will not reduce
{
"query": {
"$or": [
{ "A": /ef/ },
{ "$and": [
{ "$or": [
{ "A.C": /e/ },
{ "A.D": /e/ }
]},
{"$or": [
{ "A.C": /f/ },
{ "A.D": /f/ }
]}
] }
]
},
"out": { "inline": 1 }
}
);
So you can use that with whatever arbitrary logic to search the contained objects. This one just differentiates between "arrays" and presumes otherwise a string, allowing the additional part of the query to just search for the matching "string" element first, and which is a "short circuit" evaluation.
But really at the end of the day, the best approach is to simply have the data present in your document, and you would have to maintain this yourself as you update the document contents:
{
"A": [
{
"C": "abc",
"D": "de"
},
{
"C": "fg",
"D": "hi"
}
],
"search": "abcdefghi"
}
So that is still going to invoke a horrible usage of $regex type queries but at least this avoids ( or rather shifts to writing the document ) the overhead of "joining" the elements in order to effect the search for your desired string.
Where this eventually leads is that a "full blown" text search solution, and that means an external one at this time as opposed to the text search facilities in MongoDB, is probably going to be your best performance option.
Either using the "pre-stored" approach in creating your "joined" field or otherwise where supported ( Solr is one solution that can do this ) have a "computed field" in this text index that is created when indexing document content.
At any rate, those are the approaches and the general point of the problem. This is not XPath searching, not is their some "XPath like" view of an entire collection in this sense, so you are best suited to structuring your data towards the methods that are going to give you the best performance.
With all of that said, your sample here is a fairly contrived example, and if you had an actual use case for something "like" this, then that actual case may make a very interesting question indeed. Actual cases generally have different solutions than the contrived ones. But now you have something to consider.

MongoDB Nested Array Intersection Query

and thank you in advance for your help.
I have a mongoDB database structured like this:
{
'_id' : objectID(...),
'userID' : id,
'movies' : [{
'movieID' : movieID,
'rating' : rating
}]
}
My question is:
I want to search for a specific user that has 'userID' : 3, for example, get all is movies, then i want to get all the other users that have at least, 15 or more movies with the same 'movieID', then with that group i wanna select only the users that have those 15 movies in similarity and have one extra 'movieID' that i choose.
I already tried aggregation, but failed, and if i do single queries like getting all the users movies from a user, the cycling every user movie and comparing it takes a bunch of time.
Any ideias?
Thank you
There are a couple of ways to do this using the aggregation framework
Just a simple set of data for example:
{
"_id" : ObjectId("538181738d6bd23253654690"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 2, "rating": 6 },
{ "_id": 3, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654691"),
"movies": [
{ "_id": 1, "rating": 5 },
{ "_id": 4, "rating": 6 },
{ "_id": 2, "rating": 7 }
]
},
{
"_id" : ObjectId("538181738d6bd23253654692"),
"movies": [
{ "_id": 2, "rating": 5 },
{ "_id": 5, "rating": 6 },
{ "_id": 6, "rating": 7 }
]
}
Using the first "user" as an example, now you want to find if any of the other two users have at least two of the same movies.
For MongoDB 2.6 and upwards you can simply use the $setIntersection operator along with the $size operator:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document if you want to keep more than `_id`
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
}},
// Unwind the array
{ "$unwind": "$movies" },
// Build the array back with just `_id` values
{ "$group": {
"_id": "$_id",
"movies": { "$push": "$movies._id" }
}},
// Find the "set intersection" of the two arrays
{ "$project": {
"movies": {
"$size": {
"$setIntersection": [
[ 1, 2, 3 ],
"$movies"
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
This is still possible in earlier versions of MongoDB that do not have those operators, just using a few more steps:
db.users.aggregate([
// Match the possible documents to reduce the working set
{ "$match": {
"_id": { "$ne": ObjectId("538181738d6bd23253654690") },
"movies._id": { "$in": [ 1, 2, 3 ] },
"$and": [
{ "movies": { "$not": { "$size": 1 } } }
]
}},
// Project a copy of the document along with the "set" to match
{ "$project": {
"_id": {
"_id": "$_id",
"movies": "$movies"
},
"movies": 1,
"set": { "$cond": [ 1, [ 1, 2, 3 ], 0 ] }
}},
// Unwind both those arrays
{ "$unwind": "$movies" },
{ "$unwind": "$set" },
// Group back the count where both `_id` values are equal
{ "$group": {
"_id": "$_id",
"movies": {
"$sum": {
"$cond":[
{ "$eq": [ "$movies._id", "$set" ] },
1,
0
]
}
}
}},
// Filter the results to those that actually match
{ "$match": { "movies": { "$gte": 2 } } }
])
In Detail
That may be a bit to take in, so we can take a look at each stage and break those down to see what they are doing.
$match : You do not want to operate on every document in the collection so this is an opportunity to remove the items that are not possibly matches even if there still is more work to do to find the exact ones. So the obvious things are to exclude the same "user" and then only match the documents that have at least one of the same movies as was found for that "user".
The next thing that makes sense is to consider that when you want to match n entries then only documents that have a "movies" array that is larger than n-1 can possibly actually contain matches. The use of $and here looks funny and is not required specifically, but if the required matches were 4 then that actual part of the statement would look like this:
"$and": [
{ "movies": { "$not": { "$size": 1 } } },
{ "movies": { "$not": { "$size": 2 } } },
{ "movies": { "$not": { "$size": 3 } } }
]
So you basically "rule out" arrays that are not possibly long enough to have n matches. Noting here that this $size operator in the query form is different to $size for the aggregation framework. There is no way for example to use this with an inequality operator such as $gt is it's purpose is to specifically match the requested "size". Hence this query form to specify all of the possible sizes that are less than.
$project : There are a few purposes in this statement, of which some differ depending on the MongoDB version you have. Firstly, and optionally, a document copy is being kept under the _id value so that these fields are not modified by the rest of the steps. The other part here is keeping the "movies" array at the top of the document as a copy for the next stage.
What is also happening in the version presented for pre 2.6 versions is there is an additional array representing the _id values for the "movies" to match. The usage of the $cond operator here is just a way of creating a "literal" representation of the array. Funny enough, MongoDB 2.6 introduces an operator known as $literal to do exactly this without the funny way we are using $cond right here.
$unwind : To do anything further the movies array needs to be unwound as in either case it is the only way to isolate the existing _id values for the entries that need to be matched against the "set". So for the pre 2.6 version you need to "unwind" both of the arrays that are present.
$group : For MongoDB 2.6 and greater you are just grouping back to an array that only contains the _id values of the movies with the "ratings" removed.
Pre 2.6 since all values are presented "side by side" ( and with lots of duplication ) you are doing a comparison of the two values to see if they are the same. Where that is true, this tells the $cond operator statement to return a value of 1 or 0 where the condition is false. This is directly passed back through $sum to total up the number of matching elements in the array to the required "set".
$project: Where this is the different part for MongoDB 2.6 and greater is that since you have pushed back an array of the "movies" _id values you are then using $setIntersection to directly compare those arrays. As the result of this is an array containing the elements that are the same, this is then wrapped in a $size operator in order to determine how many elements were returned in that matching set.
$match: Is the final stage that has been implemented here which does the clear step of matching only those documents whose count of intersecting elements was greater than or equal to the required number.
Final
That is basically how you do it. Prior to 2.6 is a bit clunkier and will require a bit more memory due to the expansion that is done by duplicating each array member that is found by all of the possible values of the set, but it still is a valid way to do this.
All you need to do is apply this with the greater n matching values to meet your conditions, and of course make sure your original user match has the required n possibilities. Otherwise just generate this on n-1 from the length of the "user's" array of "movies".

Mongo. Narroving down results of nested array

If I have a document like this:
{
"name" : "Foo",
"words" :
[
"lorem",
"ipsum",
"dolor",
"sit",
"amet",
...
]
}
Let's say this words array is pretty big. Now I need a query that would fetch that document:
db.docs.find({'name':'Foo'}) - that will get whole document
but what I want, instead of fetching the entire words array (cause it's too big) I would like to retrieve only elements that meet some criteria. Let's say I want to see only words that start with "a" or have a length of at least 3 characters.
You know maybe something like this:
// this won't work!
db.docs.find({
"$where":"(this.words.map(function(e){ if (e.length >=3) { return e } }))"
})
EDIT
You cannot filter array contents using find, You can only match that the array contains the condition. So in order to filter the contents of the array you need to make use of aggregate:
db.docs.aggregate([
// Still makes sense to match the documents that meet the condition
{ "$match": {
"name": "Foo",
"words": { "$regex": "^[A-Za-z0-9_]{4,}" }
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$words" },
// Actually "filter" the array elements
{ "$match": { "words": { "$regex": "^[A-Za-z0-9_]{4,}" } } },
// Group back the document with the "filtered" array
{ "$group": {
"_id": "$_id",
"name": { "$first": "$name" },
"words": { "$push": "$words" }
}}
])
That makes use a regular expression condition that will match at least 4 characters from the start of the string. The ^ anchor is quite important here as it allows an index to be used which is much more optimal than whatever else you can do.
The result returned will look like this:
{
"result" : [
{
"_id" : ObjectId("5341f0476cbcc02b995092ac"),
"name" : "Foo",
"words" : [
"lorem",
"ipsum",
"dolor"
]
}
],
"ok" : 1
}
You can also throw a lot of arbitrary JavaScript at mapReduce and test the length of elements in the array, but that will take considerably longer to execute.
--
The terms are quite simple, you simply add the additional operator to the query document as so:
db.docs.find({ "name": "Foo", "$where": "(this.words.length > 3)" })
You really should not be using the $where operator unless absolutely necessary, and even then you really should think about what you are doing. Heed the warnings that are given in that document.
As stated in the manual page for $size, probably the best way to deal with detecting array length for a given range (rather than exact) is to create a "counter" field in your document that is updated as elements are added/removed from the array. This makes a very simple and efficient query:
db.docs.find({ "name": "Foo", "counter": { "$gt": 3 } })
Of course from MongoDB versions 2.6 and upwards you can also do this:
db.docs.aggregate([
{ "$project": {
"name": 1,
"words": 1,
"count": { "$size": "$words" }
}},
{ "$match": {
"count": { "$gt": 3 }
}}
])
Either of those forms is going to perform a lot better than using something that is going to remove the use of an index and then invoke the JavaScript interpreter over each resulting document. Or even just use the $size operator for an exact size of the array.