MongoDb : Find common element from two arrays within a query - mongodb

Let's say we have records of following structure in database.
{
"_id": 1234,
"tags" : [ "t1", "t2", "t3" ]
}
Now, I want to check if database contains a record with any of the tags specified in array tagsArray which is [ "t3", "t4", "t5" ]
I know about $in operator but I not only want to know whether any of the records in database has any of the tag specified in tagsArray, I also want to know which tag of the record in database matches with any of the tags specified in tagsArray. (i.e. t3 in for the case of record mentioned above)
That is, I want to compare two arrays (one of the record and other given by me) and find out the common element.
I need to have this expression along with many expressions in the query so projection operators like $, $elematch etc won't be of much use. (Or is there a way it can be used without having to iterate over all records?)
I think I can use $where operator but I don't think that is the best way to do this.
How can this problem be solved?

There are a few approaches to do what you want, it just depends on your version of MongoDB. Just submitting the shell responses. The content is basically JSON representation which is not hard to translate for DBObject entities in Java, or JavaScript to be executed on the server so that really does not change.
The first and the fastest approach is with MongoDB 2.6 and greater where you get the new set operations:
var test = [ "t3", "t4", "t5" ];
db.collection.aggregate([
{ "$match": { "tags": {"$in": test } }},
{ "$project": {
"tagMatch": {
"$setIntersection": [
"$tags",
test
]
},
"sizeMatch": {
"$size": {
"$setIntersection": [
"$tags",
test
]
}
}
}},
{ "$match": { "sizeMatch": { "$gte": 1 } } },
{ "$project": { "tagMatch": 1 } }
])
The new operators there are $setIntersection that is doing the main work and also the $size operator which measures the array size and helps for the latter filtering. This ends up as a basic comparison of "sets" in order to find the items that intersect.
If you have an earlier version of MongoDB then this is still possible, but you need a few more stages and this might affect performance somewhat depending if you have large arrays:
var test = [ "t3", "t4", "t5" ];
db.collection.aggregate([
{ "$match": { "tags": {"$in": test } }},
{ "$project": {
"tags": 1,
"match": { "$const": test }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$match" },
{ "$project": {
"tags": 1,
"matched": { "$eq": [ "$tags", "$match" ] }
}},
{ "$match": { "matched": true }},
{ "$group": {
"_id": "$_id",
"tagMatch": { "$push": "$tags" },
"count": { "$sum": 1 }
}}
{ "$match": { "count": { "$gte": 1 } }},
{ "$project": { "tagMatch": 1 }}
])
Or if all of that seems to involved or your arrays are large enough to make a performance difference then there is always mapReduce:
var test = [ "t3", "t4", "t5" ];
db.collection.mapReduce(
function () {
var intersection = this.tags.filter(function(x){
return ( test.indexOf( x ) != -1 );
});
if ( intersection.length > 0 )
emit ( this._id, intersection );
},
function(){},
{
"query": { "tags": { "$in": test } },
"scope": { "test": test },
"output": { "inline": 1 }
}
)
Note that in all cases the $in operator still helps you to reduce the results even though it is not the full match. The other common element is checking the "size" of the intersection result to reduce the response.
All pretty easy to code up, convince the boss to switch to MongoDB 2.6 or greater if you are not already there for the best results.

Related

mongodb find doc with matching values of element in object array [duplicate]

I have a collection of documents:
date: Date
users: [
{ user: 1, group: 1 }
{ user: 5, group: 2 }
]
date: Date
users: [
{ user: 1, group: 1 }
{ user: 3, group: 2 }
]
I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.
The best solution I've been able to find is to do:
$where: function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
Unfortunately, this seems to hurt performance is stated in the $where docs:
$where evaluates JavaScript and cannot take advantage of indexes.
How can I improve this query?
The query you want is this:
db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})
This says find me all documents that don't have elements that are outside of the list 1,5,7.
I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.
Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.
So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:
db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })
But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:
db.collection.find({
"users.user": { "$in": [ 1, 5, 7 ] },
"$where": function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
})
And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.
Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:
db.collection.find({
"$or": [
{ "users.user": { "$all": [ 1, 5, 7 ] } },
{ "users.user": { "$all": [ 1, 5 ] } },
{ "users.user": { "$all": [ 1, 7 ] } },
{ "users": { "$size": 1 }, "users.user": 1 },
{ "users": { "$size": 1 }, "users.user": 5 },
{ "users": { "$size": 1 }, "users.user": 7 }
]
})
So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.
NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in
Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:
db.problem.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Just keeping the "user" element value
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
// Compare to see if all elements are a member of the desired match
{ "$project": {
"match": { "$setEquals": [
{ "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
"$users"
]}
}},
// Filter out any documents that did not match
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.
As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:
db.collection.aggregate([
{ "$match": {
"users.user": { "$in": [ 1,5,7 ] }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
{ "$project": {
"match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
}},
{ "$match": { "match": true } },
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
// and a note of it's current size
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
"size": { "$size": "$users" }
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Which of course can still be done, though a little more long winded in versions prior to 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Group it back to get it's original size
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users" },
"size": { "$sum": 1 }
}},
// Unwind the array copy again
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:
db.collection.ensureIndex({ "users.user": 1 })
Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.
Verdict
I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:
var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
var elements = Math.floor(Math.random(10)*10)+1;
var obj = { date: new Date(), users: [] };
for ( var x = 0; x < elements; x++ ) {
var user = Math.floor(Math.random(10)*10)+1,
group = Math.floor(Math.random(10)*10)+1;
obj.users.push({ user: user, group: group });
}
batch.push( obj );
if ( n % 500 == 0 ) {
db.problem.insert( batch );
batch = [];
}
}
With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):
JavaScript with $in clause: 420ms
Aggregate with $size : 395ms
Aggregate with group array count : 650ms
Aggregate with two set operators : 275ms
Aggregate with $setIsSubset : 250ms
Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.
But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.
So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.
This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.
I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.
Let's say you expanded your question from userIds to full users.
You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]
This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.
Solution:
db.collection.find({
"users": {
"$not": {
"$elemMatch": {
// if all of the OR-blocks are true, element is not in array
"$and": [{
// each OR-block == true if element != that user
"$or": [
"user": { "ne": 1 },
"group": { "ne": 3 }
]
}, {
"$or": [
"user": { "ne": 2 },
"group": { "ne": 5 }
]
}, {
// more users...
}]
}
}
}
})
To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.

Get Distinct list of two properties using MongoDB 2.4

I have an article collection:
{
_id: 9999,
authorId: 12345,
coAuthors: [23456,34567],
title: 'My Article'
},
{
_id: 10000,
authorId: 78910,
title: 'My Second Article'
}
I'm trying to figure out how to get a list of distinct author and co-author ids out of the database. I have tried push, concat, and addToSet, but can't seem to find the right combination. I'm on 2.4.6 so I don't have access to setUnion.
Whilst $setUnion would be the "ideal" way to do this, there is another way that basically involved "switching" between a "type" to alternate which field is picked:
db.collection.aggregate([
{ "$project": {
"authorId": 1,
"coAuthors": { "$ifNull": [ "$coAuthors", [null] ] },
"type": { "$const": [ true,false ] }
}},
{ "$unwind": "$coAuthors" },
{ "$unwind": "$type" },
{ "$group": {
"_id": {
"$cond": [
"$type",
"$authorId",
"$coAuthors"
]
}
}},
{ "$match": { "_id": { "$ne": null } } }
])
And that is it. You may know the $const operation as the $literal operator from MongoDB 2.6. It has always been there, but was only documented and given an "alias" at the 2.6 release.
Of course the $unwind operations in both cases produce more "copies" of the data, but this is grouping for "distinct" values so it does not matter. Just depending on the true/false alternating value for the projected "type" field ( once unwound ) you just pick the field alternately.
Also this little mapReduce does much the same thing:
db.collection.mapReduce(
function() {
emit(this.authorId,null);
if ( this.hasOwnProperty("coAuthors"))
this.coAuthors.forEach(function(id) {
emit(id,null);
});
},
function(key,values) {
return null;
},
{ "out": { "inline": 1 } }
)
For the record, $setUnion is of course a lot cleaner and more performant:
db.collection.aggregate([
{ "$project": {
"combined": {
"$setUnion": [
{ "$map": {
"input": ["A"],
"as": "el",
"in": "$authorId"
}},
{ "$ifNull": [ "$coAuthors", [] ] }
]
}
}},
{ "$unwind": "$combined" },
{ "$group": {
"_id": "$combined"
}}
])
So there the only real concerns are converting the singular "authorId" to an array via $map and feeding an empty array where the "coAuthors" field is not present in the document.
Both output the same distinct values from the sample documents:
{ "_id" : 78910 }
{ "_id" : 23456 }
{ "_id" : 34567 }
{ "_id" : 12345 }

How to calculate difference between values of different documents using mongo aggregation?

Hi my mongo structure as below
{
"timemilliSec":1414590255,
"data":[
{
"x":23,
"y":34,
"name":"X"
},
{
"x":32,
"y":50,
"name":"Y"
}
]
},
{
"timemilliSec":1414590245,
"data":[
{
"x":20,
"y":13,
"name":"X"
},
{
"x":20,
"y":30,
"name":"Y"
}
]
}
Now I want to calculate difference of first document and second document and second to third in this way
so calculation as below
diffX = ((data.x-data.x)/(data.y-data.y)) in our case ((23-20)/(34-13))
diffY = ((data.x-data.x)/(data.y-data.y)) in our case ((32-20)/(50-30))
Tough question in principle, but I'm going to stay with the simplified case you present of two documents and base a solution around that. The concepts should abstract, but are more difficult for expanded cases. Possible with the aggregation framework in general:
db.collection.aggregate([
// Match the documents in a pair
{ "$match": {
"timeMilliSec": { "$in": [ 1414590255, 1414590245 ] }
}}
// Trivial, just keeping an order
{ "$sort": { "timeMilliSec": -1 } },
// Unwind the arrays
{ "$unwind": "$data" },
// Group first and last
{ "$group": {
"_id": "$data.name",
"firstX": { "$first": "$data.x" },
"lastX": { "$last": "$data.x" },
"firstY": { "$first": "$data.y" },
"lastY": { "$last": "$data.y" }
}},
// Difference on the keys
{ "$project": {
"diff": {
"$divide": [
{ "$subtract": [ "$firstX", "$lastX" ] },
{ "$subtract": [ "$firstY", "$lastY" ] }
]
}
}},
// Not sure you want to take it this far
{ "$group": {
"_id": null,
"diffX": {
"$min": {
"$cond": [
{ "$eq": [ "$_id", "X" ] },
"$diff",
false
]
}
},
"diffY": {
"$min": {
"$cond": [
{ "$eq": [ "$_id", "Y" ] },
"$diff",
false
]
}
}
}}
])
Possibly overblown, not sure of the intent, but the output of this based on the sample would be:
{
"_id" : null,
"diffX" : 0.14285714285714285,
"diffY" : 0.6
}
Which matches the calculations.
You can adapt to your case, but the general principle is as shown.
The last "pipeline" stage there is a little "extreme" as all that is done is combine the results into a single document. Otherwise, the "X" and "Y" results are already obtained in two documents in the pipeline. Mostly by the $group operation with $first and $last operations to find the respective elements on the grouping boundary.
The subsequent operations in $project as a pipeline stage performs the required math to determine the distinct results. See the aggregation operators for more details, particularly $divide and $subtract.
Whatever you do you follow this course. Get a "start" and "end" pair on your two keys. Then perform the calculations.

How to optimize mongoDB query?

I am having following sample document in the mongoDB.
{
"location" : {
"language" : null,
"country" : "null",
"city" : "null",
"state" : null,
"continent" : "null",
"latitude" : "null",
"longitude" : "null"
},
"request" : [
{
"referrer" : "direct",
"url" : "http://www.google.com/"
"title" : "index page"
"currentVisit" : "1401282897"
"visitedTime" : "1401282905"
},
{
"referrer" : "direct",
"url" : "http://www.stackoverflow.com/",
"title" : "index page"
"currentVisit" : "1401282900"
"visitedTime" : "1401282905"
},
......
]
"uuid" : "109eeee0-e66a-11e3"
}
Note:
The database contains more than 10845 document
Each document contains nearly 100 request(100 object in the request array).
Technology/Language - node.js
I had setProfiling to check the execution time
First Query - 13899ms
Second Query - 9024ms
Third Query - 8310ms
Fourth Query - 6858ms
There is no much difference using indexing
Queries:
I am having the following aggregation queries to be executed to fetch the data.
var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};
For Example: var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};
For third and fourth query request.visitedTime instead of request.currentVisit
First
[
{ "$project":{
"request.currentVisit":1,
"request.url":1
}},
{ "$match":{
"request.1": {$exists:true}
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
Second
[
{ "$project": {
"request.currentVisit":1,
"request.url":1
}},
{ "$match": {
"request":{ "$size": 1 }
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id":{
"url":"$request.url"
},
"count":{ "$sum": 1 }
}},
{ "$sort": { "count": -1} }
]
Third
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request.1": { "$exists": true }
}},
{ "$match": match },
{ "$group": {
"_id": "$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id": null,
"total": { "$sum":"$count" }}
}}
]
Forth
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request":{ "$size": 1 }
}},
{ "$match": match },
{ "$group": {
"_id":"$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id":null,
"total": { "$sum": "$count" }
}}
]
Problem:
It is taking more than 38091 ms to fetch the data.
Is there any way to optimize the query?
Any suggestion will be grateful.
Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.
The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.
Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.
So using the first as an example:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$unwind": "$request" },
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
So a few changes. There is a $match at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.
The $project you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project first to reduce the pipeline. The effect is very minimal if in fact there is a later $project or $group statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match statement above does more to optimize.
Dropping the need to see if the array is actually there with the other $match stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.
The rest remains unchanged, as you then $unwind the array and $match to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.
The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind it. This would produce a listing like this:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$project": {
"request": {
"$setDifference": [
{
"$map": {
"input": "$request",
"as": "el",
"in": {
"$cond"": [
{
"$and":[
{ "$gte": [ "1401282905", "$$el.currentVisit" ] },
{ "$lt": [ "1401282935", "$$el.currentVisit" ] }
]
}
"$el",
false
]
}
}
}
[false]
]
}
}}
{ "$unwind": "$request" },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
That may save you some by being able to "filter" the array before the $unwind and which is possibly better than doing the $match afterwards.
But this is the general rule for all of your statements. You need usable indexes and you need to $match first.
It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.
If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.

Check if every element in array matches condition

I have a collection of documents:
date: Date
users: [
{ user: 1, group: 1 }
{ user: 5, group: 2 }
]
date: Date
users: [
{ user: 1, group: 1 }
{ user: 3, group: 2 }
]
I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.
The best solution I've been able to find is to do:
$where: function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
Unfortunately, this seems to hurt performance is stated in the $where docs:
$where evaluates JavaScript and cannot take advantage of indexes.
How can I improve this query?
The query you want is this:
db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})
This says find me all documents that don't have elements that are outside of the list 1,5,7.
I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.
Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.
So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:
db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })
But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:
db.collection.find({
"users.user": { "$in": [ 1, 5, 7 ] },
"$where": function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
})
And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.
Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:
db.collection.find({
"$or": [
{ "users.user": { "$all": [ 1, 5, 7 ] } },
{ "users.user": { "$all": [ 1, 5 ] } },
{ "users.user": { "$all": [ 1, 7 ] } },
{ "users": { "$size": 1 }, "users.user": 1 },
{ "users": { "$size": 1 }, "users.user": 5 },
{ "users": { "$size": 1 }, "users.user": 7 }
]
})
So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.
NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in
Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:
db.problem.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Just keeping the "user" element value
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
// Compare to see if all elements are a member of the desired match
{ "$project": {
"match": { "$setEquals": [
{ "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
"$users"
]}
}},
// Filter out any documents that did not match
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.
As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:
db.collection.aggregate([
{ "$match": {
"users.user": { "$in": [ 1,5,7 ] }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
{ "$project": {
"match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
}},
{ "$match": { "match": true } },
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
// and a note of it's current size
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
"size": { "$size": "$users" }
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Which of course can still be done, though a little more long winded in versions prior to 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Group it back to get it's original size
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users" },
"size": { "$sum": 1 }
}},
// Unwind the array copy again
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:
db.collection.ensureIndex({ "users.user": 1 })
Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.
Verdict
I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:
var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
var elements = Math.floor(Math.random(10)*10)+1;
var obj = { date: new Date(), users: [] };
for ( var x = 0; x < elements; x++ ) {
var user = Math.floor(Math.random(10)*10)+1,
group = Math.floor(Math.random(10)*10)+1;
obj.users.push({ user: user, group: group });
}
batch.push( obj );
if ( n % 500 == 0 ) {
db.problem.insert( batch );
batch = [];
}
}
With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):
JavaScript with $in clause: 420ms
Aggregate with $size : 395ms
Aggregate with group array count : 650ms
Aggregate with two set operators : 275ms
Aggregate with $setIsSubset : 250ms
Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.
But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.
So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.
This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.
I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.
Let's say you expanded your question from userIds to full users.
You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]
This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.
Solution:
db.collection.find({
"users": {
"$not": {
"$elemMatch": {
// if all of the OR-blocks are true, element is not in array
"$and": [{
// each OR-block == true if element != that user
"$or": [
"user": { "ne": 1 },
"group": { "ne": 3 }
]
}, {
"$or": [
"user": { "ne": 2 },
"group": { "ne": 5 }
]
}, {
// more users...
}]
}
}
}
})
To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.