I have the following aggregate query which gives me counts (countA) for a given date range period. In this case 01/01/2016-03/31/2016. Is it possible to add a second date rage period for example 04/01/2016-07/31/2016 and count these as countB?
db.getCollection('customers').aggregate(
{$match: {"status": "Closed"}},
{$unwind: "$lines"},
{$match: {"lines.status": "Closed"}},
{$match: {"lines.deliveryMethod": "Tech Delivers"}},
{$match: {"date": {$gte: new Date('01/01/2016'), $lte: new Date('03/31/2016')}}},
{$group:{_id:"$lines.productLine",countA: {$sum: 1}}}
)
Thanks in advance
Sure, and you can also simplify your pipeline stages quite a lot, mostly since successive $match stages are really a single stage, and that you should always use match criteria at the beginning of any aggregation pipeline. Even if it doesn't actually "filter" the array content, it at least just selects the documents containing entries that will actually match. This speeds things up immensely, and especially with large data sets.
For the two date ranges, well this is just an $or query argument. Also it would be applied "before" the array filtering is done, since after all it is a document level match to begin with. So again, in the very first pipeline $match:
db.getCollection('customers').aggregate([
// Filter all document conditions first. Reduces things to process.
{ "$match": {
"status": "Closed",
"lines": { "$elemMatch": {
"status": "Closed",
"deliveryMethod": "Tech Delivers"
}},
"$or": [
{ "date": {
"$gte": new Date("2016-01-01"),
"$lt": new Date("2016-04-01")
}},
{ "date": {
"$gte": new Date("2016-04-01"),
"$lt": new Date("2016-08-01")
}}
]
}},
// Unwind the array
{ "$unwind": "$lines" },
// Filter just the matching elements
// Successive $match is really just one pipeline stage
{ "$match": {
"lines.status": "Closed",
"lines.deliveryMethod": "Tech Delivers"
}},
// Then group on the productline values within the array
{ "$group":{
"_id": "$lines.productLine",
"countA": {
"$sum": {
"$cond": [
{ "$and": [
{ "$gte": [ "$date", new Date("2016-01-01") ] },
{ "$lt": [ "$date", new Date("2016-04-01") ] }
]},
1,
0
]
}
},
"countB": {
"$sum": {
"$cond": [
{ "$and": [
{ "$gte": [ "$date", new Date("2016-04-01") ] },
{ "$lt": [ "$date", new Date("2016-08-01") ] }
]},
1,
0
]
}
}
}}
])
The $or basically "joins" two result sets as it looks for "either" range criteria to apply. As this is given in addition to the other arguments, the logic is an "AND" condition as with the others on the criteria met with either $or argument. Note the $gte and $lt combination is also another form of expressing "AND" conditions on the same key.
The $elemMatch is applied since "both" criteria are required on the array element. If you just directly applied them with "dot notation", then all that really asks is that "at least one array element" matches each condition, rather than the array element matching "both" conditions.
The later filtering after $unwind can use the "dot notation" since the array elements are now "de-normalised" into separate documents. So there is only one element per document to now match the conditions.
When you apply the $group, instead of just using { "$sum": 1 } you rather "conditionally assess whether to count it or not by using $cond. Since both date ranges are within the results, you just need to determine if the current document being "rolled up" belongs to one date range or another. As a "ternary" (if/then/else) operator, this is what $cond provides.
It looks at the values within "date" in the document and if it matches the condition set ( first argument - if ) then it returns 1 ( second argument - then ), else it returns 0, effectively not adding to the current count.
Since these are "logical" conditions then the "AND" is expressed with a logical $and operator, which itself returns true or false, requiring both contained conditions to be true.
Also note the correction in the Date object constructors, since if you do not instantiate with the string in that representation then the resulting Date is in "localtime" as opposed to the "UTC" format in which MongoDB is storing the dates. Only use a "local" constructor if you really mean that, and often people really don't.
The other note is the $lt date change, which should always be "one day" greater than the last date you are looking for. Remember these are "beginning of day" dates, and therefore you usually want all possible times within the date, and not just up to the beginning. So it's "less than the next day" as the correct condition.
For the record, with MongoDB versions from 2.6, it's likely better to "pre-filter" the array content "before" you $unwind. This removes the overhead of producing new documents in the "de-normalizing" that occurs that would not match the conditions you want to apply to array elements.
For MongoDB 3.2 and greater, use $filter:
db.getCollection('customers').aggregate([
// Filter all document conditions first. Reduces things to process.
{ "$match": {
"status": "Closed",
"lines": { "$elemMatch": {
"status": "Closed",
"deliveryMethod": "Tech Delivers"
}},
"$or": [
{ "date": {
"$gte": new Date("2016-01-01"),
"$lt": new Date("2016-04-01")
}},
{ "date": {
"$gte": new Date("2016-04-01"),
"$lt": new Date("2016-08-01")
}}
]
}},
// Pre-filter the array content to matching elements
{ "$project": {
"lines": {
"$filter": {
"input": "$lines",
"as": "line",
"cond": {
"$and": [
{ "$eq": [ "$$line.status", "Closed" ] },
{ "$eq": [ "$$line.deliveryMethod", "Tech Delivers" ] }
]
}
}
}
}},
// Unwind the array
{ "$unwind": "$lines" },
// Then group on the productline values within the array
{ "$group":{
"_id": "$lines.productLine",
"countA": {
"$sum": {
"$cond": [
{ "$and": [
{ "$gte": [ "$date": new Date("2016-01-01") ] },
{ "$lt": [ "$date", new Date("2016-04-01") ] }
]},
1,
0
]
}
},
"countB": {
"$sum": {
"$cond": [
{ "$and": [
{ "$gte": [ "$date", new Date("2016-04-01") ] },
{ "$lt": [ "$date", new Date("2016-08-01") ] }
]},
1,
0
]
}
}
}}
])
Or for at least MongoDB 2.6, then apply $redact instead:
db.getCollection('customers').aggregate([
// Filter all document conditions first. Reduces things to process.
{ "$match": {
"status": "Closed",
"lines": { "$elemMatch": {
"status": "Closed",
"deliveryMethod": "Tech Delivers"
}},
"$or": [
{ "date": {
"$gte": new Date("2016-01-01"),
"$lt": new Date("2016-04-01")
}},
{ "date": {
"$gte": new Date("2016-04-01"),
"$lt": new Date("2016-08-01")
}}
]
}},
// Pre-filter the array content to matching elements
{ "$redact": {
"$cond": {
"if": {
"$and": [
{ "$eq": [ "$status", "Closed" ] },
{ "$eq": [
{ "$ifNull": ["$deliveryMethod", "Tech Delivers" ] },
"Tech Delivers"
]
},
"then": "$$DESCEND",
"else": "$$PRUNE"
}
}},
// Unwind the array
{ "$unwind": "$lines" },
// Then group on the productline values within the array
{ "$group":{
"_id": "$lines.productLine",
"countA": {
"$sum": {
"$cond": [
{ "$and": [
{ "$gte": [ "$date": new Date("2016-01-01") ] },
{ "$lt": [ "$date", new Date("2016-04-01") ] }
]},
1,
0
]
}
},
"countB": {
"$sum": {
"$cond": [
{ "$and": [
{ "$gte": [ "$date", new Date("2016-04-01") ] },
{ "$lt": [ "$date", new Date("2016-08-01") ] }
]},
1,
0
]
}
}
}}
])
Noting that funny little $ifNull in there which is necessary due to the recursive nature of $$DESCEND, since all levels of the document are inspected, including the "top level" document and then "descending" into subsequent arrays and members or even nested objects. The "status" field is present and has a value of "Closed" due to earlier query selection criteria for the top level field, but of course there is no "top level" element called "deliveryMethod", since it is only within the array elements.
That basically is the "care" then needs to be take when using $redact like this, and if the structure if the document does not allow such conditions, then it's not really an option, so revert to processing $unwind then $match instead.
But where possible, use those methods in preference to the $unwind then $match processing, as it will save considerable time and use less resources by using the newer techniques instead.
Related
I have a collection of documents:
date: Date
users: [
{ user: 1, group: 1 }
{ user: 5, group: 2 }
]
date: Date
users: [
{ user: 1, group: 1 }
{ user: 3, group: 2 }
]
I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.
The best solution I've been able to find is to do:
$where: function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
Unfortunately, this seems to hurt performance is stated in the $where docs:
$where evaluates JavaScript and cannot take advantage of indexes.
How can I improve this query?
The query you want is this:
db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})
This says find me all documents that don't have elements that are outside of the list 1,5,7.
I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.
Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.
So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:
db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })
But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:
db.collection.find({
"users.user": { "$in": [ 1, 5, 7 ] },
"$where": function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
})
And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.
Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:
db.collection.find({
"$or": [
{ "users.user": { "$all": [ 1, 5, 7 ] } },
{ "users.user": { "$all": [ 1, 5 ] } },
{ "users.user": { "$all": [ 1, 7 ] } },
{ "users": { "$size": 1 }, "users.user": 1 },
{ "users": { "$size": 1 }, "users.user": 5 },
{ "users": { "$size": 1 }, "users.user": 7 }
]
})
So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.
NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in
Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:
db.problem.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Just keeping the "user" element value
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
// Compare to see if all elements are a member of the desired match
{ "$project": {
"match": { "$setEquals": [
{ "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
"$users"
]}
}},
// Filter out any documents that did not match
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.
As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:
db.collection.aggregate([
{ "$match": {
"users.user": { "$in": [ 1,5,7 ] }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
{ "$project": {
"match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
}},
{ "$match": { "match": true } },
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
// and a note of it's current size
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
"size": { "$size": "$users" }
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Which of course can still be done, though a little more long winded in versions prior to 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Group it back to get it's original size
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users" },
"size": { "$sum": 1 }
}},
// Unwind the array copy again
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:
db.collection.ensureIndex({ "users.user": 1 })
Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.
Verdict
I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:
var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
var elements = Math.floor(Math.random(10)*10)+1;
var obj = { date: new Date(), users: [] };
for ( var x = 0; x < elements; x++ ) {
var user = Math.floor(Math.random(10)*10)+1,
group = Math.floor(Math.random(10)*10)+1;
obj.users.push({ user: user, group: group });
}
batch.push( obj );
if ( n % 500 == 0 ) {
db.problem.insert( batch );
batch = [];
}
}
With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):
JavaScript with $in clause: 420ms
Aggregate with $size : 395ms
Aggregate with group array count : 650ms
Aggregate with two set operators : 275ms
Aggregate with $setIsSubset : 250ms
Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.
But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.
So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.
This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.
I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.
Let's say you expanded your question from userIds to full users.
You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]
This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.
Solution:
db.collection.find({
"users": {
"$not": {
"$elemMatch": {
// if all of the OR-blocks are true, element is not in array
"$and": [{
// each OR-block == true if element != that user
"$or": [
"user": { "ne": 1 },
"group": { "ne": 3 }
]
}, {
"$or": [
"user": { "ne": 2 },
"group": { "ne": 5 }
]
}, {
// more users...
}]
}
}
}
})
To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.
I have a collection set with documents like :
{
"_id": ObjectId("57065ee93f0762541749574e"),
"name": "myName",
"results" : [
{
"_id" : ObjectId("570e3e43628ba58c1735009b"),
"color" : "GREEN",
"week" : 17,
"year" : 2016
},
{
"_id" : ObjectId("570e3e43628ba58c1735009d"),
"color" : "RED",
"week" : 19,
"year" : 2016
}
]
}
I am trying to build a query witch alow me to return all documents of my collection but only select the field 'results' with subdocuments if week > X and year > Y.
I can select the documents where week > X and year > Y with the aggregate function and a $match but I miss documents with no match.
So far, here is my function :
query = ModelUser.aggregate(
{$unwind:{path:'$results', preserveNullAndEmptyArrays:true}},
{$match:{
$or: [
{$and:[
{'results.week':{$gte:parseInt(week)}},
{'results.year':{$eq:parseInt(year)}}
]},
{'results.year':{$gt:parseInt(year)}},
{'results.week':{$exists: false}}
{$group:{
_id: {
_id:'$_id',
name: '$name'
},
results: {$push:{
_id:'$results._id',
color: '$results.color',
numSemaine: '$results.numSemaine',
year: '$results.year'
}}
}},
{$project: {
_id: '$_id._id',
name: '$_id.name',
results: '$results'
);
The only thing I miss is : I have to get all 'name' even if there is no result to display.
Any idea how to do this without 2 queries ?
It looks like you actually have MongoDB 3.2, so use $filter on the array. This will just return an "empty" array [] where the conditions supplied did not match anything:
db.collection.aggregate([
{ "$project": {
"name": 1,
"user": 1,
"results": {
"$filter": {
"input": "$results",
"as": "result",
"cond": {
"$and": [
{ "$eq": [ "$$result.year", year ] },
{ "$or": [
{ "$gt": [ "$$result.week", week ] },
{ "$not": { "$ifNull": [ "$$result.week", false ] } }
]}
]
}
}
}
}}
])
Where the $ifNull test in place of $exists as a logical form can actually "compact" the condition since it returns an alternate value where the property is not present, to:
db.collection.aggregate([
{ "$project": {
"name": 1,
"user": 1,
"results": {
"$filter": {
"input": "$results",
"as": "result",
"cond": {
"$and": [
{ "$eq": [ "$$result.year", year ] },
{ "$gt": [
{ "$ifNull": [ "$$result.week", week+1 ] },
week
]}
]
}
}
}
}}
])
In MongoDB 2.6 releases, you can probably get away with using $redact and $$DESCEND, but of course need to fake the match in the top level document. This has similar usage of the $ifNull operator:
db.collection.aggregate([
{ "$redact": {
"$cond": {
"if": {
"$and": [
{ "$eq": [{ "$ifNull": [ "$year", year ] }, year ] },
{ "$gt": [
{ "$ifNull": [ "$week", week+1 ] }
week
]}
]
},
"then": "$$DESCEND",
"else": "$$PRUNE"
}
}}
])
If you actually have MongoDB 2.4, then you are probably better off filtering the array content in client code instead. Every language has methods for filtering array content, but as a JavaScript example reproducible in the shell:
db.collection.find().forEach(function(doc) {
doc.results = doc.results.filter(function(result) {
return (
result.year == year &&
( result.hasOwnProperty('week') ? result.week > week : true )
)
]);
printjson(doc);
})
The reason being is that prior to MongoDB 2.6 you need to use $unwind and $group, and various stages in-between. This is a "very costly" operation on the server, considering that all you want to do is remove items from the arrays of documents and not actually "aggregate" from items within the array.
MongoDB releases have gone to great lengths to provide array processing that does not use $unwind, since it's usage for that purpose alone is not a performant option. It should only ever be used in the case where you are removing a "significant" amount of data from arrays as a result.
The whole point is that otherwise the "cost" of the aggregation operation is likely greater than the "cost" of transferring the data over the network to be filtered on the client instead. Use with caution:
db.collection.aggregate([
// Create an array if one does not exist or is already empty
{ "$project": {
"name": 1,
"user": 1,
"results": {
"$cond": [
{ "$ifNull": [ "$results.0", false ] },
"$results",
[false]
]
}
}},
// Unwind the array
{ "$unwind": "$results" },
// Conditionally $push based on match expression and conditionally count
{ "$group": {
"_id": "_id",
"name": { "$first": "$name" },
"user": { "$first": "$user" },
"results": {
"$push": {
"$cond": [
{ "$or": [
{ "$not": "$results" },
{ "$and": [
{ "$eq": [ "$results.year", year ] },
{ "$gt": [
{ "$ifNull": [ "$results.week", week+1 ] },
week
]}
]}
] },
"$results",
false
]
}
},
"count": {
"$sum": {
"$cond": [
{ "$and": [
{ "$eq": [ "$results.year", year ] },
{ "$gt": [
{ "$ifNull": [ "$results.week", week+1 ] },
week
]}
] }
1,
0
]
}
}
}},
// $unwind again
{ "$unwind": "$results" }
// Filter out false items unless count is 0
{ "$match": {
"$or": [
"$results",
{ "count": 0 }
]
}},
// Group again
{ "$group": {
"_id": "_id",
"name": { "$first": "$name" },
"user": { "$first": "$user" },
"results": { "$push": "$results" }
}},
// Now swap [false] for []
{ "$project": {
"name": 1,
"user": 1,
"results": {
"$cond": [
{ "$ne": [ "$results", [false] ] },
"$results",
[]
]
}
}}
])
Now that is a lot of operations and shuffling just to "filter" content from an array compared to all of the other approaches which are really quite simple. And aside from the complexity, it really does "cost" a lot more to execute on the server.
So if your server version actually supports the newer operators that can do this optimally, then it's okay to do so. But if you are stuck with that last process, then you probably should not be doing it and instead do your array filtering in the client.
I want to fetch "all the documents" having highest value for specific field and than group by another field.
Consider below data:
_id:1, country:india, quantity:12, name:xyz
_id:2, country:USA, quantity:5, name:abc
_id:3, country:USA, quantity:6, name:xyz
_id:4, country:india, quantity:8, name:def
_id:5, country:USA, quantity:10, name:jkl
_id:6, country:india, quantity:12, name:jkl
Answer should be
country:india max-quantity:12
name xyz
name jkl
country:USA max-quantity:10
name jkl
I have tried several queries, but I can get only the max value without the name or i can go group by but it shows all the values.
db.coll.aggregate([{
$group:{
_id:"$country",
"maxQuantity":{$max:"$quantity"}
}
}])
for example above will give max quantity on every country but how to combine with other field such that it shows all the documents of max quantity.
If you want to keep document information, then you basically need to $push it into an array. But of course, then having your $max values, you need to filter the contents of the array for just the elements that match:
db.coll.aggregate([
{ "$group":{
"_id": "$country",
"maxQuantity": { "$max": "$quantity" },
"docs": { "$push": {
"_id": "$_id",
"name": "$name",
"quantity": "$quantity"
}}
}},
{ "$project": {
"maxQuantity": 1,
"docs": {
"$setDifference": [
{ "$map": {
"input": "$docs",
"as": "doc",
"in": {
"$cond": [
{ "$eq": [ "$maxQuantity", "$$doc.quantity" ] },
"$$doc",
false
]
}
}},
[false]
]
}
}}
])
So you store everything in an array and then test each array member to see if it's value matches the one that was recorded as the maximum, discarding any that do not.
I'd keep the _id values in the array documents since that is what makes them "unique" and won't be adversely affected by $setDifference when filtering out values. But of course if "name" is always unique then it won't be required.
You can also just return whatever fields you want from $map, but I'm just returning the whole document for example.
Keep in mind that this has the limitation of not exceeding the BSON size limit of 16MB, so is okay for small data samples, but anything producing a potentially large list ( since you cannot pre-filter array content ) would be better of processed with a separate query to find the "max" values, and another to fetch the matching documents.
I know how to do similar task simpler only if you alter specific range of countries:
[
{"$match":{"name":{"$in":["USA","india"]}}}, // stage one
{ "$sort": { "quanity": -1 }}, // stage three
{"$limit":2 } // stage four - count equal ["USA","india"] length
]
If you need all countries try follow, but without guaranties from me:
[
{"$project": {
"country": "$country",
"quantity": "$quantity",
"document": "$$ROOT" // save all fields for future usage
}},
{ "$sort": { "quantity": -1 }},
{"$group":{"_id":{"country":"$country"},"original_doc":{"$first":"$document"} }}
]
Another way can be like:
db.coll.aggregate(
[
{
$sort:{ country: -1, "quantity": -1 }
},
{
"$group":
{
"_id":{ "country": "$country" },
"data":{ "$first": "$$ROOT" }
}
}
])
Another possibility close to Blakes Seven's solution to simplify a bit the setDifference + map part by a filter of the array of documents.
db.coll.aggregate([
{ "$group":{
"_id": "$country",
"maxQuantity": { "$max": "$quantity" },
"docs": { "$push": {
"_id": "$_id",
"name": "$name",
"quantity": "$quantity"
}}
}},
{ "$project": {
"maxQuantity": 1,
"docs": {
"$filter": {
"input": "$docs",
"as": "doc",
"cond": { $eq: ["$$doc.quantity", "$maxQuantity"] }
}
}
}}
])
I am struggling with an aggregation in mongodb. I have the following type of documents:
{
"_id": "xxxx",
"workHome": true,
"commute": true,
"tel": false,
"weekend": true,
"age":39
},
{
"_id": "yyyy",
"workHome": false,
"commute": true,
"tel": false,
"weekend": true,
"age":32
},
{
"_id": "zzzz",
"workHome": false,
"commute": false,
"tel": false,
"weekend": false,
"age":27
}
Out of this I want to generate an aggregation by the total number of fields that are "true" in the document. There are a total of 4 boolean fields in the document so I want the query to group them together to generate the following output (as examples from e.g. a collection with 100 documents in total):
0:20
1:30
2:10
3:20
4:20
This means: There is 20 documents out of 100 with 'all false', 30 documents with '1x true', 10 documents with '2x true' etc. up to a total of 'all 4 are true'.
Is there any way to do this with an $aggregate statement? Right now I am trying to $group by the $sum of 'true' values but don't find a way to get the conditional query to work.
So assuming that the data is consistent with all the same fields as "workHome", "commute", "tel" and "weekend", then you would proceed with a "logical" evaluation such as this:
db.collection.aggregate([
{ "$project": {
"mapped": { "$map": {
"input": ["A","B","C","D"],
"as": "el",
"in": { "$cond": [
{ "$eq": [ "$$el", "A" ] },
"$workHome",
{ "$cond": [
{ "$eq": [ "$$el", "B" ] },
"$commute",
{ "$cond": [
{ "$eq": [ "$$el", "C" ] },
"$tel",
"$weekend"
]}
]}
]}
}}
}},
{ "$unwind": "$mapped" },
{ "$group": {
"_id": "$_id",
"size": { "$sum": { "$cond": [ "$mapped", 1, 0 ] } }
}},
{ "$group": {
"_id": "$size",
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": 1 } }
])
From your simple sample this gives:
{ "_id" : 0, "count" : 1 }
{ "_id" : 2, "count" : 1 }
{ "_id" : 3, "count" : 1 }
To break this down, first the $map operator here transposes the values of the fields to an array of the same lenght as the fields themselves. This is done my comparing each element of the "input" to an expected value via $cond and either returning the true condtion where a match, or moving on to the next condition embedded in the false part of this "ternary" operator. This is done until all logical matches are met and results in an array of values from the fields like so, for the first document:
[true,true,false,true]
The next step is to $unwind the array elements for further comparison. This "de-normalizes" into separate documents for each array element, and is usually required in aggregation pipelines when processing arrays.
Once that is done a $group pipeline stage is invoked, in order to assess the "total" of those elements with a true value. The same $cond ternary is used to transform the logical true/falsecondtions into numeric values here and fed to the $sum accumulator for addition.
Since the "grouping key" provided in _id in the $group is the original document _id value, the current totals are per document for those fields that are true. In order to get totals on the "counts" over the whole collection ( or selection ) then the futher $group stage is invoked with the grouping key being the returned "size" of the matched true results from each document.
The $sum accumulator used there simply adds 1 for each match on the grouping key, thus "counting" the number of occurances of each match count.
Finally $sort by the number of matches "key" in to produce some order to the results.
For the record, this is so much nicer with the upcoming release of MongoDB ( as of writing ) which includes the $filter operator:
db.collection.aggregate([
{ "$group": {
"_id": {
"$size": {
"$filter": {
"input": { "$map": {
"input": ["A","B","C","D"],
"as": "el",
"in": { "$cond": [
{ "$eq": [ "$$el", "A" ] },
"$workHome",
{ "$cond": [
{ "$eq": [ "$$el", "B" ] },
"$commute",
{ "$cond": [
{ "$eq": [ "$$el", "C" ] },
"$tel",
"$weekend"
]}
]}
]}
}},
"as": "el",
"cond": {
"$eq": [ "$$el", true ]
}
}
}
},
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": 1 } }
])
So now just "two" pipeline stages doing the same thing as the original statement that will work from MongoDB 2.6 and above.
Therefore if your own application is in "development" itself, or you are otherwise curious, then take a look at the Development Branch releases where this functionality is available now.
I have a collection of documents:
date: Date
users: [
{ user: 1, group: 1 }
{ user: 5, group: 2 }
]
date: Date
users: [
{ user: 1, group: 1 }
{ user: 3, group: 2 }
]
I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.
The best solution I've been able to find is to do:
$where: function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
Unfortunately, this seems to hurt performance is stated in the $where docs:
$where evaluates JavaScript and cannot take advantage of indexes.
How can I improve this query?
The query you want is this:
db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})
This says find me all documents that don't have elements that are outside of the list 1,5,7.
I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.
Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.
So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:
db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })
But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:
db.collection.find({
"users.user": { "$in": [ 1, 5, 7 ] },
"$where": function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
})
And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.
Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:
db.collection.find({
"$or": [
{ "users.user": { "$all": [ 1, 5, 7 ] } },
{ "users.user": { "$all": [ 1, 5 ] } },
{ "users.user": { "$all": [ 1, 7 ] } },
{ "users": { "$size": 1 }, "users.user": 1 },
{ "users": { "$size": 1 }, "users.user": 5 },
{ "users": { "$size": 1 }, "users.user": 7 }
]
})
So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.
NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in
Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:
db.problem.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Just keeping the "user" element value
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
// Compare to see if all elements are a member of the desired match
{ "$project": {
"match": { "$setEquals": [
{ "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
"$users"
]}
}},
// Filter out any documents that did not match
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.
As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:
db.collection.aggregate([
{ "$match": {
"users.user": { "$in": [ 1,5,7 ] }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
{ "$project": {
"match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
}},
{ "$match": { "match": true } },
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
// and a note of it's current size
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
"size": { "$size": "$users" }
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Which of course can still be done, though a little more long winded in versions prior to 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Group it back to get it's original size
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users" },
"size": { "$sum": 1 }
}},
// Unwind the array copy again
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:
db.collection.ensureIndex({ "users.user": 1 })
Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.
Verdict
I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:
var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
var elements = Math.floor(Math.random(10)*10)+1;
var obj = { date: new Date(), users: [] };
for ( var x = 0; x < elements; x++ ) {
var user = Math.floor(Math.random(10)*10)+1,
group = Math.floor(Math.random(10)*10)+1;
obj.users.push({ user: user, group: group });
}
batch.push( obj );
if ( n % 500 == 0 ) {
db.problem.insert( batch );
batch = [];
}
}
With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):
JavaScript with $in clause: 420ms
Aggregate with $size : 395ms
Aggregate with group array count : 650ms
Aggregate with two set operators : 275ms
Aggregate with $setIsSubset : 250ms
Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.
But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.
So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.
This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.
I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.
Let's say you expanded your question from userIds to full users.
You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]
This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.
Solution:
db.collection.find({
"users": {
"$not": {
"$elemMatch": {
// if all of the OR-blocks are true, element is not in array
"$and": [{
// each OR-block == true if element != that user
"$or": [
"user": { "ne": 1 },
"group": { "ne": 3 }
]
}, {
"$or": [
"user": { "ne": 2 },
"group": { "ne": 5 }
]
}, {
// more users...
}]
}
}
}
})
To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.