Usage of mapreduce in mongodb [duplicate] - mongodb

I have a query where I need to return 10 of "Type A" records, while returning all other records. How can I accomplish this?
Update: Admittedly, I could do this with two queries, but I wanted to avoid that, if possible, thinking it would be less overhead, and possibly more performant. My query already is an aggregation query that takes both kinds of records into account, I just need to limit the number of the one type of record in the results.
Update: the following is an example query that highlights the problem:
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Fiction"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Horror"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
db.books.aggregate([
{$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
{$match: {"type": "Science"}},
{$project: {
'title': 1,
'author': 1,
'type': 1,
'typeSortOrder':
{$add: [
{$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
{$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
{$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
]},
}},
{$sort: {'typeSortOrder'}},
{$limit: 10}
])
I would like to have all these records returned in one query, but limit the type to at most 10 of any category.
I realize that the typeSortOrder doesn't need to be conditional when the queries are broken out like this, I had it there for when the queries were one query, originally (which is where I would like to get back to).

I don't think this is presently (2.6) possible to do with one aggregation pipeline. It's difficult to give a precise argument as to why not, but basically the aggregation pipeline performs transformations of streams of documents, one document at a time. There's no awareness within the pipeline of the state of the stream itself, which is what you'd need to determine that you've hit the limit for A's, B's, etc and need to drop further documents of the same type. $group does bring multiple documents together and allows their field values in aggregate to affect the resulting group document ($sum, $avg, etc.). Maybe this makes some sense, but it's necessarily not rigorous because there are simple operations you could add to make it possible to limit based on the types, e.g., adding a $push x accumulator to $group that only pushes the value if the array being pushed to has fewer than x elements.
Even if I did have a way to do it, I'd recommend just doing two aggregations. Keep it simple.

Problem
The results here are not impossible but are also possibly impractical. The general notes have been made that you cannot "slice" an array or otherwise "limit" the amount of results pushed onto one. And the method for doing this per "type" is essentially to use arrays.
The "impractical" part is usually about the number of results, where too large a result set is going to blow up the BSON document limit when "grouping". But, I'm going to consider this with some other recommendations on your "geo search" along with the ultimate goal to return 10 results of each "type" at most.
Principle
To first consider and understand the problem, let's look at a simplified "set" of data and the pipeline code necessary to return the "top 2 results" from each type:
{ "title": "Title 1", "author": "Author 1", "type": "Fiction", "distance": 1 },
{ "title": "Title 2", "author": "Author 2", "type": "Fiction", "distance": 2 },
{ "title": "Title 3", "author": "Author 3", "type": "Fiction", "distance": 3 },
{ "title": "Title 4", "author": "Author 4", "type": "Science", "distance": 1 },
{ "title": "Title 5", "author": "Author 5", "type": "Science", "distance": 2 },
{ "title": "Title 6", "author": "Author 6", "type": "Science", "distance": 3 },
{ "title": "Title 7", "author": "Author 7", "type": "Horror", "distance": 1 }
That's a simplified view of the data and somewhat representative of the state of documents after an initial query. Now comes the trick of how to use the aggregation pipeline to get the "nearest" two results for each "type":
db.books.aggregate([
{ "$sort": { "type": 1, "distance": 1 } },
{ "$group": {
"_id": "$type",
"1": {
"$first": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
},
"books": {
"$push": {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
}
}
}},
{ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}},
{ "$unwind": "$books" },
{ "$project": {
"1": 1,
"books": 1,
"seen": { "$eq": [ "$1", "$books" ] }
}},
{ "$sort": { "_id": 1, "seen": 1 } },
{ "$group": {
"_id": "$_id",
"1": { "$first": "$1" },
"2": { "$first": "$books" },
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}},
{ "$project": {
"1": 1,
"2": 2,
"pos": { "$literal": [1,2] }
}},
{ "$unwind": "$pos" },
{ "$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [
{ "$eq": [ "$pos", 1 ] },
"$1",
{ "$cond": [
{ "$eq": [ "$pos", 2 ] },
"$2",
false
]}
]
}
}
}},
{ "$unwind": "$books" },
{ "$match": { "books": { "$ne": false } } },
{ "$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books.distance",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] }
]
}
}},
{ "$sort": { "sortOrder": 1 } }
])
Of course that is just two results, but it outlines the process for getting n results, which naturally is done in generated pipeline code. Before moving onto the code the process deserves a walk through.
After any query, the first thing to do here is $sort the results, and this you want to basically do by both the "grouping key" which is the "type" and by the "distance" so that the "nearest" items are on top.
The reason for this is shown in the $group stages that will repeat. What is done is essentially "popping the $first result off of each grouping stack. So other documents are not lost, they are placed in an array using $push.
Just to be safe, the next stage is really only required after the "first step", but could optionally be added for similar filtering in the repetition. The main check here is that the resulting "array" is larger than just one item. Where it is not, the contents are replaced with a single value of false. The reason for which is about to become evident.
After this "first step" the real repetition cycle beings, where that array is then "de-normalized" with $unwind and then a $project made in order to "match" the document that has been last "seen".
As only one of the documents will match this condition the results are again "sorted" in order to float the "unseen" documents to the top, while of course maintaining the grouping order. The next thing is similar to the first $group step, but where any kept positions are maintained and the "first unseen" document is "popped off the stack" again.
The document that was "seen" is then pushed back to the array not as itself but as a value of false. This is not going to match the kept value and this is generally the way to handle this without being "destructive" to the array contents where you don't want the operations to fail should there not be enough matches to cover the n results required.
Cleaning up when complete, the next "projection" adds an array to the final documents now grouped by "type" representing each position in the n results required. When this array is unwound, the documents can again be grouped back together, but now all in a single array
that possibly contains several false values but is n elements long.
Finally unwind the array again, use $match to filter out the false values, and project to the required document form.
Practicality
The problem as stated earlier is with the number of results being filtered as there is a real limit on the number of results that can be pushed into an array. That is mostly the BSON limit, but you also don't really want 1000's of items even if that is still under the limit.
The trick here is keeping the initial "match" small enough that the "slicing operations" becomes practical. There are some things with the $geoNear pipeline process that can make this a possibility.
The obvious is limit. By default this is 100 but you clearly want to have something in the range of:
(the number of categories you can possibly match) X ( required matches )
But if this is essentially a number not in the 1000's then there is already some help here.
The others are maxDistance and minDistance, where essentially you put upper and lower bounds on how "far out" to search. The max bound is the general limiter while the min bound is useful when "paging", which is the next helper.
When "upwardly paging", you can use the query argument in order to exclude the _id values of documents "already seen" using the $nin query. In much the same way, the minDistance can be populated with the "last seen" largest distance, or at least the smallest largest distance by "type". This allows some concept of filtering out things that have already been "seen" and getting another page.
Really a topic in itself, but those are the general things to look for in reducing that initial match in order to make the process practical.
Implementing
The general problem of returning "10 results at most, per type" is clearly going to want some code in order to generate the pipeline stages. No-one wants to type that out, and practically speaking you will probably want to change that number at some point.
So now to the code that can generate the monster pipeline. All code in JavaScript, but easy to translate in principles:
var coords = [-118.09771, 33.89244];
var key = "$type";
var val = {
"_id": "$_id",
"title": "$title",
"author": "$author",
"distance": "$distance"
};
var maxLen = 10;
var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$literal": [] } } };
pipe.push({ "$geoNear": {
"near": coords,
"distanceField": "distance",
"spherical": true
}});
pipe.push({ "$sort": {
"type": 1, "distance": 1
}});
for ( var x = 1; x <= maxLen; x++ ) {
fproj["$project"][""+x] = 1;
fproj["$project"]["pos"]["$literal"].push( x );
var rec = {
"$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
};
if ( stack.length == 0 ) {
rec["$cond"].push( false );
} else {
lval = stack.pop();
rec["$cond"].push( lval );
}
stack.push( rec );
if ( x == 1) {
pipe.push({ "$group": {
"_id": key,
"1": { "$first": val },
"books": { "$push": val }
}});
pipe.push({ "$project": {
"1": 1,
"books": {
"$cond": [
{ "$eq": [ { "$size": "$books" }, 1 ] },
{ "$literal": [false] },
"$books"
]
}
}});
} else {
pipe.push({ "$unwind": "$books" });
var proj = {
"$project": {
"books": 1
}
};
proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$books" ] };
var grp = {
"$group": {
"_id": "$_id",
"books": {
"$push": {
"$cond": [ { "$not": "$seen" }, "$books", false ]
}
}
}
};
for ( n=x; n >= 1; n-- ) {
if ( n != x )
proj["$project"][""+n] = 1;
grp["$group"][""+n] = ( n == x ) ? { "$first": "$books" } : { "$first": "$"+n };
}
pipe.push( proj );
pipe.push({ "$sort": { "_id": 1, "seen": 1 } });
pipe.push(grp);
}
}
pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
"$group": {
"_id": "$_id",
"msgs": { "$push": stack[0] }
}
});
pipe.push({ "$unwind": "$books" });
pipe.push({ "$match": { "books": { "$ne": false } }});
pipe.push({
"$project": {
"_id": "$books._id",
"title": "$books.title",
"author": "$books.author",
"type": "$_id",
"distance": "$books",
"sortOrder": {
"$add": [
{ "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
{ "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] },
]
}
}
});
pipe.push({ "$sort": { "sortOrder": 1, "distance": 1 } });
Alternate
Of course the end result here and the general problem with all above is that you really only want the "top 10" of each "type" to return. The aggregation pipeline will do it, but at the cost of keeping more than 10 and then "popping off the stack" until 10 is reached.
An alternate approach is to "brute force" this with mapReduce and "globally scoped" variables. Not as nice since the results all in arrays, but it may be a practical approach:
db.collection.mapReduce(
function () {
if ( !stash.hasOwnProperty(this.type) ) {
stash[this.type] = [];
}
if ( stash[this.type.length < maxLen ) {
stash[this.type].push({
"title": this.title,
"author": this.author,
"type": this.type,
"distance": this.distance
});
emit( this.type, 1 );
}
},
function(key,values) {
return 1; // really just want to keep the keys
},
{
"query": {
"location": {
"$nearSphere": [-118.09771, 33.89244]
}
},
"scope": { "stash": {}, "maxLen": 10 },
"finalize": function(key,value) {
return { "msgs": stash[key] };
},
"out": { "inline": 1 }
}
)
This is a real cheat which just uses the "global scope" to keep a single object whose keys are the grouping keys. The results are pushed onto an array in that global object until the maximum length is reached. Results are already sorted by nearest, so the mapper just gives up doing anything with the current document after the 10 are reached per key.
The reducer wont be called since only 1 document per key is emitted. The finalize then just "pulls" the value from the global and returns it in the result.
Simple, but of course you don't have all the $geoNear options if you really need them, and this form has the hard limit of 100 document as the output from the initial query.

This is a classic case for subquery/join which is not supported by MongoDB. All joins and subquery-like operations need to be implemented in the application logic. So multiple queries is your best bet. Performance of the multiple query approach should be good if you have an index on type.
Alternatively you can write a single aggregation query minus the type-matching and limit clauses and then process the stream in your application logic to limit documents per type.
This approach will be low on performance for large result sets because documents may be returned in random order. Your limiting logic will then need to traverse to the entire result set.

i guess you can use cursor.limit() on a cursor to specify the maximum number of documents the cursor will return. limit() is analogous to the LIMIT statement in a SQL database.
You must apply limit() to the cursor before retrieving any documents from the database.
The limit function in the cursors can be used for limiting the number of records in the find.
I guess this example should help:
var myCursor = db.bios.find( );
db.bios.find().limit( 5 )

Related

mongodb find doc with matching values of element in object array [duplicate]

I have a collection of documents:
date: Date
users: [
{ user: 1, group: 1 }
{ user: 5, group: 2 }
]
date: Date
users: [
{ user: 1, group: 1 }
{ user: 3, group: 2 }
]
I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.
The best solution I've been able to find is to do:
$where: function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
Unfortunately, this seems to hurt performance is stated in the $where docs:
$where evaluates JavaScript and cannot take advantage of indexes.
How can I improve this query?
The query you want is this:
db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})
This says find me all documents that don't have elements that are outside of the list 1,5,7.
I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.
Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.
So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:
db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })
But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:
db.collection.find({
"users.user": { "$in": [ 1, 5, 7 ] },
"$where": function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
})
And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.
Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:
db.collection.find({
"$or": [
{ "users.user": { "$all": [ 1, 5, 7 ] } },
{ "users.user": { "$all": [ 1, 5 ] } },
{ "users.user": { "$all": [ 1, 7 ] } },
{ "users": { "$size": 1 }, "users.user": 1 },
{ "users": { "$size": 1 }, "users.user": 5 },
{ "users": { "$size": 1 }, "users.user": 7 }
]
})
So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.
NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in
Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:
db.problem.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Just keeping the "user" element value
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
// Compare to see if all elements are a member of the desired match
{ "$project": {
"match": { "$setEquals": [
{ "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
"$users"
]}
}},
// Filter out any documents that did not match
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.
As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:
db.collection.aggregate([
{ "$match": {
"users.user": { "$in": [ 1,5,7 ] }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
{ "$project": {
"match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
}},
{ "$match": { "match": true } },
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
// and a note of it's current size
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
"size": { "$size": "$users" }
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Which of course can still be done, though a little more long winded in versions prior to 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Group it back to get it's original size
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users" },
"size": { "$sum": 1 }
}},
// Unwind the array copy again
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:
db.collection.ensureIndex({ "users.user": 1 })
Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.
Verdict
I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:
var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
var elements = Math.floor(Math.random(10)*10)+1;
var obj = { date: new Date(), users: [] };
for ( var x = 0; x < elements; x++ ) {
var user = Math.floor(Math.random(10)*10)+1,
group = Math.floor(Math.random(10)*10)+1;
obj.users.push({ user: user, group: group });
}
batch.push( obj );
if ( n % 500 == 0 ) {
db.problem.insert( batch );
batch = [];
}
}
With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):
JavaScript with $in clause: 420ms
Aggregate with $size : 395ms
Aggregate with group array count : 650ms
Aggregate with two set operators : 275ms
Aggregate with $setIsSubset : 250ms
Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.
But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.
So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.
This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.
I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.
Let's say you expanded your question from userIds to full users.
You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]
This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.
Solution:
db.collection.find({
"users": {
"$not": {
"$elemMatch": {
// if all of the OR-blocks are true, element is not in array
"$and": [{
// each OR-block == true if element != that user
"$or": [
"user": { "ne": 1 },
"group": { "ne": 3 }
]
}, {
"$or": [
"user": { "ne": 2 },
"group": { "ne": 5 }
]
}, {
// more users...
}]
}
}
}
})
To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.

MongoDb Aggregate on both field and a nested array field in the same record

I have a collection. I am trying to get an aggregate sum/count of a field in the record. I also need an aggregate sum/count of a nested array field in the record.
I am using MongoDB 3.0.0 with Jongo.
Please find my record below:
db.events.insert([{
"eventId": "a21sda2s-711f-12e6-8bcf-p1ff819aer3o",
"orgName": "ORG1",
"eventName": "EVA2",
"eventCost": 5000,
"bids": [{
"vendorName": "v1",
"bidStatus": "ACCEPTED",
"bidAmount": 4400
},{
"vendorName": "v2",
"bidStatus": "PROCESSING",
"bidAmount": 4900
},{
"vendorName": "v3",
"bidStatus": "REJECTED",
"bidAmount": "3000"
}] }, {
"eventId": "4427f318-7699-11e5-8bcf-feff819cdc9f",
"orgName": "ORG1",
"eventName": "EVA3",
"eventCost": 1000,
"bids": [ {
"vendorName": "v1",
"bidStatus": "REJECTED",
"bidAmount": 800
}, {
"vendorName": "v2",
"bidStatus": "PROCESSING",
"bidAmount": 900
},{
"vendorName": "v3",
"bidStatus": "PROCESSING",
"bidAmount": 990
}] }])
I need $eventCount and $eventCost where I aggregate $eventCost field.
I get $acceptedCount and $acceptedAmount by aggregating $bids.bidAmount field (with a condition on $bids.bidStatus)
The result I need would be in form:
[
{
"_id" : "EVA2",
"eventCount" : 2,
"eventCost" : 10000,
"acceptedCount" : 2,
"acceptedAmount" : 7400 },
{
"_id" : "EVA3",
"eventCount" : 1,
"eventCost" : 1000 ,
"acceptedCount" : 0,
"acceptedAmount" : 0 },
}]
I am not able to get the result in a single query. Right now I make two Queries A and Query B(refer below) and merge them in my Java Code.
I use an $unwind operator in my Query B.
Is there a way I can the achieve the same result, in a single query. I feel all I need is a way to pass the bids[] array downstream for the next operation in the pipeline.
I tried $push operator, but I am not able to figure, a way to push the entire bid[] array downstream.
I don't want to change my record structure, but if there is something intrinsically wrong, I could give it a try. Thanks for all your help.
My Solution
Query A:
db.events.aggregate([
{$group: {
_id: "$eventName",
eventCount: {$sum: 1}, // Get count of all events
eventCost: {$sum: "$eventCost"} // Get sum of costs
} }
])
Query B:
db.events.aggregate([
{$unwind: "$bids" },
{$group: {
_id: "$eventName",
// Get Count of Bids that have been accepted
acceptedCount:{ $sum:{$cond: [{$eq: ["$bids.bidStatus","ACCEPTED"]} ,1,0] } } ,
// Get Sum of Amounts that have been accepted
acceptedAmount:{$sum:{$cond: [{$eq: ["$bids.bidStatus","ACCEPTED"]} ,"$bids.bidAmount",0]
} } } }
])
Join Query A and QueryB in Java Code.
What I need:
A single DB operation to accomplish the same
The problem with unwinding arrays is it's going to mess up your count's for the grouped events if you try to unwind these before you do that initial grouping, as the number of items in each document array will affect the count and sum with the deformalized documents.
Provided that is practical for your data size, there is however nothing wrong with using $push to simply create and "array" of "arrays", where of course you just process $unwind twice on each grouped document:
db.events.aggregate([
{ "$group": {
"_id": "$eventName",
"eventCount": { "$sum": 1 },
"eventCost": { "$sum": "$eventCost" },
"bids": { "$push": "$bids" }
}},
{ "$unwind": "$bids" },
{ "$unwind": "$bids" },
{ "$group": {
"_id": "$_id",
"eventCount": { "$first": "$eventCount" },
"eventCost": { "$first": "$eventCost" },
"acceptedCount":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
1,
0
]
}
},
"acceptedCost":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
"$bids.bidAmount",
0
]
}
}
}}
])
The likely better alternative to this is to sum up the "accepted" values from each document first, and then sum those values per "event" later:
db.events.aggregate([
{ "$unwind": "$bids" },
{ "$group": {
"_id": "$_id",
"eventName": { "$first": "$eventName" },
"eventCost": { "$first": "$eventCost" },
"acceptedCount":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
1,
0
]
}
},
"acceptedCost":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
"$bids.bidAmount",
0
]
}
}
}},
{ "$group": {
"_id": "$eventName",
"eventCount": { "$sum": 1 },
"eventCost": { "$sum": "$eventCost" },
"acceptedCount": { "$sum": "$acceptedCount" },
"acceptedCost": { "$sum": "$acceptedCost" }
}}
])
In that way each array is reduced to just the values you need to collect and this makes the latter $group a lot easier.
Those are a couple of approaches with the latter being the better option, but if you are actually able to process both queries in parallel and combine them in a smart way, then running two queries as you are currently doing would be my recommended approach for the best performance.

MongoDB: How to Get the Lowest Value Closer to a given Number and Decrement by 1 Another Field

Given the following document containing 3 nested documents...
{ "_id": ObjectId("56116d8e4a0000c9006b57ac"), "name": "Stock 1", "items" [
{ "price": 1.50, "description": "Item 1", "count": 10 }
{ "price": 1.70, "description": "Item 2", "count": 13 }
{ "price": 1.10, "description": "Item 3", "count": 20 }
]
}
... I need to select the sub-document with the lowest price closer to a given amount (here below I assume 1.05):
db.stocks.aggregate([
{$unwind: "$items"},
{$sort: {"items.price":1}},
{$match: {"items.price": {$gte: 1.05}}},
{$group: {
_id:0,
item: {$first:"$items"}
}},
{$project: {
_id: "$item._id",
price: "$item.price",
description: "$item.description"
}}
]);
This works as expected and here is the result:
"result" : [
{
"price" : 1.10,
"description" : "Item 3",
"count" : 20
}
],
"ok" : 1
Alongside returning the item with the lowest price closer to a given amount, I need to decrement count by 1. For instance, here below is the result I'm looking for:
"result" : [
{
"price" : 1.10,
"description" : "Item 3",
"count" : 19
}
],
"ok" : 1
It depends on whether you actually want to "update" the result or simply "return" the result with a decremented value. In the former case you will of course need to go back to the document and "decrement" the value for the returned result.
Also want to note that what you "think" is efficient here is actually not. Doing the "filter" of elements "post sort" or even "post unwind" really makes no difference at all to how the $first accumulator works in terms of performance.
The better approach is to basically "pre filter" the values from the array where possible. This reduces the document size in the aggregation pipeline, and the number of array elements to be processed by $unwind:
db.stocks.aggregate([
{ "$match": {
"items.price": { "$gte": 1.05 }
}},
{ "$project": {
"items": {
"$setDifference": [
{ "$map": {
"input": "$items",
"as": "item",
"in": {
"$cond": [
{ "$gte": [ "$$item.price", 1.05 ] }
],
"$$item",
false
}
}},
[false]
]
}
}},
{ "$unwind": "$items"},
{ "$sort": { "items.price":1 } },
{ "$group": {
"_id": 0,
"item": { "$first": "$items" }
}},
{ "$project": {
"_id": "$item._id",
"price": "$item.price",
"description": "$item.description"
}}
]);
Of course that does require a MongoDB version 2.6 or greater server to have the available operators, and going by your output you may have an earlier version. If that is the case then at least loose the $match as it does not do anything of value and would be detremental to performance.
Where a $match is useful, is in the document selection before you do anything, as what you always want to avoid is processing documents that do not even possibly meet the conditions you want from within the array or anywhere else. So you should always $match or use a similar query stage first.
At any rate, if all you wanted was a "projected result" then just use $subtract in the output:
{ "$project": {
"_id": "$item._id",
"price": "$item.price",
"description": "$item.description",
"count": { "$subtract": [ "$item.count", 1 ] }
}}
If you wanted however to "update" the result, then you would be iterating the array ( it's still an array even with one result ) to update the matched item and "decrement" the count via $inc:
var result = db.stocks.aggregate([
{ "$match": {
"items.price": { "$gte": 1.05 }
}},
{ "$project": {
"items": {
"$setDifference": [
{ "$map": {
"input": "$items",
"as": "item",
"in": {
"$cond": [
{ "$gte": [ "$$item.price", 1.05 ] }
],
"$$item",
false
}
}},
[false]
]
}
}},
{ "$unwind": "$items"},
{ "$sort": { "items.price":1 } },
{ "$group": {
"_id": 0,
"item": { "$first": "$items" }
}},
{ "$project": {
"_id": "$item._id",
"price": "$item.price",
"description": "$item.description"
}}
]);
result.forEach(function(item) {
db.stocks.update({ "item._id": item._id},{ "$inc": { "item.$.count": -1 }})
})
And on a MongoDB 2.4 shell, your same aggregate query applies ( but please make the changes ) however the result contains another field called result inside it with the array, so add the level:
result.result.forEach(function(item) {
db.stocks.update({ "item._id": item._id},{ "$inc": { "item.$.count": -1 }})
})
So either just $project for display only, or use the returned result to effect an .update() on the data as required.

How to find events that occurred in a timeframe (mongo)

I have the following document structure:
{ _id:ID1
value: { data:{userData:{name:aaa,surname:bbb}}
events:[
{even1tName:{timestamp:UNIX_TIMESTAMP,value:NUMBER}},
{even2tName:{timestamp:UNIX_TIMESTAMP,value:NUMBER}},
{even3tName:{timestamp:UNIX_TIMESTAMP,value:NUMBER}},
{even4tName:{timestamp:UNIX_TIMESTAMP,value:NUMBER}},
],
activity:{countEvents:INTEGER,totalValue:NUMBER}
}
}
This is the output of a MapReduce pipe, I need to find using aggregation, what users have a certain amount of events and a certain amount of value (summed up), within a timeframe. Consider these are online buyers and I need to find those that have made 3 purchases within the last month or those that have bought of a total amount greater than $300.
Your question is a bit light on information, but the main thing is that as long as there is consistent "keyname" naming in the documents then this really is not an issue:
db.junk.aggregate([
// Match where type within timeframe
{ "$match": {
"value.events.confirmedSale.timestamp": {
"$gte": startTime, "$lt": endTime
}
}},
// Pre-filter the array for required data
{ "$project": {
"value": {
"data": "$value.data",
"events": {
"$setDifference": [
{"$map": {
"input": "$value.events",
"as": "el",
"in": {
"$cond": [
{ "$and": [
{ "$gte": [ "$$el.confirmedSale.timestamp", startTime ] },
{ "$lt": [ "$$el.confirmedSale.timestamp", endTime ] }
]},
"$$el",
false
]
}
}},
[false]
]
}
}
}},
// Unwind array elements for processing
{ "$unwind": "$value.events" },
// Group data
{ "$group": {
"_id": "$_id",
"value": { "$sum": "$value.events.confirmedSale.value"},
"count": { "$sum": 1 }
}},
// Filter results on totals
{ "$match": {
"value": { "$gte": 300, "count": { "$gte": 3 } }
}}
])
However, due to the document structure you cannot really get more extensive than that. Such naming requires "path names" to embedded objects to be absolute, and this particular case does not do well for indexing either.
With some control over the document creation, then it should look more like this:
{ _id: 1,
value: {
data:{
userData:{name:"aaa",surname:"bbb"}
},
events:[
{ "type": "adCLick", "timestamp": 1234, "value": 1234 },
{ "type": "confirmedSale", "timestamp": 5678, "value": 5678 },
{ "type": "confirmedSale", "timestamp": 4567, "value": 4567 },
{ "type": "something", "timestamp": 9876, "value": 9876}
]
}
}
Now that the field Name you were using here is actually now just a consistent "data" property, the query can be more clearly readable, do more with combined events that you cannot do, and also work in the use of indexes for performance.
MongoDB is primarily a "database", if you do not keep consistent naming paths then you will have performance and feature loss as a consequence. The aggregation framework is the "high performance" option over mapReduce with JavaScript. Working with a set key pattern is fine for the aggregation framework, but if you vary that pattern, then your only option is mapReduce.

Check if every element in array matches condition

I have a collection of documents:
date: Date
users: [
{ user: 1, group: 1 }
{ user: 5, group: 2 }
]
date: Date
users: [
{ user: 1, group: 1 }
{ user: 3, group: 2 }
]
I would like to query against this collection to find all documents where every user id in my array of users is in another array, [1, 5, 7]. In this example, only the first document matches.
The best solution I've been able to find is to do:
$where: function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
Unfortunately, this seems to hurt performance is stated in the $where docs:
$where evaluates JavaScript and cannot take advantage of indexes.
How can I improve this query?
The query you want is this:
db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})
This says find me all documents that don't have elements that are outside of the list 1,5,7.
I don't know about better, but there are a few different ways to approach this, and depending on the version of MongoDB you have available.
Not too sure if this is your intention or not, but the query as shown will match the first document example because as your logic is implemented you are matching the elements within that document's array that must be contained within the sample array.
So if you actually wanted the document to contain all of those elements, then the $all operator would be the obvious choice:
db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })
But working with the presumption that your logic is actually intended, at least as per suggestion you can "filter" those results by combining with the $in operator so that there are less documents subject to your $where** condition in evaluated JavaScript:
db.collection.find({
"users.user": { "$in": [ 1, 5, 7 ] },
"$where": function() {
var ids = [1, 5, 7];
return this.users.every(function(u) {
return ids.indexOf(u.user) !== -1;
});
}
})
And you get an index though the actual scanned will be multiplied by the number of elements in the arrays from the matched documents, but still better than without the additional filter.
Or even possibly you consider the logical abstraction of the $and operator used in combination with $or and possibly the $size operator depending on your actual array conditions:
db.collection.find({
"$or": [
{ "users.user": { "$all": [ 1, 5, 7 ] } },
{ "users.user": { "$all": [ 1, 5 ] } },
{ "users.user": { "$all": [ 1, 7 ] } },
{ "users": { "$size": 1 }, "users.user": 1 },
{ "users": { "$size": 1 }, "users.user": 5 },
{ "users": { "$size": 1 }, "users.user": 7 }
]
})
So this is a generations of all of the possible permutations of your matching condition, but again performance will likely vary depending on your available installed version.
NOTE: Actually a complete fail in this case as this does something entirely different and in fact results in a logical $in
Alternates are with the aggregation framework, your mileage may vary on which is most efficient due to the number of documents in your collection, one approach with MongoDB 2.6 and upwards:
db.problem.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Just keeping the "user" element value
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
// Compare to see if all elements are a member of the desired match
{ "$project": {
"match": { "$setEquals": [
{ "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
"$users"
]}
}},
// Filter out any documents that did not match
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
So that approach uses some newly introduced set operators in order to compare the contents, though of course you need to restructure the array in order to make the comparison.
As pointed out, there is a direct operator to do this in $setIsSubset which does the equivalent of the combined operators above in a single operator:
db.collection.aggregate([
{ "$match": {
"users.user": { "$in": [ 1,5,7 ] }
}},
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
{ "$unwind": "$users" },
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users.user" }
}},
{ "$project": {
"match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
}},
{ "$match": { "match": true } },
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Or with a different approach while still taking advantage of the $size operator from MongoDB 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
// and a note of it's current size
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
"size": { "$size": "$users" }
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
Which of course can still be done, though a little more long winded in versions prior to 2.6:
db.collection.aggregate([
// Match documents that "could" meet the conditions
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Keep your original document and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"date": "$date",
"users": "$users"
},
"users": 1,
}},
// Unwind the array copy
{ "$unwind": "$users" },
// Group it back to get it's original size
{ "$group": {
"_id": "$_id",
"users": { "$push": "$users" },
"size": { "$sum": 1 }
}},
// Unwind the array copy again
{ "$unwind": "$users" },
// Filter array contents that do not match
{ "$match": {
"users.user": { "$in": [ 1, 5, 7 ] }
}},
// Count the array elements that did match
{ "$group": {
"_id": "$_id",
"size": { "$first": "$size" },
"count": { "$sum": 1 }
}},
// Compare the original size to the matched count
{ "$project": {
"match": { "$eq": [ "$size", "$count" ] }
}},
// Filter out documents that were not the same
{ "$match": { "match": true } },
// Return the original document form
{ "$project": {
"_id": "$_id._id",
"date": "$_id.date",
"users": "$_id.users"
}}
])
That generally rounds out the different ways, try them out and see what works best for you. In all likelihood the simple combination of $in with your existing form is probably going to be the best one. But in all cases, make sure you have an index that can be selected:
db.collection.ensureIndex({ "users.user": 1 })
Which is going to give you the best performance as long as you are accessing that in some way, as all the examples here do.
Verdict
I was intrigued by this so ultimately contrived a test case in order to see what had the best performance. So first some test data generation:
var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
var elements = Math.floor(Math.random(10)*10)+1;
var obj = { date: new Date(), users: [] };
for ( var x = 0; x < elements; x++ ) {
var user = Math.floor(Math.random(10)*10)+1,
group = Math.floor(Math.random(10)*10)+1;
obj.users.push({ user: user, group: group });
}
batch.push( obj );
if ( n % 500 == 0 ) {
db.problem.insert( batch );
batch = [];
}
}
With 10000 documents in a collection with random arrays from 1..10 in length holding random values of 1..0, I came to a match count of 430 documents (reduced from 7749 from the $in match ) with the following results (avg):
JavaScript with $in clause: 420ms
Aggregate with $size : 395ms
Aggregate with group array count : 650ms
Aggregate with two set operators : 275ms
Aggregate with $setIsSubset : 250ms
Noting that over the samples done all but the last two had a peak variance of approximately 100ms faster, and the last two both exhibited 220ms response. The largest variations were in the JavaScript query which also exhibited results 100ms slower.
But the point here is relative to hardware, which on my laptop under a VM is not particularly great, but gives an idea.
So the aggregate, and specifically the MongoDB 2.6.1 version with set operators clearly wins on performance with the additional slight gain coming from $setIsSubset as a single operator.
This is particularly interesting given (as indicated by the 2.4 compatible method) the largest cost in this process will be the $unwind statement ( over 100ms avg ), so with the $in selection having a mean around 32ms the rest of the pipeline stages execute in less than 100ms on average. So that gives a relative idea of aggregation versus JavaScript performance.
I just spent a substantial portion of my day trying to implement Asya's solution above with object-comparisons rather than strict equality. So I figured I'd share it here.
Let's say you expanded your question from userIds to full users.
You want to find all documents where every item in its users array is present in another users array: [{user: 1, group: 3}, {user: 2, group: 5},...]
This won't work: db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) because $nin only works for strict equality. So we need to find a different way of expressing "Not in array" for arrays of objects. And using $where would slow down the query too much.
Solution:
db.collection.find({
"users": {
"$not": {
"$elemMatch": {
// if all of the OR-blocks are true, element is not in array
"$and": [{
// each OR-block == true if element != that user
"$or": [
"user": { "ne": 1 },
"group": { "ne": 3 }
]
}, {
"$or": [
"user": { "ne": 2 },
"group": { "ne": 5 }
]
}, {
// more users...
}]
}
}
}
})
To round out the logic: $elemMatch matches all documents that have a user not in the array. So $not will match all documents that have all of the users in the array.