difference between aggregate ($match) and find, in MongoDB? - mongodb

What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
Why doesn't the find function allow renaming the field names like the aggregate function?
e.g. In aggregate we can pass the following string:
{ "$project" : { "OrderNumber" : "$PurchaseOrder.OrderNumber" , "ShipDate" : "$PurchaseOrder.ShipDate"}}
Whereas, find does not allow this.
Why does not the aggregate output return as a DBCursor or a List? and also why can't we get a count of the documents that are returned?
Thank you.

Why does not the aggregate output return as a DBCursor or a List?
The aggregation framework was created to solve easy problems that otherwise would require map-reduce.
This framework is commonly used to compute data that requires the full db as input and few document as output.
What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
One of differences, like you stated, is the return type. Find operations output return as a DBCursor.
Other differences:
Aggregation result must be under 16MB. If you are using shards, the full data must be collected in a single point after the first $group or $sort.
$match only purpose is to improve aggregation's power, but it has some other uses, like improve the aggregation performance.
and also why can't we get a count of the documents that are returned?
You can. Just count the number of elements in the resulting array or add the following command to the end of the pipe:
{$group: {_id: null, count: {$sum: 1}}}
Why doesn't the find function allow renaming the field names like the aggregate function?
MongoDB is young and features are still coming. Maybe in a future version we'll be able to do that. Renaming fields is more critical in aggregation than in find.
EDIT (2014/02/26):
MongoDB 2.6 aggregation operations will return a cursor.
EDIT (2014/04/09):
MongoDB 2.6 was released with the predicted aggregation changes.

I investigated a few things about the aggregation and find call:
I did this with a descending sort in a table of 160k documents and limited my output to a few documents.
The Aggregation command is slower than the find command.
If you access to the data like ToList() the aggregation command is faster than the find.
if you watch at the total times (point 1 + 2) the commands seem to be equal
Maybe the aggregation automatically calls the ToList() and does not have to call it again. If you dont call ToList() afterwards the find() call will be much faster.
7 [ms] vs 50 [ms] (5 documents)

Related

Which query is faster to perform in mongodb: using $in range or pair of $gte and $lte?

I'm interesting in performance issue. Suppose I have a collection with field ref (represented in each document). What I want is to find all documents in specific range (for example, [1-1,000,000]. Is there any difference in the following queries in terms of db performance
db.test.find({"ref": {"$gte":1, "$lte": 1000000}}) and
db.test.find({"ref": {"$in": [1,2,3, ..., 1000000]}})
Additional question is about memory consumption. Which query is more suitable in this case if I use pymongo driver?
There is a huge difference, since range query will be transformed internally to regex query, and if there is an index on 'ref' field, covered query will be evaluated almost in no-time.
Query where you pass a list of elements is a bit heavy, so it will take more time to send it over the wire, after which MongoDb engine will have to evaluate comparison of each array element with each document property.

Joining two aggregation results in mongodb using $out of aggregate

I have a reporting app, and I generate mongodb commands, and it involves running three aggregate calls. The aggregate calls have [match,group,project] in their pipes.
RESULT OF AGGREGATE 1-3
{_id: <XXX>, ...}
The grouping "_id" for these calls are same, but, because their $match are different they cannot be in the same aggregate call. I need to join all of these aggregation results. I know that one way to solve this is using conditions during the $group stage, but the problem is the conditions are really complicated to mix with the already complex $group pipe.
To give some context why that solution is very difficult if not impossible; the data is quite huge, each doc has 700 attributes, and the docs are coming it at around 1k per day. Generating such complicated condition into EACH field in the $group stage will make a mess.
I have seen answers that are running map-reduce to combine these aggregation results, but I am looking for other solutions. As I've researched, aggregate has an $out pipe. Is there any way that I can manipulate that $out pipe to join these aggregation results? (The reason for thinking of $out is that I have to save ALL the results anyway as a report)
If you indeed want to go ahead with merging the aggregation results, then you can create an output collection using bulk upserts. For performance you can create a compound index on this output collection, which has your grouping attributes.
dataArray.map(function(data) {
data.forEach(function(err, row){
var setOnInsert = {grouping_attrs: row.grouping_values, v1: row.v1}
var set = {v2: row.v2}
var query = {grouping_attrs: row.grouping_values}
bulk.find(query).upsert().update({$setOnInsert: setOnInsertStmt, $set: set});
})
})
Here your dataArray is created using find on the $out collections.

MongoDB Query Nested Array Search

I need to query documents with mongoDb that contain nested arrays. I see a lot of examples using the simple $in operator. The only problem is that I strictly need to check for proper subsets.
Consider the following document.
{data: [[1,2,3], [4,5,6]]}
The query needs to be able to get documents with all of [1,2,3] where 1,2,3 can be in any order, which rules out the following query, because it will only match in the correct order.
{data:{$elemMatch:{$all:[[1,2,3]]}}}
I've also tried nested $elemMatch operators with no success, because the $in operator will return the document even if only one element matches such as the following.
{data:{$elemMatch:{$elemMatch:{$in:[1,4]}}}}
Not sure what your actual query looks like, but this should do what you need:
db.documentDto.find({"some_field":{"$elemMatch":{"$in":[1,2,3]}} })
I haven't got a complete answer (and not much time as its late here) but I would consider
Using aggregation pipeline instead of a query if your not already
Use $unwind operator to deconstruct your nested arrays
Use $sort to sort the contents of the arrays - so you can now compare
Use $match to filter out the arrays which don't fit the array subset values as you can now check based on order.
Use $group to group the result back together based on the _id value
Ref:
http://docs.mongodb.org/manual/reference/operator/aggregation-pipeline/ will give you info on each of the above.
From a quick search I came up with a similar question/example that might be helpful: Mongodb sort inner array

MongoDB query: Using Limit together with $near skips few documents

I am currently developing an app which gets the specific number of documents from a collection if their location cordinates falls within certain range of distance. I am using a active record library for Codeigniter and the query that is generated is as follows
db.updates.find({locs: { $near: [72.844102008984, 19.130207090604 ], $maxDistance: 5000 }, posted_on : { $lt :1398425538.1942 },}).sort( { posted_on: -1 } ).limit(10).toArray()
The problem I am facing is that the above query skips few documents which should actually get pulled. But if I remove the limit(10) from the above query then proper documents gets pulled.
I am not sure, but does using limit() in MongoDB omit few results ? or does it limits to only the closest(nearest) documents?
P.S - The documents skipped using the limit are not always the same & random results are generated
I suspect you are running into problems with the special nature of the $near query. $near performs both a limit() and a sort() on the cursor returning the results -
Specifies a point for which a geospatial query returns the closest documents first. The query sorts the documents from nearest to farthest.
By default, queries that use a 2d index return a limit of 100 documents; however you may use limit() to change the number of results.
http://docs.mongodb.org/manual/reference/operator/query/near/
While the documentation does specifically discuss overriding the limit of 100 with your own limit call
You can further limit the number of results using cursor.limit().
It is silent on adding your own sort() or both sorting and overriding the limit at the same time. I suspect you are running into side effects of doing both. Note that it's not incorrect to do both - it just may not produce the results you are looking for. I'd suggest trying the same query using $geoWithin
http://docs.mongodb.org/manual/reference/operator/query/geoWithin/
$geoWithin does not apply a sort or a limit on the results, so it gives you something of a more raw result set.
Do you have any identical posted_on dates in the system? I recommend sorting by a second key, perhaps _id. If the sort order is non-deterministic the system may skip documents in a non-deterministic manor. Adding the _id field to your sort order is generally not that expensive if you have an index on the other fields as they will already be very close to the correct order and _id is part of all indexes. ("By default, all collections have an index on the _id field, and applications and users may add additional indexes to support important queries and operations." http://docs.mongodb.org/manual/core/index-single/ )

Efficient pagination of MongoDB aggregation?

For efficiency, the Mongo documentation recommends that limit statements immediately follow sort statements, thus ending up with the somewhat nonsensical:
collection.find(f).sort(s).limit(l).skip(p)
I say this is somewhat nonsensical because it seems to say take the first l items, and then drop the first p of those l. Since p is usually larger than l, you'd think you'd end up with no results, but in practice you end up with l results.
Aggregation works more as you'd expect:
collection.aggregate({$unwind: u}, {$group: g},{$match: f}, {$sort: s}, {$limit: l}, {$skip: p})
returns 0 results if p>=l.
collection.aggregate({$unwind: u}, {$group: g}, {$match: f}, {$sort: s}, {$skip: p}, {$limit: l})
works, but the documentation seems to imply that this will fail if the match returns a result set that's larger than working memory. Is this true? If so, is there a better way to perform pagination on a result set returned through aggregation?
Source: the "Changed in version 2.4" comment at the end of this page: http://docs.mongodb.org/manual/reference/operator/aggregation/sort/
In MongoDB cursor methods (i.e. when using find()) like limit, sort, skip can be applied in any order => order does not matter. A find() returns a cursor on which modifications applied. Sort is always done before limit, skip is done before limit as well. So in other words the order is: sort -> skip -> limit.
Aggregation framework does not return a DB cursor. Instead it returns a document with results of aggregation. It works by producing intermediate results at each step of the pipeline and thus the order of operations really matters.
I guess MongoDB does not support order for cursor modifier methods because of the way it's implemented internally.
You can't paginate on a result of aggregation framework because there is a single document with results only. You can still paginate on a regular query by using skip and limit, but a better practice would be to use a range query due to it's efficiency of using an index.
UPDATE:
Since v2.6 Mongo aggregation framework returns a cursor instead of a single document. Compare: v2.4 and v2.6.
The documentation seems to imply that this (aggregation) will fail if the match returns a result set that's larger than working memory. Is this true?
No. You can, for example, aggregate on a collection that is larger than physical memory without even using the $match operator. It might be slow, but it should work. There is no problem if $match returns something that is larger than RAM.
Here are the actual pipeline limits.
http://docs.mongodb.org/manual/core/aggregation-pipeline-limits/
The $match operator solely does not cause memory problems. As stated in the documentation, $group and $sort are the usual villains. They are cumulative, and might require access to the entire input set before they can produce any output. If they load too much data into physical memory, they will fail.
If so, is there a better way to perform pagination on a result set returned through aggregation?
I has been correctly said that you cannot "paginate" (apply $skip and $limit) on the result of the aggregation, because it is simply a MongoDB document. But you can "paginate" on the intermediate results of the aggregation pipeline.
Using $limit on the pipeline will help on keeping the result set within the 16 MB bounds, the maximum BSON document size. Even if the collection grows, you should be safe.
Problems could arise with $group and, specially, $sort. You can create "sort friendly" indexes to deal with them if they do actually happen. Have a look at the documentation on indexing strategies.
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Finally, be aware that $skip does not help with performance. On the contrary, they tend to slow down the application since it forces MongoDB to scan every skipped document to reach the desired point in the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/
MongoDB recommendation of $sort preceding $limit is absolutely true as when it happens it optimizes the memory required to do the operation for top n results.
It just that the solution you proposes doesn't fit your use case, which is pagination.
You can modify your query to to get the benefit from this optimization.
collection.aggregate([
{
$unwind: u
},
{
$group: g
},
{
$match: f
},
{
$sort: s
},
{
$limit: l+p
},
{
$skip: p
}
]);
or for find query
collection.find(f).sort(s).limit(l+p).skip(p)
Though, as you can see the with big pagination the memory will grow more and more even with this optimization.