MongoDB, indexing query in inner object and grouping? [duplicate] - mongodb

I'm trying to use aggregation framework with $match and $group stages. Does $group stage use index data? I'm using latest available mongodb version - 2.5.4

$group does not use index data.
From the mongoDB docs:
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
The $geoNear pipeline operator takes advantage of a geospatial index.
When using $geoNear, the $geoNear pipeline operation must appear as
the first stage in an aggregation pipeline.

#ArthurTacca, as of Mongo 4.0 $sort preceding $group will speed up things significantly. See https://stackoverflow.com/a/56427875/92049.

As 4J41's answer says, $group does not (directly) use an index, although $sort does if it is the first stage in the pipeline. However, it seems possible that $group could, in principle, have an optimised implementation if it immediately follows a $sort, in which case you could make it effectively make use of an index by putting a $sort before hand.
There does not seem to be a straight answer either way in the docs about whether $group has this optimisation (although I bet there would be if it did, so this suggests it doesn't). The answer is in MongoDB bug 4507: currently $group does NOT have this implementation, so the top line of 4J41's answer is right after all. If you really need efficiency, depending on the application it may be quickest to use a regular query and do the grouping in your client code.
Edit: As sebastian's answer says, it seems that in practice using $sort (that can take advantage of an index) before a $group can make a very large speed improvement. The bug above is still open so it seems that it is not making the absolute best possible advantage of the index (that is, starting to group items as items are loaded, rather than loading them all in memory first). But it is still certainly worth doing.

Per Mongo's 4.2 $group documentation, there is a special optimization for $first:
Optimization to Return the First Document of Each Group
If a pipeline sorts and groups by the same field and the $group stage only uses the $first accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first document of each group.
It makes sense, since only the first entry in an ordered index should be needed for each bin in the $group stage. Unfortunately, in my 3.6 testing, I haven't been able to get nearly the performance I would expect if the index were really being used. I've posted about that problem in detail in another question.
EDIT 2020-04-23
I confirmed with Atlas's MongoDB Support that this $first optimization was added in Mongo 4.2, hence my trouble getting it to work with 3.6. There is also a bug preventing it from working with a composite $group _id at the moment. Further details are available in the post that I linked above.

Changed in version 3.2: Starting in MongoDB 3.2, indexes can cover an aggregation pipeline. In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents.
https://docs.mongodb.com/master/core/aggregation-pipeline/#pipeline-operators-and-indexes

Related

Get data between stages in an aggregation Pipeline

Is it possible to retrieve the documents between stages in mongo aggregation pipeline?
Imagine that I have an aggregation pipeline running in pymongo with 10 stages and I want to be able to retrive some info available after stage 8 that will not be available on the last stage. Is it possible?
The idea is quite similar of this question, and looking at the answers I found this $facet but it wasn't clear for me if the stage1 of all outputFields are the same then it will be executed only once and perform as expected. And also, as I saw on the docs, $facet does not support indexes, that is a problem in my case.
To retrieve values of particular fields which are changed in subsequent stages, use $set to duplicate those values into new fields.
To retrieve the result set exactly as it exists after the 8th stage, send the first 8 stages as their own pipeline.

MongoDB {aggregation $match} vs {find} speed

I have a mongoDB collection with millions of rows and I'm trying to optimize my queries. I'm currently using the aggregation framework to retrieve data and group them as I want. My typical aggregation query is something like : $match > $group > $ group > $project
However, I noticed that the last parts only take a few ms, the beginning is the slowest.
I tried to perform a query with only the $match filter, and then to perform the same query with collection.find. The aggregation query takes ~80ms while the find query takes 0 or 1ms.
I have indexes on pretty much each field so I guess this isn't the problem. Any idea on what could go wrong ? Or is it just a "normal" drawback of the aggregation framework ?
I could use find queries instead of aggregation queries, however I would have to perform a lot of processing after the request and this process can be done quickly with $group etc. so I would rather keep the aggregation framework.
Thanks,
EDIT :
Here is my criteria :
{
"action" : "click",
"timestamp" : {
"$gt" : ISODate("2015-01-01T00:00:00Z"),
"$lt" : ISODate("2015-02-011T00:00:00Z")
},
"itemId" : "5"
}
The main purpose of the aggregation framework is to ease the query of a big number of entries and generate a low number of results that hold value to you.
As you have said, you can also use multiple find queries, but remember that you can not create new fields with find queries. On the other hand, the $group stage allows you to define your new fields.
If you would like to achieve the functionality of the aggregation framework, you would most likely have to run an initial find (or chain several ones), pull that information and further manipulate it with a programming language.
The aggregation pipeline might seem to take longer, but at least you know you only have to take into account the performance of one system - MongoDB engine.
Whereas, when it comes to manipulating the data returned from a find query, you would most likely have to further manipulate the data with a programming language, thus increasing the complexity depending on the intricacies of the programming language of choice.
Have you tried using explain() to your find queries? It'll give you good idea about how much time find() query will exactly take. You can do the same for $match with $explain & see whether there is any difference in index accessing & other parameters.
Also the $group part of aggregation framework doesn't utilize the indexing so it has to process all the records returned by $match stage of aggregation framework. So to better understand the the working of your query see the result set it returns & whether it fits into memory to be processed by MongoDB.
if you are concern with performance, then no doubt aggregation is time taking task rather then find clause.
when you are fetching record on multiple conditions, having lookup, grouping, and some limited record ( paginated) then it is best approch to use aggregate , meanwhile in find query is fast when you have to fetch very big data set. you have some population, projection and no pagination i suggest to use find query that is fast

How to aggregate and merge the result into a collection?

I want to aggregate and insert the results into an existing collection, without deleting that collection. The documentation seems to suggest that this isn't directly possible. I find that hard to believe.
The map-reduce functionality has 'output modes', including 'merge', which does what I want. I'm looking for the equivalent for aggregation.
The new $out aggregation stage supports inserting into a collection, but it replaces the collection rather than updating it. If I did this I would (I think) have to run another map-reduce to merge this into another collection, which seems inefficient.
Am I missing something or is the functionality just missing from the aggregation feature?
I used the output from aggregation to insert/merge to collection:
db.coll2.insert(
db.coll1.aggregate([]).toArray()
)
Reading the documentation answers this question quite precisely. Atm mongo is not able to do what you want.
The $out operation creates a new collection in the current database if one does not already exist. The collection is not visible until the aggregation completes. If the aggregation fails, MongoDB does not create the collection.
If the collection specified by the $out operation already exists, then upon completion of aggregation the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the previous collection.
For anyone coming to this more recently, this is available from version 4.2, you will be able to do this using the $merge operator in an aggregation pipeline. It needs to be the last stage in the pipeline.
{ $merge: { into: "myOutput", on: "_id", whenMatched: "replace", whenNotMatched: "insert" } }
If your not stuck on using the Aggregation operators, you could do an incremental map-reduce on the collection. This operator allows you to merge results into an existing collection.
See documentation below:
http://docs.mongodb.org/manual/tutorial/perform-incremental-map-reduce/

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.

Is there a difference between $lt/$gt and $ne in MongoDB?

I am just getting started with MongoDB and trying to understand how indexes work. I have a list of items in a collection. Each item has a version that gets incremented. Then, all previous versions (less than current version) get removed (record is not updated so that both versions are available for a while). There is a compound index on item ID and version. For removing, does it make a difference (in terms of performance) whether you use $ne versus $lt?
I would assume no, but I just want to confirm.
Without knowing the details of the implementation $lt can be more efficient than $ne. On a B-tree index $ne would be two range scans ($lt and $gt), whereas $lt is just one.
But in your case $lt seems to be what you want anyway (to find the older versions). If you used $ne, you could accidentally also remove newer versions that you just assume do not exist, but might actually have been created in the mean-time. Remember that MongoDB does not support transactions or consistent views across documents. Concurrent updates might bite you here.
Actually, there's a huge difference. The "$ne and $nin operators are not selective", which means that an index will not speed up that part of the query at all. So if you use $ne, then the version part of the compound index will not be used by MongoDB.