Get all documents in a mongodb aggregation pipeline containing a $limit stage - mongodb

I am using mongodb 3.4.
In my application, I have an aggregation operation with a $limit stage.
Now I want to know how I can get all records from the pipeline, ignoring the $limit. Removing $limit is not feasible because I need all records only in a specific scenario.
In the mongodb find() operation, I know I can bypass the cursor.limit() operation by passing 0 as the limit, and I'm looking for something similar for aggregation $limit.

Related

Get data between stages in an aggregation Pipeline

Is it possible to retrieve the documents between stages in mongo aggregation pipeline?
Imagine that I have an aggregation pipeline running in pymongo with 10 stages and I want to be able to retrive some info available after stage 8 that will not be available on the last stage. Is it possible?
The idea is quite similar of this question, and looking at the answers I found this $facet but it wasn't clear for me if the stage1 of all outputFields are the same then it will be executed only once and perform as expected. And also, as I saw on the docs, $facet does not support indexes, that is a problem in my case.
To retrieve values of particular fields which are changed in subsequent stages, use $set to duplicate those values into new fields.
To retrieve the result set exactly as it exists after the 8th stage, send the first 8 stages as their own pipeline.

How can perform facet aggregation in ReactiveMongo with playframework?

I am trying to run $facet aggregation in order to get documents according to $match with $skip and the count of matching documents.
I found there is no $facet aggregation in ReactiveMongo. How can I do this in my playframework 2.7 app?
Any aggregation stage can be written using PipelineOperator (if no convenient function provided).

MongoDB, indexing query in inner object and grouping? [duplicate]

I'm trying to use aggregation framework with $match and $group stages. Does $group stage use index data? I'm using latest available mongodb version - 2.5.4
$group does not use index data.
From the mongoDB docs:
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
The $geoNear pipeline operator takes advantage of a geospatial index.
When using $geoNear, the $geoNear pipeline operation must appear as
the first stage in an aggregation pipeline.
#ArthurTacca, as of Mongo 4.0 $sort preceding $group will speed up things significantly. See https://stackoverflow.com/a/56427875/92049.
As 4J41's answer says, $group does not (directly) use an index, although $sort does if it is the first stage in the pipeline. However, it seems possible that $group could, in principle, have an optimised implementation if it immediately follows a $sort, in which case you could make it effectively make use of an index by putting a $sort before hand.
There does not seem to be a straight answer either way in the docs about whether $group has this optimisation (although I bet there would be if it did, so this suggests it doesn't). The answer is in MongoDB bug 4507: currently $group does NOT have this implementation, so the top line of 4J41's answer is right after all. If you really need efficiency, depending on the application it may be quickest to use a regular query and do the grouping in your client code.
Edit: As sebastian's answer says, it seems that in practice using $sort (that can take advantage of an index) before a $group can make a very large speed improvement. The bug above is still open so it seems that it is not making the absolute best possible advantage of the index (that is, starting to group items as items are loaded, rather than loading them all in memory first). But it is still certainly worth doing.
Per Mongo's 4.2 $group documentation, there is a special optimization for $first:
Optimization to Return the First Document of Each Group
If a pipeline sorts and groups by the same field and the $group stage only uses the $first accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first document of each group.
It makes sense, since only the first entry in an ordered index should be needed for each bin in the $group stage. Unfortunately, in my 3.6 testing, I haven't been able to get nearly the performance I would expect if the index were really being used. I've posted about that problem in detail in another question.
EDIT 2020-04-23
I confirmed with Atlas's MongoDB Support that this $first optimization was added in Mongo 4.2, hence my trouble getting it to work with 3.6. There is also a bug preventing it from working with a composite $group _id at the moment. Further details are available in the post that I linked above.
Changed in version 3.2: Starting in MongoDB 3.2, indexes can cover an aggregation pipeline. In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents.
https://docs.mongodb.com/master/core/aggregation-pipeline/#pipeline-operators-and-indexes

How to process documents by batch in Mongo pipeline without using $skip?

Me and my team have some questions related to Mongo pipeline.
We have a pipeline used to process a large number of documents. In this pipeline, we use the $group stage. However, this stage has an hard-coded limit of 100MB, which is not enough for the quantity of document we have (https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/#memory-restrictions).
The solution we implemented was a combination of the $skip and $limit stages at the beginning of the pipeline in order to process the documents by batch of N (where N is a number chosen so a batch of N documents weight less than 100MB).
However this solution has a performance issue, as the $skip stage get slower
as the number of documents to skip increases (https://arpitbhayani.me/techie/mongodb-cursor-skip-is-slow.html).
We then found this article: https://arpitbhayani.me/techie/fast-and-efficient-pagination-in-mongodb.html. The second approach described there seems to be a good solution for our use case. We would like to implement this approach. However, for this to work, the documents have to be sorted by the internal Mongo ID (_id) at the start of the pipeline. We thought about using the $sort stage beforehand, but this stage also has a hard-coded 100MB limit.
What method could we use to ensure that the documents are sorted by _id at the start of the pipeline?
Is there maybe another solution to process the documents by batch weighting less than 100MB?

How to aggregate and merge the result into a collection?

I want to aggregate and insert the results into an existing collection, without deleting that collection. The documentation seems to suggest that this isn't directly possible. I find that hard to believe.
The map-reduce functionality has 'output modes', including 'merge', which does what I want. I'm looking for the equivalent for aggregation.
The new $out aggregation stage supports inserting into a collection, but it replaces the collection rather than updating it. If I did this I would (I think) have to run another map-reduce to merge this into another collection, which seems inefficient.
Am I missing something or is the functionality just missing from the aggregation feature?
I used the output from aggregation to insert/merge to collection:
db.coll2.insert(
db.coll1.aggregate([]).toArray()
)
Reading the documentation answers this question quite precisely. Atm mongo is not able to do what you want.
The $out operation creates a new collection in the current database if one does not already exist. The collection is not visible until the aggregation completes. If the aggregation fails, MongoDB does not create the collection.
If the collection specified by the $out operation already exists, then upon completion of aggregation the $out stage atomically replaces the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the previous collection.
For anyone coming to this more recently, this is available from version 4.2, you will be able to do this using the $merge operator in an aggregation pipeline. It needs to be the last stage in the pipeline.
{ $merge: { into: "myOutput", on: "_id", whenMatched: "replace", whenNotMatched: "insert" } }
If your not stuck on using the Aggregation operators, you could do an incremental map-reduce on the collection. This operator allows you to merge results into an existing collection.
See documentation below:
http://docs.mongodb.org/manual/tutorial/perform-incremental-map-reduce/