Mongodb Atlas - pipeline length greater than 50 not supported - mongodb

When i use mongoose for mongodb with an atlas mongo sever , it gives the error that mongo pipeline length greater than 50 not supported.
I have searched the whole web for this.cant find a solution.
Is there a work around on this?

This happens because you have used too many pipelines in one aggregation.
If you want to use more than 50 pipeline stages, you can use $facet.
The following workflow may be of help:
Separate your pipelines into several chunks (pipeline stages in each chunk must not exceed 50).
After running the facet, you can use $unwind to separate the result into separate documents like normal (you may need to restructure the data to restore the former format using $project).
You can run another facet after that if you want.
In this case, if you plan to run 150 stages in one aggregation, you separate them into 4-5 chunks, make sure that each chunk must use the same scope to avoid causing "missing" or "undefined variable" case, you can $use unwind to restore the document format to run the next chunk, too.
Make sure that each output document must not exceed 16MB, that's why I recommend using $unwind after $facet (it can excced 16MB during the stages). The reason is that the $facet will output everything into 1 document with an array of documents inside (if you want to show all documents), so $unwind will separate those "inner" documents into separate documents.
One note on this is that you can try limit the field name to prevent BufBuilder to exceed maximum size, which is 64MB. Try using $project instead of $addFields as it will increase the buffer between each stage.
Another one is that you should not use pipeline stages that exceed 100MB of RAM if you are using MongoDB Atlas < M10.
It may be better if you provide a pseudo-code for the problem you are having but I think this would fix it.

Related

MongoDB $facet limitation

Scenario - For a pipeline the $facet stage has limitation of 16MB data for processing and pass the data to next stage. This means that if I have millions of records (as in my case), the data processed from any $facet stage will be limited to 16MB only.
Question -
How to overcome the above problem?
Are there any other pipeline stages that can help in this regard?
Can we use fix this issue at programming level? (Note: I am using C#'s mongo db driver).
Solutions already looked at :
Using "allowDiskUse" feature -> This doesn't work as expected.
For facet 104MB pagination limit in mongodb. Per record you can store
16MB size not more than.

How to process documents by batch in Mongo pipeline without using $skip?

Me and my team have some questions related to Mongo pipeline.
We have a pipeline used to process a large number of documents. In this pipeline, we use the $group stage. However, this stage has an hard-coded limit of 100MB, which is not enough for the quantity of document we have (https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/#memory-restrictions).
The solution we implemented was a combination of the $skip and $limit stages at the beginning of the pipeline in order to process the documents by batch of N (where N is a number chosen so a batch of N documents weight less than 100MB).
However this solution has a performance issue, as the $skip stage get slower
as the number of documents to skip increases (https://arpitbhayani.me/techie/mongodb-cursor-skip-is-slow.html).
We then found this article: https://arpitbhayani.me/techie/fast-and-efficient-pagination-in-mongodb.html. The second approach described there seems to be a good solution for our use case. We would like to implement this approach. However, for this to work, the documents have to be sorted by the internal Mongo ID (_id) at the start of the pipeline. We thought about using the $sort stage beforehand, but this stage also has a hard-coded 100MB limit.
What method could we use to ensure that the documents are sorted by _id at the start of the pipeline?
Is there maybe another solution to process the documents by batch weighting less than 100MB?

Should I use the "allowDiskUse" option in a product environment?

Should I use the allowDiskUse option when returned doc exceed 16MB limit in aggregation?
Or should I alter db structure or codes logic to avoid the limit?
What's the advantage and disadvantage of 'allowDiskUse'?
Thanks for your help.
Hers is the official doc I have seen:
Result Size Restrictions
Changed in version 2.6.
Starting in MongoDB 2.6, the aggregate command can return a cursor or store the results in a collection. When returning a cursor or storing the results in a collection, each document in the result set is subject to the BSON Document Size limit, currently 16 megabytes; if any single document that exceeds the BSON Document Size limit, the command will produce an error. The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size.
Memory Restrictions¶
Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.
https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/
allowDiskUse is unrelated to the 16MB result size limit. That setting controls whether pipeline steps such as $sort or $group can use some temporary disk space if they need more than 100MB of memory. In theory, for an arbitrary pipeline this could be a very large amount of diskspace. Personally it's never been a problem, but that will be down to your data.
If your result is going to be more than 16MB then you need to use the $out pipeline stage to output the data to a collection or use a pipeline API that returns a cursor to results instead of returning all the data inline (for some drivers this is a separate method, for others it is a flag passed to the same method).

When do I absolutely need the $out operator in MongoDB aggregation?

MongoDB aggregation documentation on $out says:
"Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline. The $out operator lets the aggregation framework return result sets of any size."
https://docs.mongodb.org/manual/reference/operator/aggregation/out/
So, one issue may be that aggregation may run out of memory or use a lot of memory. But how $out will help here, ultimately if the aggregation returning a lot of buckets, they are to be held in memory first.
The $out operator is useful when you have a certain use-case which takes long to calculate but doesn't need to be current all the time.
Let's say you have a website where you want a list of the top ten currently most popular articles on the frontpage (most hits in the past 60 minutes). To create this statistic, you need to parse your access log collections with a pipeline like this:
$match the last hour
$group by article-id and user to filter out reloads
$group again by article-id to get the hit count for each article and user
$sort by count.
$limit to 10 results
When you have a very popular website with a lot of content, this can be a quite load-heavy aggregation. And when you have it on the frontpage you need to do it for every single frontpage hit. This can create a quite nasty load on your database and considerably bog down the loading time.
How do we solve this problem?
Instead of performing that aggregation on every page hit, we perform it once every minute with a cronjob which uses $out to put the aggregated top ten list into a new collection. You can then query the cached results in that collection directly. Getting all 10 results from a 10-document collection will be far faster than performing that aggregation all the time.

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.