MongoDB $facet limitation - mongodb

Scenario - For a pipeline the $facet stage has limitation of 16MB data for processing and pass the data to next stage. This means that if I have millions of records (as in my case), the data processed from any $facet stage will be limited to 16MB only.
Question -
How to overcome the above problem?
Are there any other pipeline stages that can help in this regard?
Can we use fix this issue at programming level? (Note: I am using C#'s mongo db driver).
Solutions already looked at :
Using "allowDiskUse" feature -> This doesn't work as expected.

For facet 104MB pagination limit in mongodb. Per record you can store
16MB size not more than.

Related

Mongodb Atlas - pipeline length greater than 50 not supported

When i use mongoose for mongodb with an atlas mongo sever , it gives the error that mongo pipeline length greater than 50 not supported.
I have searched the whole web for this.cant find a solution.
Is there a work around on this?
This happens because you have used too many pipelines in one aggregation.
If you want to use more than 50 pipeline stages, you can use $facet.
The following workflow may be of help:
Separate your pipelines into several chunks (pipeline stages in each chunk must not exceed 50).
After running the facet, you can use $unwind to separate the result into separate documents like normal (you may need to restructure the data to restore the former format using $project).
You can run another facet after that if you want.
In this case, if you plan to run 150 stages in one aggregation, you separate them into 4-5 chunks, make sure that each chunk must use the same scope to avoid causing "missing" or "undefined variable" case, you can $use unwind to restore the document format to run the next chunk, too.
Make sure that each output document must not exceed 16MB, that's why I recommend using $unwind after $facet (it can excced 16MB during the stages). The reason is that the $facet will output everything into 1 document with an array of documents inside (if you want to show all documents), so $unwind will separate those "inner" documents into separate documents.
One note on this is that you can try limit the field name to prevent BufBuilder to exceed maximum size, which is 64MB. Try using $project instead of $addFields as it will increase the buffer between each stage.
Another one is that you should not use pipeline stages that exceed 100MB of RAM if you are using MongoDB Atlas < M10.
It may be better if you provide a pseudo-code for the problem you are having but I think this would fix it.

MongoDB and Aggregation Framework Pipeline Stages

I have a doubt like,i am using mongodb aggregation framework but i have multiple stages of $lookup in the pipeline does it going to affect the performance.Is there any limitation on number of stages in the aggregation pipeline?
There is no limitation on the number of stages in a pipeline. However, there are result size and memory limitations, refer to the online doc. $lookup doesn't, at least for now, take advantage of indexes. The more data and stages you have, the more time mongo engines needs to process.

Should I use the "allowDiskUse" option in a product environment?

Should I use the allowDiskUse option when returned doc exceed 16MB limit in aggregation?
Or should I alter db structure or codes logic to avoid the limit?
What's the advantage and disadvantage of 'allowDiskUse'?
Thanks for your help.
Hers is the official doc I have seen:
Result Size Restrictions
Changed in version 2.6.
Starting in MongoDB 2.6, the aggregate command can return a cursor or store the results in a collection. When returning a cursor or storing the results in a collection, each document in the result set is subject to the BSON Document Size limit, currently 16 megabytes; if any single document that exceeds the BSON Document Size limit, the command will produce an error. The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size.
Memory Restrictions¶
Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.
https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/
allowDiskUse is unrelated to the 16MB result size limit. That setting controls whether pipeline steps such as $sort or $group can use some temporary disk space if they need more than 100MB of memory. In theory, for an arbitrary pipeline this could be a very large amount of diskspace. Personally it's never been a problem, but that will be down to your data.
If your result is going to be more than 16MB then you need to use the $out pipeline stage to output the data to a collection or use a pipeline API that returns a cursor to results instead of returning all the data inline (for some drivers this is a separate method, for others it is a flag passed to the same method).

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.