Should I use the "allowDiskUse" option in a product environment? - mongodb

Should I use the allowDiskUse option when returned doc exceed 16MB limit in aggregation?
Or should I alter db structure or codes logic to avoid the limit?
What's the advantage and disadvantage of 'allowDiskUse'?
Thanks for your help.
Hers is the official doc I have seen:
Result Size Restrictions
Changed in version 2.6.
Starting in MongoDB 2.6, the aggregate command can return a cursor or store the results in a collection. When returning a cursor or storing the results in a collection, each document in the result set is subject to the BSON Document Size limit, currently 16 megabytes; if any single document that exceeds the BSON Document Size limit, the command will produce an error. The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size.
Memory Restrictions¶
Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.
https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/

allowDiskUse is unrelated to the 16MB result size limit. That setting controls whether pipeline steps such as $sort or $group can use some temporary disk space if they need more than 100MB of memory. In theory, for an arbitrary pipeline this could be a very large amount of diskspace. Personally it's never been a problem, but that will be down to your data.
If your result is going to be more than 16MB then you need to use the $out pipeline stage to output the data to a collection or use a pipeline API that returns a cursor to results instead of returning all the data inline (for some drivers this is a separate method, for others it is a flag passed to the same method).

Related

What is the maximum size of an output collection in a stage of aggregation pipeline?

I read that query cannot be more than 16mb in size in the aggregation pipeline.
What if a query outputs a collection of billions of documents? Is it simply going to crash then?
This seems rather limiting, that we can't query a collection beyond 16mb?

Mongodb Atlas - pipeline length greater than 50 not supported

When i use mongoose for mongodb with an atlas mongo sever , it gives the error that mongo pipeline length greater than 50 not supported.
I have searched the whole web for this.cant find a solution.
Is there a work around on this?
This happens because you have used too many pipelines in one aggregation.
If you want to use more than 50 pipeline stages, you can use $facet.
The following workflow may be of help:
Separate your pipelines into several chunks (pipeline stages in each chunk must not exceed 50).
After running the facet, you can use $unwind to separate the result into separate documents like normal (you may need to restructure the data to restore the former format using $project).
You can run another facet after that if you want.
In this case, if you plan to run 150 stages in one aggregation, you separate them into 4-5 chunks, make sure that each chunk must use the same scope to avoid causing "missing" or "undefined variable" case, you can $use unwind to restore the document format to run the next chunk, too.
Make sure that each output document must not exceed 16MB, that's why I recommend using $unwind after $facet (it can excced 16MB during the stages). The reason is that the $facet will output everything into 1 document with an array of documents inside (if you want to show all documents), so $unwind will separate those "inner" documents into separate documents.
One note on this is that you can try limit the field name to prevent BufBuilder to exceed maximum size, which is 64MB. Try using $project instead of $addFields as it will increase the buffer between each stage.
Another one is that you should not use pipeline stages that exceed 100MB of RAM if you are using MongoDB Atlas < M10.
It may be better if you provide a pseudo-code for the problem you are having but I think this would fix it.

Default batch size in aggregate command in MongoDB

From aggregate command doucumentation`:
To indicate a cursor with the default batch size, specify cursor: {}.
However, I haven't found the value of such defaul or how to find it (maybe using a mongo admin command).
How to find such value?
From the docs:
The MongoDB server returns the query results in batches. The amount of data in the batch will not exceed the maximum BSON document size.
New in version 3.4: Operations of type find(), aggregate(), listIndexes, and listCollections return a maximum of 16 megabytes per batch. batchSize() can enforce a smaller limit, but not a larger one.
find() and aggregate() operations have an initial batch size of 101 documents by default. Subsequent getMore operations issued against the resulting cursor have no default batch size, so they are limited only by the 16 megabyte message size.
So, the default for the first batch is 101 documents, the batch size for subsequent getMore() calls is undetermined but cannot exceed 16 megabytes.
If I'm not entirely wrong, I think it's 101 for aggregation pipeline.
See here

16 MB size of aggregration pipeline in MongoDB

This is about a recommendation on mongodb. I have a collection that always increase row by row (I mean the count of documents). It is about 5 billion now. When I make a request on this collection I sometimes get the error about 16 MB size.
The first thing that I want to ask is which structure is the best way of creating collections that increasing the rows hugely. What is the best approach? What should I do for this kind of structure and the performance?
Just to clarify, the 16MB limitation is on documents, not collections. Specifically, its the maximum BSON document size, as specified in this page in the documentation.
The maximum BSON document size is 16 megabytes.
If you're running into the 16MB limit in aggregation is because you are using MongoDB version 2.4 or older. In these, the aggregate() method returned a document, which is subject to the same limitation as all other documents. Starting in 2.6, the aggregate() method returns a cursor, which is not subject to the 16MB limit. For more information, you should consult this page in the documentation. Note that each stage in the aggregation pipeline is still limited to 100MB of RAM.

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.