MongoDB $group operation - memory usage optimisation - mongodb

I need to perform a $group operation over my entire collection. This group stage is reaching the limit of 100MB RAM usage.
The $group stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $group will produce an error. However, to allow for the handling of large datasets, set the allowDiskUse option to true to enable $group operations to write to temporary files.
I'm not limited by the RAM but I couldn't find how to increase this memory usage limit. Does anyone know how to config this restriction?
Setting up allowDiskUse to true will solve the solution but I assume the whole operation will be much slower and I'd like to find a better solution.
{
$group: {
_id: {
producer: "$producer",
dataset:"$dataset",
featureOfInterest:"$_id.featureOfInterest",
observedProperty:"$_id.observedProperty"
},
documentId: {$push:"$documentId"}
}
}
This $group operation is performed over entire complexe objects (producer and dataset). I understand that this operation is expensive since "It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups." I'd rather $group on uniqueId fields for both of these object.
How could I $group object using unique ID and $project the whole object afterwards?
I'd like to obtain the same result as the group operation above using the group operation below at the beginning of my aggregation pipeline :
{
$group: {
_id: {
producer: "$producer.producerId",
dataset:"$dataset.datasetId",
featureOfInterest:"$_id.featureOfInterest",
observedProperty:"$_id.observedProperty"
},
documentId: {$push:"$documentId"}
}
}

allowDiskUse
There is no option in MongoDB to increase memory usage more than 100mb in aggregations so in heavy pipeline so you have to set the flag to true.
However
You may be interested reading about MongoDB In-Memory Storage Engine
Example starting mongodb with in-memory storage engine in the command line
mongod --storageEngine inMemory --dbpath <path> --inMemorySizeGB <newSize>
More information in Mongodb docs
https://docs.mongodb.com/manual/core/inmemory/
Regarding the second question - I didn't get it. Please post example documents.

Related

MongoError: document constructed by $facet is 104860008 bytes, which exceeds the limit of 104857600 bytes

I am getting memory size limit error while running multiple sub-pipelines in $facet. Can someone help me on this issue.
Scenario: I have a crown job which runs once a day. I want to execute multiple pipelines using $facet against a collection with millions of documents whenever job is triggered.
[
{
$facet: {
query1: [pipeline_1],
query2: [pipeline_2],
query3: [pipeline_3]
...
query_n: [pipeline_n]
},
},
{
$merge:{ into: some_collection}
}
]
I tried db.collection.aggregate([], {allowDiskUse: true});, But still getting same error.
What can be the work around on this. Please help.
In most cases using allowDiskUse should eliminate this error this is why i'm suspecting you're using the wrong syntax for it. Depending on your Mongo version there are some hard limitations on certain operators like $graphLookup, these operators will always have a memory error regardless of the allowDiskUse flag usage.
Assuming you don't use any of these operators allowDiskUse will work, here is a syntax example for the nodejs driver:
db.collection.aggregate([], {allowDiskUse: true});
If you are $graphLookup or one of the other limited operators then there's not much you can do. I would start by splitting these $facet stages into separate pipelines. If the problem still persists you'll need to either optimize the aggregation or find a different approach.

Does MongoDB aggregation $project decrease the amount of data to be kept in memory?

I am wondering whether writing $project just after the $match statement is actually decrease the amount of data to be kept in memory. As an example if we want an array element with paging from a user document like following:
const skip = 20;
const limit = 50;
UserModel.aggregate([
{ $match: { _id: userId } },
{ $project: { _id: 0, postList: 1 } },
{ $slice: ["$postList", skip, limit] },
{ $lookup: ...
]);
Assume that there are other lists in the user document and they are very large in size.
So, Is $project will help to improve the performance by not taking other large lists in memory?
Each aggregation stage scans the input documents from the collection (if its the first stage) or the previous stage. For example,
match (filters the documents) - this will reduce the number of
documents, the overall size
project (transforms or shapes the document) - this can reduce (or
increase) the size of the document; the number of documents remain
same
group - reduces the number of documents and changes the size
skip, limt - reduce the number of documents
sort - no change in the size or number of documents,
etc.
Each stage can affect the memory or cpu or both.
In general the document size, number of documents, the indexes, and memory can affect the query performance.
The memory restrictions for aggregation are already clearly specified in the documentation (see Aggregation Pipeline Limits). If the memory limit exceeds the restrictions the aggregation will terminate. In such cases you can specify the aggregation option { allowDiskuse: true }, and the usage of this option will affect the query performance. If your aggregation is working without any memory related issues (like query termination due to exceeding the memory limits) then there is no issue with your query performance directly.
The $match and $sort stages use indexes, if used early in the pipeline. And this can improve performance.
Adding a stage to a pipeline means extra processing, and it can affect the overall performance. This is because the documents from the previous stage has to pass thru this extra stage. In an aggregation pipeline the documents are passed through each stage - like in a pipe and the stage does some data transformation. If you can avoid a stage it can benefit the overall query performance, sometimes. When the numbers are large, having an extra (unnecessary) stage is definitely a disadvantage. You have to take into consideration both the memory restrictions as well as size and the number of documents.
A $project can be used to reduce the size of the document. But, is it necessary to add this stage? It depends on the factors I had mentioned above and your implemetation and the application. The documentataion (Projection Optimization) says:
The aggregation pipeline can determine if it requires only a subset of
the fields in the documents to obtain the results. If so, the pipeline
will only use those required fields, reducing the amount of data
passing through the pipeline.

Should I use the "allowDiskUse" option in a product environment?

Should I use the allowDiskUse option when returned doc exceed 16MB limit in aggregation?
Or should I alter db structure or codes logic to avoid the limit?
What's the advantage and disadvantage of 'allowDiskUse'?
Thanks for your help.
Hers is the official doc I have seen:
Result Size Restrictions
Changed in version 2.6.
Starting in MongoDB 2.6, the aggregate command can return a cursor or store the results in a collection. When returning a cursor or storing the results in a collection, each document in the result set is subject to the BSON Document Size limit, currently 16 megabytes; if any single document that exceeds the BSON Document Size limit, the command will produce an error. The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size.
Memory Restrictions¶
Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.
https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/
allowDiskUse is unrelated to the 16MB result size limit. That setting controls whether pipeline steps such as $sort or $group can use some temporary disk space if they need more than 100MB of memory. In theory, for an arbitrary pipeline this could be a very large amount of diskspace. Personally it's never been a problem, but that will be down to your data.
If your result is going to be more than 16MB then you need to use the $out pipeline stage to output the data to a collection or use a pipeline API that returns a cursor to results instead of returning all the data inline (for some drivers this is a separate method, for others it is a flag passed to the same method).

collection count increasing after sharding mongodb

I am not able to understand why collection count increasing after sharding mongodb.
I have a collection of 20M records , when I sharded collection count keep increasing , plz help me out
clusture configration
3 shards
3 config sever
6 query routers
If I get it right, you mean that db.shardedCollection.count()returns more documents than you expect. This is a known bug (SERVER-3645).
TL;DR
The problem is that the way sharding works, it can happen that after a chunk migration so called orphaned documents exist. These are documents which exist as a duplicate on a shard not responsible for the key range that the document falls into. For almost all practical purposes, this is not a problem, since the mongos takes care of "sorting them out" (which is a bit simplified, but sufficient in this context).
However, when calling a db.collection.count() on a sharded collection, this query gets routed to all shards, since it does not contain the shard key.
Disclaimer from here on, it is my theory, deduced from the observed behavior
Since the orphaned documents still technically exist on a shard, they seem to get counted and the result of the count as a whole is reported back to the mongos, which simply sums up all the results. I assume .count() to take a shortcut on the individual shard, possibly simply counting the entries of the _id index for performance reasons.
Workaround
As written in the ticket, using an aggregation mitigates the problem:
db.collection.aggregate({$group:{_id:"uniqueDocs",count:{$sum:1}}})
However, this aggregation is not ideal, and should show better performance when changed as below if you have a shard key other than _id
db.books.aggregate([
{ $project:{ _id:0, yourShardKey: 1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
or
db.books.aggregate([
{ $project:{ _id:1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
if you use _id as your shard key.

Incorrect Count returned by MongoDB (WiredTiger)

This sounds odd, and I hope I am doing something wrong, but my MongoDB collection is returning the Count off by one in my collection.
I have a collection with (I am sure) 359671 documents. However the count() command returns 359670 documents.
I am executing the count() command using the mongo shell:
rs0:PRIMARY> db.COLLECTION.count()
359670
This is incorrect.
It is not finding each and every document in my collection.
If I provide the following query to count, I get the correct result:
rs0:PRIMARY> db.COLLECTION.count({_id: {$exists: true}})
359671
I believe this is a bug in WiredTiger. As far as I am aware each document has the same definition, an _id field of an integer ranging from 0 to 359670, and a BinData field. I did not have this problem with the older storage engine (or Mongo 2, either could have caused the issue).
Is this something I have done wrong? I do not want to use the {_id: {$exists: true}} query as that takes 100x longer to complete.
According to this issue, this behaviour can occur if mongodb experiences a hard crash and is not shut down gracefully. If not issuing any query, mongodb probably just falls back to the collected statistics.
According to the article, calling db.COLLECTION.validate(true) should reset the counters.
As now stated in the doc, db.collection.count() without using a query parameter, returns results based on the collection’s metadata:
This may result in an approximate count. In particular:
On a sharded cluster, the resulting count will not correctly filter out orphaned documents.
After an unclean shutdown, the count may be incorrect.
When using a query parameter, as you did in the second query ({_id: {$exists: true}}), then it forces count to not use the collection's metadata, but to scan the collection instead.
Starting Mongo 4.0.3, count() is considered deprecated and the following alternatives are recommended instead:
Exact count of douments:
db.collection.countDocuments({})
which under the hood actually performs the following "expensive", but accurate aggregation (expensive since the whole collection is scanned to count records):
db.collection.aggregate([{ $group: { _id: null, n: { $sum: 1 } } }])
Approximate count of documents:
db.collection.estimatedDocumentCount()
which performs exactly what db.collection.count() does/did (it's actually a wrapper around count), which uses the collection’s metadata.
This is thus almost instantaneous, but may lead to an approximate result in the particular cases mentioned above.