Does MongoDB aggregation $project decrease the amount of data to be kept in memory? - mongodb

I am wondering whether writing $project just after the $match statement is actually decrease the amount of data to be kept in memory. As an example if we want an array element with paging from a user document like following:
const skip = 20;
const limit = 50;
UserModel.aggregate([
{ $match: { _id: userId } },
{ $project: { _id: 0, postList: 1 } },
{ $slice: ["$postList", skip, limit] },
{ $lookup: ...
]);
Assume that there are other lists in the user document and they are very large in size.
So, Is $project will help to improve the performance by not taking other large lists in memory?

Each aggregation stage scans the input documents from the collection (if its the first stage) or the previous stage. For example,
match (filters the documents) - this will reduce the number of
documents, the overall size
project (transforms or shapes the document) - this can reduce (or
increase) the size of the document; the number of documents remain
same
group - reduces the number of documents and changes the size
skip, limt - reduce the number of documents
sort - no change in the size or number of documents,
etc.
Each stage can affect the memory or cpu or both.
In general the document size, number of documents, the indexes, and memory can affect the query performance.
The memory restrictions for aggregation are already clearly specified in the documentation (see Aggregation Pipeline Limits). If the memory limit exceeds the restrictions the aggregation will terminate. In such cases you can specify the aggregation option { allowDiskuse: true }, and the usage of this option will affect the query performance. If your aggregation is working without any memory related issues (like query termination due to exceeding the memory limits) then there is no issue with your query performance directly.
The $match and $sort stages use indexes, if used early in the pipeline. And this can improve performance.
Adding a stage to a pipeline means extra processing, and it can affect the overall performance. This is because the documents from the previous stage has to pass thru this extra stage. In an aggregation pipeline the documents are passed through each stage - like in a pipe and the stage does some data transformation. If you can avoid a stage it can benefit the overall query performance, sometimes. When the numbers are large, having an extra (unnecessary) stage is definitely a disadvantage. You have to take into consideration both the memory restrictions as well as size and the number of documents.
A $project can be used to reduce the size of the document. But, is it necessary to add this stage? It depends on the factors I had mentioned above and your implemetation and the application. The documentataion (Projection Optimization) says:
The aggregation pipeline can determine if it requires only a subset of
the fields in the documents to obtain the results. If so, the pipeline
will only use those required fields, reducing the amount of data
passing through the pipeline.

Related

MongoDB - How can I analyze Aggregation performance? [duplicate]

Is there an explain function for the Aggregation framework in MongoDB? I can't see it in the documentation.
If not is there some other way to check, how a query performs within the aggregation framework?
I know with find you just do
db.collection.find().explain()
But with the aggregation framework I get an error
db.collection.aggregate(
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{
$group:
{
_id : { id: "$_id"},
"count": { $sum:1 }
}
},
{ $sort: {"count":-1}}
).explain()
Starting with MongoDB version 3.0, simply changing the order from
collection.aggregate(...).explain()
to
collection.explain().aggregate(...)
will give you the desired results (documentation here).
For older versions >= 2.6, you will need to use the explain option for aggregation pipeline operations
explain:true
db.collection.aggregate([
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{ $group: {
_id : "$_id",
count: { $sum:1 }
}},
{$sort: {"count":-1}}
],
{
explain:true
}
)
An important consideration with the Aggregation Framework is that an index can only be used to fetch the initial data for a pipeline (e.g. usage of $match, $sort, $geonear at the beginning of a pipeline) as well as subsequent $lookup and $graphLookup stages. Once data has been fetched into the aggregation pipeline for processing (e.g. passing through stages like $project, $unwind, and $group) further manipulation will be in-memory (possibly using temporary files if the allowDiskUse option is set).
Optimizing pipelines
In general, you can optimize aggregation pipelines by:
Starting a pipeline with a $match stage to restrict processing to relevant documents.
Ensuring the initial $match / $sort stages are supported by an efficient index.
Filtering data early using $match, $limit , and $skip .
Minimizing unnecessary stages and document manipulation (perhaps reconsidering your schema if complicated aggregation gymnastics are required).
Taking advantage of newer aggregation operators if you have upgraded your MongoDB server. For example, MongoDB 3.4 added many new aggregation stages and expressions including support for working with arrays, strings, and facets.
There are also a number of Aggregation Pipeline Optimizations that automatically happen depending on your MongoDB server version. For example, adjacent stages may be coalesced and/or reordered to improve execution without affecting the output results.
Limitations
As at MongoDB 3.4, the Aggregation Framework explain option provides information on how a pipeline is processed but does not support the same level of detail as the executionStats mode for a find() query. If you are focused on optimizing initial query execution you will likely find it beneficial to review the equivalent find().explain() query with executionStats or allPlansExecution verbosity.
There are a few relevant feature requests to watch/upvote in the MongoDB issue tracker regarding more detailed execution stats to help optimize/profile aggregation pipelines:
SERVER-19758: Add "executionStats" and "allPlansExecution" explain modes to aggregation explain
SERVER-21784: Track execution stats for each aggregation pipeline stage and expose via explain
SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection
Starting with version 2.6.x mongodb allows users to do explain with aggregation framework.
All you need to do is to add explain : true
db.records.aggregate(
[ ...your pipeline...],
{ explain: true }
)
Thanks to Rafa, I know that it was possible to do even in 2.4, but only through runCommand(). But now you can use aggregate as well.
The aggregation framework is a set of analytics tools within MongoDB that allows us to run various types of reports or analysis on documents in one or more collections. Based on the idea of a pipeline. We take input from a MongoDB collection and pass the documents from that collection through one or more stages, each of which performs a different operation on it's inputs. Each stage takes as input whatever the stage before it produced as output. And the inputs and outputs for all stages are a stream of documents. Each stage has a specific job that it does. It's expecting a specific form of document and produces a specific output, which is itself a stream of documents. At the end of the pipeline, we get access to the output.
An individual stage is a data processing unit. Each stage takes as input a stream of documents one at a time, processes each document one at a time and produces the output stream of documents. Again, one at a time. Each stage provide a set of knobs or tunables that we can control to parameterize the stage to perform whatever task we're interested in doing. So a stage performs a generic task - a general purpose task of some kind and parameterize the stage for the particular set of documents that we're working with. And exactly what we would like that stage to do with those documents. These tunables typically take the form of operators that we can supply that will modify fields, perform arithmetic operations, reshape documents or do some sort of accumulation task as well as a veriety of other things. Often times, it the case that we'll want to include the same type of stage multiple times within a single pipeline.
e.g. We may wish to perform an initial filter so that we don't have to pass the entire collection into our pipeline. But, then later on, following some additional processing, want to filter once again using a different set of criteria. So, to recap, pipeline works with a MongoDB collection. They're composed of stages, each of which does a different data processing task on it's input and produces documents as output to be passed to the next stage. And finally at the end of the pipeline output is produced that we can then do something within our application. In many cases, it's necessary to include the same type of stage, multiple times within an individual pipeline.

MongoDB $group operation - memory usage optimisation

I need to perform a $group operation over my entire collection. This group stage is reaching the limit of 100MB RAM usage.
The $group stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $group will produce an error. However, to allow for the handling of large datasets, set the allowDiskUse option to true to enable $group operations to write to temporary files.
I'm not limited by the RAM but I couldn't find how to increase this memory usage limit. Does anyone know how to config this restriction?
Setting up allowDiskUse to true will solve the solution but I assume the whole operation will be much slower and I'd like to find a better solution.
{
$group: {
_id: {
producer: "$producer",
dataset:"$dataset",
featureOfInterest:"$_id.featureOfInterest",
observedProperty:"$_id.observedProperty"
},
documentId: {$push:"$documentId"}
}
}
This $group operation is performed over entire complexe objects (producer and dataset). I understand that this operation is expensive since "It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups." I'd rather $group on uniqueId fields for both of these object.
How could I $group object using unique ID and $project the whole object afterwards?
I'd like to obtain the same result as the group operation above using the group operation below at the beginning of my aggregation pipeline :
{
$group: {
_id: {
producer: "$producer.producerId",
dataset:"$dataset.datasetId",
featureOfInterest:"$_id.featureOfInterest",
observedProperty:"$_id.observedProperty"
},
documentId: {$push:"$documentId"}
}
}
allowDiskUse
There is no option in MongoDB to increase memory usage more than 100mb in aggregations so in heavy pipeline so you have to set the flag to true.
However
You may be interested reading about MongoDB In-Memory Storage Engine
Example starting mongodb with in-memory storage engine in the command line
mongod --storageEngine inMemory --dbpath <path> --inMemorySizeGB <newSize>
More information in Mongodb docs
https://docs.mongodb.com/manual/core/inmemory/
Regarding the second question - I didn't get it. Please post example documents.

Should I use the "allowDiskUse" option in a product environment?

Should I use the allowDiskUse option when returned doc exceed 16MB limit in aggregation?
Or should I alter db structure or codes logic to avoid the limit?
What's the advantage and disadvantage of 'allowDiskUse'?
Thanks for your help.
Hers is the official doc I have seen:
Result Size Restrictions
Changed in version 2.6.
Starting in MongoDB 2.6, the aggregate command can return a cursor or store the results in a collection. When returning a cursor or storing the results in a collection, each document in the result set is subject to the BSON Document Size limit, currently 16 megabytes; if any single document that exceeds the BSON Document Size limit, the command will produce an error. The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size.
Memory Restrictions¶
Changed in version 2.6.
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.
https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/
allowDiskUse is unrelated to the 16MB result size limit. That setting controls whether pipeline steps such as $sort or $group can use some temporary disk space if they need more than 100MB of memory. In theory, for an arbitrary pipeline this could be a very large amount of diskspace. Personally it's never been a problem, but that will be down to your data.
If your result is going to be more than 16MB then you need to use the $out pipeline stage to output the data to a collection or use a pipeline API that returns a cursor to results instead of returning all the data inline (for some drivers this is a separate method, for others it is a flag passed to the same method).

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.

Mongodb Explain for Aggregation framework

Is there an explain function for the Aggregation framework in MongoDB? I can't see it in the documentation.
If not is there some other way to check, how a query performs within the aggregation framework?
I know with find you just do
db.collection.find().explain()
But with the aggregation framework I get an error
db.collection.aggregate(
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{
$group:
{
_id : { id: "$_id"},
"count": { $sum:1 }
}
},
{ $sort: {"count":-1}}
).explain()
Starting with MongoDB version 3.0, simply changing the order from
collection.aggregate(...).explain()
to
collection.explain().aggregate(...)
will give you the desired results (documentation here).
For older versions >= 2.6, you will need to use the explain option for aggregation pipeline operations
explain:true
db.collection.aggregate([
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{ $group: {
_id : "$_id",
count: { $sum:1 }
}},
{$sort: {"count":-1}}
],
{
explain:true
}
)
An important consideration with the Aggregation Framework is that an index can only be used to fetch the initial data for a pipeline (e.g. usage of $match, $sort, $geonear at the beginning of a pipeline) as well as subsequent $lookup and $graphLookup stages. Once data has been fetched into the aggregation pipeline for processing (e.g. passing through stages like $project, $unwind, and $group) further manipulation will be in-memory (possibly using temporary files if the allowDiskUse option is set).
Optimizing pipelines
In general, you can optimize aggregation pipelines by:
Starting a pipeline with a $match stage to restrict processing to relevant documents.
Ensuring the initial $match / $sort stages are supported by an efficient index.
Filtering data early using $match, $limit , and $skip .
Minimizing unnecessary stages and document manipulation (perhaps reconsidering your schema if complicated aggregation gymnastics are required).
Taking advantage of newer aggregation operators if you have upgraded your MongoDB server. For example, MongoDB 3.4 added many new aggregation stages and expressions including support for working with arrays, strings, and facets.
There are also a number of Aggregation Pipeline Optimizations that automatically happen depending on your MongoDB server version. For example, adjacent stages may be coalesced and/or reordered to improve execution without affecting the output results.
Limitations
As at MongoDB 3.4, the Aggregation Framework explain option provides information on how a pipeline is processed but does not support the same level of detail as the executionStats mode for a find() query. If you are focused on optimizing initial query execution you will likely find it beneficial to review the equivalent find().explain() query with executionStats or allPlansExecution verbosity.
There are a few relevant feature requests to watch/upvote in the MongoDB issue tracker regarding more detailed execution stats to help optimize/profile aggregation pipelines:
SERVER-19758: Add "executionStats" and "allPlansExecution" explain modes to aggregation explain
SERVER-21784: Track execution stats for each aggregation pipeline stage and expose via explain
SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection
Starting with version 2.6.x mongodb allows users to do explain with aggregation framework.
All you need to do is to add explain : true
db.records.aggregate(
[ ...your pipeline...],
{ explain: true }
)
Thanks to Rafa, I know that it was possible to do even in 2.4, but only through runCommand(). But now you can use aggregate as well.
The aggregation framework is a set of analytics tools within MongoDB that allows us to run various types of reports or analysis on documents in one or more collections. Based on the idea of a pipeline. We take input from a MongoDB collection and pass the documents from that collection through one or more stages, each of which performs a different operation on it's inputs. Each stage takes as input whatever the stage before it produced as output. And the inputs and outputs for all stages are a stream of documents. Each stage has a specific job that it does. It's expecting a specific form of document and produces a specific output, which is itself a stream of documents. At the end of the pipeline, we get access to the output.
An individual stage is a data processing unit. Each stage takes as input a stream of documents one at a time, processes each document one at a time and produces the output stream of documents. Again, one at a time. Each stage provide a set of knobs or tunables that we can control to parameterize the stage to perform whatever task we're interested in doing. So a stage performs a generic task - a general purpose task of some kind and parameterize the stage for the particular set of documents that we're working with. And exactly what we would like that stage to do with those documents. These tunables typically take the form of operators that we can supply that will modify fields, perform arithmetic operations, reshape documents or do some sort of accumulation task as well as a veriety of other things. Often times, it the case that we'll want to include the same type of stage multiple times within a single pipeline.
e.g. We may wish to perform an initial filter so that we don't have to pass the entire collection into our pipeline. But, then later on, following some additional processing, want to filter once again using a different set of criteria. So, to recap, pipeline works with a MongoDB collection. They're composed of stages, each of which does a different data processing task on it's input and produces documents as output to be passed to the next stage. And finally at the end of the pipeline output is produced that we can then do something within our application. In many cases, it's necessary to include the same type of stage, multiple times within an individual pipeline.