Mongodb Explain for Aggregation framework - mongodb

Is there an explain function for the Aggregation framework in MongoDB? I can't see it in the documentation.
If not is there some other way to check, how a query performs within the aggregation framework?
I know with find you just do
db.collection.find().explain()
But with the aggregation framework I get an error
db.collection.aggregate(
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{
$group:
{
_id : { id: "$_id"},
"count": { $sum:1 }
}
},
{ $sort: {"count":-1}}
).explain()

Starting with MongoDB version 3.0, simply changing the order from
collection.aggregate(...).explain()
to
collection.explain().aggregate(...)
will give you the desired results (documentation here).
For older versions >= 2.6, you will need to use the explain option for aggregation pipeline operations
explain:true
db.collection.aggregate([
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{ $group: {
_id : "$_id",
count: { $sum:1 }
}},
{$sort: {"count":-1}}
],
{
explain:true
}
)
An important consideration with the Aggregation Framework is that an index can only be used to fetch the initial data for a pipeline (e.g. usage of $match, $sort, $geonear at the beginning of a pipeline) as well as subsequent $lookup and $graphLookup stages. Once data has been fetched into the aggregation pipeline for processing (e.g. passing through stages like $project, $unwind, and $group) further manipulation will be in-memory (possibly using temporary files if the allowDiskUse option is set).
Optimizing pipelines
In general, you can optimize aggregation pipelines by:
Starting a pipeline with a $match stage to restrict processing to relevant documents.
Ensuring the initial $match / $sort stages are supported by an efficient index.
Filtering data early using $match, $limit , and $skip .
Minimizing unnecessary stages and document manipulation (perhaps reconsidering your schema if complicated aggregation gymnastics are required).
Taking advantage of newer aggregation operators if you have upgraded your MongoDB server. For example, MongoDB 3.4 added many new aggregation stages and expressions including support for working with arrays, strings, and facets.
There are also a number of Aggregation Pipeline Optimizations that automatically happen depending on your MongoDB server version. For example, adjacent stages may be coalesced and/or reordered to improve execution without affecting the output results.
Limitations
As at MongoDB 3.4, the Aggregation Framework explain option provides information on how a pipeline is processed but does not support the same level of detail as the executionStats mode for a find() query. If you are focused on optimizing initial query execution you will likely find it beneficial to review the equivalent find().explain() query with executionStats or allPlansExecution verbosity.
There are a few relevant feature requests to watch/upvote in the MongoDB issue tracker regarding more detailed execution stats to help optimize/profile aggregation pipelines:
SERVER-19758: Add "executionStats" and "allPlansExecution" explain modes to aggregation explain
SERVER-21784: Track execution stats for each aggregation pipeline stage and expose via explain
SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection

Starting with version 2.6.x mongodb allows users to do explain with aggregation framework.
All you need to do is to add explain : true
db.records.aggregate(
[ ...your pipeline...],
{ explain: true }
)
Thanks to Rafa, I know that it was possible to do even in 2.4, but only through runCommand(). But now you can use aggregate as well.

The aggregation framework is a set of analytics tools within MongoDB that allows us to run various types of reports or analysis on documents in one or more collections. Based on the idea of a pipeline. We take input from a MongoDB collection and pass the documents from that collection through one or more stages, each of which performs a different operation on it's inputs. Each stage takes as input whatever the stage before it produced as output. And the inputs and outputs for all stages are a stream of documents. Each stage has a specific job that it does. It's expecting a specific form of document and produces a specific output, which is itself a stream of documents. At the end of the pipeline, we get access to the output.
An individual stage is a data processing unit. Each stage takes as input a stream of documents one at a time, processes each document one at a time and produces the output stream of documents. Again, one at a time. Each stage provide a set of knobs or tunables that we can control to parameterize the stage to perform whatever task we're interested in doing. So a stage performs a generic task - a general purpose task of some kind and parameterize the stage for the particular set of documents that we're working with. And exactly what we would like that stage to do with those documents. These tunables typically take the form of operators that we can supply that will modify fields, perform arithmetic operations, reshape documents or do some sort of accumulation task as well as a veriety of other things. Often times, it the case that we'll want to include the same type of stage multiple times within a single pipeline.
e.g. We may wish to perform an initial filter so that we don't have to pass the entire collection into our pipeline. But, then later on, following some additional processing, want to filter once again using a different set of criteria. So, to recap, pipeline works with a MongoDB collection. They're composed of stages, each of which does a different data processing task on it's input and produces documents as output to be passed to the next stage. And finally at the end of the pipeline output is produced that we can then do something within our application. In many cases, it's necessary to include the same type of stage, multiple times within an individual pipeline.

Related

MongoDB save pre-defined aggregation pipeline in server itself, and pass some sort of input variables

I have a Python application (using pymongo) to run an aggregation pipeline on MongoDB server. The aggregation pipeline is quite large with around 18 stages and the final stage is a $merge -- so it doesn't return anything back to the Python client. Note that even though the pipeline is very large, it only accesses a few documents and it executes really fast.
From the Python client, I am calling this pipeline around a million times using multi-threading (with 32 threads). For each call, the only difference in the aggregation pipeline is in the first $match stage where the values of the fields change. In the sample code below, only parameters that change in each call in var-list1, var2, and var3.
pipeline = [
{
'$match': {
'week': {'$in': <var-list1>},
'eventCode': <var2>,
'customer': <var3>
}
}, {
..... some long list of aggregation stages which is fixed...
}, {
'$merge': {
'into': 'my_out_collection',
'on': ['week', 'eventCode', 'customer'],
'whenMatched': 'replace',
'whenNotMatched': 'insert'
}
}
]
My MongoDB server has the capacity to accept more parallel threads (based on disk and CPU usage metrics). However, I am not able to use this as my client's network-out traffic is hitting the ceiling. I suspect this is because the aggregation pipeline I am using is very big.
Does MongoDB support any ways of saving an aggregation pipeline in the server, with some placeholder variables, that can be executed by just passing in the variables?
MongoDB server version = 4.4
Pymongo = 3.10.1

MongoDB - How can I analyze Aggregation performance? [duplicate]

Is there an explain function for the Aggregation framework in MongoDB? I can't see it in the documentation.
If not is there some other way to check, how a query performs within the aggregation framework?
I know with find you just do
db.collection.find().explain()
But with the aggregation framework I get an error
db.collection.aggregate(
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{
$group:
{
_id : { id: "$_id"},
"count": { $sum:1 }
}
},
{ $sort: {"count":-1}}
).explain()
Starting with MongoDB version 3.0, simply changing the order from
collection.aggregate(...).explain()
to
collection.explain().aggregate(...)
will give you the desired results (documentation here).
For older versions >= 2.6, you will need to use the explain option for aggregation pipeline operations
explain:true
db.collection.aggregate([
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{ $group: {
_id : "$_id",
count: { $sum:1 }
}},
{$sort: {"count":-1}}
],
{
explain:true
}
)
An important consideration with the Aggregation Framework is that an index can only be used to fetch the initial data for a pipeline (e.g. usage of $match, $sort, $geonear at the beginning of a pipeline) as well as subsequent $lookup and $graphLookup stages. Once data has been fetched into the aggregation pipeline for processing (e.g. passing through stages like $project, $unwind, and $group) further manipulation will be in-memory (possibly using temporary files if the allowDiskUse option is set).
Optimizing pipelines
In general, you can optimize aggregation pipelines by:
Starting a pipeline with a $match stage to restrict processing to relevant documents.
Ensuring the initial $match / $sort stages are supported by an efficient index.
Filtering data early using $match, $limit , and $skip .
Minimizing unnecessary stages and document manipulation (perhaps reconsidering your schema if complicated aggregation gymnastics are required).
Taking advantage of newer aggregation operators if you have upgraded your MongoDB server. For example, MongoDB 3.4 added many new aggregation stages and expressions including support for working with arrays, strings, and facets.
There are also a number of Aggregation Pipeline Optimizations that automatically happen depending on your MongoDB server version. For example, adjacent stages may be coalesced and/or reordered to improve execution without affecting the output results.
Limitations
As at MongoDB 3.4, the Aggregation Framework explain option provides information on how a pipeline is processed but does not support the same level of detail as the executionStats mode for a find() query. If you are focused on optimizing initial query execution you will likely find it beneficial to review the equivalent find().explain() query with executionStats or allPlansExecution verbosity.
There are a few relevant feature requests to watch/upvote in the MongoDB issue tracker regarding more detailed execution stats to help optimize/profile aggregation pipelines:
SERVER-19758: Add "executionStats" and "allPlansExecution" explain modes to aggregation explain
SERVER-21784: Track execution stats for each aggregation pipeline stage and expose via explain
SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection
Starting with version 2.6.x mongodb allows users to do explain with aggregation framework.
All you need to do is to add explain : true
db.records.aggregate(
[ ...your pipeline...],
{ explain: true }
)
Thanks to Rafa, I know that it was possible to do even in 2.4, but only through runCommand(). But now you can use aggregate as well.
The aggregation framework is a set of analytics tools within MongoDB that allows us to run various types of reports or analysis on documents in one or more collections. Based on the idea of a pipeline. We take input from a MongoDB collection and pass the documents from that collection through one or more stages, each of which performs a different operation on it's inputs. Each stage takes as input whatever the stage before it produced as output. And the inputs and outputs for all stages are a stream of documents. Each stage has a specific job that it does. It's expecting a specific form of document and produces a specific output, which is itself a stream of documents. At the end of the pipeline, we get access to the output.
An individual stage is a data processing unit. Each stage takes as input a stream of documents one at a time, processes each document one at a time and produces the output stream of documents. Again, one at a time. Each stage provide a set of knobs or tunables that we can control to parameterize the stage to perform whatever task we're interested in doing. So a stage performs a generic task - a general purpose task of some kind and parameterize the stage for the particular set of documents that we're working with. And exactly what we would like that stage to do with those documents. These tunables typically take the form of operators that we can supply that will modify fields, perform arithmetic operations, reshape documents or do some sort of accumulation task as well as a veriety of other things. Often times, it the case that we'll want to include the same type of stage multiple times within a single pipeline.
e.g. We may wish to perform an initial filter so that we don't have to pass the entire collection into our pipeline. But, then later on, following some additional processing, want to filter once again using a different set of criteria. So, to recap, pipeline works with a MongoDB collection. They're composed of stages, each of which does a different data processing task on it's input and produces documents as output to be passed to the next stage. And finally at the end of the pipeline output is produced that we can then do something within our application. In many cases, it's necessary to include the same type of stage, multiple times within an individual pipeline.

can we write mongodb crud queries and aggregate query together?

In MongoDB can we execute below written query?
db.dbaname.find(userName:"abc").aggregate([])
else is there any other way we can execute CRUD and aggregate query together.
Short answer - No you can't do this : .find(userName:"abc").aggregate([])
aggregation-pipeline is heavily used for reads which is mostly similar to .find() but capable of executing complex queries with help of it's multiple stages & many aggregation-operators. there are only two stages in aggregation $out & $merge that can perform writes to database - these stages are not that much used compared to other stages & needs to be used only when needed & as they need to be last stages in aggregation pipeline, then all the previous stages are to be tested very well. So when it comes to CRUD eliminating CUD you'll benefit over R - Reads.
Same .find(userName:"abc") can be written as :
.aggregate( [ { $match : { userName:"abc"} } ] ) // Using `$match` stage

Does MongoDB aggregation $project decrease the amount of data to be kept in memory?

I am wondering whether writing $project just after the $match statement is actually decrease the amount of data to be kept in memory. As an example if we want an array element with paging from a user document like following:
const skip = 20;
const limit = 50;
UserModel.aggregate([
{ $match: { _id: userId } },
{ $project: { _id: 0, postList: 1 } },
{ $slice: ["$postList", skip, limit] },
{ $lookup: ...
]);
Assume that there are other lists in the user document and they are very large in size.
So, Is $project will help to improve the performance by not taking other large lists in memory?
Each aggregation stage scans the input documents from the collection (if its the first stage) or the previous stage. For example,
match (filters the documents) - this will reduce the number of
documents, the overall size
project (transforms or shapes the document) - this can reduce (or
increase) the size of the document; the number of documents remain
same
group - reduces the number of documents and changes the size
skip, limt - reduce the number of documents
sort - no change in the size or number of documents,
etc.
Each stage can affect the memory or cpu or both.
In general the document size, number of documents, the indexes, and memory can affect the query performance.
The memory restrictions for aggregation are already clearly specified in the documentation (see Aggregation Pipeline Limits). If the memory limit exceeds the restrictions the aggregation will terminate. In such cases you can specify the aggregation option { allowDiskuse: true }, and the usage of this option will affect the query performance. If your aggregation is working without any memory related issues (like query termination due to exceeding the memory limits) then there is no issue with your query performance directly.
The $match and $sort stages use indexes, if used early in the pipeline. And this can improve performance.
Adding a stage to a pipeline means extra processing, and it can affect the overall performance. This is because the documents from the previous stage has to pass thru this extra stage. In an aggregation pipeline the documents are passed through each stage - like in a pipe and the stage does some data transformation. If you can avoid a stage it can benefit the overall query performance, sometimes. When the numbers are large, having an extra (unnecessary) stage is definitely a disadvantage. You have to take into consideration both the memory restrictions as well as size and the number of documents.
A $project can be used to reduce the size of the document. But, is it necessary to add this stage? It depends on the factors I had mentioned above and your implemetation and the application. The documentataion (Projection Optimization) says:
The aggregation pipeline can determine if it requires only a subset of
the fields in the documents to obtain the results. If so, the pipeline
will only use those required fields, reducing the amount of data
passing through the pipeline.

MongoDB Aggregation Framework Group Performance

Working with the MongoDB aggregation framework it is clear that the $group function is the bottleneck. By using explain() on some find queries, I'm able to tailor my indexes to reduce table scans significantly, however it seems that $group does not take into account any $sort that happens before, even if I end up sorting by the fields it will end up doing the $group by.
Besides simply reducing the result set, are there any practical ways to improve the performance of the $group function? I'm almost tempted to take advantage of the sort, and just do the $group in my own application, but there must be an elegant and performant solution using the framework.
I'm noticing that as the result set from the $match increases, the $group time also increases.
My document is basically like this
{
a: (String)
b: (String)
}
with a pipeline that looks something like
$match :{ a : 'frank'}
$sort : { b : 1 }
$group : { _id : { $b : b }}
It is surprising to me, because I assume by the time it gets to the group, the data is loaded into memory, and since the fields are indexed, a few thousand records shouldn't take that much time to load into memory. Is this not the case?
Just seems that the $sort has no effect on the overall performance. Is there a way to use indexes, as well as the previous functions of the pipeline to improve the performance of the $group? Also, does $group stay within the result set from the previous functions, or does it go back to an entire table scan (I'm pretty sure, or hopefully that's not it)