MongoDB save pre-defined aggregation pipeline in server itself, and pass some sort of input variables - mongodb

I have a Python application (using pymongo) to run an aggregation pipeline on MongoDB server. The aggregation pipeline is quite large with around 18 stages and the final stage is a $merge -- so it doesn't return anything back to the Python client. Note that even though the pipeline is very large, it only accesses a few documents and it executes really fast.
From the Python client, I am calling this pipeline around a million times using multi-threading (with 32 threads). For each call, the only difference in the aggregation pipeline is in the first $match stage where the values of the fields change. In the sample code below, only parameters that change in each call in var-list1, var2, and var3.
pipeline = [
{
'$match': {
'week': {'$in': <var-list1>},
'eventCode': <var2>,
'customer': <var3>
}
}, {
..... some long list of aggregation stages which is fixed...
}, {
'$merge': {
'into': 'my_out_collection',
'on': ['week', 'eventCode', 'customer'],
'whenMatched': 'replace',
'whenNotMatched': 'insert'
}
}
]
My MongoDB server has the capacity to accept more parallel threads (based on disk and CPU usage metrics). However, I am not able to use this as my client's network-out traffic is hitting the ceiling. I suspect this is because the aggregation pipeline I am using is very big.
Does MongoDB support any ways of saving an aggregation pipeline in the server, with some placeholder variables, that can be executed by just passing in the variables?
MongoDB server version = 4.4
Pymongo = 3.10.1

Related

MongoDB - How can I analyze Aggregation performance? [duplicate]

Is there an explain function for the Aggregation framework in MongoDB? I can't see it in the documentation.
If not is there some other way to check, how a query performs within the aggregation framework?
I know with find you just do
db.collection.find().explain()
But with the aggregation framework I get an error
db.collection.aggregate(
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{
$group:
{
_id : { id: "$_id"},
"count": { $sum:1 }
}
},
{ $sort: {"count":-1}}
).explain()
Starting with MongoDB version 3.0, simply changing the order from
collection.aggregate(...).explain()
to
collection.explain().aggregate(...)
will give you the desired results (documentation here).
For older versions >= 2.6, you will need to use the explain option for aggregation pipeline operations
explain:true
db.collection.aggregate([
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{ $group: {
_id : "$_id",
count: { $sum:1 }
}},
{$sort: {"count":-1}}
],
{
explain:true
}
)
An important consideration with the Aggregation Framework is that an index can only be used to fetch the initial data for a pipeline (e.g. usage of $match, $sort, $geonear at the beginning of a pipeline) as well as subsequent $lookup and $graphLookup stages. Once data has been fetched into the aggregation pipeline for processing (e.g. passing through stages like $project, $unwind, and $group) further manipulation will be in-memory (possibly using temporary files if the allowDiskUse option is set).
Optimizing pipelines
In general, you can optimize aggregation pipelines by:
Starting a pipeline with a $match stage to restrict processing to relevant documents.
Ensuring the initial $match / $sort stages are supported by an efficient index.
Filtering data early using $match, $limit , and $skip .
Minimizing unnecessary stages and document manipulation (perhaps reconsidering your schema if complicated aggregation gymnastics are required).
Taking advantage of newer aggregation operators if you have upgraded your MongoDB server. For example, MongoDB 3.4 added many new aggregation stages and expressions including support for working with arrays, strings, and facets.
There are also a number of Aggregation Pipeline Optimizations that automatically happen depending on your MongoDB server version. For example, adjacent stages may be coalesced and/or reordered to improve execution without affecting the output results.
Limitations
As at MongoDB 3.4, the Aggregation Framework explain option provides information on how a pipeline is processed but does not support the same level of detail as the executionStats mode for a find() query. If you are focused on optimizing initial query execution you will likely find it beneficial to review the equivalent find().explain() query with executionStats or allPlansExecution verbosity.
There are a few relevant feature requests to watch/upvote in the MongoDB issue tracker regarding more detailed execution stats to help optimize/profile aggregation pipelines:
SERVER-19758: Add "executionStats" and "allPlansExecution" explain modes to aggregation explain
SERVER-21784: Track execution stats for each aggregation pipeline stage and expose via explain
SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection
Starting with version 2.6.x mongodb allows users to do explain with aggregation framework.
All you need to do is to add explain : true
db.records.aggregate(
[ ...your pipeline...],
{ explain: true }
)
Thanks to Rafa, I know that it was possible to do even in 2.4, but only through runCommand(). But now you can use aggregate as well.
The aggregation framework is a set of analytics tools within MongoDB that allows us to run various types of reports or analysis on documents in one or more collections. Based on the idea of a pipeline. We take input from a MongoDB collection and pass the documents from that collection through one or more stages, each of which performs a different operation on it's inputs. Each stage takes as input whatever the stage before it produced as output. And the inputs and outputs for all stages are a stream of documents. Each stage has a specific job that it does. It's expecting a specific form of document and produces a specific output, which is itself a stream of documents. At the end of the pipeline, we get access to the output.
An individual stage is a data processing unit. Each stage takes as input a stream of documents one at a time, processes each document one at a time and produces the output stream of documents. Again, one at a time. Each stage provide a set of knobs or tunables that we can control to parameterize the stage to perform whatever task we're interested in doing. So a stage performs a generic task - a general purpose task of some kind and parameterize the stage for the particular set of documents that we're working with. And exactly what we would like that stage to do with those documents. These tunables typically take the form of operators that we can supply that will modify fields, perform arithmetic operations, reshape documents or do some sort of accumulation task as well as a veriety of other things. Often times, it the case that we'll want to include the same type of stage multiple times within a single pipeline.
e.g. We may wish to perform an initial filter so that we don't have to pass the entire collection into our pipeline. But, then later on, following some additional processing, want to filter once again using a different set of criteria. So, to recap, pipeline works with a MongoDB collection. They're composed of stages, each of which does a different data processing task on it's input and produces documents as output to be passed to the next stage. And finally at the end of the pipeline output is produced that we can then do something within our application. In many cases, it's necessary to include the same type of stage, multiple times within an individual pipeline.

MongoError: document constructed by $facet is 104860008 bytes, which exceeds the limit of 104857600 bytes

I am getting memory size limit error while running multiple sub-pipelines in $facet. Can someone help me on this issue.
Scenario: I have a crown job which runs once a day. I want to execute multiple pipelines using $facet against a collection with millions of documents whenever job is triggered.
[
{
$facet: {
query1: [pipeline_1],
query2: [pipeline_2],
query3: [pipeline_3]
...
query_n: [pipeline_n]
},
},
{
$merge:{ into: some_collection}
}
]
I tried db.collection.aggregate([], {allowDiskUse: true});, But still getting same error.
What can be the work around on this. Please help.
In most cases using allowDiskUse should eliminate this error this is why i'm suspecting you're using the wrong syntax for it. Depending on your Mongo version there are some hard limitations on certain operators like $graphLookup, these operators will always have a memory error regardless of the allowDiskUse flag usage.
Assuming you don't use any of these operators allowDiskUse will work, here is a syntax example for the nodejs driver:
db.collection.aggregate([], {allowDiskUse: true});
If you are $graphLookup or one of the other limited operators then there's not much you can do. I would start by splitting these $facet stages into separate pipelines. If the problem still persists you'll need to either optimize the aggregation or find a different approach.

can we write mongodb crud queries and aggregate query together?

In MongoDB can we execute below written query?
db.dbaname.find(userName:"abc").aggregate([])
else is there any other way we can execute CRUD and aggregate query together.
Short answer - No you can't do this : .find(userName:"abc").aggregate([])
aggregation-pipeline is heavily used for reads which is mostly similar to .find() but capable of executing complex queries with help of it's multiple stages & many aggregation-operators. there are only two stages in aggregation $out & $merge that can perform writes to database - these stages are not that much used compared to other stages & needs to be used only when needed & as they need to be last stages in aggregation pipeline, then all the previous stages are to be tested very well. So when it comes to CRUD eliminating CUD you'll benefit over R - Reads.
Same .find(userName:"abc") can be written as :
.aggregate( [ { $match : { userName:"abc"} } ] ) // Using `$match` stage

Mongodb Aggregation Map reduce

I have started exploring mongodb couple of weeks back. I have a scenario here. I have a collection which has 3 million records.
I would want to perform aggregation on the aggreation based on two keys (also need to use match condition). I used aggregation framework for the same. I came to know that aggregation would fail if the processing document size (array) exceeds 16 MB.
I faced the same issue when i tried. I am trying to use map reduce now. I would need the guidance on implementing the same. How can I overcome the 16 MB size limit by using map reduce?
Also I came to know that I can do it by splitting the collection into multiple collections and do the aggregation on the same. Would be great if anyone can point me in right direction?
Even without code there are basic answers to your questions.
The limitation on the BSON document 16MB output size is for "inline" responses. That means a response from your operations that does not write the individual "documents" from your response to a collection.
So with mapReduce a statement much like this:
db.collection.mapReduce(
mapper,
reducer,
{ "out": { "inline": 1 } }
)
Has the problem that the "array" in the response needs to be under 16MB. But if you change this to output to a collection:
db.collection.mapReduce(
mapper,
reducer,
{ "out": { "replace": "newcollection" } }
)
Then you no longer have this limitation.
The same applies to the aggregate method from versions 2.6 and upwards using the $out pipeline stage:
db.collection.aggregate([
// lots of pipeline
{ "$out": "newcollection }
])
This overcomes the limtation by the same means by outputing to a collection.
Actually with the aggregate statement, again from version 2.6 and upwards this returns a cursor, just like the .find() method, and is also not subject to this limitation.

Mongodb Explain for Aggregation framework

Is there an explain function for the Aggregation framework in MongoDB? I can't see it in the documentation.
If not is there some other way to check, how a query performs within the aggregation framework?
I know with find you just do
db.collection.find().explain()
But with the aggregation framework I get an error
db.collection.aggregate(
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{
$group:
{
_id : { id: "$_id"},
"count": { $sum:1 }
}
},
{ $sort: {"count":-1}}
).explain()
Starting with MongoDB version 3.0, simply changing the order from
collection.aggregate(...).explain()
to
collection.explain().aggregate(...)
will give you the desired results (documentation here).
For older versions >= 2.6, you will need to use the explain option for aggregation pipeline operations
explain:true
db.collection.aggregate([
{ $project : { "Tags._id" : 1 }},
{ $unwind : "$Tags" },
{ $match: {$or: [{"Tags._id":"tag1"},{"Tags._id":"tag2"}]}},
{ $group: {
_id : "$_id",
count: { $sum:1 }
}},
{$sort: {"count":-1}}
],
{
explain:true
}
)
An important consideration with the Aggregation Framework is that an index can only be used to fetch the initial data for a pipeline (e.g. usage of $match, $sort, $geonear at the beginning of a pipeline) as well as subsequent $lookup and $graphLookup stages. Once data has been fetched into the aggregation pipeline for processing (e.g. passing through stages like $project, $unwind, and $group) further manipulation will be in-memory (possibly using temporary files if the allowDiskUse option is set).
Optimizing pipelines
In general, you can optimize aggregation pipelines by:
Starting a pipeline with a $match stage to restrict processing to relevant documents.
Ensuring the initial $match / $sort stages are supported by an efficient index.
Filtering data early using $match, $limit , and $skip .
Minimizing unnecessary stages and document manipulation (perhaps reconsidering your schema if complicated aggregation gymnastics are required).
Taking advantage of newer aggregation operators if you have upgraded your MongoDB server. For example, MongoDB 3.4 added many new aggregation stages and expressions including support for working with arrays, strings, and facets.
There are also a number of Aggregation Pipeline Optimizations that automatically happen depending on your MongoDB server version. For example, adjacent stages may be coalesced and/or reordered to improve execution without affecting the output results.
Limitations
As at MongoDB 3.4, the Aggregation Framework explain option provides information on how a pipeline is processed but does not support the same level of detail as the executionStats mode for a find() query. If you are focused on optimizing initial query execution you will likely find it beneficial to review the equivalent find().explain() query with executionStats or allPlansExecution verbosity.
There are a few relevant feature requests to watch/upvote in the MongoDB issue tracker regarding more detailed execution stats to help optimize/profile aggregation pipelines:
SERVER-19758: Add "executionStats" and "allPlansExecution" explain modes to aggregation explain
SERVER-21784: Track execution stats for each aggregation pipeline stage and expose via explain
SERVER-22622: Improve $lookup explain to indicate query plan on the "from" collection
Starting with version 2.6.x mongodb allows users to do explain with aggregation framework.
All you need to do is to add explain : true
db.records.aggregate(
[ ...your pipeline...],
{ explain: true }
)
Thanks to Rafa, I know that it was possible to do even in 2.4, but only through runCommand(). But now you can use aggregate as well.
The aggregation framework is a set of analytics tools within MongoDB that allows us to run various types of reports or analysis on documents in one or more collections. Based on the idea of a pipeline. We take input from a MongoDB collection and pass the documents from that collection through one or more stages, each of which performs a different operation on it's inputs. Each stage takes as input whatever the stage before it produced as output. And the inputs and outputs for all stages are a stream of documents. Each stage has a specific job that it does. It's expecting a specific form of document and produces a specific output, which is itself a stream of documents. At the end of the pipeline, we get access to the output.
An individual stage is a data processing unit. Each stage takes as input a stream of documents one at a time, processes each document one at a time and produces the output stream of documents. Again, one at a time. Each stage provide a set of knobs or tunables that we can control to parameterize the stage to perform whatever task we're interested in doing. So a stage performs a generic task - a general purpose task of some kind and parameterize the stage for the particular set of documents that we're working with. And exactly what we would like that stage to do with those documents. These tunables typically take the form of operators that we can supply that will modify fields, perform arithmetic operations, reshape documents or do some sort of accumulation task as well as a veriety of other things. Often times, it the case that we'll want to include the same type of stage multiple times within a single pipeline.
e.g. We may wish to perform an initial filter so that we don't have to pass the entire collection into our pipeline. But, then later on, following some additional processing, want to filter once again using a different set of criteria. So, to recap, pipeline works with a MongoDB collection. They're composed of stages, each of which does a different data processing task on it's input and produces documents as output to be passed to the next stage. And finally at the end of the pipeline output is produced that we can then do something within our application. In many cases, it's necessary to include the same type of stage, multiple times within an individual pipeline.