Me and my team have some questions related to Mongo pipeline.
We have a pipeline used to process a large number of documents. In this pipeline, we use the $group stage. However, this stage has an hard-coded limit of 100MB, which is not enough for the quantity of document we have (https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/#memory-restrictions).
The solution we implemented was a combination of the $skip and $limit stages at the beginning of the pipeline in order to process the documents by batch of N (where N is a number chosen so a batch of N documents weight less than 100MB).
However this solution has a performance issue, as the $skip stage get slower
as the number of documents to skip increases (https://arpitbhayani.me/techie/mongodb-cursor-skip-is-slow.html).
We then found this article: https://arpitbhayani.me/techie/fast-and-efficient-pagination-in-mongodb.html. The second approach described there seems to be a good solution for our use case. We would like to implement this approach. However, for this to work, the documents have to be sorted by the internal Mongo ID (_id) at the start of the pipeline. We thought about using the $sort stage beforehand, but this stage also has a hard-coded 100MB limit.
What method could we use to ensure that the documents are sorted by _id at the start of the pipeline?
Is there maybe another solution to process the documents by batch weighting less than 100MB?
Related
Is it possible to retrieve the documents between stages in mongo aggregation pipeline?
Imagine that I have an aggregation pipeline running in pymongo with 10 stages and I want to be able to retrive some info available after stage 8 that will not be available on the last stage. Is it possible?
The idea is quite similar of this question, and looking at the answers I found this $facet but it wasn't clear for me if the stage1 of all outputFields are the same then it will be executed only once and perform as expected. And also, as I saw on the docs, $facet does not support indexes, that is a problem in my case.
To retrieve values of particular fields which are changed in subsequent stages, use $set to duplicate those values into new fields.
To retrieve the result set exactly as it exists after the 8th stage, send the first 8 stages as their own pipeline.
I am using Mongo 4.2 (stuck with this) and have a collection say "product_data" with documents with the following schema:
_id:"2lgy_itmep53vy"
uIdHash:"2lgys2yxouhug5xj3ms45mluxw5hsweu"
userTS:1494055844000
Case 1: With this, I have the following indexes for the collection:
_id:Regular - Unique
uIdHash: Hashed
I tried to execute
db.product_data.find( {"uIdHash":"2lgys2yxouhug5xj3ms45mluxw5hsweu"}).sort({"userTS":-1}).explain()
and these are the stages in result:
Ofcourse, I could realize that it would make sense to have an additional compound index to avoid the mongo in-memory 'Sort' stage.
Case 2: Now I have attempted to add another index with those which were existing
3. {uIdHash:1 , userTS:-1}: Regular and Compound
Up to my expectation, the result of execution here was able to optimize on the sorting stage:
All good so far, now that I am looking to build for pagination on top of this query. I would need to limit the data queried. Hence the query further translates to
db.product_data.find( {"uIdHash":"2lgys2yxouhug5xj3ms45mluxw5hsweu"}).sort({"userTS":-1}).limit(10).explain()
The result for each Case now are as follows:
Case 1 Limit Result:
The in-memory sorting does less work (36 instead of 50) and returns the expected number of documents. Fair enough, a good underlying optimization in the stage.
Case 2 Limit Result:
Surprisingly, with the index in use and the data queried, there is an additional Limit stage added to processing!
The doubts now I have are as follows:
Why do we need an additional stage for LIMIT, when we already have 10 documents retured from FETCH stage?
What would be the impact of this additional stage? Given that I need pagination, shall I stick with Case 1 indexes and not use the last compound index?
Limit stage tells you that the database is limiting the result set. This means subsequent stages will work with less data.
Your question of "why do we need an additional stage for limit" doesn't make sense. You send the query to the database, and you do not use (or need) any stages. The database decides how to fulfill the query, if you asked it to limit the result set it does that and it communicates to you that it has done that, by telling you there is a limit stage in query processing.
The query executor is able to perform some optimizations. One of these is that when there is a limit and no blocking stage (like a sort), when the limit is reached, all of the upstream stages can stop early.
This means that if there were no limit stage, the ixscan and fetch stages would have continued through all 24 matching documents.
There is no discreet limit stage with the non-index sort because it is combined with the sort stage.
When i use mongoose for mongodb with an atlas mongo sever , it gives the error that mongo pipeline length greater than 50 not supported.
I have searched the whole web for this.cant find a solution.
Is there a work around on this?
This happens because you have used too many pipelines in one aggregation.
If you want to use more than 50 pipeline stages, you can use $facet.
The following workflow may be of help:
Separate your pipelines into several chunks (pipeline stages in each chunk must not exceed 50).
After running the facet, you can use $unwind to separate the result into separate documents like normal (you may need to restructure the data to restore the former format using $project).
You can run another facet after that if you want.
In this case, if you plan to run 150 stages in one aggregation, you separate them into 4-5 chunks, make sure that each chunk must use the same scope to avoid causing "missing" or "undefined variable" case, you can $use unwind to restore the document format to run the next chunk, too.
Make sure that each output document must not exceed 16MB, that's why I recommend using $unwind after $facet (it can excced 16MB during the stages). The reason is that the $facet will output everything into 1 document with an array of documents inside (if you want to show all documents), so $unwind will separate those "inner" documents into separate documents.
One note on this is that you can try limit the field name to prevent BufBuilder to exceed maximum size, which is 64MB. Try using $project instead of $addFields as it will increase the buffer between each stage.
Another one is that you should not use pipeline stages that exceed 100MB of RAM if you are using MongoDB Atlas < M10.
It may be better if you provide a pseudo-code for the problem you are having but I think this would fix it.
Whenever I run a count query on MongoDB with explain I can see two different stages COUNT_SCAN and IXSCAN. I want to know the difference between them according to performance and how can I improve the query.
field is indexed.
following query:
db.collection.explain(true).count({field:1}})
uses COUNT_SCAN and query like:
db.collection.explain(true).count({field:"$in":[1,2]})
uses IXSCAN.
The short: COUNT_SCAN is the most efficient way to get a count by reading the value from an index, but can only be performed in certain situations. Otherwise, IXSCAN is performed following by some filtering of documents and a count in memory.
When reading from secondary the read concern available is used. This concern level doesn't consider orphan documents in sharded clusters, and so no SHARDING_FILTER stage will be performed. This is when you see COUNT_SCAN.
However, if we use read concern local, we need to fetch the documents in order to perform the SHARDING_FILTER filter stage. In this case, there are multiple stages to fulfill the query: IXSCAN, then FETCH then SHARDING_FILTER.
I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.