Aggregation framework on full table scan - mongodb

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.

As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.

Related

Why do we need an additional LIMIT stage with compound index in Mongo

I am using Mongo 4.2 (stuck with this) and have a collection say "product_data" with documents with the following schema:
_id:"2lgy_itmep53vy"
uIdHash:"2lgys2yxouhug5xj3ms45mluxw5hsweu"
userTS:1494055844000
Case 1: With this, I have the following indexes for the collection:
_id:Regular - Unique
uIdHash: Hashed
I tried to execute
db.product_data.find( {"uIdHash":"2lgys2yxouhug5xj3ms45mluxw5hsweu"}).sort({"userTS":-1}).explain()
and these are the stages in result:
Ofcourse, I could realize that it would make sense to have an additional compound index to avoid the mongo in-memory 'Sort' stage.
Case 2: Now I have attempted to add another index with those which were existing
3. {uIdHash:1 , userTS:-1}: Regular and Compound
Up to my expectation, the result of execution here was able to optimize on the sorting stage:
All good so far, now that I am looking to build for pagination on top of this query. I would need to limit the data queried. Hence the query further translates to
db.product_data.find( {"uIdHash":"2lgys2yxouhug5xj3ms45mluxw5hsweu"}).sort({"userTS":-1}).limit(10).explain()
The result for each Case now are as follows:
Case 1 Limit Result:
The in-memory sorting does less work (36 instead of 50) and returns the expected number of documents. Fair enough, a good underlying optimization in the stage.
Case 2 Limit Result:
Surprisingly, with the index in use and the data queried, there is an additional Limit stage added to processing!
The doubts now I have are as follows:
Why do we need an additional stage for LIMIT, when we already have 10 documents retured from FETCH stage?
What would be the impact of this additional stage? Given that I need pagination, shall I stick with Case 1 indexes and not use the last compound index?
Limit stage tells you that the database is limiting the result set. This means subsequent stages will work with less data.
Your question of "why do we need an additional stage for limit" doesn't make sense. You send the query to the database, and you do not use (or need) any stages. The database decides how to fulfill the query, if you asked it to limit the result set it does that and it communicates to you that it has done that, by telling you there is a limit stage in query processing.
The query executor is able to perform some optimizations. One of these is that when there is a limit and no blocking stage (like a sort), when the limit is reached, all of the upstream stages can stop early.
This means that if there were no limit stage, the ixscan and fetch stages would have continued through all 24 matching documents.
There is no discreet limit stage with the non-index sort because it is combined with the sort stage.

Mongodb Atlas - pipeline length greater than 50 not supported

When i use mongoose for mongodb with an atlas mongo sever , it gives the error that mongo pipeline length greater than 50 not supported.
I have searched the whole web for this.cant find a solution.
Is there a work around on this?
This happens because you have used too many pipelines in one aggregation.
If you want to use more than 50 pipeline stages, you can use $facet.
The following workflow may be of help:
Separate your pipelines into several chunks (pipeline stages in each chunk must not exceed 50).
After running the facet, you can use $unwind to separate the result into separate documents like normal (you may need to restructure the data to restore the former format using $project).
You can run another facet after that if you want.
In this case, if you plan to run 150 stages in one aggregation, you separate them into 4-5 chunks, make sure that each chunk must use the same scope to avoid causing "missing" or "undefined variable" case, you can $use unwind to restore the document format to run the next chunk, too.
Make sure that each output document must not exceed 16MB, that's why I recommend using $unwind after $facet (it can excced 16MB during the stages). The reason is that the $facet will output everything into 1 document with an array of documents inside (if you want to show all documents), so $unwind will separate those "inner" documents into separate documents.
One note on this is that you can try limit the field name to prevent BufBuilder to exceed maximum size, which is 64MB. Try using $project instead of $addFields as it will increase the buffer between each stage.
Another one is that you should not use pipeline stages that exceed 100MB of RAM if you are using MongoDB Atlas < M10.
It may be better if you provide a pseudo-code for the problem you are having but I think this would fix it.

How to process documents by batch in Mongo pipeline without using $skip?

Me and my team have some questions related to Mongo pipeline.
We have a pipeline used to process a large number of documents. In this pipeline, we use the $group stage. However, this stage has an hard-coded limit of 100MB, which is not enough for the quantity of document we have (https://docs.mongodb.com/manual/core/aggregation-pipeline-limits/#memory-restrictions).
The solution we implemented was a combination of the $skip and $limit stages at the beginning of the pipeline in order to process the documents by batch of N (where N is a number chosen so a batch of N documents weight less than 100MB).
However this solution has a performance issue, as the $skip stage get slower
as the number of documents to skip increases (https://arpitbhayani.me/techie/mongodb-cursor-skip-is-slow.html).
We then found this article: https://arpitbhayani.me/techie/fast-and-efficient-pagination-in-mongodb.html. The second approach described there seems to be a good solution for our use case. We would like to implement this approach. However, for this to work, the documents have to be sorted by the internal Mongo ID (_id) at the start of the pipeline. We thought about using the $sort stage beforehand, but this stage also has a hard-coded 100MB limit.
What method could we use to ensure that the documents are sorted by _id at the start of the pipeline?
Is there maybe another solution to process the documents by batch weighting less than 100MB?

What is the difference between COUNT_SCAN and IXSCAN?

Whenever I run a count query on MongoDB with explain I can see two different stages COUNT_SCAN and IXSCAN. I want to know the difference between them according to performance and how can I improve the query.
field is indexed.
following query:
db.collection.explain(true).count({field:1}})
uses COUNT_SCAN and query like:
db.collection.explain(true).count({field:"$in":[1,2]})
uses IXSCAN.
The short: COUNT_SCAN is the most efficient way to get a count by reading the value from an index, but can only be performed in certain situations. Otherwise, IXSCAN is performed following by some filtering of documents and a count in memory.
When reading from secondary the read concern available is used. This concern level doesn't consider orphan documents in sharded clusters, and so no SHARDING_FILTER stage will be performed. This is when you see COUNT_SCAN.
However, if we use read concern local, we need to fetch the documents in order to perform the SHARDING_FILTER filter stage. In this case, there are multiple stages to fulfill the query: IXSCAN, then FETCH then SHARDING_FILTER.

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.