When do I absolutely need the $out operator in MongoDB aggregation? - mongodb

MongoDB aggregation documentation on $out says:
"Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline. The $out operator lets the aggregation framework return result sets of any size."
https://docs.mongodb.org/manual/reference/operator/aggregation/out/
So, one issue may be that aggregation may run out of memory or use a lot of memory. But how $out will help here, ultimately if the aggregation returning a lot of buckets, they are to be held in memory first.

The $out operator is useful when you have a certain use-case which takes long to calculate but doesn't need to be current all the time.
Let's say you have a website where you want a list of the top ten currently most popular articles on the frontpage (most hits in the past 60 minutes). To create this statistic, you need to parse your access log collections with a pipeline like this:
$match the last hour
$group by article-id and user to filter out reloads
$group again by article-id to get the hit count for each article and user
$sort by count.
$limit to 10 results
When you have a very popular website with a lot of content, this can be a quite load-heavy aggregation. And when you have it on the frontpage you need to do it for every single frontpage hit. This can create a quite nasty load on your database and considerably bog down the loading time.
How do we solve this problem?
Instead of performing that aggregation on every page hit, we perform it once every minute with a cronjob which uses $out to put the aggregated top ten list into a new collection. You can then query the cached results in that collection directly. Getting all 10 results from a 10-document collection will be far faster than performing that aggregation all the time.

Related

Mongodb Atlas - pipeline length greater than 50 not supported

When i use mongoose for mongodb with an atlas mongo sever , it gives the error that mongo pipeline length greater than 50 not supported.
I have searched the whole web for this.cant find a solution.
Is there a work around on this?
This happens because you have used too many pipelines in one aggregation.
If you want to use more than 50 pipeline stages, you can use $facet.
The following workflow may be of help:
Separate your pipelines into several chunks (pipeline stages in each chunk must not exceed 50).
After running the facet, you can use $unwind to separate the result into separate documents like normal (you may need to restructure the data to restore the former format using $project).
You can run another facet after that if you want.
In this case, if you plan to run 150 stages in one aggregation, you separate them into 4-5 chunks, make sure that each chunk must use the same scope to avoid causing "missing" or "undefined variable" case, you can $use unwind to restore the document format to run the next chunk, too.
Make sure that each output document must not exceed 16MB, that's why I recommend using $unwind after $facet (it can excced 16MB during the stages). The reason is that the $facet will output everything into 1 document with an array of documents inside (if you want to show all documents), so $unwind will separate those "inner" documents into separate documents.
One note on this is that you can try limit the field name to prevent BufBuilder to exceed maximum size, which is 64MB. Try using $project instead of $addFields as it will increase the buffer between each stage.
Another one is that you should not use pipeline stages that exceed 100MB of RAM if you are using MongoDB Atlas < M10.
It may be better if you provide a pseudo-code for the problem you are having but I think this would fix it.

Why aggregate+sort is faster than find+sort in mongo?

I'm using mongoose in my project. When the number of documents in my collection becomes bigger, the method of find+sort becomes slower. So I use aggregate+$sort instead. I just wonder why?
Without seeing your data and your query it is difficult to answer why aggregate+sort is faster than find+sort.
But below are the things that holds good on find and aggregate
A well indexed(Indexing that suits your query) data will always yield faster results on your find query.
The components of aggregation pipeline which you use on your aggregate query, more operations is directly proportional to more execution time.
When you go for aggregation pipeline you can create new fields such as sum, avg and so on, which is not possible in a find.
see this thread for more info
MongoDB {aggregation $match} vs {find} speed

Is it ok to use $out for temporary collections?

I tried to create single aggregation request but without any luck - I need to split it. I think I can do following:
First aggregation request will filter/transform/sort/limit documents
and save result to temporary collection by using $out
After that, I'll execute 2-3 aggregation requests on temporary
collection
Finally, I'll delete temporary collection
By saving data to a temporary collection, I'll skip filter/sort/limit stages on subsequent aggregation requests.
Is it ok? What's the overhead of this approach? What's the main usage of $out operator?
Yes; MongoDB does this itself when it runs map-reduce aggregations: Temporary Collection in MongoDB
It would be great to get specifics as to what you are trying to accomplish as it may be possible to do in a single aggregation or map-reduce operation.

MongoDB {aggregation $match} vs {find} speed

I have a mongoDB collection with millions of rows and I'm trying to optimize my queries. I'm currently using the aggregation framework to retrieve data and group them as I want. My typical aggregation query is something like : $match > $group > $ group > $project
However, I noticed that the last parts only take a few ms, the beginning is the slowest.
I tried to perform a query with only the $match filter, and then to perform the same query with collection.find. The aggregation query takes ~80ms while the find query takes 0 or 1ms.
I have indexes on pretty much each field so I guess this isn't the problem. Any idea on what could go wrong ? Or is it just a "normal" drawback of the aggregation framework ?
I could use find queries instead of aggregation queries, however I would have to perform a lot of processing after the request and this process can be done quickly with $group etc. so I would rather keep the aggregation framework.
Thanks,
EDIT :
Here is my criteria :
{
"action" : "click",
"timestamp" : {
"$gt" : ISODate("2015-01-01T00:00:00Z"),
"$lt" : ISODate("2015-02-011T00:00:00Z")
},
"itemId" : "5"
}
The main purpose of the aggregation framework is to ease the query of a big number of entries and generate a low number of results that hold value to you.
As you have said, you can also use multiple find queries, but remember that you can not create new fields with find queries. On the other hand, the $group stage allows you to define your new fields.
If you would like to achieve the functionality of the aggregation framework, you would most likely have to run an initial find (or chain several ones), pull that information and further manipulate it with a programming language.
The aggregation pipeline might seem to take longer, but at least you know you only have to take into account the performance of one system - MongoDB engine.
Whereas, when it comes to manipulating the data returned from a find query, you would most likely have to further manipulate the data with a programming language, thus increasing the complexity depending on the intricacies of the programming language of choice.
Have you tried using explain() to your find queries? It'll give you good idea about how much time find() query will exactly take. You can do the same for $match with $explain & see whether there is any difference in index accessing & other parameters.
Also the $group part of aggregation framework doesn't utilize the indexing so it has to process all the records returned by $match stage of aggregation framework. So to better understand the the working of your query see the result set it returns & whether it fits into memory to be processed by MongoDB.
if you are concern with performance, then no doubt aggregation is time taking task rather then find clause.
when you are fetching record on multiple conditions, having lookup, grouping, and some limited record ( paginated) then it is best approch to use aggregate , meanwhile in find query is fast when you have to fetch very big data set. you have some population, projection and no pagination i suggest to use find query that is fast

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.