Is it ok to use $out for temporary collections? - mongodb

I tried to create single aggregation request but without any luck - I need to split it. I think I can do following:
First aggregation request will filter/transform/sort/limit documents
and save result to temporary collection by using $out
After that, I'll execute 2-3 aggregation requests on temporary
collection
Finally, I'll delete temporary collection
By saving data to a temporary collection, I'll skip filter/sort/limit stages on subsequent aggregation requests.
Is it ok? What's the overhead of this approach? What's the main usage of $out operator?

Yes; MongoDB does this itself when it runs map-reduce aggregations: Temporary Collection in MongoDB
It would be great to get specifics as to what you are trying to accomplish as it may be possible to do in a single aggregation or map-reduce operation.

Related

Duplicate efficiently documents in MongoDB

I would like to find-out the most efficient way to duplicate documents in MongoDB, given that I want to take a bunch of documents from an existing collection, update one of their field, unset _id to generate a new one, and push them back in the collection to create duplicates.
This is typically to create a "branching" feature in MongoDB, allowing users to modify data in two separate branches at the same time.
I've tried the following things:
In my server, get data chunks in multiple threads, modify data, and insert modified data with a new _id in the base
This basically works but performance is not super good (~ 20s for 1 million elements).
In the future MongoDB version (tested on version 4.1.10), use the new $out aggregation mechanism to insert in the same collection
This does not seem to work and raise an error message "errmsg" : "$out with mode insertDocuments is not supported when the output collection is the same as the aggregation collection"
Any ideas how to be faster than the first approach? Thanks!

When do I absolutely need the $out operator in MongoDB aggregation?

MongoDB aggregation documentation on $out says:
"Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline. The $out operator lets the aggregation framework return result sets of any size."
https://docs.mongodb.org/manual/reference/operator/aggregation/out/
So, one issue may be that aggregation may run out of memory or use a lot of memory. But how $out will help here, ultimately if the aggregation returning a lot of buckets, they are to be held in memory first.
The $out operator is useful when you have a certain use-case which takes long to calculate but doesn't need to be current all the time.
Let's say you have a website where you want a list of the top ten currently most popular articles on the frontpage (most hits in the past 60 minutes). To create this statistic, you need to parse your access log collections with a pipeline like this:
$match the last hour
$group by article-id and user to filter out reloads
$group again by article-id to get the hit count for each article and user
$sort by count.
$limit to 10 results
When you have a very popular website with a lot of content, this can be a quite load-heavy aggregation. And when you have it on the frontpage you need to do it for every single frontpage hit. This can create a quite nasty load on your database and considerably bog down the loading time.
How do we solve this problem?
Instead of performing that aggregation on every page hit, we perform it once every minute with a cronjob which uses $out to put the aggregated top ten list into a new collection. You can then query the cached results in that collection directly. Getting all 10 results from a 10-document collection will be far faster than performing that aggregation all the time.

Use $out to insert into collection withougt wiping it first

According to the documentation, using $out in MongoDB's aggregation framework is wiping any existing data before writing.
Is there any way to force it not to remove existing documents but only add to the collection?
No, the aggregation framework doesn't have such a feature. You can either write a map reduce job which, if I remember correctly, can append to a collection or you can have the aggregation job return a cursor which you can iterate over and then update your collection.

mongodb - keeping track of aggregated documents

I have a mongodb collection that stores raw information coming from an app. I wrote a multi-pipeline aggregation method to generate more meaningful data from the raw documents.
Using the $out operator in my aggregation function I store the aggregation results in another collection.
I would like to be able to either delete raw documents that were already aggregated, or somehow mark those documents so I know not to aggregate again.
I am worried that I cannot guaranty I won't miss out some documents that are created in between or create duplicate aggregated documents.
Is there a way to achieve this?

Mongodb: batch jobs and deleting data

I am writing a batch job to aggregate some data using the aggregation framework. Since the data output is potentially large, I am using a $limit in the top of my pipeline to reduce the number of objects that are processed at a time. My question is: after the aggregation is complete, how would I reliably remove all the objects that have been processed from the collection without having to worry about race conditions? Or if there's another way to go about what I am trying to do?
1> rename your collection
2> Now do all your processing on the new collection.
Warning: It doesn't work well in Sharded environments