Mongodb: batch jobs and deleting data - mongodb

I am writing a batch job to aggregate some data using the aggregation framework. Since the data output is potentially large, I am using a $limit in the top of my pipeline to reduce the number of objects that are processed at a time. My question is: after the aggregation is complete, how would I reliably remove all the objects that have been processed from the collection without having to worry about race conditions? Or if there's another way to go about what I am trying to do?

1> rename your collection
2> Now do all your processing on the new collection.
Warning: It doesn't work well in Sharded environments

Related

Duplicate efficiently documents in MongoDB

I would like to find-out the most efficient way to duplicate documents in MongoDB, given that I want to take a bunch of documents from an existing collection, update one of their field, unset _id to generate a new one, and push them back in the collection to create duplicates.
This is typically to create a "branching" feature in MongoDB, allowing users to modify data in two separate branches at the same time.
I've tried the following things:
In my server, get data chunks in multiple threads, modify data, and insert modified data with a new _id in the base
This basically works but performance is not super good (~ 20s for 1 million elements).
In the future MongoDB version (tested on version 4.1.10), use the new $out aggregation mechanism to insert in the same collection
This does not seem to work and raise an error message "errmsg" : "$out with mode insertDocuments is not supported when the output collection is the same as the aggregation collection"
Any ideas how to be faster than the first approach? Thanks!

Using Mongo to continuously rollup data

I've been experimenting with the Mongo Aggregation Framework and, with help from folks on here, am able to generate the right set of output docs for a given input. I have a couple of conceptual issues though that I'm hoping folks can help me design around.
The application I have is a runtime system that collects data for all the transactions it processes. All this data is written to a distributed, sharded collection in Mongo. What I need to do is periodically (every 5 seconds at this point) run a job that traverses this data, rolling it up by carious categories and appending the rolled up documents to a set of existing collections (or one existing collection).
I have a couple of challenges with the way Mongo Aggregration works:
1 - the $out pipeline stage doesn’t append to the target collection, it overwrites it - I need to append to a constantly growing collection. It also can't write to a sharded collection, but I don't think this is that big an issue.
2 - I don't know how I can configure it to essentially "tail" the input collection. Right now I would need to run it from a server and would have to mark the set of documents it's going to process with a query before running the aggregate() command and then have another job that periodically goes back through the source collection deleting documents that have been marked for processing (this assumes the aggregate worked and rolled them up properly - there is no transactionality).
Anyone have any suggestions for a better way to do this?
Thanks,
Ian
I recommend looking at version 3.6 (released last Nov) and the feature known as change streams. Change streams is effectively the "tail" you seek. A compact program in pseudo-code would look like this. Note also how we iterate over the agg on inputCollection and write doc by doc to the outputCollection.
tailableCursor = db.inputCollection.watch()
while(true) {
// Block until something comes in;
document = next(tailableCursor);
// Examine document to ensure it is of interest
if(of interest) {
cur2 = db.inputCollection.aggregate([pipeline]);
while(cur.hasNext()) {
db.outputCollection.insert(cur.next());
}
}
}

Is it ok to use $out for temporary collections?

I tried to create single aggregation request but without any luck - I need to split it. I think I can do following:
First aggregation request will filter/transform/sort/limit documents
and save result to temporary collection by using $out
After that, I'll execute 2-3 aggregation requests on temporary
collection
Finally, I'll delete temporary collection
By saving data to a temporary collection, I'll skip filter/sort/limit stages on subsequent aggregation requests.
Is it ok? What's the overhead of this approach? What's the main usage of $out operator?
Yes; MongoDB does this itself when it runs map-reduce aggregations: Temporary Collection in MongoDB
It would be great to get specifics as to what you are trying to accomplish as it may be possible to do in a single aggregation or map-reduce operation.

MongoDb MapReduce

I have a question about mongoDb's mapReduce funcion. Say there is currently a mapReduce running that will take a long time. What will happen when a user tries to acces the same collection the mapReduce is writing to?
Does the map reduce write all the data after it is finished or does it write it while running?
Long running read and write operations, such as queries, updates, and deletes, yield under many conditions. MongoDB operations can also yield locks between individual document modifications in write operations that affect multiple documents like update() with the multi parameter.
In Map reduce mongoDB is doing Read and write lock, unless operations
are specified as non-atomic. Portions of map-reduce jobs can run
concurrently.
See the concurrency page for details on mongodb locking. For your case, the map-reduce command takes a read and write lock for the relevant collections while it is running. Portions of the map-reduce command can be concurrent, but in the general case it is locked while running.

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.