MongoDB aggregation pipeline output change stream

MongoDB aggregation pipeline output change stream - mongodb

I am using MongoDB's aggregation pipeline to generate a new collection B containing aggregated results from collection A. To this purpose I use the $out stage. Every time I run the aggregation pipeline, new documents might be added, some might be updated and some removed.
I would now like to have a change stream over the aggregated collection B in order to be notified when the aggregation generates different results from the previous one (i.e. at least one insertion/update/remove).
However, if I use the $out stage, the collection is recreated on every execution and I get a rename and invalidate change and then the stream is closed. I can use start_after with a resume token to open the stream again but I am not notified of the changes (rename and invalidate).
I tried using $merge to avoid recreating the collection. The change stream is working as I expect but I can no longer delete old documents from collection B.
Is there a way to make my use case work (i.e. the result of the aggregation pipeline is the new content of the collection + get change notification for insert/remove/update from previous collection content)?

$out does not diff the new result set vs the previous contents of a collection. It drops the previous contents and inserts new documents.
Therefore there is nothing in MongoDB that knows which documents were added to B and which were removed. I don't see how you would be able to get this information via a change stream on B.
You are going to need to come up with another solution I'm afraid.

Related

Duplicate efficiently documents in MongoDB

I would like to find-out the most efficient way to duplicate documents in MongoDB, given that I want to take a bunch of documents from an existing collection, update one of their field, unset _id to generate a new one, and push them back in the collection to create duplicates.
This is typically to create a "branching" feature in MongoDB, allowing users to modify data in two separate branches at the same time.
I've tried the following things:
In my server, get data chunks in multiple threads, modify data, and insert modified data with a new _id in the base
This basically works but performance is not super good (~ 20s for 1 million elements).
In the future MongoDB version (tested on version 4.1.10), use the new $out aggregation mechanism to insert in the same collection
This does not seem to work and raise an error message "errmsg" : "$out with mode insertDocuments is not supported when the output collection is the same as the aggregation collection"
Any ideas how to be faster than the first approach? Thanks!

Using Mongo to continuously rollup data

I've been experimenting with the Mongo Aggregation Framework and, with help from folks on here, am able to generate the right set of output docs for a given input. I have a couple of conceptual issues though that I'm hoping folks can help me design around.
The application I have is a runtime system that collects data for all the transactions it processes. All this data is written to a distributed, sharded collection in Mongo. What I need to do is periodically (every 5 seconds at this point) run a job that traverses this data, rolling it up by carious categories and appending the rolled up documents to a set of existing collections (or one existing collection).
I have a couple of challenges with the way Mongo Aggregration works:
1 - the $out pipeline stage doesn’t append to the target collection, it overwrites it - I need to append to a constantly growing collection. It also can't write to a sharded collection, but I don't think this is that big an issue.
2 - I don't know how I can configure it to essentially "tail" the input collection. Right now I would need to run it from a server and would have to mark the set of documents it's going to process with a query before running the aggregate() command and then have another job that periodically goes back through the source collection deleting documents that have been marked for processing (this assumes the aggregate worked and rolled them up properly - there is no transactionality).
Anyone have any suggestions for a better way to do this?
Thanks,
Ian

I recommend looking at version 3.6 (released last Nov) and the feature known as change streams. Change streams is effectively the "tail" you seek. A compact program in pseudo-code would look like this. Note also how we iterate over the agg on inputCollection and write doc by doc to the outputCollection.
tailableCursor = db.inputCollection.watch()
while(true) {
// Block until something comes in;
document = next(tailableCursor);
// Examine document to ensure it is of interest
if(of interest) {
cur2 = db.inputCollection.aggregate([pipeline]);
while(cur.hasNext()) {
db.outputCollection.insert(cur.next());
}
}
}

Use $out to insert into collection withougt wiping it first

According to the documentation, using $out in MongoDB's aggregation framework is wiping any existing data before writing.
Is there any way to force it not to remove existing documents but only add to the collection?

No, the aggregation framework doesn't have such a feature. You can either write a map reduce job which, if I remember correctly, can append to a collection or you can have the aggregation job return a cursor which you can iterate over and then update your collection.

Automatically remove document from the collection when array field becomes empty

I have a collection with documents which contains array fields with even triggers and when there are no more triggers left i want to remove this document. As i understand mongo doesn't have triggers support. Is there any way i can delegate this job to Mongo?

You are correct, there is no triggers in mongo. So there is no normal way to do this with mogno. You have to use application logic to achieve this. One way would be to do cleaning every n minutes. Where you remove documents which have array of size zero. Another way (which I like a more) is after each update to the document, remove it if it has empty array.

The only feature I know MongoDB provides to expire data is by using an expiration index.
Expire data

Overwriting a collection in mongoDB (involving remove + bulk save) . How to make sure this is performed as transaction?

In some situations I need to completely overwrite a specific MongoDB collection by doing:
db.collection.remove()
db.collection.insert(doc) multiple times.
What if 1. succeeds but somewhere 2. fails?
Is there a way to do a rollback when this fails?
Any other way to go about this?

If your collection isn't sharded you could:
Rename the original collection.
Create a new collection using the original name.
Populate the new collection.
If all went well, drop the original collection, otherwise drop the new collection and rename the original one back to the original name.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse