Using Mongo to continuously rollup data - mongodb

I've been experimenting with the Mongo Aggregation Framework and, with help from folks on here, am able to generate the right set of output docs for a given input. I have a couple of conceptual issues though that I'm hoping folks can help me design around.
The application I have is a runtime system that collects data for all the transactions it processes. All this data is written to a distributed, sharded collection in Mongo. What I need to do is periodically (every 5 seconds at this point) run a job that traverses this data, rolling it up by carious categories and appending the rolled up documents to a set of existing collections (or one existing collection).
I have a couple of challenges with the way Mongo Aggregration works:
1 - the $out pipeline stage doesn’t append to the target collection, it overwrites it - I need to append to a constantly growing collection. It also can't write to a sharded collection, but I don't think this is that big an issue.
2 - I don't know how I can configure it to essentially "tail" the input collection. Right now I would need to run it from a server and would have to mark the set of documents it's going to process with a query before running the aggregate() command and then have another job that periodically goes back through the source collection deleting documents that have been marked for processing (this assumes the aggregate worked and rolled them up properly - there is no transactionality).
Anyone have any suggestions for a better way to do this?
Thanks,
Ian

I recommend looking at version 3.6 (released last Nov) and the feature known as change streams. Change streams is effectively the "tail" you seek. A compact program in pseudo-code would look like this. Note also how we iterate over the agg on inputCollection and write doc by doc to the outputCollection.
tailableCursor = db.inputCollection.watch()
while(true) {
// Block until something comes in;
document = next(tailableCursor);
// Examine document to ensure it is of interest
if(of interest) {
cur2 = db.inputCollection.aggregate([pipeline]);
while(cur.hasNext()) {
db.outputCollection.insert(cur.next());
}
}
}

Related

MongoDB sequence number based on count in one operation

I'm working on creating an immutable append only event log for MongoDB, in this I need a sequence number genereated and can base it off of the count of documents, since there will be no removals from the event log. However, I'm trying to avoid having to do two operations on MongoDB and would rather it happen in one "transaction" within the database itself.
If I were to do this from the Mongo shell, it would be something like below:
db['event-log'].insertOne({SequenceNumber: db['event-log'].count() +1 })
Is this doable in any way with the regular API?
Prior to v4, there was the possibility of doing eval - which would have made this much easier.
Update
The reason for my need of a sequence number is to be able to guarantee the order in which they were inserted when reading them back. Default behavior of Mongo is to retrieve them in the $natural order and one can explicitly define that on .find() as well (read more here). Although documentation is clear on not relying on it, it seems that as long as there are no modifications / removal of documents already there, it should be fine from what I can gather.
I realized also that I might get around this in another way as well, I'm going to introduce an Actor framework and I could make my committer a stateful actor with the sequence number in it if I need it.

Faster way to remove 2TB of data from single collection (without sharding)

We collect a lot of data and currently decided to migrate our data from mongodb into data lake. We are going to leave in mongo only some portion of our data and use it as our operational database that keeps only newest most relevant data. We have replica set, but we don't use sharding. I suspect, if we had sharded cluster, we could achieve necessary results much simpler, but it's one-time operation, so setting up cluster just for one operation looks like very complex solution (plus I also suspect, that it will be very long running operation to convert such collection into sharded collection, but I can be completely wrong here)
One of our collections has size of 2TB right now. We want to remove old data from original database as fast as possible, but looks like standard "remove" operation is very slow, even if we use unorderedBulkOperation.
I found a few suggestions to copy data into another collection and then just drop original collection instead of trying remove data (so migrate data that we want to keep instead of removing data that we don't want to keep). There are few different ways, that I found, to copy portion of data from original collection to another collection:
Extract data and insert it into other collection one by one. Or extract portion of data and insert it in bulk using insertMany(). It looks faster than just remove data, but still not enough fast.
Use $out operator with aggregation framework to extract portions of data. It's very fast! But it extracts every portion of data into separate collections and doesn't have ability to append data in current mongodb version. So we will need to combine all exported portions of data into one final collection, what is slow again. I see that $out will be able to append data in next release of mongo (https://jira.mongodb.org/browse/SERVER-12280). But we need some solution now, and unfortunately, we won't be able to do quick update of mongo version anyway.
mongoexport / mongoimport - it exports portion of data into json file and append to another collection using import. It's quite fast too, so looks like good option.
Currently it looks like the best choice to improve performance of migration is combination of $out + mongoexport/mongoimport approaches. Plus multithreading to perform multiple described operations at once.
But is there any even faster option that I might missed?

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
docs.forEach(
function(doc){
//whatever
}
)
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
ET…
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
…L
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

Solution to Bulk FindAndModify in MongoDB

My use case is as follows -
I have a collection of documents in mongoDB which I have to send for analysis.
The format of the documents are as follows -
{ _id:ObjectId("517e769164702dacea7c40d8") ,
date:"1359911127494",
status:"available",
other_fields... }
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing.
ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
Also, in future for scalability, we might go with more than one reader process.
In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
Reader selects X documents with appropriate limit and sorting
Reader marks the documents returned by 1) with it's own unique reader ID (e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true))
Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
Offer the resultset from 3) for your processing.
Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
EDIT: More details :
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.

Prioritize specific long-running operation

I have a mongo collection with a little under 2 million documents in it, and I have a query that I wish to run that will delete around 700.000 of them, based on a Date-field.
The remove query looks something like this:
db.collection.remove({'timestamp': { $lt: ISODate('XXXXX') }})
The exact date is not important in this case, the syntax is correct and I know it will work. However, I also know it's going to take forever (last time we did something similar it took a little under 2 hours).
There is another process inserting and updating records at the same time that I cannot stop. However, as long as those insertions/updates "eventually" get executed, I don't mind them being deferred.
My question is: Is there any way to set the priority of a specific query / operation so that it runs faster / before all the queries sent afterwards? In this case, I assume mongo has to do a lot of swapping data in and out of the database which is not helping performance.
I don't know whether the priority can be fine-tuned, so there might be a better answer.
A simple workaround might be what is suggested in the documentation:
Note: For large deletion operations it may be more effect [sic] to copy the documents that you want to save to a new collection and then use drop() on the original collection.
Another approach is to write a simple script that fetches e.g. 500 elements and then deletes them using $in. You can add some kind of sleep() to throttle the deletion process. This was recommended in the newsgroup.
If you will encounter this problem in the future, you might want to
Use a day-by-day collection so you can simply drop the entire collection once data becomes old enough (this makes aggregation harder), or
use a TTL-Collection where items will time out automatically and don't need to be deleted in a bunch.
If your application needs to delete data older than a certain amount of time i suggest using TTL indexes. Ex (from the mongodb site):
db.log.events.ensureIndex( { "status": 1 }, { expireAfterSeconds: 3600 } )
This works like a capped collection, except data is deleted by time. The biggest win for you is that it works in a background thread, your inserts/updates will be mostly unhurt. I use this technique on a SaaS based product in production, works like a charm.
This may not be your use-case, but i hope that helped.