Prioritize specific long-running operation - mongodb

I have a mongo collection with a little under 2 million documents in it, and I have a query that I wish to run that will delete around 700.000 of them, based on a Date-field.
The remove query looks something like this:
db.collection.remove({'timestamp': { $lt: ISODate('XXXXX') }})
The exact date is not important in this case, the syntax is correct and I know it will work. However, I also know it's going to take forever (last time we did something similar it took a little under 2 hours).
There is another process inserting and updating records at the same time that I cannot stop. However, as long as those insertions/updates "eventually" get executed, I don't mind them being deferred.
My question is: Is there any way to set the priority of a specific query / operation so that it runs faster / before all the queries sent afterwards? In this case, I assume mongo has to do a lot of swapping data in and out of the database which is not helping performance.

I don't know whether the priority can be fine-tuned, so there might be a better answer.
A simple workaround might be what is suggested in the documentation:
Note: For large deletion operations it may be more effect [sic] to copy the documents that you want to save to a new collection and then use drop() on the original collection.
Another approach is to write a simple script that fetches e.g. 500 elements and then deletes them using $in. You can add some kind of sleep() to throttle the deletion process. This was recommended in the newsgroup.
If you will encounter this problem in the future, you might want to
Use a day-by-day collection so you can simply drop the entire collection once data becomes old enough (this makes aggregation harder), or
use a TTL-Collection where items will time out automatically and don't need to be deleted in a bunch.

If your application needs to delete data older than a certain amount of time i suggest using TTL indexes. Ex (from the mongodb site):
db.log.events.ensureIndex( { "status": 1 }, { expireAfterSeconds: 3600 } )
This works like a capped collection, except data is deleted by time. The biggest win for you is that it works in a background thread, your inserts/updates will be mostly unhurt. I use this technique on a SaaS based product in production, works like a charm.
This may not be your use-case, but i hope that helped.

Related

MongoDB watch single document [scalability]

This is MongoDB's api:
db.foo.watch([{$match: {"bar.baz": "qux" }}])
Let's say that collection foo contains millions of documents. The arguments passed into watch indicate that for every single document that gets updated the system will filter the ones that $match the query (but it will be triggered behind the scenes with any document change).
The problem is that as my application scales, my listeners will also scale and my intuition is that I will end up having n^2 complexity with this approach.
I think that as I add more listeners, database performance will deteriorate due to changes to documents that are not part of the $match query. There are other ways to deal with this, (web sockets & rooms) but before prematurely optimizing the system, I would like to know if my intuition is correct.
Actual Question:
Can I attach a listener to a single document, such that watch's performance isn't affected by sibling documents?
When I do collection.watch([$matchQuery]), does the MongoDB driver listen to all documents and then filters out the relevant ones? (this is what I am trying to avoid)
The code collection.watch([$matchQuery]) actually means watch the change stream for that collection rather than the collection directly.
As far as I know, there is no way to add a listener to a single document. Since I do not know of any way, I will give you a couple tips on how to avoid scalability problems with the approach you have chosen. Your code appears to be using change streams. It should not cause problems unless you open too many change streams.
There are two ways to accomplish this task by watching the entire collection with a process outside of that won't lead to deterioration of the database performance.
If you use change streams, you can open only a single change stream with logic that checks for all the conditions you need to filter for over time. The mistake is that people often open many change streams for single document filtering tasks, and that is when people have problems.
The simpler way, since you mentioned Atlas, is to use Triggers. You can use something called a match expression in your Triggers configuration to prevent any operations on your collection unless the
match expression evaluates to true. As noted in the documentation, the trigger function will not execute unless a field status in this case is updated to "blocked", but many match expressions are available:
{
"updateDescription.updatedFields": {
"status": "blocked"
}
}
I hope this helps. If not, I can keep digging. I think with change streams or Triggers, you are ok if you want to write a bit of code. :)

A good way to expire specific documents in one collection and add them to another collection on expiry

I'm using Nodejs and MongoDB driver.
A two part question
I have a collection which is called openOffers which I want to expire when it hits the time closeOfferAt. I know that MongoDB offers a TTL, expireAt and expireAfterSeconds. But, when I use these, the same TTL is applied to all the documents in a particular collection. I'm not sure if I'm doing this correctly. I want document level custom expiry. Any syntax might be very useful!
Docs in openOffers
{
"_id":"12345",
"data": {...},
"closeOfferAt" : "2022-02-21T23:22:34.023Z"
}
I want to push these expired documents to another collection openOfferLog. This is so that I can retain the document for later analysis.
Current approach:
I haven't figured a way to have a customized TTL on each doc in openOffers. But, I currently insert docs into both, openOffers and openOffersLog together, and any data change in openOffers has to be separately propagated to openOffersLog to ensure consistency. There has to be a better scalable approach I suppose.
EDIT-1:
I'm looking for some syntax logic that I can use for the above use case. If not possible with the current MongoDB driver, I'm looking for an alternative solution with NodeJS code I can experiment with. I'm new to both NodeJS and MongoDB -- so any reasoning supporting the solution would be super useful as well.
There are two ways to implement TTL indexes.
Delete after a certain amount of time - this is you have already implemented
Delete at a specific clock time - for the detailed answers you can visit the MongoDB Docs
So the second option fulfills your expectation,
Just set 0 (zero) in expireAfterSeconds field while creating an index,
db.collection.createIndex({ "closeOfferAt": 1 }, { expireAfterSeconds: 0 })
just set the expiration date in closeOfferAt while inserting the document, this will remove the document at a particular timestamp.
db.collection.insert({
"_id":"12345",
"data": {...},
"closeOfferAt" : ISODate("2022-02-23T06:14:15.840Z")
})
Do not make your application logic depend on TTL indexes.
In your app you should have some scheduler that runs periodic tasks, some of them would move the finished offers to other collection and delete from the original even in bulk and no need to set a TTL index.
To keep consistency nothing better than a single source of truth, so if you can, avoid deleting and only change some status flag and timestamps.
A good use of a TTL index is to automatically clear old data after a relative long time, like one month or more. This keeps collection/indexes size in check.

Performance loss with big size of collections

I've a collection that name "test" and has 132K documents in it. When I get first document of the collection it takes between 2-5ms but it's not same for last documation. It takes 100-200ms to pull.
So I've decided to ask the community.
My questions
What is the best document amount in one collection for the performance?
Why does it take so long to get last document from the collection? (I actually don't know how mongo works partially.)
What should I do for this issue and future problems?
After some search of how mongodb works, I found the solution. I didn't use any indexes for my collection so whenever I try to pull something it scans each data and each document. After creating some indexes for my needs, it is much more faster, actually 1ms, than before.
Conclusion
Create indexes for your collection and your needs. It'd be effective write and read operation both. Do not forget to search more 'cause there're some options like background which prevents blocking operations while creating index.

Dealing with a LARGE data in mongodb

It is going to be a "general-ish" question but I have a reason for that. I am asking this because I am not sure what kind of approach shall I take to make things faster.
I have a mongoDB server running on a BIG aws instance (r3.4xlarge 16 core vCPU and 122 GB primary memory). The database has one HUGE collection which has 293017368 documents in it and some of them have a field called paid which holds a string value and some of them do not. Also some of them them have an array called payment_history some of them do not. I need to perform some tasks on that database but ALL the documents that do not have either paid or payment_history or both is not relevant to me. So I thought of cleaning (shrinking) the DB before I proceed with actual operations. I thought that as I have to check something like ({paid: {$exists: false}}) to delete records for the first step I should create an index over paid. I can see that at the present rate the index will take 80 days to be finished.
I am not sure what should be my approach for this situation? Shall I write a map-reduce to visit each and every document and perform whatever I need to perform in one shot and write the resulting documents in a different collection? Or shall I somehow (not sure how though) split up the huge DB in small sections and apply the transforms on each of them and then merge the resultant records from each server into a final cleaned record set? Or shall I somehow (not sure how) put that data in a Elastic Map-Reduce or redshift to operate on it? In a nutshell, what do you people think the best route to take for such a situation?
I am sorry in advance if this question sounds a little bit vague. I tried to explain the real situation as much as I could.
Thanks a lot in advance for help :)
EDIT
According to the comment about sparse indexing, I am now performing a partialIndexing. with this command - db.mycol.createIndex({paid: 1}, {partialFilterExpression: {paid: {$exists: true}}}) it roughly creates 53 indices per second... at this rate I am not sure how long it will take for the entire collection to get indexed. But I am keeping it on for the night and I will come back tomorrow here to update this question. I intend to hold this question the entire journey that I will go through, just for the sake of people in future with the same problem and same situation.

Is moving documents between collections a good way to represent state changes in MongoDB?

I have two collections, one (A) containing items to be processed (relatively small) and one (B) with those already processed (fairly large, with extra result fields).
Items are read from A, get processed and save()'d to B, then remove()'d from A.
The rationale is that indices can be different across these, and that the "incoming" collection can be kept very small and fast this way.
I've run into two issues with this:
if either remove() or save() time out or otherwise fail under load, I lose the item completely, or process it twice
if both fail, the side effects happen but there is no record of that
I can sidestep the double-failure case with findAndModify locks (not needed otherwise, we have a process-level lock) but then we have stale lock issues and partial failures can still happen. There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Is there a Best Practice for this situation?
There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Yes this is by design. MongoDB explicitly does not provides joins or transactions. Remove + Save is a form of transaction.
Is there a Best Practice for this situation?
You really have two low-complexity options here, both involve findAndModify.
Option #1: a single collection
Based on your description, you are basically building a queue with some extra features. If you leverage a single collection then you use findAndModify to update the status of each item as it is processing.
Unfortunately, that means you will lose this: ...that the "incoming" collection can be kept very small and fast this way.
Option #2: two collections
The other option is basically a two phase commit, leveraging findAndModify.
Take a look at the docs for this here.
Once an item is processed in A you set a field to flag it for deletion. You then copy that item over to B. Once copied to B you can then remove the item from A.
I've not tried this myself yet but the new book 50 Tips and Tricks for MongoDB Developers mentions a few times about using cron jobs (or services/scheduler) to clean up data like this. You could leave the documents in Collection A flagged for deletion and run daily job to clear them out, reducing the overall scope of the original transaction.
From what I've learned so far, I'd never leave the database in a state where I rely on the next database action succeeding unless it is the last action (journalling will resend the last db action upon recovery). For example, I have a three phase account registration process where I create a user in CollectionA and then add another related document to CollectionB. When I create the user I embed the details of the CollectionB document in CollectionA in case the second write fails. Later I will write a process that removes the embedded data from CollectionA if the document in CollectionB exists
Not having transactions does cause pain points like this, but I think in some cases there are new ways of thinking about it. In my case, time will tell as I progress with my app