Purge documents in MongoDB without impacting the working set

Purge documents in MongoDB without impacting the working set - mongodb

We have a collection of documents and each document has a time window associated to it. (For example, fields like 'fromDate' and 'toDate'). Once the document is expired (i.e. toDate is in the past), the document isn't accessed anymore by our clients.
So we wanted to purge these documents to reduce the number of documents in the collection and thus making our queries faster. However we later realized that this past data could be important to analyze the pattern of data changes, so we decided to archive it instead of purging it completely. So this is what I've come up with.
Let's say we have a "collectionA" which has past versions of documents
Query all the past documents in "collectionA". (queries are made on secondary server)
Insert them to a separate collection called "collectionA-archive"
Delete the documents from collectionA that are successfully inserted in the archive
Delete documents in "collectionA-archive" that meet a certain condition. (we do not want to keep a huge archive)
My question here is, even though I'm making the queries on Secondary server, since the insertions are happening in Primary, does the documents inserted in archive collection make it to the working set of Primary ? The last thing we need is these past documents getting stored in RAM of Primary which could affect the performance of our live API.
I know, one solution could be to insert the past documents into a separate DB server. But acquiring another server is a bit of hassle. So would like to know if this is achievable within one server.

Related

MongoDB query for deleted documents

I will be running a nightly cron job to query a collection and then send results to another system.
I need to sync this collection between two systems.
Documents can be removed from the host and this deletion needs to be reflected on the client system.
So - my question is, is there a way to query for documents that have been recently deleted?
I'm looking for something like db.Collection.find({RECORDS_THAT_WERE_DELETED_YESTERDAY});
I was reading about parsing the oplog. However, I don't have one setup yet. Is that something you can introduce into an existing DB?

Optimize 'Not In' Query for NoSQL

For my application when pulling documents from my NoSQL database, I need to ensure that they are unique documents every time and not ones that I've already pulled.
Two approaches I have considered:
1) A basic Not-In/'$nin' query that checks the potential documents against a given local array of document ID's that have already been pulled (gets updated on every pull as well).
2) Instead of using a NIN condition in the query, after pulling potential documents I then manually sift out documents that match the DocID's in a locally constructed hash map (gets updated on every pull as well).
Both of these approaches seem slightly inefficient, so I was wondering if there was a more optimized solution that I have not thought about yet?

What is the best way to archive history data in mongo

I have a collection in mongo that stores every user action of my application, and its very huge in size (3Million documents per day). On UI I have a requirement to show the user actions for max. 6months period.
And the queries on this collection are becoming very slow with all the historic data, though there are indexes in place. So, I want to move the documents that are older than 6months to a separate collection.
Is it the right way to handle my issue?

Following are some of the techniques you can use to manage data growth in MongoDB:
Using capped collection
Using TTLs
Using mulitple collections for months
Using different databases on same host

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!

Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.

This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

MongoDB: is indexing a pain?

Speaking in general, I want to know what are the best practices for querying (and therefore indexing) of schemaless data structures? (i.e. documents)
Lets say I use MongoDB to store and query deterministic data structures in a collection. At this point all documents have the same structure therefore I can easily create indexes for any queries in my app since I know each document has required field(s) for the index.
What happens after I change the structure and try to save new documents to the db? Lets say I joined two fields FirstName and Lastname to FullName. As a result the collection contains nondeterministic data. I see two problems here:
Old indexes cannot cover new data, therefore new indexes needed that handle both fields old and new
App should take care of dealing with two representations of the documents
This may result in a big problem when there are many changes in the db resulting in many versions of document structures.
I see two main approaches:
Lazy migration. This means that each document is migrated on demand (i.e. only after loading from collection) to final structure and then stored back to colection. This approach actually does not solve the problems because it concedes nondeterminism at any point of time.
Forced migration. This is the same approach as for RDBMS migrations. The migration is performed for all documents at one point of time while the app does not run. The main con is downtime of the app.
So the question: Is there any good way of solving the problem, especially without app downtime?

If you can't have downtime then the only choice is to do the migrations "on the fly":
Change the application so that when new documents are saved the new field is created, but read from the old ones.
Update your collection with a script/queries to add the new field in the collection.
Create new indexes on that field.
Change the application so that it reads from the new fields.
Drop the unnecessary indexes and remove the old fields from the documents.
Changing the schema on a live database is never an easy process, no matter what database you use. It always requires some forward thinking and careful planning.
is indexing a pain?
Indexing is not a pain, but premature optimization is. You should always test and check that you actually need indexes before adding them and when you have them, check that they are being properly used.
If you're worried about performance issues on a live system when creating indexes, then you should consider having replica sets and doing rolling maintenance (in short: taking secondaries down from replication, creating indexes on them, bringing them back into replication and then repeating the process for all the subsequent replica set members).
Edit
What I was describing is basically a process of migrating your schema to a new one while temporary supporting both versions of the documents.
In step 1, you're basically adding support for multiple versions of documents. You're updating existing documents i.e. creating new fields, while you're reading data from the previous version fields. Step 2 is optional, because you can gradually update your documents as they are being saved.
In step 4 you're removing the support for the previous versions from your application code and migrating to a new version. Finally, in step 5 you're removing the previous version fields from your actual MongoDB documents.