Mongo Collection with TTL index not reclaiming disk space - mongodb

I have a mongo collection with TTL index. I see the documents are getting evicted as expected but i dont see that the disk space is getting reclaimed. Did anyone see this kind of issue?
Let me know if you need more details.

We have discussed this with mongo team and based on all information there are couple of things and its not that easy
If you have already have more space then TTL will delete documents and ideally space should be reclaimed but it will not.
If new documents will come then it will be reused
If size is going to remain constant then you need to run compact command. In case of sharded cluster it will be on each shard.
Other options are create new collection and move your data to newer collection. and once done drop the previous collection
Take backup of this collection and drop collection and then restore it.
After all this things, there is possibility that mongo holds all in memory and you need to restart cluster, once restarted it will release the storage.

Related

Mongo DB collection size not changed after removal of fields from document

To reduce my existing collection size, I have removed some unwanted fields from the documents from the collection. After and before I ran the collection stats to check the size, but it never changed.
I am missing some thing (like update anything) to reflect the reduced size in my stats, Please advice me.
I am running this in my local PC, not having any other nodes.
Thank you.
This is correct behavior.
The only time mongo releases disk space is when you drop a database or do a repair. (see here)
prettier explanation here
Basically, mongodb keeps any space it has allocated unless and until you drop a database or do a repair. It does this because adding space is not efficient, it is time-consuming. So once it has it, it keeps it and uses the blank space created until there is none left and then it gets more.
As mentioned by Daniel F in a comment on the question, this worked for me in the mongo shell:
use your-database-name
db.runCommand({ compact: "your-collection-name" })
I know this is an old question but I'll add that here just in case. Read carefully the documentation on compact as it causes the node it is run on to stop taking most of the traffic until it is done.
All previous ideas have their usecases I'll add another one that might be the best fit. If your able to query the subset that you want to leave from the collection and it is a lot smaller than the entire collection and you have tons of other collections there that you don't want to wait on restoring from scratch then you can write the sub query to another collection, apply indexes and when done plan a short downtime if necessary and switch the collections.

Is space reclaimed when I delete an index at MongoDB?

I have looked around the whole documentation but can't figure out if this actually happens or not. If I remove an index from a collection in MongoDB, does it delete the index files right away? Is space reclaimed?
No, MongoDB won't automatically release diskspace after collection data or indexes are deleted. Allocating new files is a relatively slow thing compared to other functions in a high-performance databases so MongoDB keeps all previously allocated files open and available by design.
If you need to reclaim diskspace use the repairDatabase command which achieves compaction as a side-effect of it's checking/fixing functionality.
An alternative that is available when using replica sets is to add a new member and let it sync- the data will be inserted fairly compactly in the replica set member's new database extent files. To compact all members you would do it in a rolling fashion, and probably force the primary to step down at the end so it can be re-synced too.

TTL index on oplog or reducing the size of oplog?

I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.

MongoDB very slow deletes

I've got a small replica set of three mongod servers (16GB RAM each, at least 4 CPU cores and real HDDs) and one dedicated arbiter. The replicated data has about 100,000,000 records currently. Nearly all of this data is in one collection with an index on _id (the auto-generated Mongo ID) and date, which is a native Mongo date field. Periodically I delete old records from this collection using the date index, something like this (from the mongo shell):
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
This does work, but it runs very, very slowly. One of my nodes has slower I/O than the other two, having just a single SATA drive. When this node is primary, the deletes run at about 5-10 documents/sec. By using rs.stepDown() I have demoted this slower primary and forced an election to get a primary with better I/O. On that server, I am getting about 100 docs/sec.
My main question is, should I be concerned? I don't have the numbers from before I introduced replication, but I know the delete was much faster. I'm wondering if the replica set sync is causing I/O wait, or if there is some other cause. I would be totally happy with temporarily disabling sync and index updates until the delete statement finishes, but I don't know of any way to do that currently. For some reason, when I disable two of the three nodes, leaving just one node and the arbiter, the remaining node is demoted and writes are impossible (isn't the arbiter supposed to solve that?).
To give you some indication of the general performance, if I drop and recreate the date index, it takes about 15 minutes to scan all 100M docs.
This is happening because even though
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
looks like a single command it's actually operating on many documents - as many as satisfy this query.
When you use replication, every change operation has to be written to a special collection in the local database called oplog.rs - oplog for short.
The oplog has to have an entry for each deleted document and every one of those entries needs to be applied to the oplog on each secondary before it can also delete the same record.
One thing I can suggest that you consider is TTL indexes - they will "automatically" delete documents based on expiration date/value you set - this way you won't have one massive delete and instead will be able to spread the load more over time.
Another suggestion that may not fit you, but it was optimal solution for me:
drop indeces from collection
iterate over all entries of collection and store id's of records to delete into memory array
each time array is big enough (for me it was 10K records), i removed these records by ids
rebuild indeces
It is the fastest way, but it requires stopping the system, which was suitable for me.

MongoDB filtered replication collection

I have capped local MongoDB collection A, and I'd like to replicate to a MongoDB collection B on a cloud server. Now, I want to keep the documents from A that will be deleted due to its capped size.
Collection A replicate Collection B
-------------- ----------> --------------
Capped at 50MB Infinite size!
local cloud server
Is this possible in MongoDB (as it is possible in CouchDB using filters)?
Or should I search for a totally different approach?
Thanks for your advice!
The deletes in a capped collection are not operations, and so they are not replicated via the oplog. Hence, all you need to do is make the collection non-capped on a secondary and it will simply continue to grow as you add data to the capped collection and those ops are replicated. Try something like this:
Add the secondary on the cloud server as normal
Stop that new secondary, restart it outside the set with no replset argument
Drop the capped collection, recreate with same name as a regular non-capped collection
(Optional) Re-import the data manually using mongodump/mongorestore
Restart the secondary with the original replica set parameters
The new normal collection will just keep growing
If you want to delete the collection or make other changes you will need to take the secondary out each time, but otherwise this should behave as you want. I haven't done this myself explicitly, but I have seen this happen accidentally and had to do the reverse :)