Drop a 5TB collection in Mongo without bringing down the db - mongodb

In our Mongo configuration we have a replica set with a primary and 2 secondaries. We currently have a collection that is about 5TB in size that we want to drop completely. From reading docs it sounds like just dropping the collection would lock the database. Seems like it might take a bit to delete 5TB and anything more than a few minutes downtime really isn't an option.
I tried deleting records a little bit at a time via query and remove commands, but this still slowed the db down to a crawl.
I've thought about taking the primary out of the set, dropping the collection and then putting it back in the set as primary, but what will the impact be of having those changes replicate to the secondaries? Is it still just going to use a ton of cpu and lock things up?
The end goal is to move all of our mongo instances to smaller disks, so it would be nice if there was an option that allowed us to tackle both the migration and the deletion of the data at the same time.
Any advice is appreciated.

Related

Oplog tailing in Meteor - to do it or not to do it?

I am trying to reconcile this kadira.io article which says that oplog tailing is a must for every Meteor production app with this compose.io article - see section "To Oplog or not Oplog" which says you should only use oplog in certain circumstances.
Basically I have a Meteor app which does not have a high volume of users or a massive amount of continuous writing to collections.
It does however need to read a lot of data from the DB which seems to be slowing things down.
As far as I know it is only running on one server.
I am wondering will adding oplog tailing speed things up?
Thanks in advance.
Basically no matter if you do it, replica set is always doing it to keep all nodes in sync. Usually if your system is not write heavy, the tailing shouldn't be an issue because with replication working, the latest oplog should be in memory. What causes stress is usually the first round when the program tries to find where to tail from. With no index, it has to be a COLLSCAN. Other than that there's no need to worry. But it's a one time thing so as long as you know what's going on, it should be fine.
Back to your question. Yes it's running on one server. Which one depends on your readPreference and replica set tag if any. And after the first time finding the tail point, it shouldn't be a problem.

MongoDB disk space reclamation

I am familiar both with the MongoDB repairDatabase and compact commands, but these both seem to lock the database and/or collection. Is there another way to reclaim deleted disk space without essentially shutting down the database? What are best practices in this area? Thanks!
Best practice would probably depend on your schema and what your application does. Here's my use case, perhaps you can learn something... My application is storing very large amounts of time stamped data samples. Deleting data from a very large store is a very expensive operation, this gets more complicated when you try doing this on live systems. MongoDB had several issues in the past with reclaiming the disk space back to OS and we had to dance around this, not sure how good it works now. But what we did solved everything for good - we partitioned the data in such way so that we could dispose of old stuff by simply dumping entire database. Dropping mongodb database is a very cheap and efficient operation, almost instantaneous even when you drop a TB. Note that dropping collection is not as effective as dropping database, this was actually a key to the solution. For doing this we had to redesign the schema.. Your case of course could be different, but the lesson learned is that deleting data from large storage is very expensive.
The best method currently is to run a Master Slave Setup.
Shutdown 1 mongod instance and let it resync.
More details here: Reducing MongoDB database file size

MongoRestore Create Index Phase Uses 100% resources and locks up database

I'm using MongoDB. I have a table with 7M records and a weighted text search index.
When i do a MongoRestore, the create index phase of the restore uses 100% of my database's resources. MongoDB unresponsive to anything until it is done. My db is locked to any incoming connections. In fact it stops reporting any progress of the index creation to my output at that point, and my mongodb client starts getting request timeout errors. I can still tail into the server side mongodb logs to check the progress of the index creation.
I need the database to be responsive while this process is happening. It works just fine for all my other tables, which are a bit smaller. The next largest table, which works great, and still uses a weighted text search index is around 3M records.
What do i do?! Thanks.
I haven't tried this, but it seems that indexes created with { background: true } are dumped with this property by mongodump. This property will be passed to mongorestore during the index creation phase.
Maybe you could recreate some strategical indexes with the background option, and then dump the database. Then, the restore processes should put less strain on the server, and finish faster. Read and write operations should be allowed while MondoDB rebuilds the backgrounded indexes.
Notice that background index builds take longer to complete and result in a larger index. Also, this will not work with secondary replica set members, since background index creation operations will be foregrounded on them.
http://docs.mongodb.org/manual/tutorial/build-indexes-in-the-background/
http://docs.mongodb.org/manual/tutorial/build-indexes-on-replica-sets/
HTH.
I ran into similar issue(s):
Mongo restore took up so much resources that other database operations would simply time out or take on the order of a minute to complete (the restore is in essence a denial of service attack on the DB).
Mongo restore index phase completely blocked the database.
I found that to limit the bandwidth for the restore issue 1 was solved. I used the linux tc command line tool to achieve this. Tweaking the rate and burst from very low until other database operations started to be affected and then scaling it back a bit. The command looked as follows:
sudo tc qdisc change dev enp3s0 root tbf rate 30000kbit burst 40000kbit latency 5ms
To solve issue 2 I found this link which suggests you either:
update the *.metadata.json files in the dump directory to add background:true if not present.
use mongorestore's --noIndexRestore option to avoid accidentally building any indexes in the foreground, and then create the indexes with background:true after mongorestore finishes restoring the data.
Of course all of this is an issue because MongoDB best practices are not followed, which are to have the operational database always work in some form of replica set. If replication is present then one have many more options available such as (oversimplified) taking one out of the larger replication set, restore to it, and then move it back into the replication set.

MongoDB Replica Set: Disk size difference in Primary and Secondary Nodes

I just did the mongodb replica set configuration and all looks good. All data moved to secondary nodes properly. But when I looked at the data directory, I can see Primary have ~140G of data and at the same time secondary has only ~110G.
Did anyone come across this kind of issue while setting up the Replica Set. Is that something normal behavior?
When you do an initial sync from scratch on a secondary, it writes all the data fresh. This removes padding, empty space (deleted data) etc. As a result, in that respect it is similar to running a repair.
If you ran a repair on the primary (blocking operation, only to be done if absolutely necessary), then the two would be far closer overall.
If you check the output from db.stats() you should see that the various databases have the same object count, the data directory size differences are nothing to be worried about.

When/Where is the best-practices time/place to configure a MongoDB "schema"?

In an app that uses MongoDB, when/where is the best place to make database changes that would be migrations in a relational database?
For example, how should creating indexes or setting shard keys be managed? Where should this code go?
it's probably best to do this in the shell, conciously!, because you could cause havoc if you accidentally start such a command at the wrong moment and on the wrong instance.
Most importantly: do this offline on an extra slave instance if you add an index on an existing DB! For large data sets, building an index can take hours, even days!
see also:
http://www.mongodb.org/display/DOCS/Indexes
http://www.javabeat.net/articles/print.php?article_id=353
http://www.mongodb.org/display/DOCS/Indexing+as+a+Background+Operation
http://nosql.mypopescu.com/post/1312926692/mongodb-indexes-and-indexing
If you have a large data set, make sure to read up on the 4square outage last year..!!
http://www.infoq.com/news/2010/10/4square_mongodb_outage
http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/
http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html
one of the main reasons for not wanting to put indexing in a script or config file of some sort is that in MongoDB the index operation is blocking(!) -- that means MongoDB will stop other operations on the database from proceeding until the indexing is completed. Just imagine an innocent change in the code, requiring a new index to improve performance -- and this change is carelessly checked-in and deployed to production ... and suddenly your production MongoDB is feezing up for your app-server, because MongoDB is internally adding the new index first before doing anything else... outch! Apparently that has happened to a couple of folks, that's why they keep reminding people at the MongoDB conferences to be careful to not 'programmatically' require indexes.
New versions of MongoDB allow background indexing -- you should always do that e.g. db.yourcollection.ensureIndex(..., {background: true})
otherwise, not-so-fun stuff happens:
https://jira.mongodb.org/browse/SERVER-1341
https://jira.mongodb.org/browse/SERVER-3067