MongoDB - recover from corrupted shard? - mongodb

How do I rescue a sharded MongoDB cluster when one shard is permanently damaged?
I have a MongoDB cluster with 48 shards. Each shard is a primary with one replicaset. Due to Bad Planning (tm), one of the boxes ran out of filespace and died. The other one, already close, then ran out of space too. Due to bad circumstances (probably a compact() or repairdb() going on at the time, the entire shard was corrupted.
I stopped daemons, tried to repair, but it would not succeed.
So, the question is, how do I accept the loss of one shard but keep the other good shards? With 48 shards, the los of one is only 2% of my data. I'm okay with losing that data, but I have to get to a normal healthy state.
What do I do?

ANSWER OBSOLETE, REDOING ANSWER:
Stop all daemons on all boxes.
change config files for primaries to make them come up as standalone instances.
use mongoexport or mongodump to dump that shard's data into a file. Ensure that the file contains the collections you want. Try to get it so it doesn't include the _id field.
when you have backups completed and moved off the boxes to appropriately safe locations, clean up. delete all data files, etc., and essentially re-create your cluster.
Re-load your data from your data backups.
Note that when you do the re-creation of the cluster, you should probably prepopulate it with a certain / large number of chunks so the splitchunk processes doesn't take forever.
If you end up with unbalanced shards (lots of chunks in one, not another), pause, turn off balancer's throttle so it goes Real Fast, and once it's balanced again, restart reloading.

Related

Deleting selective data from MongoDB Secondary Only

Is it possible to delete data from a single Mongo secondary instance, by running delete command directly on a secondary, without affecting the primary and other secondary instances?
Explanation: I want to purge a large collection ~500 GB, having ~500 million records. I want to keep last few months data, so I will have to remove ~400 million records. It is a replica setup, with one primary and 2 secondaries. Storage Engine is WiredTiger. I do not want any downtime or slowness as it is a production DB of a live transactional system. I am thinking of below options:
Create a new collection, and copy last few months records in it, and drop the old one. But copying such huge data will slow down the DB server.
Take backup of entire collection, then run bulk delete, with a batch size of 1000. This will take weeks to delete so many records, also will create huge op logs, as every delete will produce an op log that will be synced to secondary. These op logs will take up huge disk space.
Another option is that I run bulk delete on one secondary only. Once the data is deleted, I promote it as primary. Then run same delete on other 2 secondary instances. This will not affect the prod environment. Hence the question: Can we run delete on a secondary only? Once this secondary comes back in cluster after deletion, what will be the behaviour of the sync process between primary and secondary?
I run a small test on a local MongoDB cluster. In principle it seems to work when you follow this procedure:
Shut down the Secondary
Restart the secondary as a standalone (you cannot perform any changes on SECONDARY)
Connect to the standalone and delete old data
Shutdown the standalone
Restart the standalone normally as ReplicaSet member
Repeat step (1) to (5) with the other Secondary. You may run above steps in parallel on all Secondaries, however then you have no redundancy in case of problems.
Set a Secondary from above to Primary
Repeat step (1) to (5) with the last node
As I said, I did a "quick and dirty" test with a few documents and it seems to work.
However, I don't think it will work in your setup because:
Step (5) "delete old data" will take some time, maybe some hours or even days. When you have finished deletion, most likely you will trap into this situation:
Resync a Member of a Replica Set:
A replica set member becomes "stale" when its replication process falls so far behind that the primary overwrites oplog entries the member has not yet replicated. The member cannot catch up and becomes "stale." When this occurs, you must completely resynchronize the member by removing its data and performing an initial sync.
I.e. you will add all deleted data again.
Perhaps there are hacks to overwrite "stale" to "SECONDARY". Then you would have to drop old PRIMARY and add it again as SECONDARY. But by this you would loose all data which have been newly inserted in production while step (5) was running. I assume the application is constantly inserting new data (otherwise you would not get such an amount of documents), such data would be lost.

MongoDB - Removed Documents directly from Primary Shard

I have a problem in production on my cluster.
Our monitoring fail on monitoring disk space and this over.
And i needed to remove some part of data directly on a master shard node.
I say on mongod with command:
db.collection.remove({query})
I know this is dangerous but is my only option at moment because i can't open mongo shell on mongos.
Now cluster works well but i need to know the real impact of my action.
And how to solve.
The real impact is that you lose the data you deleted. There should be no other operational impact on the database itself. It should just return nothing when the affected documents are requested.
I'm sure you understand that this deletion directly into a shard (bypassing mongos) is not a recommended action by any means. In general, bypassing mongos could result in an undefined behavior of the cluster, and the resulting issue could stay dormant for a long time. In the worst case, this would lead to corrupt backup.
Having said that, deletion using the mongo shell (or a driver) is much preferred compared to going into the dbPath directory and deleting files. That action could lead into unrecoverable database.
The more immediate impact may be felt by the application, e.g. if your application expects a result and it receives none. I would encourage you to test all workflows of your application and confirm that everything is working as expected.

Expected balancing behavior when restoring data with mongorestore to a sharded cluster

I have noticed that when restoring data with mongorestore to a sharded cluster through mongos, all the records are initially saved to the primary shard (of the collection) and only the balancer process moves these chunks, which is a relatively slow process, so right after restore I have a similar situation:
chunks:
rs_shard-1 28
rs_shard-2 29
rs_shard-4 27
rs_shard-3 644
I don't have any errors in the mongodb/mongos log files.
I'm not sure, but I think that in the past data was restored in an already balanced way. Now I'm using version 2.4.6. Can someone confirm what is the expected behavior?
Here is what happens imho:
When restoring the data, there are initial ranges for chunks assigned to each shard. The data is inserted by mongorestore without waiting for any responses from mongos, not speaking of the shards, resulting in a relatively fast insertion of the documents. I assume that you have a monotonically increasing shard key, like ObjectId for example. Now what happens is that one shard has been assigned the range from X to infinite (called "maxKey" in mongoland) during the initial assignment of chunk ranges. The documents in this range will be created on that shard, resulting in a lot of chunk splits and an increasing number of chunks on that server. A chunk split will trigger a balancer round, but since the insertion of new documents is faster than the chunk migration, the number of chunks will increase faster than the balancer can reduce it.
So what I would do is to check the shard key. I am pretty sure that it is monotonically increasing. Which is bad not only when restoring a backup, but in production use, too. Please see shard key documentation and Considerations for Selecting Shard Keys in the MongoDB docs.
A few additional notes. The mongodump utility is designed for small databases, like the config db of a sharded cluster. Your database has a size of roughly 46.5GB which isn't exactly small. I'd rather use file system snapshots on each individual shard, synchronized using a cronjob. If you really need a point in time recovery, you can still use mongodump in direct file access mode on the snapshotted files to create a dump and restore those dumps using the --oplogLimit option. Other than the ability to do a point in time recovery, the usage of mongodump has no advantage over taking file system snapshots, but has the disadvantage that you have to stop the balancer in order to have a consistent backup and to lock the database during the whole backup procedure in order to have a true point in time recovery option.

MongoDB chunks selection to move

If a shard say has 200 chunks on it and it is time to move some of those chunks to another shard,
1>how does mongo db decide which chunks to move?
2>Is this move logic in config server or mongos?
3>How can I affect/control chunk selection algorithm such that mongodb will move chunks to other shards such that it helps to distribute my reads based on my users access pattern?
Movement of chunks between shards is triggered by mongos. Mongos will move chunks under two circumstances. If one shard contains 9 or more chunks than any other, mongos will trigger a balancing round and redistribute the chunks between the other shards. In this situation, the chunks with the lowest shard keys will be moved. Additionally, if the top most chunk is split, mongos will move the chunk with the higher shard key to another shard.
One of the features of Mongo is that in a properly set-up sharded cluster, chunks are split and moved automatically such that your application does not need to be aware that it is interacting with a sharded database. Everything happens behind-the-scenes.
However, it is possible to split and move chunks manually using the "split" and "moveChunks" commands. Please see the mongo documentation for examples of how to use these commands: "Splitting Shard Chunks" http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks and "Moving
Chunks" http://www.mongodb.org/display/DOCS/Moving+Chunks There have been cases where users have written their own custom balancers taylored to their own applications, but this is not common, and only attempted by the most advanced of users.
As an alternative, it is possible to give the balancer window of time when it may operate, or to disable it entirely. Some users will temporarily disable the balancer for maintenance, or give it a window so it is not competing for write locks at times when they expect their application to be putting the db under high loads.
More information on the balancer is available in the "Balancing" and "Balancer window" sections of the "Sharding Administration" documentation.
http://www.mongodb.org/display/DOCS/Sharding+Administration
Hopefully the above resources will give you a better understanding of how sharding works, and how chunks are balanced between shards.

MongoDb Hot backup - copy data/db VS replicaset with fsyncLock

I read about the different MongoDB setups for doing backup without downtime. Which strategy is best or can they even be compared?
Enable journaling and simply copy the /data/db directory - it is unclear to me if this is enough – on the MongoDB home page it states that you have to "snapshot it" and it works on SAN and LVM as examples.
Questions:
What does snapshot mean in this context will a copy command count as a snapshot? Is it save to copy a journaling MongoDB (2.0+) data directory on a Windows server with NTFS? How do you ensure that it is safe to do on your own filesystem and setup?
Establish a replica set with 2 servers and an arbiter. Then use rs.status() and fsyncLock/unlock to ensure data is read only on the secondary server while doing backup.
> db.fsyncLock
function () {
return db.adminCommand({fsync:1, lock:true});
}
> db.fsyncUnlock
function () {
return db.getSiblingDB("admin").$cmd.sys.unlock.findOne();
}
Questions:
If you use locks in a replica set it seems that writes and reads can be locked for the whole replica set and this bug has not been fixed?
What if the secondary is voted in as primary while the backup is in progress? Will the backup process stop or will the replica set stop responding to write requests until it is unlocked?
Considerations:
For now I would like the simple solution and simply copy the data/db directory with journal files and wait with the replica set. MongoDB runs on a 64 bit Windows server (RackSpace Cloud).
The best bet is to do fsync + lock on a secondary, then snapshot the volume at the disk or volume level (e.g. using lvm2, hyper-v, btrfs), unlock the database, then copy the snapshotted data files. This minimizes downtime of the secondary and is easy to restore.
"Snapshotting" in this context refers to the snapshot features offered by some volume managers, file systems and hypervisors. Essentially, this is a 'copy-on-write' feature for block devices: instead of overwriting data when the OS demands it, it will write the new data elsewhere and keep both the old version and the new version readable. Snapshotting usually takes almost no time, but on some systems, it's a bad idea to keep many snapshots of the same files, because it may dramatically slow future writes.
Why I believe this is the best strategy for full backups:
Using mongodump won't store the index data The indexes will be restored, but rebuilding indexes for recovery can take hours - the last thing you need when everybody is yelling at you is an operation that takes hours and can't be accelerated.
Fsync + lock will block writers and might block readers hence, it's best to do that on a (passive) secondary, not on the primary.
Halting a secondary will fill the oplog which is why you should keep the lock time as short as possible. Instead of copying all data files (which could take hours) during the lock, merely performing a snapshot should take only a couple of seconds. Hence, oplog limits are not a concern.
Everything is 'back to normal' while the actual copy is running, which gives you peace of mind. The only difference will be higher load on a secondary during the backup, which shouldn't be a major concern.
Addressing your questions:
regarding locks in replica sets: Keep the lock time short, and use a passive secondary (which can't be elected master) so the writer queue can't stall.
"What if the secondary is voted in as primary while the backup is in progress" can't happen if your backup system is passive
For now I would like the simple solution and simply copy the data/db dir with journal files and wait with the replica set. The MongoDB runs on a 64 bit Windows server (RackSpace Cloud).
You can do that. Volume snapshotting is probably still the best way to go, giving you only seconds of downtime. If your data is small, a simple mongodump might be even easier, but make sure recovery times are acceptable (depends on your indexes).