We have 1 test mongodb cluster that includes
1 mongos servers
3 config servers
6 shards
Q1. We have tried to restore a outdated config server backup. We can only find that the config.chunks have less records than before but we can query and insert/update data in the mongodb. What will be the worst result if we use an outdated config server backup ?
Q2. Is there any tools that can re-build the loss records in config server with the existing data in each shard ?
Answer to Q1
With outdated config server contents, iirc, there may be an unnoticed gigantic loss of data. Here is why:
Sharding in MongoDB is based on key ranges. That is, each shard is assigned a range of the shard keys it is responsible for.
For illustration purposes, let's assume you have a shard key of integer numbers starting from 1 to infinity. So the key ranges could look like this (exclusive the boundaries)
shard0001: -infinity to 100
shard0002: 101 - 200
shard0003: 201 - 300
shard0004: 301 - 400
shard0005: 401 - 500
shard0006: 501 - 600
So how does you mongo know about this distribution? It is stored on the config servers. Now let's assume that your metadata has changed and your shard0002 actually holds the data from 100-500. Let's assume you want to retrieve the document with the shard key 450. According to the old metadata, this document has to be on shard0005, if it exists. So the query gets routed to shard0005. An index lookup will be done and the shard finds out that it does not have the document. So while the document exists (on shard0002), due to the outdated metadata it will be looked up on shard0005, where it does not exist.
Answer to Q2
Not as far as I know. What you could do, however is to use the following procedure for MongoDB < 3.0.0.
Disclaimer
I haven't tested this procedure. Make sure you have the backups ready before wiping the data directories and do not omit the --repair and --objcheck flags
For maximum security, create filesystem snapshots before using it.
If you don't, please do not blame me for any data loss.
Shut down the whole cluster gracefully
Use mongodump against the data directory
mongodump --repair --dbpath /path/to/your/datafiles -o /path/for/backups/mongo
Do this once for each shard.
Wipe all data directories and recreate your sharded cluster
Connect to a mongos
sh.enableSharding({"database":yourDb})
sh.shardCollection("yourdDb.yourShardedCollection",{"yourShardKey":1})
From each shard, use mongorestore to write the backups to a mongos
mongorestore -h mongosHost:mongosPort --db yourDb --dir /path/for/backups/ \
--objcheck --write-concern "{w:1}"
Note that you should NOT do the restores in parallel, since this might well overload the balancer.
What we basically do is to gather all data from the individual shards, create a new sharded collection within a new database and put the collected data into that database, with the sharded collection being automatically balanced.
Watch the process very carefully and make absolutely positively sure that you do not overload the balancer, otherwise you might run out of disk space on a shard in case you do not have an optimal shard key.
Of course, you can recreate other sharded databases from the backup by using mongorestore accordingly. To restore unsharded databases, simply connect to the replicaset you want to hold the collection instead of connecting to mongos.
Side note:
If you need to restore a config server, simply dump one of the other two and restore the config database to that server.
The reason this works is because the metadata can not be updated unless all config servers are up, running and in sync.
Related
Is it safe to run mongodump against running server with many writes per second? Is it possible to get corrupted dump doing in this way?
From here:
Use --oplog to capture incoming write operations during the mongodump operation to ensure that the backups reflect a consistent data state.
Does it mean that no matter how many writes in database dump will be consistent?
If I ran mongodump --oplog at 1AM and it finished at 2AM then I run mongorestore --oplogReplay what state will I get?
From here:
However, the use of mongodump and mongorestore as a backup strategy can be problematic for sharded clusters and replica sets.
but why? I had replica set of 1 primary and 2 secondary. What the problem to run mongodump against one of secondary? It should same as primary (except replication lag difference).
The docs are quite clear about it:
--oplog
Creates a file named oplog.bson as part of the mongodump output. The oplog.bson file, located in the top level of the output directory, contains oplog entries that occur during the mongodump operation. This file provides an effective point-in-time snapshot of the state of a mongod instance. To restore to a specific point-in-time backup, use the output created with this option in conjunction with mongorestore --oplogReplay.
Without --oplog, if there are write operations during the dump operation, the dump will not reflect a single moment in time. Changes made to the database during the update process can affect the output of the backup.
--oplog has no effect when running mongodump against a mongos instance to dump the entire contents of a sharded cluster. However, you can use --oplog to dump individual shards.
Without --oplog you still get a valid dump, just a bit inconsistent - some of the writes done between 1 AM and 2 AM will be missing.
With --oplog you have the oplog file captured at 2 AM. The dump remains inconsistent, and replaying the oplog on restore fixes this issue.
The problems dumping the sharded clusters deserve a dedicated page in the docs. Essentially because of complexity to synchronise backups of all nodes:
To create backups of a sharded cluster, you will stop the cluster balancer, take a backup of the config database, and then take backups of each shard in the cluster using mongodump to capture the backup data. To capture a more exact moment-in-time snapshot of the system, you will need to stop all application writes before taking the filesystem snapshots; otherwise the snapshot will only approximate a moment in time.
There are no problems to dump replica set.
We have a three-server replicaset running MongoDB 2.2 on Ubuntu 10.04, and recently had to upgrade the hard drive for each server where one particular database resides. This database contains log information for web service requests, where they write to collections in hourly buckets using the current timestamp to determine the name, e.g. log_yyyymmddhh.
I performed this process:
backup the database on the primary server with mongodump --db log_db
take a secondary server offline, replace the disk
bring the secondary server up in standalone mode (i.e. comment out the replSet entry
in /etc/mongodb.conf before starting the service)
restore the database on the secondary server with mongorestore --drop --db log_db
add the secondary server back into the replicaset and bring it online,
letting replication catch up the hourly buckets that were updated/created
while it had been offline
Everything seemed to go as expected, except that the collection which was the current bucket at the time of the backup was not brought up to date by replication. I had to manually copy that collection over by hand to get it up to date. Note that collections which were created after the backup were synched just fine.
What did I miss in this process that caused MongoDB not to get things back in synch for that one collection? I assume something got out of whack with regard to the oplog?
Edit 1:
The oplog on the primary showed that its earliest timestamp went back a couple of days, so there should have been plenty of space to maintain transactions for a few hours (which was the time the secondary was offline).
Edit 2:
Our MongoDB installation uses two disk partitions: /dev/sda1 and /dev/sdb1. The primary MongoDB directory /var/lib/mongodb/ is on /dev/sda1, and holds several databases, while the log database resides by itself on /dev/sdb1. There's a sym link /var/lib/mongodb/log_db which points to a directory on /dev/sdb1. Since the log db was getting full, we needed to upgrade the disk for /dev/sdb1.
You should be using mongodump with the --oplog option. Running a full database backup with mongodump on a replicaset that is updating collections at the same time may not leave you with a consistent backup. This becomes worse with larger databases, more collections and more frequent updates/inserts/deletes.
From the documentation for your version (2.2) of MongoDB (it's the same for 2.6 but just to be as accurate as possible):
--oplog
Use this option to ensure that mongodump creates a dump of the
database that includes an oplog, to create a point-in-time snapshot of
the state of a mongod instance. To restore to a specific point-in-time
backup, use the output created with this option in conjunction with
mongorestore --oplogReplay.
Without --oplog, if there are write operations during the dump
operation, the dump will not reflect a single moment in time. Changes
made to the database during the update process can affect the output
of the backup.
http://docs.mongodb.org/v2.2/reference/mongodump/
This is not covered well in most MongoDB tutorials around backups and restores. Generally you are better off if you can perform a live snapshot of the storage volume your database resides on (assuming your storage solution has a live snapshot ability compatible with MongoDB). Failing that, your next best bet is taking a secondary offline and then performing a snapshot or backup of the database files. Mongodump on a live database is increasingly a less optimal solution for larger databases due to performance issues.
I'd definitely take a look at the MongoDB overview of backup options: http://docs.mongodb.org/manual/core/backups/
I would guess this has to do with the oplog not being long enough, although it seems like you checked that and it looked reasonably big.
Still, when adding new members to a replica set you shouldn't be snapshotting and restoring them. It's better to simply add a new member and let replication happen by itself. This is described in the Mongo docs and is the process I've always followed.
I have a mongodb replicaset with 2 members. 1 primary and 1 secondary. if I issue show dbs, both of them are show like followings:
local 24.06640625GB
test 0.203125GB
db1 9.94921875GB
db1test 0.953125GB
and then I issue use db1 -> db.events.count(), the result return 1003130 documents on both of the members.
that makes sense they reflecting to each other and db1 and db1test on both of dbserver have the same amount of disk usage and the same amount of document in each collection.
then I decide to add a new member(a new dbserver) which has an empty /data/db. I start the new server by using:
sudo mongod --replSet rs0 --fork --logpath /var/log/mongodb/mongodb.log
then in primary server, I issue
rs.add('ipOfNewDBServer:27017')
After a few seconds. my new mongodb server shell change from > -> STARTUP2 -> rs0:SECONDARY which I think start sync.
In the new/recent added mongodb server I issue show dbs, it looks like the following:
local 22.0673828125GB
test 0.203125GB
db1 1.953125GB
db1test 0.453125GB
the disk usage of each database are not as the same as the other two(1 primary and 1 secondary). however, if I issue use db1 -> db.events.count(), the result return 1003130 which are the same as the other two. and I check the other collections in this db1. they all the same.
I wonder why database disk usage are different and the collection in each of database have the same amount of documents??? and correct me if I did anything wrong to sync the data from those two existing set to the new set. the mongodb offical document says This procedure relies on MongoDB’s regular process for initial sync, I have no ideas, please help. thanks
The new member of the replica set will have the benefit of no fragmentation as he full synchronizes to the replset. The existing replicase, very likely, have fragmentation due to deletes and document updates moving the documents.
In our environment, we periodically take each member of the replset offline, whack its data directory, and allow it to full sync to drive out fragmentation. It works for us, but our dataset may be "small" relative to other deployments. I think there is a way to do this through the console with some db.runCommand but I don't know what it is.
Is it possible to modify the MongoDB oplog and replay it?
A bug caused an update to be applied to more documents than it was supposed to be, overwriting some data. Data was recovered from backup and reintegrated, so nothing was actually lost, but I was wondering if there was a way to modify the oplog to remove or modify the offending update and replay it.
I don't have in depth knowledge of MongoDB internals, so informative answers along the lines of, "you don't understand how it works, it's like this" will also be considered for acceptance.
One of the big issues in application or human error data corruption is that the offending write to the primary will immediately be replicated to the secondary.
This is one of the reasons that users take advantage of "slaveDelay" - an option to run one of your secondary nodes with a fixed time delay (of course that only helps you if you discover the error or bug during the time period that's shorter than the delay on that secondary).
In case you don't have such a set-up, you have to rely on a backup to recreate the state of the records you need to restore to their pre-bug state.
Perform all the operations on a separate stand-alone copy of your data - only after verifying that everything was properly recreated should you then move the corrected data into your production system.
What is required to be able to do this is a recent copy of the backup (let's say the backup is X hours old) and the oplog on your cluster must hold more than X hours worth of data. I didn't specify which node's oplog because (a) every member of the replica set has the same contents in the oplog and (b) it is possible that your oplog size is different on different node members, in which case you want to check the "largest" one.
So let's say your most recent backup is 52 hours old, but luckily you have an oplog that holds 75 hours worth of data (yay).
You already realized that all of your nodes (primary and secondaries) have the "bad" data, so what you would do is restore this most recent backup into a new mongod. This is where you will restore these records to what they were right before the offending update - and then you can just move them into the current primary from where they will get replicated to all the secondaries.
While restoring your backup, create a mongodump of your oplog collection via this command:
mongodump -d local -c oplog.rs -o oplogD
Move the oplog to its own directory renaming it to oplog.bson:
mkdir oplogR
mv oplogD/local/oplog.rs.bson oplogR/oplog.bson
Now you need to find the "offending" operation. You can dump out the oplog to human readable form, using the bsondump command on oplogR/oplog.bson file (and then use grep or what-not to find the "bad" update). Alternatively you can query against the original oplog in the replica set via use local and db.oplog.rs.find() commands in the shell.
Your goal is to find this entry and note its ts field.
It might look like this:
"ts" : Timestamp( 1361497305, 2789 )
Note that the mongorestore command has two options, one called --oplogReplay and the other called oplogLimit. You will now replay this oplog on the restored stand-alone server BUT you will stop before this offending update operation.
The command would be (host and port are where your newly restored backup is):
mongorestore -h host --port NNNN --oplogReplay --oplogLimit 1361497305:2789 oplogR
This will restore each operation from the oplog.bson file in oplogR directory stopping right before the entry with ts value Timestamp(1361497305, 2789).
Recall that the reason you were doing this on a separate instance is so you can verify the restore and replay created correct data - once you have verified it then you can write the restored records to the appropriate place in the real primary (and allow replication propagate the corrected records to the secondaries).
i was playing around on our dev server for a while for a new product and now it's sorta live and i want to move existing data from a single machine (mongod, local) to our 6 server shard setup (2 shards each a 3 replica set) - is there any way to clone the db to a remote shard?
(worst case, a simple dump & insert with shard key example would be very nice!)
thanks!
You should add your dev server to the sharding environement :
Restart your dev server with the --shard option
on your mongos : type db.runCommand( { addshard : "serverhostname[:port]", name : "migration" } );
use the remove shard command to remove your shard "migration".
When it is done (you get "remove shard completed successfully"), you can stop your dev server and all your data have been migrated from dev to the new cluster
You don't have to shard your database for the migration, however you need to do it if you want to benefit from sharding.
The advantages of this solution is that you have minimum action to take ( everything is automatic) and there is no downtime (the more load you put, however, the slower the operation is)
However this is a slow solution (slower than a manual copy).
One more advantage compared to raw files copy : the transfer will also compact (~ defrag) data, which is always good :-)
Add your dev server to a replica-set as the master, and the other 3 servers as slaves.
Then remove the dev-server once the data was copied by the other servers.
http://www.mongodb.org/display/DOCS/Replica+Set+Commands
You can use mongodump to dump out the database, and then load the db-dump with mongorestore onto the master of each of your replica-sets
man mongodump , man mongorestore
See:
http://www.mongodb.org/display/DOCS/Replica+Set+Internals
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Master+Slave