Is it possible to modify the MongoDB oplog and replay it?
A bug caused an update to be applied to more documents than it was supposed to be, overwriting some data. Data was recovered from backup and reintegrated, so nothing was actually lost, but I was wondering if there was a way to modify the oplog to remove or modify the offending update and replay it.
I don't have in depth knowledge of MongoDB internals, so informative answers along the lines of, "you don't understand how it works, it's like this" will also be considered for acceptance.
One of the big issues in application or human error data corruption is that the offending write to the primary will immediately be replicated to the secondary.
This is one of the reasons that users take advantage of "slaveDelay" - an option to run one of your secondary nodes with a fixed time delay (of course that only helps you if you discover the error or bug during the time period that's shorter than the delay on that secondary).
In case you don't have such a set-up, you have to rely on a backup to recreate the state of the records you need to restore to their pre-bug state.
Perform all the operations on a separate stand-alone copy of your data - only after verifying that everything was properly recreated should you then move the corrected data into your production system.
What is required to be able to do this is a recent copy of the backup (let's say the backup is X hours old) and the oplog on your cluster must hold more than X hours worth of data. I didn't specify which node's oplog because (a) every member of the replica set has the same contents in the oplog and (b) it is possible that your oplog size is different on different node members, in which case you want to check the "largest" one.
So let's say your most recent backup is 52 hours old, but luckily you have an oplog that holds 75 hours worth of data (yay).
You already realized that all of your nodes (primary and secondaries) have the "bad" data, so what you would do is restore this most recent backup into a new mongod. This is where you will restore these records to what they were right before the offending update - and then you can just move them into the current primary from where they will get replicated to all the secondaries.
While restoring your backup, create a mongodump of your oplog collection via this command:
mongodump -d local -c oplog.rs -o oplogD
Move the oplog to its own directory renaming it to oplog.bson:
mkdir oplogR
mv oplogD/local/oplog.rs.bson oplogR/oplog.bson
Now you need to find the "offending" operation. You can dump out the oplog to human readable form, using the bsondump command on oplogR/oplog.bson file (and then use grep or what-not to find the "bad" update). Alternatively you can query against the original oplog in the replica set via use local and db.oplog.rs.find() commands in the shell.
Your goal is to find this entry and note its ts field.
It might look like this:
"ts" : Timestamp( 1361497305, 2789 )
Note that the mongorestore command has two options, one called --oplogReplay and the other called oplogLimit. You will now replay this oplog on the restored stand-alone server BUT you will stop before this offending update operation.
The command would be (host and port are where your newly restored backup is):
mongorestore -h host --port NNNN --oplogReplay --oplogLimit 1361497305:2789 oplogR
This will restore each operation from the oplog.bson file in oplogR directory stopping right before the entry with ts value Timestamp(1361497305, 2789).
Recall that the reason you were doing this on a separate instance is so you can verify the restore and replay created correct data - once you have verified it then you can write the restored records to the appropriate place in the real primary (and allow replication propagate the corrected records to the secondaries).
Related
Is it safe to run mongodump against running server with many writes per second? Is it possible to get corrupted dump doing in this way?
From here:
Use --oplog to capture incoming write operations during the mongodump operation to ensure that the backups reflect a consistent data state.
Does it mean that no matter how many writes in database dump will be consistent?
If I ran mongodump --oplog at 1AM and it finished at 2AM then I run mongorestore --oplogReplay what state will I get?
From here:
However, the use of mongodump and mongorestore as a backup strategy can be problematic for sharded clusters and replica sets.
but why? I had replica set of 1 primary and 2 secondary. What the problem to run mongodump against one of secondary? It should same as primary (except replication lag difference).
The docs are quite clear about it:
--oplog
Creates a file named oplog.bson as part of the mongodump output. The oplog.bson file, located in the top level of the output directory, contains oplog entries that occur during the mongodump operation. This file provides an effective point-in-time snapshot of the state of a mongod instance. To restore to a specific point-in-time backup, use the output created with this option in conjunction with mongorestore --oplogReplay.
Without --oplog, if there are write operations during the dump operation, the dump will not reflect a single moment in time. Changes made to the database during the update process can affect the output of the backup.
--oplog has no effect when running mongodump against a mongos instance to dump the entire contents of a sharded cluster. However, you can use --oplog to dump individual shards.
Without --oplog you still get a valid dump, just a bit inconsistent - some of the writes done between 1 AM and 2 AM will be missing.
With --oplog you have the oplog file captured at 2 AM. The dump remains inconsistent, and replaying the oplog on restore fixes this issue.
The problems dumping the sharded clusters deserve a dedicated page in the docs. Essentially because of complexity to synchronise backups of all nodes:
To create backups of a sharded cluster, you will stop the cluster balancer, take a backup of the config database, and then take backups of each shard in the cluster using mongodump to capture the backup data. To capture a more exact moment-in-time snapshot of the system, you will need to stop all application writes before taking the filesystem snapshots; otherwise the snapshot will only approximate a moment in time.
There are no problems to dump replica set.
my mongodb was hacked today, all data was deleted, and hacker requires some amount to get it back, I will not pay him, cause I know he will not send me back my database.
But I have had oplog turn on, I see it contains over 300 000 documents, saving all operations.
Is there any tool that can restore my data from this logs?
Depending on how far back your oplog is, you may be able to restore the deployment. I would recommend taking a backup of the current state of your dbpath just in case.
Note that there are many variables in play for doing a restore like this, so success is never a guarantee. It can be done using mongodump and mongorestore, but only if your oplog goes back to the beginning of time (i.e. when the deployment was first created). If it does, you may be able to restore your data. If it does not, you'll see errors during the process.
Secure your deployment before doing anything else. This situation arises due to a lack of security. There are extensive security features available in MongoDB. Check out the Security Checklist page for details.
Dump the oplog collection using mongodump --host <old_host> --username <user> --password <pwd> -d local -c oplog.rs -o oplogDump.
Check the content of the oplog to determine the timestamp when the offending drop operation occur by using bsondump oplogDump/local/oplog.rs.bson. You're looking for a line that looks approximately like this:
{"ts":{"$timestamp":{"t":1502172266,"i":1}},"t":{"$numberLong":"1"},"h":{"$numberLong":"7041819298365940282"},"v":2,"op":"c","ns":"test.$cmd","o":{"dropDatabase":1}}
This line means that a dropDatabase() command was executed on the test database.
Keep note of the t value in {"$timestamp":{"t":1502172266,"i":1}}.
Restore to a secure new deployment using mongorestore --host <new_host> --username <user> --password <pwd> --oplogReplay --oplogLimit=1502172266 --oplogFile=oplogDump/local/oplog.rs.bson oplogDump
Note the parameter to oplogLimit, which is basically telling mongorestore to stop replaying the oplog once it hit that timestamp (which is the timestamp of the dropDatabase command in Step 3.
The oplogFile parameter is new to MongoDB 3.4. For older versions, you would need to copy the oplogDump/local/oplog.rs.bson to the root of the dump directory to a file named oplog.bson, e.g. oplogDump/oplog.bson and remove the oplogFile parameter from the example command above.
After Step 4, if your oplog goes back to the beginning of time and you stop the oplog replay at the right time, hopefully you should see your data at the point just before the dropDatabase command was executed.
We have a three-server replicaset running MongoDB 2.2 on Ubuntu 10.04, and recently had to upgrade the hard drive for each server where one particular database resides. This database contains log information for web service requests, where they write to collections in hourly buckets using the current timestamp to determine the name, e.g. log_yyyymmddhh.
I performed this process:
backup the database on the primary server with mongodump --db log_db
take a secondary server offline, replace the disk
bring the secondary server up in standalone mode (i.e. comment out the replSet entry
in /etc/mongodb.conf before starting the service)
restore the database on the secondary server with mongorestore --drop --db log_db
add the secondary server back into the replicaset and bring it online,
letting replication catch up the hourly buckets that were updated/created
while it had been offline
Everything seemed to go as expected, except that the collection which was the current bucket at the time of the backup was not brought up to date by replication. I had to manually copy that collection over by hand to get it up to date. Note that collections which were created after the backup were synched just fine.
What did I miss in this process that caused MongoDB not to get things back in synch for that one collection? I assume something got out of whack with regard to the oplog?
Edit 1:
The oplog on the primary showed that its earliest timestamp went back a couple of days, so there should have been plenty of space to maintain transactions for a few hours (which was the time the secondary was offline).
Edit 2:
Our MongoDB installation uses two disk partitions: /dev/sda1 and /dev/sdb1. The primary MongoDB directory /var/lib/mongodb/ is on /dev/sda1, and holds several databases, while the log database resides by itself on /dev/sdb1. There's a sym link /var/lib/mongodb/log_db which points to a directory on /dev/sdb1. Since the log db was getting full, we needed to upgrade the disk for /dev/sdb1.
You should be using mongodump with the --oplog option. Running a full database backup with mongodump on a replicaset that is updating collections at the same time may not leave you with a consistent backup. This becomes worse with larger databases, more collections and more frequent updates/inserts/deletes.
From the documentation for your version (2.2) of MongoDB (it's the same for 2.6 but just to be as accurate as possible):
--oplog
Use this option to ensure that mongodump creates a dump of the
database that includes an oplog, to create a point-in-time snapshot of
the state of a mongod instance. To restore to a specific point-in-time
backup, use the output created with this option in conjunction with
mongorestore --oplogReplay.
Without --oplog, if there are write operations during the dump
operation, the dump will not reflect a single moment in time. Changes
made to the database during the update process can affect the output
of the backup.
http://docs.mongodb.org/v2.2/reference/mongodump/
This is not covered well in most MongoDB tutorials around backups and restores. Generally you are better off if you can perform a live snapshot of the storage volume your database resides on (assuming your storage solution has a live snapshot ability compatible with MongoDB). Failing that, your next best bet is taking a secondary offline and then performing a snapshot or backup of the database files. Mongodump on a live database is increasingly a less optimal solution for larger databases due to performance issues.
I'd definitely take a look at the MongoDB overview of backup options: http://docs.mongodb.org/manual/core/backups/
I would guess this has to do with the oplog not being long enough, although it seems like you checked that and it looked reasonably big.
Still, when adding new members to a replica set you shouldn't be snapshotting and restoring them. It's better to simply add a new member and let replication happen by itself. This is described in the Mongo docs and is the process I've always followed.
I have a standalone MongoDb instance. It has many databases in it. I am though only concerned with backingup/restoring one of those databases, lets call it DbOne.
Using the instructions in (http://www.mongodb.com/blog/post/dont-let-your-standalone-mongodb-server-stand-alone), I can create an oplog on this standalone server.
Using the tool Tayra, I can record/store the oplog entries. Being able to create incremental backups is the main reason I enabled the oplog on my standalone instance.
I intend to take full backups once a day, using the command
mongodump --db DbOne --oplog
From my understanding, this backup will contain a point-in-time snapshot of my db.
Assuming I want to discard all updates since this backup, I delete all backedup oplog and I restore only this full backup, using the command
mongorestore --drop --db DbOne --oplogReplay
At this point, do I need to do something to the oplog collection in the local db? Will mongodb automatically drop the entries pertaining to this db from the oplog? Because if not, then wouldn't Tayra end up finding those oplog entries and backup them all over again?
Tbh, I haven't tried this yet on my machine. I am hoping someone can point to a document that lists supported/expected behaviour in this scenario.
I had experimented with a MongoDb server, setup as a replica set with only 1 member, shortly after asking the question. I however forgot to answer this question.
I took a backup using mongodump --db DbOne --oplog. I executed some additional updates. Keeping the mongodb server as is, ie still running under replication, if I run the mongorestore command, then it would create thousands of oplog entries, one for each document of each collection in the db. It was a big mess.
The alternative was to shutdown MongoDb and start it as a standalone instance (ie not running as a replica set). Now if I were to restore using mongorestore the oplog wouldnt be touched. This was bad, because the oplog now contained entries that were not present in the actual collections.
I wanted a mechanism that would restore both my database as well as oplog info in the local database to the time the backup took place. mongodump doesnt backup the local database.
Eventually I had stop using mongodump and instead switched to backing up the entire data directory (after stopping mongodb). Once we switch to AWS, I could use the EBS Snapshot feature to perform the same.
I understand you want a link to the docs about mongorestore:
http://docs.mongodb.org/manual/reference/program/mongorestore/
From what I understand you want to make a point in time backup and then restore that backup. The commands you listed above will do that:
1)mongodump --db DbOne --oplog
2)mongorestore --drop --db DbOne --oplogReplay
However, please note that the "point in time" that the backup was effectively taken at it is when the dump ends, not the moment the command started. This is a fine detail that might not matter to you, but is included for completeness.
Let me know if there is anything else.
Best,
Charlie
The mongodb document says that
To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).
in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
I wonder how to make the mongodb free deleted disk space automatically ?
p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.
In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.
So, you should try to provide as much disk-space as possible for the database.
However if you must shrink the database you should keep two things in mind.
MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.
Serverside Javascript
You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...
Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...
$ mongo foo bar.js
The javascript file would look something like ...
// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();
print('Storage Size: ' + tojson(storage));
print('TotalSize: ' + tojson(total));
print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');
// Run repair
db.repairDatabase()
// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();
print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));
This will run and return something like ...
MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153
Run this on a schedule (during none peak hours) and you are good to go.
Capped Collections
However there is one other option, capped collections.
Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.
Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.
This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
Also this can not be automated. Well it could, but I don't think I'm willing to try.
Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.
As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.
use cleanup_database
db.dropDatabase();
use oversize_database
db.collection.find({},{}).forEach(function(doc){
db = db.getSiblingDB("cleanup_database");
db.collection_subset.insert(doc);
});
use oversize_database
db.dropDatabase();
use cleanup_database
db.collection_subset.find({},{}).forEach(function(doc){
db = db.getSiblingDB("oversize_database");
db.collection.insert(doc);
});
use oversize_database
<add indexes>
db.collection.ensureIndex({field:1});
use cleanup_database
db.dropDatabase();
An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.
Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.
If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.
To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.
By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.
To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:
NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES
#!/bin/bash
# First arg is host MongoDB is running on, second arg is the MongoDB port
MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath
# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
sleep 2
done
echo "Node is no longer primary!\n"
# Now shut down that server
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user#$MONGOHOST sudo service mongodb stop
# Wipe the data files for that server
ssh -t user#$MONGOHOST sudo rm -rf $DBPATH
ssh -t user#$MONGOHOST sudo mkdir $DBPATH
ssh -t user#$MONGOHOST sudo chown mongodb:mongodb $DBPATH
# Start up server again
# similar to shutdown something like
ssh -t user#$MONGOHOST sudo service mongodb start