I have a two-member replica set, I accidentally removed all documents in an collection, I am not sure how I did this, but it's gone.
Is it possible to get all the data back?
Unless you have a backup (always recommended for just this type of thing), or one of the replicas is using slavedelay, then I am afraid the removal of the records is final. You might have been able to force a shutdown in time to save the data on-disk if you killed the process before the next fsync to disk (similarly if you broke replication before the removal was replicated), but even then it would be tricky.
Related
If I understand correctly, when I delete a record and call commit, postgres will update write-ahead log (wal) and wait for the checkpoint then flush changes to the file.
My question is:
Is there any way I can recover deleted record after committed and before postgres checkpointing?
Btw, why is this method reducing disk write? Isn't wal an append log file?
I couldn't find anywhere how to do this without paying for postgres engineers.
Changes to the datafiles must be written and flushed by the end of the next checkpoint, but they can also be written earlier if the pages need to be (or are expected to be) evicted to make room for other data to be read in. As soon as the corresponding WAL is flushed, the change to the datafile is eligible to be written.
There is no user-available way to recover the deleted record. You could force a crash, then interfere with the recovery process. But you would have to be quite an expert to pull this off. It would be easier to just retrieve the record from a backup and then re-insert it. You don't even need to be an engineer to make this happen, just have valid backups (possibly including WAL archives) and know how to use them. Or better yet, don't commit things you don't want committed.
The system is designed this way for crash safety, not for reduced disk writing.
MongoDB configuration has a param called "autoresync".
This is what it says:
Automatically resync if slave data is stale
autoresync
So if we enable this parameter, when one of the secondaries go into RECOVERING state, can it auto heal MongoDB non primary members who have stale data and unable to replicate data. Some times we see data is too stale. So if we enable this param, can it automatically heal and bring it to good state.
That parameter is "legasy" and has not been supported for... Long time. It was when there was master-slave paradigm in the MongoDB.
With current versions of MongoDB, secondaries always recover (auto heal), IF primary have opLog what is big enough to cover that "missing" data.
So, if your secondary cannot replicate/recover, check that your PRIMARY node opLog is big enough. Resize opLog without reboot
To see how long time your opLog can cover, use command db.getReplicationInfo.timeDiff
I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.
I'm trying to figure out the best way to quickly recover from a disastrous query (e.g. dropDatabase()) that has corrupted the primary and any secondaries.
My current solution requires recovering from a snapshot on MMS, which means 2 hours of downtime to download and untar the file.
My plan is to add a 30-minute delayed secondary to my replica set so I could quickly failover in this disaster case -- I would lose 30 minutes of writes, but this is better than 2 hours of downtime. Ideally, I could actually have the delayed database replay the oplog up to the bad query first.
Does this approach make sense? If so, how do you actually promote a delayed secondary (after removing the corrupted servers), ensuring that it doesn't replicate the bad query that corrupted the primary?
I read about the different MongoDB setups for doing backup without downtime. Which strategy is best or can they even be compared?
Enable journaling and simply copy the /data/db directory - it is unclear to me if this is enough – on the MongoDB home page it states that you have to "snapshot it" and it works on SAN and LVM as examples.
Questions:
What does snapshot mean in this context will a copy command count as a snapshot? Is it save to copy a journaling MongoDB (2.0+) data directory on a Windows server with NTFS? How do you ensure that it is safe to do on your own filesystem and setup?
Establish a replica set with 2 servers and an arbiter. Then use rs.status() and fsyncLock/unlock to ensure data is read only on the secondary server while doing backup.
> db.fsyncLock
function () {
return db.adminCommand({fsync:1, lock:true});
}
> db.fsyncUnlock
function () {
return db.getSiblingDB("admin").$cmd.sys.unlock.findOne();
}
Questions:
If you use locks in a replica set it seems that writes and reads can be locked for the whole replica set and this bug has not been fixed?
What if the secondary is voted in as primary while the backup is in progress? Will the backup process stop or will the replica set stop responding to write requests until it is unlocked?
Considerations:
For now I would like the simple solution and simply copy the data/db directory with journal files and wait with the replica set. MongoDB runs on a 64 bit Windows server (RackSpace Cloud).
The best bet is to do fsync + lock on a secondary, then snapshot the volume at the disk or volume level (e.g. using lvm2, hyper-v, btrfs), unlock the database, then copy the snapshotted data files. This minimizes downtime of the secondary and is easy to restore.
"Snapshotting" in this context refers to the snapshot features offered by some volume managers, file systems and hypervisors. Essentially, this is a 'copy-on-write' feature for block devices: instead of overwriting data when the OS demands it, it will write the new data elsewhere and keep both the old version and the new version readable. Snapshotting usually takes almost no time, but on some systems, it's a bad idea to keep many snapshots of the same files, because it may dramatically slow future writes.
Why I believe this is the best strategy for full backups:
Using mongodump won't store the index data The indexes will be restored, but rebuilding indexes for recovery can take hours - the last thing you need when everybody is yelling at you is an operation that takes hours and can't be accelerated.
Fsync + lock will block writers and might block readers hence, it's best to do that on a (passive) secondary, not on the primary.
Halting a secondary will fill the oplog which is why you should keep the lock time as short as possible. Instead of copying all data files (which could take hours) during the lock, merely performing a snapshot should take only a couple of seconds. Hence, oplog limits are not a concern.
Everything is 'back to normal' while the actual copy is running, which gives you peace of mind. The only difference will be higher load on a secondary during the backup, which shouldn't be a major concern.
Addressing your questions:
regarding locks in replica sets: Keep the lock time short, and use a passive secondary (which can't be elected master) so the writer queue can't stall.
"What if the secondary is voted in as primary while the backup is in progress" can't happen if your backup system is passive
For now I would like the simple solution and simply copy the data/db dir with journal files and wait with the replica set. The MongoDB runs on a 64 bit Windows server (RackSpace Cloud).
You can do that. Volume snapshotting is probably still the best way to go, giving you only seconds of downtime. If your data is small, a simple mongodump might be even easier, but make sure recovery times are acceptable (depends on your indexes).