For MongoDB version 3+
We are using MongoDB as a single stand-alone instance. I discovered I can't create a point in time recovery without building a single node replica so I can take advantage of the oplogs to "replay transactions" for point in time recovery. I've done some testing making a single node replica I seem to be able to restore to a specific time.
Is there an alternative with a stand-alone mongodb to get point in time recovery or do I need to convert it to a single node replca.
Thanks
By point-in-time I assume you mean arbitrary point-in-time. You can get a backup up to a particular point by taking a filesystem snapshot at that point, but you can't roll that snapshot forward arbitrarily (by say 10 seconds) to another point without taking another snapshot, or having an oplog.
Basically, yes, you will need an oplog (and hence a single node replica set at a minimum) to be able to recover your data to a specific point in time. The oplog needs to be big enough to cover all operations between your backup/snapshot to the point in time you wish to recover.
For example, if you take backups every 24 hours then you will need an oplog to cover at least 24 hours worth of operations in order to be able to restore to any point in that day. The more often you take backups, the smaller you can make your oplog while retaining the capability.
Related
I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.
Amazon says the following on Redshift billing
"Node usage hours are billed for each hour your data warehouse cluster is running in an Available state. If you no longer wish to be charged for your data warehouse cluster, you must terminate it to avoid being billed for additional node hours."
This means if I just create a cluster and whether use it or not I'll be billed 24/7 because the cluster doesn't have any state like "Suspend". Is there a way to shut down the whole Redshift server when not in use so that I'll be billed only for the hours when I want to use the clusters?
Edit: With Tomasz's reply it sounds like if I want to shutdown the cluster on weekend it'll be like backing up the whole database on Friday evening and restoring on Sunday evening. This doesn't sound good. What does Amazon really mean when they say "PAY ONLY FOR THE HOURS YOU USE"?
Can you tell me how much time will it take to backup/restore a data warehouse of size around 100GB? Can I automatically associate security groups to the cluster after restoring from the Java code?
You can create a manual snapshot of a cluster when you have finished work and then remove cluster.
You will pay for S3 storage, but that is much less than for running Redshift cluster.
Next day just restore cluster from latest snapshot. You will have to add security groups to new cluster, probably with JAVA API:
The new cluster will be associated only with the default security and
parameter groups. If the original cluster was associated with any
other security or parameter group, you will need to manually associate
those groups with the new cluster.
The easiest way to create snapshot is from the console, but you probably will want to do it automatically using cli or Java SDK.
Creating a snapshot of a 3 node cluster filled up to 80% took me about 5 minutes (it's so quick because snapshots are incremental). 100GB is much less than my setup, so it should be even faster. Also restore shouldn't take long time.
UPDATE: A lot has changed in the intervening years, in particular restore from snapshot is now quite fast. Your cluster becomes available in a few minutes and you can run queries while the restore continues in the background. Total time for complete restore of 100GB would now be measured in minutes (varies based on node type & count).
What does Amazon really mean when they say "PAY ONLY FOR THE HOURS YOU USE"?
You pay for the whole hour of any partial hours used.
Can you tell me how much time will it take to backup/restore a data warehouse of size around 100GB?
Snapshots are incremental and this is what makes them fast (as Tomasz mentioned). It's is fairly quick to shutdown a cluster about half an hour. However restoring from a snapshot is very slow I'd suggest around 3 hours for restoring 100GB.
If you really want to be able to take a database cluster up and down quickly you might be better using another analytic DB (e.g. Greenplum or Vertica free editions) with the data stored on EBS volumes. It'd be a lot more work to manage though, that's the tradeoff.
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
"MongoDB in Action" book says:
Imagine you issue a write to the primary node of a replica set. What happens next? First, the write is recorded and then added to the
primary’s oplog. Meanwhile, all sec- ondaries have their own oplogs
that replicate the primary’s oplog. So when a given secondary node is
ready to update itself, it does three things. First, it looks at the
time- stamp of the latest entry in its own oplog. Next, it queries the
primary’s oplog for all entries greater than that timestamp. Finally,
it adds each of those entries to its own oplog and applies the entries
to itself
So this means nodes must be time synchronized? because timestamps must be equal on all nodes.
In general, yes, it is a very good idea to have your hosts synchronized (NTP is the usual solution). In fact I have seen far worse issues caused than an out of sync oplog - different times on database hosts in a cluster should be considered a must.
This is actually mentioned on the Production Notes page in the docs:
http://www.mongodb.org/display/DOCS/Production+Notes#ProductionNotes-Linux
See the note about minimizing clock skew.
Based on the writing you have provided, nodes are basing everything on the timestamp of the most recently received write, not their own clocks. However, the problem happens when the master is stepped down and a secondary becomes the primary. If the time is skewed greatly, it may cause replication to be delayed or other issues.
I read about the different MongoDB setups for doing backup without downtime. Which strategy is best or can they even be compared?
Enable journaling and simply copy the /data/db directory - it is unclear to me if this is enough – on the MongoDB home page it states that you have to "snapshot it" and it works on SAN and LVM as examples.
Questions:
What does snapshot mean in this context will a copy command count as a snapshot? Is it save to copy a journaling MongoDB (2.0+) data directory on a Windows server with NTFS? How do you ensure that it is safe to do on your own filesystem and setup?
Establish a replica set with 2 servers and an arbiter. Then use rs.status() and fsyncLock/unlock to ensure data is read only on the secondary server while doing backup.
> db.fsyncLock
function () {
return db.adminCommand({fsync:1, lock:true});
}
> db.fsyncUnlock
function () {
return db.getSiblingDB("admin").$cmd.sys.unlock.findOne();
}
Questions:
If you use locks in a replica set it seems that writes and reads can be locked for the whole replica set and this bug has not been fixed?
What if the secondary is voted in as primary while the backup is in progress? Will the backup process stop or will the replica set stop responding to write requests until it is unlocked?
Considerations:
For now I would like the simple solution and simply copy the data/db directory with journal files and wait with the replica set. MongoDB runs on a 64 bit Windows server (RackSpace Cloud).
The best bet is to do fsync + lock on a secondary, then snapshot the volume at the disk or volume level (e.g. using lvm2, hyper-v, btrfs), unlock the database, then copy the snapshotted data files. This minimizes downtime of the secondary and is easy to restore.
"Snapshotting" in this context refers to the snapshot features offered by some volume managers, file systems and hypervisors. Essentially, this is a 'copy-on-write' feature for block devices: instead of overwriting data when the OS demands it, it will write the new data elsewhere and keep both the old version and the new version readable. Snapshotting usually takes almost no time, but on some systems, it's a bad idea to keep many snapshots of the same files, because it may dramatically slow future writes.
Why I believe this is the best strategy for full backups:
Using mongodump won't store the index data The indexes will be restored, but rebuilding indexes for recovery can take hours - the last thing you need when everybody is yelling at you is an operation that takes hours and can't be accelerated.
Fsync + lock will block writers and might block readers hence, it's best to do that on a (passive) secondary, not on the primary.
Halting a secondary will fill the oplog which is why you should keep the lock time as short as possible. Instead of copying all data files (which could take hours) during the lock, merely performing a snapshot should take only a couple of seconds. Hence, oplog limits are not a concern.
Everything is 'back to normal' while the actual copy is running, which gives you peace of mind. The only difference will be higher load on a secondary during the backup, which shouldn't be a major concern.
Addressing your questions:
regarding locks in replica sets: Keep the lock time short, and use a passive secondary (which can't be elected master) so the writer queue can't stall.
"What if the secondary is voted in as primary while the backup is in progress" can't happen if your backup system is passive
For now I would like the simple solution and simply copy the data/db dir with journal files and wait with the replica set. The MongoDB runs on a 64 bit Windows server (RackSpace Cloud).
You can do that. Volume snapshotting is probably still the best way to go, giving you only seconds of downtime. If your data is small, a simple mongodump might be even easier, but make sure recovery times are acceptable (depends on your indexes).
I am working on a project which has some important data in it. This means we cannot to lose any of it if the light or server goes down. We are using MongoDB for the database. I'd like to be sure that my data is in the database after the insert and rollback the whole batch if one element was not inserted. I know it is the philosophy behind Mongo that we do not need transactions but how can I make sure that my data is really safely stored after insert rather than sent to some "black hole".
Should I make a search?
Should I use some specific mongoDB commands?
Should I use sharding even if one server is enough for satisfying
the speed and by the way it doesn't guarantee anything if the light
goes down?
What is the best solution?
Your best bet is to use Write Concerns - these allow you to tell MongoDB how important a piece of data is. The quickest Write Concern is also the least safe - the data is not flushed to disk until the next scheduled flush. The safest will confirm that the data has been written to disk on a number of machines before returning.
The write concern you are looking for is FSYNC_SAFE (at least that is what it is called from the point of view of the Java driver) or REPLICAS_SAFE which confirms that your data has been replicated.
Bear in mind that MongoDB does not have transactions in the traditional sense - your rollback will have to be rolled by hand as you can't tell the Mongo database to do this for you.
The other thing you need to do is either use the relatively new --journal option (which uses a Write Ahead Log), or use replica sets to share your data across many machines in order to maximise data integrity in the event of a crash/power loss.
Sharding is not so much a protection against hardware failure as a method for sharing the load when dealing with particularly large datasets - sharding shouldn't be confused with replica sets which is a way of writing data to more than one disk on more than one machine.
Therefore, if your data is valuable enough, you should definitely be using replica sets, perhaps even siting slaves in other data centres/availability zones/racks/etc in order to provide the resilience you require.
There is/will be (can't remember offhand whether this has been implemented yet) a way to specify the priority of individual nodes in a replica set such that if the master goes down the new master that is elected is one in the same data centre if such a machine is available (ie to stop a slave on the other side of the country from becoming master unless it really is the only other option).
I received a really nice answer from a person called GVP on google groups. I will quote it(basically it adds up to Rich's answer):
I'd like to be sure that my data is in the database after the
insert and rollback the whole batch if one element was not inserted.
This is a complex topic and there are several trade-offs you have to
consider here.
Should I use sharding?
Sharding is for scaling writes. For data safety, you want to look a
replica sets.
Should I use some specific mongoDB commands?
First thing to consider is "safe" mode or "getLastError()" as
indicated by Andreas. If you issue a "safe" write, you know that the
database has received the insert and applied the write. However,
MongoDB only flushes to disk every 60 seconds, so the server can fail
without the data on disk.
Second thing to consider is "journaling"
(v1.8+). With journaling turned on, data is flushed to the journal
every 100ms. So you have a smaller window of time before failure. The
drivers have an "fsync" option (check that name) that goes one step
further than "safe", it waits for acknowledgement that the data has
be flushed to the disk (i.e. the journal file). However, this only
covers one server. What happens if the hard drive on the server just
dies? Well you need a second copy.
Third thing to consider is
replication. The drivers support a "W" parameter that says "replicate
this data to N nodes" before returning. If the write does not reach
"N" nodes before a certain timeout, then the write fails (exception
is thrown). However, you have to configure "W" correctly based on the
number of nodes in your replica set. Again, because a hard drive
could fail, even with journaling, you'll want to look at replication.
Then there's replication across data centers which is too long to get
into here. The last thing to consider is your requirement to "roll
back". From my understanding, MongoDB does not have this "roll back"
capacity. If you're doing a batch insert the best you'll get is an
indication of which elements failed.
Here's a link to the PHP driver on this one: http://it.php.net/manual/en/mongocollection.batchinsert.php You'll have to check the details on replication and the W parameter. I believe the same limitations apply here.