MongoDB Write and lock processes - mongodb

I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon

Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.

Related

How does Mongo's eventual consistency work with a large number of data writes?

I have a flow like this:
I have a Worker that's processing a "large" batch (say, 1M records) and storing the results in Mongo.
Once the batch is complete, a notification message is sent to Publish, which then pulls all the records from Mongo for final publication.
Let's say the Worker write process is done, i.e. it has sent all 1M records to Mongo through a driver. Mongo is "eventually consistent" so I'm not 100% guaranteed all records are written to physical storage at the time the Notify Publish happens.
When Publish does a 'find' and gets a cursor on the collection holding the batch records, is the cursor smart enough to handle the eventual consistency?
So in practical terms let's imagine 750,000 records are actually physically written by Mongo when Notify Publish happens and Publish does its find. Will the cursor traverse 750,000 records and stop or will it block or otherwise handle the remaining 250,000 as they're eventually written to disk (which presumably is very likely to happen while publishing of the first 750K)?
As #BlakesSeven already noted in the comments, "eventual consistency" refers to the fact that in a replicated environment, when a write is finished on the primary, it will only be written to the secondaries eventually. You can modify this behavior at the cost of reduced write performance by setting the write concern to > 1. Setting it to "majority" basically guarantees that a write operation is durable even in case of a failover – though at a (in some cases) drastically reduced performance.
In general here is what happens when you do a write (simplified) with journaling enabled:
The operation is checked for being syntactically correct.
The query optimizer kicks in and does his stuff. (Irrelevant for this question, so I spare the details).
The write operation is applied to the in memory representation of the data set called "private view".
Every commitIntervalMs, the private view is synced to the journal, with a median of 15 or 50ms, depending on the write concern.
On sync, the operation is applied to the shared view. Iirc, this is the point where a new connection would be provided with the new data.
So in order to ensure that the data will be readable by the new connection, simply delay the publish notification by commitIntervalMs + 1, which, given your batch size, is hardly noticeable.

Confirm basic understanding of MongoDB's acknowledged write concern

Using MongoDB (via PyMongo) in the default "acknowledged" write concern mode, is it the case that if I have a line that writes to the DB (e.g. a mapReduce that outputs a new collection) followed by a line that reads from the DB, the read will always see the changes from the write?
Further, is the above true for all stricter write concerns than "acknowledged," i.e. "journaled" and "replica acknowledged," but not true in the case of "unacknowledged"?
If the write has been acknowledged, it should have been written to memory, thus any subsequent query should get the current data. This won't work if you have a replica set and allow reads from secondaries.
Journaled writes are written to the journal file on disk, which protects your data in case of power / hardware failures, etc. This shouldn't have an impact on consistency, which is covered as soon as the data is in memory.
Any replica configuration in the write concern will ensure that writes need to be acknowledged by the majority / all nodes in the replica set. This will only make a difference if you read from replicas or to protect your data against unreachable / dead servers.
For example in case of WiredTiger database engine, there'll be a cache of pages inside memory that are periodically written and read from disk, depending on memory pressure. And, in case of MMAPV1 storage engine, there would be a memory mapped address space that would correspond to pages on the disk. Now, the secondary structure that's called a journal. And a journal is a log of every single thing that the database processes - notice that the journal is also in memory.
When does the journal gets written to the disk?
When the app request something to the mongodb server via a TCP connection - and the server is gonna process the request. And it's going to write it into the memory pages. But they may not write to the disk for quite a while, depending on the memory pressure. It's also going to update request into the journal. By default, in the MongoDB driver, when we make a database request, we wait for the response. Say an acknowledged insert/update. But we don't wait for the journal to be written to the disk. The value that represents - whether we're going to wait for this write to be acknowledged by the server is called w.
w = 1
j = false
And by default, it's set to 1. 1 means, wait for this server to respond to the write. By default, j equals false, and j which stands for journal, represents whether or not we wait for this journal to be written to be written to the disk before we continue. So, what are the implications of these defaults? Well, the implications are that when we do an update/insert - we're really doing the operation in memory and not necessarily to the disk. This means, of course, it's very fast. And periodically (every few seconds) the journal gets written to the disk. It won't be long, but during this window of vulnerability when the data has been written into the server's memory into the pages, but the journal has not yet been persisted to the disk, if the server crashed, we could lose the data. We also have to realize that, as a programmer just because the write came back as good and it was written successfully to the memory. It may not ever persist to disk if the server subsequently crashes. And whether or not this is the problem depends on the application. For some applications, where there are lots of writes and logging small amount of data, we might find that it's very hard to even keep up with the data stream, if we wait for the journal to get written to the disk, because the disk is going to be 100 times, 1,000 times slower than memory for every single write. But for other applications, we may find that it's completely necessary for us to wait for this to be journaled and to know that it's been persisted to the disk before we continue. So, it's really upto us.
The w and j value together are called write concern. They can be set in the driver, at the collection level, database level or a client level.
1 : wait for the write to be acknowledged
0 : do not wait for the write to be acknowledged
TRUE : sync to journal
FALSE : do not sync to journal
There are also other values for w as well that also have some significance. With w=1 & j=true we could make sure that those writes have been persisted to disk. Now, if the writes have been written to the journal, then what happens is if the server crashes, then even though the pages may not be written back to disk yet, on recovery, the server can look at the journal on the disk - the mongod process and recreate all the writes that were not yet persisted to the pages. Because, they've been written to the journal. So, that's why this gives us a greater level of safety.

Does MongoDB journaling guarantee durability?

Even if journaling is on, is there still a chance to lose writes in MongoDB?
"By default, the greatest extent of lost writes, i.e., those not made to the journal, are those made in the last 100 milliseconds."
This is from Manage Journaling, which indicates you could lose writes made since the last time the journal was flushed to disk.
If I want more durability, "To force mongod to commit to the journal more frequently, you can specify j:true. When a write operation with j:true is pending, mongod will reduce journalCommitInterval to a third of the set value."
Even in this case, it looks like flushing the journal to disk is asynchronous so there is still a chance to lose writes. Am I missing something about how to guarantee that writes are not lost?
Posting a new answer to clean this up. I performed tests and read the source code again and I'm sure the irritation comes from an unfortunate sentence in the write concern documentation. With journaling enabled and j:true write concern, the write is durable, and there is no mysterious window for data loss.
Even if journaling is on, is there still a chance to lose writes in MongoDB?
Yes, because the durability also depends on the individual operations write concern.
"By default, the greatest extent of lost writes, i.e., those not made to the journal, are those made in the last 100 milliseconds."
This is from Manage Journaling, which indicates you could lose writes made since the last time the journal was flushed to disk.
That is correct. The journal is flushed by a separate thread asynchronously, so you can lose everything since the last flush.
If I want more durability, "To force mongod to commit to the journal more frequently, you can specify j:true. When a write operation with j:true is pending, mongod will reduce journalCommitInterval to a third of the set value."
This irritated me, too. Here's what it means:
When you send a write operation with j:true, it doesn't trigger the disk flush immediately, and not on the network thread. That makes sense, because there could be dozens of applications talking to the same mongod instance. If every application were to use journaling a lot, the db would be very slow because it's fsyncing all the time.
Instead, what happens is that the 'durability thread' will take all pending journal commits and flush them to disk. The thread is implemented like this (comments mine):
sleepmillis(oneThird); //dur.cpp, line 801
for( unsigned i = 1; i <= 2; i++ ) {
// break, if any j:true write is pending
if( commitJob._notify.nWaiting() )
break;
// or the number of bytes is greater than some threshold
if( commitJob.bytes() > UncommittedBytesLimit / 2 )
break;
// otherwise, sleep another third
sleepmillis(oneThird);
}
// fsync all pending writes
durThreadGroupCommit();
So a pending j:true operation will cause the journal commit thread to commit earlier than it normally would, and it will commit all pending writes to the journal, including those that don't have j:true set.
Even in this case, it looks like flushing the journal to disk is asynchronous so there is still a chance to lose writes. Am I missing something about how to guarantee that writes are not lost?
The write (or the getLastError command) with a j:true journaled write concern will wait for the durability thread to finish syncing, so there's no risk of data loss (as far as the OS and hardware guarantee that).
The sentence "However, there is a window between journal commits when the write operation is not fully durable" probably refers to a mongod running with journaling enabled that accepts a write that does NOT use the j:true write concern. In that case, there's a chance of the write getting lost since the last journal commit.
I filed a docs bug report for this.
Maybe. Yes, it waits for the data to be written, but according to the docs there's a 'there is a window between journal commits when the write operation is not fully durable', whatever that is. I couldn't find out what they refer to.
I'm leaving the edited answer here, but I reversed myself back-and-forth, so it's a bit irritating:
This is a bit tricky, because there are a lot of levers you can pull:
Your MongoDB setup
Assuming that journaling is activated (default for 64 bit), the journal will be committed in regular intervals. The default value for the journalCommitInterval is 100ms if the journal and the data files are on the same block device, or 30ms if they aren't (so it's preferable to have the journal on a separate disk).
You can also change the journalCommitInterval to as little as 2ms, but it will increase the number of write operations and reduce overall write performance.
The Write Concern
You need to specify a write concern that tells the driver and the database to wait until the data is written to disk. However, this won't wait until the data has been actually written to the disk, because that would take 100ms in a bad-case scenario with the default setup.
So, at the very best, there's a 2ms window where data can get lost. That's insufficient for a number of applications, however.
The fsync command forces a disk flush of all data files, but that's unnecessary if you use journaling, and it's inefficient.
Real-Life Durability
Even if you were to journal every write, what is it good for if the datacenter administrator has a bad day and uses a chainsaw on your hardware, or the hardware simply disintegrates itself?
Redundant storage, not on a block device level like RAID, but on a much higher level is a better option for many scenarios: Have the data in different locations or at least on different machines using a replica set and use the w:majority write concern with journaling enabled (journaling will only apply on the primary, though). Use RAID on the individual machines to increase your luck.
This offers the best tradeoff of performance, durability and consistency. Also, it allows you to adjust the write concern for every write and has good availability. If the data is queued for the next fsync on three different machines, it might still be 30ms to the next journal commit on any of the machines (worst case), but the chance of three machines going down within the 30ms interval is probably a millionfold lower than the chainsaw-massacre-admin scenario.
Evidence
TL;DR: I think my answer above is correct.
The documentation can be a little irritating, especially with regards to wtimeout, so I checked the source. I'm not an expert on the mongo source, so take this with a grain of salt:
In write_concern.cpp, we find (edited for brevity):
if ( cmdObj["j"].trueValue() ) {
if( !getDur().awaitCommit() ) {
// --journal is off
result->append("jnote", "journaling not enabled on this server");
} // ...
}
else if ( cmdObj["fsync"].trueValue() ) {
if( !getDur().awaitCommit() ) {
// if get here, not running with --journal
log() << "fsync from getlasterror" << endl;
result->append( "fsyncFiles" , MemoryMappedFile::flushAll( true ) );
}
Note the call MemoryMappedFile::flushAll( true ) if fsync is set. This call is clearly not in the first branch. Otherwise, durability is handled on a sepate thread (relevant files prefixed dur_).
That explains what wtimeout is for: it refers to the time waiting for slaves, and has nothing to do with I/O or fsync on the server.
Journaling is for keeping the data on a particular mongod in a consistent state, even in case of chainsaw madness, however with client settings through writeconcern it can be used to force out durability. About write concern DOCS.
There is an option, j:1, which you can read about here which ensures that the particular write operation waits for acknowledge till it is written to the journal file on disk (so not just in the memory map). However this docs says the opposite. :) I would vote for the first case it makes me feel more comfortable.
If you run lots of commands with such option mongodb will adapt the size of the commit interval of the journal to speed things up, you can read about it here: DOCS this one you also mentioned and as others already said that you can specify an interval between 2-300ms.
Durability is much more ensured in my opinion over the w:2 option while if the update/write operation is acknowledged by two members in a replicaset it is really unlikely to lose both in the same minute (datafile flush interval), but not impossible.
Using both options will cause the situation that when the operation is acknowledged by the database cluster it will reside in memory at two different boxes and on one it will be in a consistent recoverable disk place too.
Generally lost writes are an issue in every system where there is buffering/caching/delayed-write involved between a system's runtime and a permanent (non-volatile) storage, even at the OS level (for example write-behind caching). So there is always a chance to lose writes, even if your concrete provider (MongoDB) provides functionality for transaction durability it's the underlying OS that is responsible for ultimately writing the data, and even then there is caching at the device level... And that's just the lower levels, making the system highly concurrent, distributed and performant only makes matters worse.
In short there is no absolute durability, only practical/eventual/hope-for-the-best durability especially with a NoSQL storage like Mongo, which isn't primarily made for consistency and durability in the first place.
I would have to agree with Sammaye that journoualing has little to do with durability. However, if you want to get an answer to whether you can really trust mongodb to store your data with good consistency, then I would suggest that you read this blog post. There is a reply from 10gen regarding that post, and a reply from the author to the 10gen post. I would suggest that you read into it to make an educated decision. It took me some time to understand all the details on my own, but this post has the basics covered.
The response to the blog post was given here by 10gen, the company that makes mongodb.
And the response to the response was given by the professor on this post.
It explains a lot about how Mongodb can shard data, how it actually functions, and the performance hits it takes if you add on extra safety locks. I strongly want to say that these three writings are the best thing out there, and by far the most comprehensive things out there that talk about the benefits and drawbacks of mongodb, if you think its one sided, look at the comments, and also see what people had to say, because if something received a reply from the company that made the software, then it must have made some good points atleast.

Under what circumstances is a change not written to a MongoDB database when safe mode is off?

I have safe mode turned off in my MongoDB database because none of the data being written is absolutely 100% mission critical and the gain in insertion speed is very important, but I would really prefer if all of the data is written to the database.
My understanding is that with journaling turned on and safe mode turned off, if the server crashes in the 100ms between when a write request is received and the data is output to the journal, the data can be lost.
If the data is successfully written to the journal, is it a pretty safe bet, even if the database is lagging due to heavy load, that the data will end up in the database when the database catches up and is able to process what's in the journal? Or is my understanding of what the journal does flawed? Are there any other circumstances under which inserted data may be lost?
What happens if I update a document a fraction of a second before another process attempts to read it, but the changes haven't been committed to the collection yet? Will the read block until the insert has completed?
The read will only be blocked until the insert has completed if the read is requested on the same connection as the write. There is no guarantee that once the data is written, it will be immediately visible to other connections unless proper getLastError is used.
Data is processed and put in memory mapped region of data before journaling. However, it may not be "fsynced" to disk as often as journaling. This means that even though the load is high, the data should be eventually updated and become visible to other connections. Journaling data is only used to restore durability when mongod instance unexpectedly crashes.
Your data may be lost due to network interruption, disk corruption, index dup, etc.

How safe is MongoDB's safe mode on inserts?

I am working on a project which has some important data in it. This means we cannot to lose any of it if the light or server goes down. We are using MongoDB for the database. I'd like to be sure that my data is in the database after the insert and rollback the whole batch if one element was not inserted. I know it is the philosophy behind Mongo that we do not need transactions but how can I make sure that my data is really safely stored after insert rather than sent to some "black hole".
Should I make a search?
Should I use some specific mongoDB commands?
Should I use sharding even if one server is enough for satisfying
the speed and by the way it doesn't guarantee anything if the light
goes down?
What is the best solution?
Your best bet is to use Write Concerns - these allow you to tell MongoDB how important a piece of data is. The quickest Write Concern is also the least safe - the data is not flushed to disk until the next scheduled flush. The safest will confirm that the data has been written to disk on a number of machines before returning.
The write concern you are looking for is FSYNC_SAFE (at least that is what it is called from the point of view of the Java driver) or REPLICAS_SAFE which confirms that your data has been replicated.
Bear in mind that MongoDB does not have transactions in the traditional sense - your rollback will have to be rolled by hand as you can't tell the Mongo database to do this for you.
The other thing you need to do is either use the relatively new --journal option (which uses a Write Ahead Log), or use replica sets to share your data across many machines in order to maximise data integrity in the event of a crash/power loss.
Sharding is not so much a protection against hardware failure as a method for sharing the load when dealing with particularly large datasets - sharding shouldn't be confused with replica sets which is a way of writing data to more than one disk on more than one machine.
Therefore, if your data is valuable enough, you should definitely be using replica sets, perhaps even siting slaves in other data centres/availability zones/racks/etc in order to provide the resilience you require.
There is/will be (can't remember offhand whether this has been implemented yet) a way to specify the priority of individual nodes in a replica set such that if the master goes down the new master that is elected is one in the same data centre if such a machine is available (ie to stop a slave on the other side of the country from becoming master unless it really is the only other option).
I received a really nice answer from a person called GVP on google groups. I will quote it(basically it adds up to Rich's answer):
I'd like to be sure that my data is in the database after the
insert and rollback the whole batch if one element was not inserted.
This is a complex topic and there are several trade-offs you have to
consider here.
Should I use sharding?
Sharding is for scaling writes. For data safety, you want to look a
replica sets.
Should I use some specific mongoDB commands?
First thing to consider is "safe" mode or "getLastError()" as
indicated by Andreas. If you issue a "safe" write, you know that the
database has received the insert and applied the write. However,
MongoDB only flushes to disk every 60 seconds, so the server can fail
without the data on disk.
Second thing to consider is "journaling"
(v1.8+). With journaling turned on, data is flushed to the journal
every 100ms. So you have a smaller window of time before failure. The
drivers have an "fsync" option (check that name) that goes one step
further than "safe", it waits for acknowledgement that the data has
be flushed to the disk (i.e. the journal file). However, this only
covers one server. What happens if the hard drive on the server just
dies? Well you need a second copy.
Third thing to consider is
replication. The drivers support a "W" parameter that says "replicate
this data to N nodes" before returning. If the write does not reach
"N" nodes before a certain timeout, then the write fails (exception
is thrown). However, you have to configure "W" correctly based on the
number of nodes in your replica set. Again, because a hard drive
could fail, even with journaling, you'll want to look at replication.
Then there's replication across data centers which is too long to get
into here. The last thing to consider is your requirement to "roll
back". From my understanding, MongoDB does not have this "roll back"
capacity. If you're doing a batch insert the best you'll get is an
indication of which elements failed.
Here's a link to the PHP driver on this one: http://it.php.net/manual/en/mongocollection.batchinsert.php You'll have to check the details on replication and the W parameter. I believe the same limitations apply here.