Do I have a zombie somewhere?
My script finished inserting a massive amount of new data.
However, the server continues with high lock rates and slowly inserting new records. It's been about an hour since the script that did the inserts finished, and the documents are still trickling in.
Where are these coming from and how to I purge the queue? (I refactored the code to use an index and want to redo the process to avoid the 100-200% lock rate)
This could be because of following scenarios,
1..Throughput bound Disk IO
One can look into following metrics using "mongostat" and "MongoDB Management Service":
Average Flush time (how long MongoDB's periodic sync to disk is taking)
IOStats in the hardware tab (look specifically IOWait)
Since the Disk IO is slower than CPU processing time, all the inserts get queued up, and this can continue for longer duration, one can check the server status using db.serverStatus() and look into "globalLock"(as Write acquires global lock) field for "currrentQueue" associated with the lock, to see number of writers in queue.
2..Another possible cause could be Managed Sharded Cluster Balancer has been put in On Status(which is by Default On)
If you have been working on clustered environment, whenever write operation starts Balancer automatically gets ON, in-order to keep the cluster in balanced state, which can continue moving chunks from one shard to another even after completion of your scripts. In such a case I would rather suggest to keep the balancer off while having bulk load, in such a case all your documents goes to single shard, but balancer can be kicked on at any downtime.
3..Write Concern
This may also contribute to problem slightly, if they are set to Replica Acknowledged or Acknowledged mode, it depends on you, based on your type of data, to decide on these concerns.
Related
Use case
I am using MongoDB to persist messages from a message queue system (e. g. RabbitMQ / Kafka). Each message has a timestamp and based on that timestamp I want to expire the documents 1 hour afterwards. Therefore I've got a deleteAt field which is indexed and has set expireAfterSeconds: 0. Everything works fine, except if MongoDB is under heavy load.
We are inserting roughly 5-7k messages / second into a single replica set. The TTL Thread seems to be way slower than the rate of message coming in and thus the storage is quickly growing (which we want to avoid with TTLs).
To describe the behaviour more exactly, when I sort the messages by deleteAt ascending (oldest date first) I can see that it sometimes does not delete any of those messages for hours. Because of this observation I believe that the TTL thread sometimes is stuck or not active at all.
My question
What could I do to ensure that the TTL thread is not negatively impacted by the rate of messages coming in? According to our metrics our only bottleneck seems to be CPU, even though the SSD disk I/O is pretty high too.
Do I need to tune something (e. g. give MongoDB more threads for document deletion) so that the TTL thread can keep up with the write rate?
I believe I am facing a known bug as described in MongoDB's Jira Dashboard: https://jira.mongodb.org/browse/SERVER-19334
From https://docs.mongodb.com/manual/core/index-ttl/:
The background task that removes expired documents runs every 60 seconds. As a result, documents may remain in a collection during the period between the expiration of the document and the running of the background task.
Because the duration of the removal operation depends on the workload of your mongod instance, expired data may exist for some time beyond the 60 second period between runs of the background task.
I'm not aware of any way to tune that TTL thread, and I suspect you'll need to run your own cron to do batched deletes.
The other thing to look at might be what's taking up CPU and IO and see if there's any way of reducing that load.
You can create the index with "sparse", this should perform the clean up on a separate thread in the background.
I have a flow like this:
I have a Worker that's processing a "large" batch (say, 1M records) and storing the results in Mongo.
Once the batch is complete, a notification message is sent to Publish, which then pulls all the records from Mongo for final publication.
Let's say the Worker write process is done, i.e. it has sent all 1M records to Mongo through a driver. Mongo is "eventually consistent" so I'm not 100% guaranteed all records are written to physical storage at the time the Notify Publish happens.
When Publish does a 'find' and gets a cursor on the collection holding the batch records, is the cursor smart enough to handle the eventual consistency?
So in practical terms let's imagine 750,000 records are actually physically written by Mongo when Notify Publish happens and Publish does its find. Will the cursor traverse 750,000 records and stop or will it block or otherwise handle the remaining 250,000 as they're eventually written to disk (which presumably is very likely to happen while publishing of the first 750K)?
As #BlakesSeven already noted in the comments, "eventual consistency" refers to the fact that in a replicated environment, when a write is finished on the primary, it will only be written to the secondaries eventually. You can modify this behavior at the cost of reduced write performance by setting the write concern to > 1. Setting it to "majority" basically guarantees that a write operation is durable even in case of a failover – though at a (in some cases) drastically reduced performance.
In general here is what happens when you do a write (simplified) with journaling enabled:
The operation is checked for being syntactically correct.
The query optimizer kicks in and does his stuff. (Irrelevant for this question, so I spare the details).
The write operation is applied to the in memory representation of the data set called "private view".
Every commitIntervalMs, the private view is synced to the journal, with a median of 15 or 50ms, depending on the write concern.
On sync, the operation is applied to the shared view. Iirc, this is the point where a new connection would be provided with the new data.
So in order to ensure that the data will be readable by the new connection, simply delay the publish notification by commitIntervalMs + 1, which, given your batch size, is hardly noticeable.
I have a large collection with more that half a million of docs, which I need to updated continuously. To achieve this, my first approach was to use w=1 to ensure write result, which causes a lot of delay.
collection.update(
{'_id': _id},
{'$set': data},
w=1
)
So I decided to use w=0 in my update method, now the performance got significantly faster.
Since my past bitter experience with mongodb, I'm not sure if all the update are guaranteed when w=0. My question is, is it guaranteed to update using w=0?
Edit: Also, I would like to know how does it work? Does it create an internal queue and perform update asynchronously one by one? I saw using mongostat, that some update is being processed even after the python script quits. Or the update is instant?
Edit 2: According to the answer of Sammaye, link, any error can cause silent failure. But what happens if a heavy load of updates are given? Does some updates fail then?
No, w=0 can fail, it is only:
http://docs.mongodb.org/manual/core/write-concern/#unacknowledged
Unacknowledged is similar to errors ignored; however, drivers will attempt to receive and handle network errors when possible.
Which means that the write can fail silently within MongoDB itself.
It is not reliable if you wish to specifically guarantee. At the end of the day if you wish to touch the database and get an acknowledgment from it then you must wait, laws of physics.
Does w:0 guarantee an update?
As Sammaye has written: No, since there might be a time where the data is only applied to the in memory data and is not written to the journal yet. So if there is an outage during this time, which, depending on the configuration, is somewhere between 10 (with j:1 and the journal and the datafiles living on separate block devices) and 100ms by default, your update may be lost.
Please keep in mind that illegal updates (such as changing the _id of a document) will silently fail.
How does the update work with w:0?
Assuming there are no network errors, the driver will return as soon it has send the operation to the mongod/mongos instance with w:0. But let's look a bit further to give you an idea on what happens under the hood.
Next, the update will be processed by the query optimizer and applied to the in memory data set. After sucessful application of the operation a write with write concern w:1 would return now. The operations applied will be synced to the journal every commitIntervalMs, which is divided by 3 with write concern j:1. If you have a write concern of {j:1}, the driver will return after the operations are stored in the journal successfully. Note that there are still edge cases in which data which made it to the journal won't be applied to replica set members in case a very "well" timed outage occurs now.
By default, every syncPeriodSecs, the data from the journal is applied to the actual data files.
Regarding what you saw in mongostat: It's granularity isn't very high, you might well we operations which took place in the past. As discussed, the update to the in memory data isn't instant, as the update first has to pass the query optimizer.
Will heavy load make updates silently fail with w:0?
In general, it is safe to say "No." And here is why:
For each connection, there is a certain amount of RAM allocated. If the load is so high that mongo can't allocate any further RAM, there would be a connection error – which is dealt with, regardless of the write concern, except for unacknowledged writes.
Furthermore, the application of updates to the in memory data is extremely fast - most likely still faster than they come in in case we are talking of load peaks. If mongod is totally overloaded (e.g. 150k updates a second on a standalone mongod with spinning disks), problems might occur, of course, though even that usually is leveraged from a durability point of view by the underlying OS.
However, updates still may silently disappear in case of an outage when the write concern is w:0,j:0 and the outage happens in the time the update is not synced to the journal.
Notes:
The optimal balance between maximum performance and minimal guaranteed durability is a write concern of j:1. With a proper setup, you can reduce the latency to slightly over 10ms.
To further reduce the latency/update, it might be worth having a look at bulk write operations, if those apply to your use case. In my experience, they do more often than not. Please read and try before dismissing the idea.
Doing write operations with w:0,j:0 is highly discouraged in case you expect any guarantee on data durability. Use a t your own risk. This write concern is only meant for "cheap" data, which is easy to reobtain or where speed concern exceeds the need for durability. Collecting real time weather data in a large scale would be an example – the system still works, even if one or two data points are missing here and there. For most applications, durability is a concern. Conclusion: use w:1,j:1 at least for durable writes.
I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.
I am working on a project which has some important data in it. This means we cannot to lose any of it if the light or server goes down. We are using MongoDB for the database. I'd like to be sure that my data is in the database after the insert and rollback the whole batch if one element was not inserted. I know it is the philosophy behind Mongo that we do not need transactions but how can I make sure that my data is really safely stored after insert rather than sent to some "black hole".
Should I make a search?
Should I use some specific mongoDB commands?
Should I use sharding even if one server is enough for satisfying
the speed and by the way it doesn't guarantee anything if the light
goes down?
What is the best solution?
Your best bet is to use Write Concerns - these allow you to tell MongoDB how important a piece of data is. The quickest Write Concern is also the least safe - the data is not flushed to disk until the next scheduled flush. The safest will confirm that the data has been written to disk on a number of machines before returning.
The write concern you are looking for is FSYNC_SAFE (at least that is what it is called from the point of view of the Java driver) or REPLICAS_SAFE which confirms that your data has been replicated.
Bear in mind that MongoDB does not have transactions in the traditional sense - your rollback will have to be rolled by hand as you can't tell the Mongo database to do this for you.
The other thing you need to do is either use the relatively new --journal option (which uses a Write Ahead Log), or use replica sets to share your data across many machines in order to maximise data integrity in the event of a crash/power loss.
Sharding is not so much a protection against hardware failure as a method for sharing the load when dealing with particularly large datasets - sharding shouldn't be confused with replica sets which is a way of writing data to more than one disk on more than one machine.
Therefore, if your data is valuable enough, you should definitely be using replica sets, perhaps even siting slaves in other data centres/availability zones/racks/etc in order to provide the resilience you require.
There is/will be (can't remember offhand whether this has been implemented yet) a way to specify the priority of individual nodes in a replica set such that if the master goes down the new master that is elected is one in the same data centre if such a machine is available (ie to stop a slave on the other side of the country from becoming master unless it really is the only other option).
I received a really nice answer from a person called GVP on google groups. I will quote it(basically it adds up to Rich's answer):
I'd like to be sure that my data is in the database after the
insert and rollback the whole batch if one element was not inserted.
This is a complex topic and there are several trade-offs you have to
consider here.
Should I use sharding?
Sharding is for scaling writes. For data safety, you want to look a
replica sets.
Should I use some specific mongoDB commands?
First thing to consider is "safe" mode or "getLastError()" as
indicated by Andreas. If you issue a "safe" write, you know that the
database has received the insert and applied the write. However,
MongoDB only flushes to disk every 60 seconds, so the server can fail
without the data on disk.
Second thing to consider is "journaling"
(v1.8+). With journaling turned on, data is flushed to the journal
every 100ms. So you have a smaller window of time before failure. The
drivers have an "fsync" option (check that name) that goes one step
further than "safe", it waits for acknowledgement that the data has
be flushed to the disk (i.e. the journal file). However, this only
covers one server. What happens if the hard drive on the server just
dies? Well you need a second copy.
Third thing to consider is
replication. The drivers support a "W" parameter that says "replicate
this data to N nodes" before returning. If the write does not reach
"N" nodes before a certain timeout, then the write fails (exception
is thrown). However, you have to configure "W" correctly based on the
number of nodes in your replica set. Again, because a hard drive
could fail, even with journaling, you'll want to look at replication.
Then there's replication across data centers which is too long to get
into here. The last thing to consider is your requirement to "roll
back". From my understanding, MongoDB does not have this "roll back"
capacity. If you're doing a batch insert the best you'll get is an
indication of which elements failed.
Here's a link to the PHP driver on this one: http://it.php.net/manual/en/mongocollection.batchinsert.php You'll have to check the details on replication and the W parameter. I believe the same limitations apply here.