MongoDB TTL doesn't delete documents if under load - mongodb

Use case
I am using MongoDB to persist messages from a message queue system (e. g. RabbitMQ / Kafka). Each message has a timestamp and based on that timestamp I want to expire the documents 1 hour afterwards. Therefore I've got a deleteAt field which is indexed and has set expireAfterSeconds: 0. Everything works fine, except if MongoDB is under heavy load.
We are inserting roughly 5-7k messages / second into a single replica set. The TTL Thread seems to be way slower than the rate of message coming in and thus the storage is quickly growing (which we want to avoid with TTLs).
To describe the behaviour more exactly, when I sort the messages by deleteAt ascending (oldest date first) I can see that it sometimes does not delete any of those messages for hours. Because of this observation I believe that the TTL thread sometimes is stuck or not active at all.
My question
What could I do to ensure that the TTL thread is not negatively impacted by the rate of messages coming in? According to our metrics our only bottleneck seems to be CPU, even though the SSD disk I/O is pretty high too.
Do I need to tune something (e. g. give MongoDB more threads for document deletion) so that the TTL thread can keep up with the write rate?

I believe I am facing a known bug as described in MongoDB's Jira Dashboard: https://jira.mongodb.org/browse/SERVER-19334

From https://docs.mongodb.com/manual/core/index-ttl/:
The background task that removes expired documents runs every 60 seconds. As a result, documents may remain in a collection during the period between the expiration of the document and the running of the background task.
Because the duration of the removal operation depends on the workload of your mongod instance, expired data may exist for some time beyond the 60 second period between runs of the background task.
I'm not aware of any way to tune that TTL thread, and I suspect you'll need to run your own cron to do batched deletes.
The other thing to look at might be what's taking up CPU and IO and see if there's any way of reducing that load.

You can create the index with "sparse", this should perform the clean up on a separate thread in the background.

Related

MongoDB oplog contains many Noops

I am trying to improve the oplog of my MongoDB server, because for now it's covering less hours, than I would like (I am not planning to increase oplog file size for now). What I found that there are many noops records in the oplog collection - { "op": "n" } + the whole document on "o". And they could take about ~20%-30% of the physical oplog size.
How could I find the reason for that, because it seems to be not ok ?
We are using MongoDB 3.6 + NodeJS 10 + Mongoose
p.s. it appears for many different collection and use cases, so it's hard to understand what is a application logic behind all these items.
No-op writes are expected in a MongoDB 3.4+ replica set in order to support the Max Staleness specification that helps applications avoid reading from stale secondaries and provides a more accurate measure of replication lag. These no-op writes only happen when the primary is idle. The idle write interval is not currently configurable (as at MongoDB 4.2).
The Max Staleness specification includes an example scenario and more detailed rationale for why the Primary must write periodic no-ops as well as other design decisions.
A relevant excerpt from the design rationale:
An idle primary must execute a no-op every 10 seconds (idleWritePeriodMS) to keep secondaries' lastWriteDate values close to the primary's clock. The no-op also keeps opTimes close to the primary's, which helps mongos choose an up-to-date secondary to read from in a CSRS.
Monitoring software like MongoDB Cloud Manager that charts replication lag will also benefit when spurious lag spikes are solved.

What is MongDB (WiredTiger) update query's default lock wait time?

I have embedded sub documents array in MongoDB document and multiple users can try add sub documents into the array. I use update($push) query to add document into array, but if multiple users try to add entry from UI, how do I make sure second $push doesn't fail due to lock by first? I would have chance of only couple of users adding entry at same into single document, so not to worry about the case where 100s of users exist. What is the default wait time of update in WiredTiger, so 2nd push doesn't abort immediately and can take upto 1 sec, but $push should complete successful?
I tried finding default wait time in MongoDB and WiredTiger docs, I could find transaction default wait times, but update query.
Internally, WiredTiger uses an optimistic locking method. This means that when two threads are trying to update the same document, one of them will succeed, and the other will back off. This will manifest as a "write conflict" (see metrics.operation.writeConflicts).
This conflict will be retried transparently, so from a client's point of view, the write will just take longer than usual.
The backing off algorithm will wait longer the more conflict it encounters, from 1 ms, and capped at 100 ms per wait. So the more conflict it encounters, it will eventually wait at 100 ms per retry.
Having said that, the design of updating a single document from multiple sources will have trouble scaling in the future due to two reasons:
MongoDB has a limit of 16 MB document size, so the array cannot be updated indefinitely.
Locking issues, as you have readily identified.
For #2, in a pathological case, a write can encounter conflict after conflict, waiting 100 ms between waits. There is no cap limit on the number of waits, so it can potentially wait for minutes. This is a sign that the workload is bottlenecked on a single document, and the app essentially operates on a single-thread model.
Typically the solution is to not create artificial bottlenecks, but to spread out the work across many different documents, perhaps in a separate collection. This way, concurrency can be maintained.

How does Mongo's eventual consistency work with a large number of data writes?

I have a flow like this:
I have a Worker that's processing a "large" batch (say, 1M records) and storing the results in Mongo.
Once the batch is complete, a notification message is sent to Publish, which then pulls all the records from Mongo for final publication.
Let's say the Worker write process is done, i.e. it has sent all 1M records to Mongo through a driver. Mongo is "eventually consistent" so I'm not 100% guaranteed all records are written to physical storage at the time the Notify Publish happens.
When Publish does a 'find' and gets a cursor on the collection holding the batch records, is the cursor smart enough to handle the eventual consistency?
So in practical terms let's imagine 750,000 records are actually physically written by Mongo when Notify Publish happens and Publish does its find. Will the cursor traverse 750,000 records and stop or will it block or otherwise handle the remaining 250,000 as they're eventually written to disk (which presumably is very likely to happen while publishing of the first 750K)?
As #BlakesSeven already noted in the comments, "eventual consistency" refers to the fact that in a replicated environment, when a write is finished on the primary, it will only be written to the secondaries eventually. You can modify this behavior at the cost of reduced write performance by setting the write concern to > 1. Setting it to "majority" basically guarantees that a write operation is durable even in case of a failover – though at a (in some cases) drastically reduced performance.
In general here is what happens when you do a write (simplified) with journaling enabled:
The operation is checked for being syntactically correct.
The query optimizer kicks in and does his stuff. (Irrelevant for this question, so I spare the details).
The write operation is applied to the in memory representation of the data set called "private view".
Every commitIntervalMs, the private view is synced to the journal, with a median of 15 or 50ms, depending on the write concern.
On sync, the operation is applied to the shared view. Iirc, this is the point where a new connection would be provided with the new data.
So in order to ensure that the data will be readable by the new connection, simply delay the publish notification by commitIntervalMs + 1, which, given your batch size, is hardly noticeable.

Mongo continues to insert documents, slowly, long after script is quit

Do I have a zombie somewhere?
My script finished inserting a massive amount of new data.
However, the server continues with high lock rates and slowly inserting new records. It's been about an hour since the script that did the inserts finished, and the documents are still trickling in.
Where are these coming from and how to I purge the queue? (I refactored the code to use an index and want to redo the process to avoid the 100-200% lock rate)
This could be because of following scenarios,
1..Throughput bound Disk IO
One can look into following metrics using "mongostat" and "MongoDB Management Service":
Average Flush time (how long MongoDB's periodic sync to disk is taking)
IOStats in the hardware tab (look specifically IOWait)
Since the Disk IO is slower than CPU processing time, all the inserts get queued up, and this can continue for longer duration, one can check the server status using db.serverStatus() and look into "globalLock"(as Write acquires global lock) field for "currrentQueue" associated with the lock, to see number of writers in queue.
2..Another possible cause could be Managed Sharded Cluster Balancer has been put in On Status(which is by Default On)
If you have been working on clustered environment, whenever write operation starts Balancer automatically gets ON, in-order to keep the cluster in balanced state, which can continue moving chunks from one shard to another even after completion of your scripts. In such a case I would rather suggest to keep the balancer off while having bulk load, in such a case all your documents goes to single shard, but balancer can be kicked on at any downtime.
3..Write Concern
This may also contribute to problem slightly, if they are set to Replica Acknowledged or Acknowledged mode, it depends on you, based on your type of data, to decide on these concerns.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_seconds¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.