Mongodb cluster sudden crashes with read / write concern queries - mongodb

We have a mongodb cluster with 5 PSA replica sets and one sharded collection. About 3,5 TB of data, 2 billion docs on primaries. Average insert rate: 300rps. Average select rate: 1000rps. Mongodb version 4.0.6. Collection has only one extra unique index, all read queries use one of the indexes (no long running queries).
PROBLEM. Sometimes (4 times in a last 2 month) one of the nodes stops responding to the queries with specified read concern or write concern. The same query without read/write concern executes successfully regardless of doing it locally or through mongos. These queries never execute, no errors, no timeouts even when restarting mongos, that initiate the query. No errors in mongod logs, no errors in system logs. Restart of this node fixes the problem. Mongodb sees such broken node as normal, rs.status() shows that everything is ok.
Have no idea how to reproduce this problem, much more intense load testing have no results.
We would appreciate any help and suggestions.

Related

How to count number of read/write/delete operations in mongo for a period?

I need to find out how many entity writes/reads or deletes a mongo cluster does within a specific period for internal metrics.
Only found db.currentOp().inprog.length which counts current op log.
Obviously, would be great if I don't need to do it through code, but from the sharded cluster out of the box system.
Later edit: logging all queries will be another option, but not for my production DB as it's too much and I need to do it over a 30 days period at least to have a good average

MongoDB: blocked queries during write operation in a replica set

I am using MongoDB (3.0) with a replica set of 3 servers. I experience very slow queries since a week and I have tried to find out what was wrong on my servers.
By using the db.currentOp() command I can see that queries are sometimes blocked on the secondaries when a "replication worker" is running. All the queries are waiting for lock ("waitingForLock" : true) and it seems that the replication worker has taken this lock and is running since several minutes (seems pretty long).
To be more specific about my user case, I have multiple databases in the replica set, all these database containing the same collections but not the same amount of data (I use one database per client).
I use WiredTiger as a storage engine that normally (as the doc claims) do not use global locks. So I was expecting that queries on the specific collection to be slow if this collection is updated, but I was not expecting all the queries to be slow or blocked.
Does anyone experienced the same issue? Is there some limitation with MongoDB when read are performed when processes write in the database?
Furthermore, is there a way to tell MongoDB that I don't care about consistency for read operations (in order to avoid locks)?
Thanks.
Update :
By restarting the servers the problems disappeared. It seems that memory and cpu usage was growing (but was still very low) that this lead to slow replication process which hold a lock and prevent queries execution.
I still don't understand why the we have the problem on this database. Maybe version 3.0.9 has a bug (I will upgrade to 3.0.12). Still it takes one month to the database to be very slow and only a restart of all the servers solve the problem. Our workload is mainly writes (with findAndModify). Does anyone know about a bug in Mongo where intensive write leads to performance decreasing over the time ?

Why do individual shards in MongoDB report more delete operations compared to corresponding mongos in a sharded cluster?

So I have a production sharded MongoDB cluster that has 8 shards (replica sets) managed by mongos. Let's say I have 20 servers which are running my application and each of the servers runs a mongos process that manages the 8 shards.
Given this setup, when I check the number of ops on each of my mongos on the 20 servers, I can see that my number of inserts and deletes are in proportion - which is in accordance with my application logic. However, when I run mongostat --discover on the individual shards, I see that deletes are nearly 4x the number of inserts which violates both my application logic as well as the 1:1 ratio indicated by mongos. Straightforward intuition supports that mongos would write to only one shard and so the average ratio of inserts and deletes across individual shards should be the same as that on mongos (which the application directly writes to) unless mongos does something different internally with the shards.
Could anyone point me to any relevant info on why this would happen or let me know if something could possibly wrong with my infra?
Thanks
The reason for this is that I was running the remove() queries to mongos without specifying my shard key. In that case, mongos does not know which shard to direct the query to and thus broadcasts the query to all the shards effectively performing more deletes than a targeted query.
Check documentation for more information.

MongoDB is giving inconsistent write times

I am using Scala, Reactive Mongo 0.10.5 and Mongo 2.6.4 running on Ubuntu. I have tested on a few machine configurations but right now I am working with 15gb of memory, 2 cores and 60gb of SSD storage (AWS)
I have just set up a test mongo instance and have been using it to benchmark a few things, however I am seeing some inconsistency that I can't explain.
I am writing a consistent amount of data using 10 separate threads to a single collection. Each write consists of a document containing an array which contains 1000 elements. Each element is a complex document consisting of several fields and nested fields. I have tested with arrays of 1000, 10000 and 100 and have seen the same behavior with all. Each write is unique (i.e. I never write to the same document twice)
The write speed tends to be around 100-200ms per write with the current hardware I am using. I would like better but that isn't my main issue.
My main issue is that sometimes the write times will spike. When they do, it can take a single write several seconds to complete. They do eventually complete but it takes a while. I have timeouts built into the app doing the writing (10 seconds) and when the spikes happen it will frequently hit that timeout. I have increased the timeout and verified that the write does eventually complete but it can take a long time (30+ seconds).
I have worked with Mongo before using the Mongo Java Driver in Scala and have not noticed this problem. However it is unclear whether the issue is a result of the driver, or my Mongo setup.
I have looked at the logs and while they report when the query is taking longer, they don't actually provide any information about why it is taking longer. I have done the same with profiling and again they report a long query but don't say why it is long.
I have run mongostat while running and it seems that when the writes start taking a long time I notice a similar slow down in mongostat. I.E. mongostat will pause for several seconds before continuing.
The mongo machine itself is bored while this is happening. Load averages are minimal as are CPU and memory usage. It does not appear to be going into swap.
I suspect I just have something configured incorrectly in the Mongo but I haven't been able to find anything that indicates what.
Has anyone seen this behavior before? Is it something in my configuration or perhaps something with the Reactive Mongo driver?
UPDATE:
Using iostat I was able to determine that the normal writes/second is hitting around 1Mb/second. However during the slow periods it spikes to 6-7Mb/second.
I also found the following in the mongo logs.
[DataFileSync] flushing mmaps took 15621ms for 35 files
[DataFileSync] flushing mmaps took 14816ms for 22 files
In at least one case this log statement corresponds exactly with one of the slow downs.
This definitely seems to be a disk flush problem based on these observations.
Does this imply that I am pushing more data than the current Mongo configuration can handle? Or is there some other configuration that can be done to reduce the impact of those flushes?
It appears that in this case the problem may actually have been related to thread locking within the application itself. Once I resolved the issues with thread locking these other issues seemed to go away.
To be honest I don't know why thread locking would result in the observed behavior in Mongo, but if the problem is gone I am not going to complain.

how to divide mongodb data without stopping service

I got a mongodb running online, which contains about 200 million object, and the file size is about 20GB. And, I found that the insert speed become very slow (about 2000 per sec, and this value is more than 10000 in the beginning). So I decide to divide the data to optimize the insert speed.
I would like the know if i can divide the mongodb data without stopping service, and how?
You just described "sharding". Luckily for you, MongoDB has nice sharding features out of the box.
Your migration will consist of:
Create 3 mongo config servers
Create more than 1 mongos router
Add your current replica set as a shard
Point your application to connect to your mongos
Configure shard keys your current collections
Then, you are set to add shards as needed
For detailed instructions, see 10gen's sharding overview
At MongoHQ, we convert replica sets to shards all the time. So, should be quick, painless, and without downtime.