MongoDB: Disk I/O % utilization on Data Partition has gone - mongodb

Last time I get alert from MongoDB Atlas:
Disk I/O % utilization on Data Partition has gone above 70 on nvme2n1
But I have no any ideas how can I localize / query / index / part of code / problematic collection.
In what way can I perform any analyze to find out problem root-cause?

Not answer, but just seen that many people faced with similar problem.
In My case root cause was: we had collection with huge documents that contain array of data (in fact - list of coordinates with some metadata), and update it as many times, as coordinates we have (when adding new coordinates). + some additional operations.
As I know MongoDB cannot fetch just part of document, it fetch full document, and when we fetch many different and big documents, they are not fit into MongoDB in-memory cache, and each time access into hard disc, that lead to this issue.
So, we just split up this document on several, and this fixed issue. While we need frequent access to update/add this data, we keep it into different documents, and finally, after process done, we gather back all this documents into one big document, for "history check" purpose.

Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90 after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.
Understanding Disk I/O % Utilization
The definition of Disk I/O % Utilization and Disk I/O % utilization on Data Partition per doc
Disk I/O % Utilization alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.
Disk I/O % utilization on Data Partition occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.
Two traps in iostat: %util and svctm
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util, as the disk was in use 100% of that time.
Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS capacity.
Having disk utilization at 100% does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100% utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100% interval.
Conclusion
We should look at a combination of all the available disk-related metrics, as well as IOWait in the System CPU when diagnosing potential disk performance-related issues.
Possible actions to help resolve Disk Utilization % alerts
Optimize your queries
Create an Index to Support Read Operations
Pay attention to Query Selectivity and Covered Query
Use the Atlas Performance Advisor to view slow queries and suggested indexes.
Review Indexing Strategies for possible further indexing improvements.
Analyze Query Performance to review how your queries are using your indexes.
Analyze Profile to optimize the long execution time query
Increase hardware resources, such as instance size and IOPS on Atlas
Source: Mongo Doc

As the alert says, it is due to the high utilization of the disk. The most common cause of it is unoptimized queries with poor Query Targeting Ratio, or simply reading/writing a lot of documents from/to the disk in a relatively shorter time window.
In order to identify these queries, start with the Profiler and look for the operations with a poor Examined:Returned ratio. You can also refer to the Performance Advisor to see if it suggests any indexes on the inefficient operations. Since Profiler's window is limited to the last 24 hours, you can also refer to your logs to identify the Slow Queries.
Ultimately, the effort to solve this is tri-directional:
Optimizing the query execution with efficient indexing and filtering strategies
Keep a check on the volume of data being read/written in one go.
Increase the IOPS of the cluster
For official reference, checkout the documentation here.

Related

mongodb max number of parallel find() requests from single instance

What is the maximum theoretical number of parallel requests that we can squize from single mongodb instance before deciding to shard?
Considering the database and indexes fit in memory and all requests are find() queries fetching single document based on indexed field. The hosting OS is Ubuntu , the data partition is SSD. ulimits are set to max.
In my laptop with simple test on single instance I reach near 40k/sec , after that the avg execution times start to increase significantly, but wondering what can be the upper theoretical limit?
It depends. If your active dataset can fit in the memory - if most of the requests don't need to perform any disk I/O - then you can achieve 24k+ requests pretty easily. If not on a (bigger) single machine, then at least use a replica set cluster with multiple secondaries.
If an active dataset is much larger than the available RAM then you have the same problem as with any other database. The advantage of MongoDB's new engine WiredTiger (since v3.0) is a transparent compression - it can reduce the amount of data and I/O and thus improve performance - even despite the fact that compression adds CPU load.
For more performance it really helps:
if the most accessed documents are small so it takes less time to
load them, transfer them, and less time to deserialize in your app List item
If you use projections in find(), for the same reasons
if you use bulk operations to reduce networking I/O and context switches
Even MongoDB itself has an option to limit the maximum number of incoming connections. It defaults to 64k.
for more information you can refer link

How to calculate the number of CPU, memory and storage that my Google Cloud SQL needs

My DB is reaching the 100% of CPU utilization and increasing the number of CPU is not working anymore.
What kind of information should I consider to create my Google Cloud SQL? How do you set up the DB configuration?
Info I have:
For 10-50 minute a day I have 120 request/seconds and the CPU reaches 100% of utilization
Memory usage is the maximum 2.5GB during this critical period
Storage usage is currently around 1.3GB
Current configuration:
vCPUs: 10
Memory: 10 GB
SSD storage: 50 GB
Unfortunately, there is no magic formula for determining the correct database size. This is because queries have variable load - some are small and simple and take no time at all, others are complex or huge and take lots of resources to complete.
There are generally two strategies to dealing with high load: Reduce your load (use connection pooling, optimize your queries, cache results), or increase the size of your database (add additional CPUs, Storage, or Read replicas).
Usually, when we have CPU utilization, it is because the CPU is overloaded or we have too many database tables in the same instances. Here are some common issues and fixes provided by Google’s documentation:
If CPU utilization is over 98% for 6 hours, your instance is not properly sized for your workload, and it is not covered by the SLA.
If you have 10,000 or more database tables on a single instance, it could result in the instance becoming unresponsive or unable to perform maintenance operations, and the instance is not covered by the SLA.
When the CPU is overloaded, it is recommended to use this documentation to view the percentage of available CPU your instance is using on the Instance details page in the Google Cloud Console.
It is also recommended to monitor your CPU usage and receive alerts at a specified threshold, set up a Stackdriver alert.
Increasing the number of CPUs for your instance should reduce the strain of your instance. Note that changing CPUs requires an instance restart. If your instance is already at the maximum number of CPUs, shard your database to multiple instances.
Google has this very interesting documentation about investigating high utilization and determining whether a system or user task is causing high CPU utilization. You could use it to troubleshoot your instance and find what's causing the high CPU utilization.

MongoDB Replica Set CPU load

I am running a fairly standard MongoDB (3.0.5) replica set with 1 primary and 2 secondaries. My PHP application's read preference is primary, so no reads take place on the secondaries - they are only for failover. I am running a load test on my application, which creates around 600 queries / updates per second. The operations are all being run against a collection that has ~500,000 documents. However, the queries are optimized and supported by indexes. Any query will not take longer than 40ms max.
My problem is that I am getting a quite high CPU load on all 3 nodes (200% - 300%) - sometimes the load on the secondaries is even higher than on the primary. Disk IO and RAM usage seem to be okay - at least they are not hitting any limits.
The primary's log file contains a huge amount of getmore oplog queries - I would guess that any operation on the primary creates an oplog query. It appears to me that this is too much replication overhead but I don't have any prior experience with MongoDB under load and I don't have any key figures.
As the setup will have to tolerate even more load in production, my question is whether the replication overhead is to be expected and whether it's normal that the CPU load goes up that high, even on the secondaries or is there something I'm missing?
Think about it this way. Whatever data-changing operation happens on the primary, it also needs to happen on every secondary. If there are many such operations and they create high CPU load on the primary, well, then the same situation will repeat itself on the secondaries.
Of course, in your case you'd expect the primary's CPU to be more stressed, because in addition to the writes it also handles all the reads. Probably, in your scenario, reads are relatively light and there aren't many of them when compared to the amount of writes. This would explain why the load on the primary is roughly the same as on the secondaries.
my question is whether the replication overhead is to be expected
What you call replication overhead I see as the nature of replication. A primary stressed by writes results in all secondaries being stressed by writes as well.
and whether it's normal that the CPU load goes up that high, even on the secondaries
You have 600 write queries per second and your RAM and disk are not stressed, to me this signifies that you've set up your indexes properly. High CPU load is expected with this amount of write operations per second, because the indexes are being used intensively.
Please keep in mind that once you have gathered more data, the indexes and the memory-mapped data may not fit into memory anymore, and then both the RAM and the disk will be stressed, while CPU is unlikely to be under high load anymore. In this situation, you will probably want to either add more RAM or look into sharding.

What is mongodb behavior regarding keeping loaded indexes in ram?

Say I have a single collection in mongodb with only one index, and I require the index for the entire life cycle of the application using that mongo collection.
I would like to know about the behaviour of mongodb.
In this case once the index is loaded into memory, will mongodb keep it in the ram?
Thanks
The first thing MongoDB will knock out of RAM will be the LRU (least recently used) piece of data. So if you only have one index, chances are it will continue to be used pretty regularly and it should stay in memory.
Source
Unfortunately you cannot currently pin a collection or index in memory. MongoDB uses memory mapped files to load collections and indexes into memory. As your activities touch various pieces of your database thru queries, updates, insertions and deletions, that data will get loaded into memory. This is referred to as the working set. If the total memory required to load the working set is less than available memory, no problem.
If not, MongoDB is going to use an LRU algorithm to pick what to unload from memory. This is why it's so important to understand the concept of the working set and how it relates to your available memory.
This writeup from the documentation should be helpful:
How do I calculate how much RAM I need for my application?
The amount of RAM you need depends on several factors, including but
not limited to:
The relationship between database storage and working set.
The operating system’s cache strategy for LRU (Least Recently Used)
The impact of journaling
The number or rate of page faults and other MMS gauges to detect when you need more RAM
Each database connection thread will need up to 1 MB of RAM. MongoDB
defers to the operating system when loading data into memory from
disk. It simply memory maps all its data files and relies on the
operating system to cache data. The OS typically evicts the
least-recently-used data from RAM when it runs low on memory. For
example if clients access indexes more frequently than documents, then
indexes will more likely stay in RAM, but it depends on your
particular usage.
To calculate how much RAM you need, you must calculate your working
set size, or the portion of your data that clients use most often.
This depends on your access patterns, what indexes you have, and the
size of your documents. Because MongoDB uses a thread per connection
model, each database connection also will need up to 1MB of RAM,
whether active or idle.
If page faults are infrequent, your working set fits in RAM. If fault
rates rise higher than that, you risk performance degradation. This is
less critical with SSD drives than with spinning disks.
http://docs.mongodb.org/manual/faq/diagnostics/
You can use the serverStatus command to get an estimate of your current working set:
db.runCommand( { serverStatus: 1, workingSet: 1 } )

what constitutes "large amount of write activity" for Mongodb?

I am currently working on an online ordering application using Mongodb as the backend. In looking into sharding, the Mongo docs say that you should consider sharding if
"your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other approaches have not reduced contention."
So my question is: what constitutes a large amount of write activity? are we talking 1000's of writes per second? 100's?
I know that sharding introduces a level of infrastructure complexity that I'd rather not get into if I don't have to.
thanks!
R
The "large amount of write activity" is not defined in terms of a specific number .. but rather when your common usage pattern exceeds the resources of your server hardware. For example, when average I/O flush time or iowait indicates that I/O has become a significant limiting factor.
You do have other options to consider before sharding:
if your working set is larger than RAM and you have significant page faults, upgrade your RAM
if your disk I/O isn't keeping up, consider upgrading to faster disks, RAID, or SSD
review and adjust your readahead settings
look into optimization of slow or inefficient queries
review your indexes and remove unnecessary ones