Bulk insert (of 1000 documents) to mongodb server is 1 IOPS in GCP, or it's 1000 iops? - mongodb

I read GCP docs about performance, and I read about mongo bulk operations, and I'm wondering if bulk operation count as 1 operation (so I can do 30 like that per second per GB for example) or it's equal to the number of operations in the bulk.
thank you!

MongoDB bulk operations reduce network and processing overhead but have little effect on disk IOPS.
Disk IOPS are always proportional to the number of documents inserted/updated/deleted, also in bulk mode.
But it's not 1-to-1; in addition to # of documents, disk IOPS are dependent on the indices, journal, checkpoints, shards and cache size.
A very rough estimate would be (2 + #_of_indexes) * #_of_documents_updated / seconds.
Source: Sizing MongoDB Clusters

Related

mongodb max number of parallel find() requests from single instance

What is the maximum theoretical number of parallel requests that we can squize from single mongodb instance before deciding to shard?
Considering the database and indexes fit in memory and all requests are find() queries fetching single document based on indexed field. The hosting OS is Ubuntu , the data partition is SSD. ulimits are set to max.
In my laptop with simple test on single instance I reach near 40k/sec , after that the avg execution times start to increase significantly, but wondering what can be the upper theoretical limit?
It depends. If your active dataset can fit in the memory - if most of the requests don't need to perform any disk I/O - then you can achieve 24k+ requests pretty easily. If not on a (bigger) single machine, then at least use a replica set cluster with multiple secondaries.
If an active dataset is much larger than the available RAM then you have the same problem as with any other database. The advantage of MongoDB's new engine WiredTiger (since v3.0) is a transparent compression - it can reduce the amount of data and I/O and thus improve performance - even despite the fact that compression adds CPU load.
For more performance it really helps:
if the most accessed documents are small so it takes less time to
load them, transfer them, and less time to deserialize in your app List item
If you use projections in find(), for the same reasons
if you use bulk operations to reduce networking I/O and context switches
Even MongoDB itself has an option to limit the maximum number of incoming connections. It defaults to 64k.
for more information you can refer link

What is the minimum number of shards required for Mongo database to store 1 billion documents?

We need to store 1 billion documents of 1KB each. Each shard is planned to have 8GB of RAM. The platform is Open Shift Red Hat Linux.
Initially we had 10 shards for 300 million. We started inserting documents with 2000 inserts/second. Everything went well till 250 million. After that the insert slowed down drastically to 300/400 insert per second.
The queries are also taking long time (more than 1 minute) even all the queries are covered queries.(Queries which need to scan all the indexes).
Hence we assumed, that 20 million per shard is the optimal value and hence we require 50 shards for the current hardware to achieve 1 billion.
Is this reasonable estimate or we can improve it (less shards) by tweaking mongo db parameters for better performance with the current hardware?
There are two compound indexes and one unique index(long).insertion is done using bulk write( with unordered option) with 10 threads and 200 records per (thread) bulk write using java script directly on the mongos.Shardkey is nodeId(prefix of compound index) which has cardinality upto 10k. For 300 million, the total index size comes to 45 GB.40 GB for the 2 compound indexes.Almost 9500 chunks are distributed across 10 nodes.One interesting fact is that if I increase RAM to 12 GB, the speed increases to 1500 inserts/sec.Is RAM limiting factor?
Update:
Using mongostat tool, we found that the flush(fysnc) takes more than 55 seconds to complete.MongoDB cluster runs on kubernetes based on RedHat OpenShift platform. It runs on Dell EMC server with NFS (EXT4 disk format).Is it a problem in the I/O that it supports only 2MB/second. It takes 60 seconds to write 2000 records per second and another 55 seconds to flush completely to disk.(during which all the operations of DB are blocked)
The disk utilization does not even reach 4 %.
Have you tried not sharding at all?
There's a common tendency to shard prematurely. I've seen a MongoDB consultant who suggested a rule of thumb, to not shard until your total data size is at least 2 TB. Your 1B documents of 1KB each should be around 1 TB. While it's only a rule of thumb, maybe it's worth trying.
If nothing else, it'll be much simpler to design the db without sharding and performance will be much more predictable.

MongoDB: Disk I/O % utilization on Data Partition has gone

Last time I get alert from MongoDB Atlas:
Disk I/O % utilization on Data Partition has gone above 70 on nvme2n1
But I have no any ideas how can I localize / query / index / part of code / problematic collection.
In what way can I perform any analyze to find out problem root-cause?
Not answer, but just seen that many people faced with similar problem.
In My case root cause was: we had collection with huge documents that contain array of data (in fact - list of coordinates with some metadata), and update it as many times, as coordinates we have (when adding new coordinates). + some additional operations.
As I know MongoDB cannot fetch just part of document, it fetch full document, and when we fetch many different and big documents, they are not fit into MongoDB in-memory cache, and each time access into hard disc, that lead to this issue.
So, we just split up this document on several, and this fixed issue. While we need frequent access to update/add this data, we keep it into different documents, and finally, after process done, we gather back all this documents into one big document, for "history check" purpose.
Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90 after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.
Understanding Disk I/O % Utilization
The definition of Disk I/O % Utilization and Disk I/O % utilization on Data Partition per doc
Disk I/O % Utilization alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.
Disk I/O % utilization on Data Partition occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.
Two traps in iostat: %util and svctm
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util, as the disk was in use 100% of that time.
Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS capacity.
Having disk utilization at 100% does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100% utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100% interval.
Conclusion
We should look at a combination of all the available disk-related metrics, as well as IOWait in the System CPU when diagnosing potential disk performance-related issues.
Possible actions to help resolve Disk Utilization % alerts
Optimize your queries
Create an Index to Support Read Operations
Pay attention to Query Selectivity and Covered Query
Use the Atlas Performance Advisor to view slow queries and suggested indexes.
Review Indexing Strategies for possible further indexing improvements.
Analyze Query Performance to review how your queries are using your indexes.
Analyze Profile to optimize the long execution time query
Increase hardware resources, such as instance size and IOPS on Atlas
Source: Mongo Doc
As the alert says, it is due to the high utilization of the disk. The most common cause of it is unoptimized queries with poor Query Targeting Ratio, or simply reading/writing a lot of documents from/to the disk in a relatively shorter time window.
In order to identify these queries, start with the Profiler and look for the operations with a poor Examined:Returned ratio. You can also refer to the Performance Advisor to see if it suggests any indexes on the inefficient operations. Since Profiler's window is limited to the last 24 hours, you can also refer to your logs to identify the Slow Queries.
Ultimately, the effort to solve this is tri-directional:
Optimizing the query execution with efficient indexing and filtering strategies
Keep a check on the volume of data being read/written in one go.
Increase the IOPS of the cluster
For official reference, checkout the documentation here.

behaviour of balancer in mongodb sharding

I was experimenting with mongo sharding. The collection has shard key as {policyId,startTime}.
policyId - java UUID (limited values,lets say 50)
startTime - monotonically increasing time.
After inserting around 30M(32 GB) documents in the collection : Below is the data distribution:
shard key: { "policyId" : 1, "startDate" : 1 }
unique: false
balancing: true
chunks:
sharda 63
shardb 138
During insertion sh.isBalancerRunning() was giving 'false' as result. When I stopped inserting more documents, balancer started moving chunks. After that I got even distribution of data.
Below are my concerns / Questions regarding balancer:
1. If insertion in db is stopped, then only balancer is active and started moving chunks. If I insert more data for longer duration which will create more chunks and data will be more skewed. Chunk migration will itself take more time to balance the shards. So how does mongo decide when to migrate chunks?
2. I was able notice spikes in write latency if data is getting inserted after 20M docs. Does it mean balancer is moving some of the chunks intermittently?
3. Count API gives inconsistent result during chunk migration because balancer copies chunks from one shard to another and deletes the old chunk. Should we expect Find API will also give incorrect result (duplicate docs)?
If is possible could any one share any documentation/blog for mongo balancer for better understanding.
Assumption is wrong (i.e. If insertion in db is stopped, then only balancer is active and started moving chunks). The balancer process automatically migrates chunks when there is an uneven distribution of a sharded collection’s chunks across the shards.
Migration is not a continuous or steadily process. Automatic migration happens when it is required. for more details refer https://docs.mongodb.com/v3.0/core/sharding-balancing/#sharding-migration-thresholds
Read while migration will not give incorrect result. No duplicates records should come via find API.
For more about balancer refer https://docs.mongodb.com/manual/core/sharding-balancer-administration/
About migration refer https://docs.mongodb.com/v3.0/core/sharding-chunk-migration/
There are various things to consider
Default chunk size - 64 MB
Cardinality - If cardinality is more then and your data over period of time will not cause same value to be more than 64 MB ( assume you store 1 or more years data ) then you don't have to worry. In case not then you probably had to increase the default chunk size
Suppose you have 2 shards - Cardinality (hash key) is 100 then 50 values data will go to 1 shard and 50% to other. If you have range keys then 0-50 will go to 1 shard and 50-100 in other.
Now suppose your current chunk with value A to F reaches size 64 MB then this chunk will be split and data will be moved to other shard.
If your cardinality is low then A value itself can be more than 64 MB and chunk will not be able to split and marked as Jumbo chunk

Does mongo replication split data or duplicate it

I am creating a mongoDB/nodejs based CMS and I am using GridFS to store all the uploaded docs. The question I have is this:
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database. For Instance, if I have 5 servers
with 1TB of storage each, if I replica mongo across all of them, would
my GridFS system have theoretically 5TB of storage (minus caching and
padding) or 1TB of storage duplicated several times for better read
performance?
Thanks!
Informal description:
Replication = The same copy of the data on multiple nodes, i.e., 5 nodes with 1TB each provide 1TB overall.
Sharding / Partitioning = Fraction of the data goes to the nodes, i.e., 5 nodes with 1TB each provide 5TB overall.
Each approach has certain advantages and disadvantages, e.g., replication can help with read throughput and is good as backup, but slows down inserts (depending on commit level), whereas partitioning can help with insert throughput and distributed lookups.
Again, details left to the storage system implementor.
Sharding means splitting your data across multiple nodes, this is useful when you have a huge amount of data.
Replication means copying the data from a node to another node, and it's useful when your application is read heavy or you want to backup your data for example.
Resources:
http://www.mongodb.org/display/DOCS/Sharding
http://www.mongodb.org/display/DOCS/Replication
http://nosql-exp.blogspot.com/2010/09/mongodb-sharding-and-replication-with.html
Does MongoDB replication sets allow increased amount of DB Storage, or
simply duplicates of the database.
Mongo can do both.
The first case is called sharding.
The second case is called replication.