mongoDb Atlas - In general, What affects my writing rate here? - mongodb

I am using an Atlas cluster with M60 tier configured with IOPS of 3099.
I am trying to write 116,550,000 documents, waging at around 12 KB on average per doc, as fast as possible.(optimally in less then 6 hours)
I have 6 compound indexes made of multiple fields.
My question is: I would like to know what, in principle, affects my writing speed and essentially holds my speed from getting any faster.
Is it the IOPS? If I increase the IOPS say to 9000 will I see a drastic change in writing speed?
Is sharding the answer here? And if so, can I use multiple shards in the, lets say M30 tier, but with more IOPS and achieve my goal?
Thank to all the responders :)
The best writing speed I managed to achieve was 1200 docs per sec, using node. optional: I have a spark cluster with mongodb-spark-connector as a sending option. But I think the problem is in the mongo server and not the client.

Related

Mongodb Production - nssize, preallocDataFiles, managing large number of collections

I have a large number of collections getting created at high bursts of traffic. I generally delete this collections once I m done processing the data in them. But at sudden bursts I sometimes run into namesspace issues..
Can I increase nssize for handling this and what values of nssize are OK? By default, it is 16 MB.. I increased it to 100 MB and still hit the issue.. Can I still increase it without worrying?
Also, I have a lot of databases where the data is around 1 Mb but mongo pre allocates 64 Mb space. How do I fix this? If I run compact, does it hit mongo performance?
You can increase the namespace size, up to 2047MB. Each namespace file is per database and the default size should be fine for about 24000 collections.
What are the issues you're seeing, exactly? Do you have log lines or error messages? The numbers don't look like they should be a problem.
For more about nsSize, see the docs.
As for your second question, please see the link in the first comment as it has a good explanation and links to more info.

What is the biggest Couchbase cluster nodes number?

any body knows what is biggest couchbase cluster has been deployed, since there are lot of info broadcast from each node, i am doubt on the scalability
thanks
I would like to answer this question differently than a "simple number of nodes". In your question you are talking about scalability and some "doubts" about it.... And as you can case, as Couchbase, I have no doubts about the scalability...
When people are using Couchbase, like any NoSQL solution, they have specific use case in mind for their data. And each use case have a specific data "life cycle" (volume, throughput, expiration, ...) So what do you have in mind when you are talking about scalability?
For example I have been working on a project where we have a 20 nodes cluster processing 650,000 op/s with 30% of mutation of the data. For this specific use case, no need to go bigger. You can see in other use case like Draw Something with 80/90 nodes ~50 million total users, 16 million daily users, 2 billion documents...
So instead of talking of "hypothetical" size of the cluster, I would like to understand your use case and type of deployment (available hardware/VMs) to define what will be a good topology.
Check out this article, it covers the growth of the game 'draw something '. They went from a 6 node cluster to a 90 node cluster in 8 weeks due to rapid growth. They also had zero downtime in adding nodes to the cluster and at week 6 were processing 3000 drawings a second.
http://www.couchbase.com/customer-stories/couchbase-helps-omgpop-scale-draw-something-50-million-users-50-days
Edit
Check slide 16 on this link, cluster size of 100+ for Viber
http://www.couchbase.com/presentations/couchbase-tlv-2014-couchbase-at-viber

When to start MongoDB sharding

At the moment we run a MongoDB Replicaset containing 2 Servers + 1 Arbiter.
And we store about 150 GB of data in the databases on the replicaset.
Right now we are thinking about when to start with sharding. Because we are wondering if there is a point where you can't start sharding anymore.
It is obvious that we would have to start sharding before we run out of hard disk space, our cpu is overloaded or the overall performance goes down because of too little RAM.
Somebody also told me that there is a limit of 256 GB data size after which you can't start sharding anymore. Also I read the official documentation http://docs.mongodb.org/manual/sharding/ and "MongoDB the definitive guide", I could not proove that.
From your experience is there a limit where you should have started with sharding ?
I would start sharding when you hit about 60-70% resource utilisation. This could be both hard disk space and RAM. The 256 GB limit is indeed there, it's documented at http://docs.mongodb.org/manual/reference/limits/#Sharding%20Existing%20Collection%20Data%20Size
I have found the limit to be based on reads/writes; afterall sharding is about increasing capacity, mainly writes, while replica sets being more concerned with reads. However, using separate servers (nodes) for ranges of data (shard key) can help reads too so it does have a knock on effect for both.
For example you could be only using 40% of your current servers memory with your current working set but due to the amount of writes being sent to that single server you could actually be seeing speed problems due to IO. At this time you would take sharding into account.
So really I would personally say, and this question is heavily opinion based, that you should shard when you feel as though you need more capacity for operations than is cost effective for a single replica set.
I have known of single replica setups that can take what, normally, an entire cluster would but it depends on how big your budget is. As a computer gets bigger it gets more expensive.

Can't map file memory-mongo requires 64 bit build for larger datasets

I have a sharded cluster in 3 systems.
While inserting I get the error message:
cant map file memory-mongo requires 64 bit build for larger datasets
I know that 32 bit machine have a limit size of 2 gb.
I have two questions to ask.
The 2 gb limit is for 1 system, so the total data will be, 6gb as my sharding is done in 3 systems. So it would be only 2 gb or 6 gb?
While sharding is done properly, all the data are stored in single system in spite of distributing data in all the three sharded system?
Does Sharding play any role in increasing the datasize limit?
Does chunk size play any vital role in performance?
I would not recommend you do anything with 32bit MongoDB beyond running it on a development machine where you perhaps cannot run 64bit. Once you hit the limit the file becomes unuseable.
The documentation states "Use 64 bit for production. This is important as if you hit the mmap size limit (exact limit varies but less than 2GB) you will be unable to write to the database (analogous to a disk full condition)."
Sharding is all about scaling out your data set across multiple nodes so in answer to your question, yes you have increased the possible size of your data set. Remember though that namespaces and indexes also take up space.
You haven't specified where your mongos resides??? Where are you seeing the error from - a mongod or the mongos? I suspect that it's the mongod. I believe that you need to look at pre-splitting the chunks - http://docs.mongodb.org/manual/administration/sharding/#splitting-chunks.
which would seem to indicate that all your data is going to the one mongod.
If you have a mongos, what does sh.status() return? Are chunks spread across all mongod's?
For testing, I'd recommend a chunk size of 1mb. In production, it's best to stick with the default of 64mb unless you've some really important reason why you don't want the default and you really know what you are doing. If you have too small of a chunk size, then you will be performing splits far too often.

Does auto-sharding in MongoDB work on shards with many small collections/small databases

In the MongoDB documentation for auto-sharding it says: "Sharding is performed on a per-collection basis. Small collections need not be sharded."
Our business has many databases (~100), with many small collections (~30), each with a document count of 1 - 3000. Our DB system is looking at approximately 100,000,000 page views per month.
In that scenario will sharding ever activate since the collections are never big enough even though the DB usage and site traffic is certainly high enough to require load balancing. From the docs I can't seem to find a clear answer.
Whether it makes sense to shard depends a little bit on whether you have mostly writes or reads to the database. Sharding is primarily used for write-scaling, but if you are not doing a lot of writes, then simply using replicasets with "slaveOkay" for the reads might work just as well.
From the numbers that you provided you seem to get about 9 million documents, but are they large documents? If they easily fit in memory, then there is most likely not even going to be a need for replicasets besides for failover capabilities.
This is hard to answer without knowing more about your use case, but I'll give it a shot.
Are you sure sharding is what you need? What does your insert rate look like?
If you are going to have a static set of data, or even a relatively static set, then you probably don't need to shard, you could simply use more secondaries and enable slaveOK reads. The reads will be distributed to the various secondaries and scale up your read capacity.
If that is not the case, and you do need to shard, then there are options. But first, to explain briefly and at a high level how automatic sharding works:
The mongos process is responsible for splitting and migrating chunks in general. These are two separate operations - splitting and balancing.
Splits occur when the mongos sees that a certain portion of the
maximum chunk size has been written, it initiates a split if there is
in fact enough data to warrant it. Over time, with enough data
written, the number of chunks grows.
Balancing occurs when there is an imbalance of chunks (currently 8 in
2.0, though moving to a more dynamic heuristic in 2.2). The balancer migrates the chunks around the shards until a balance is achieved.
So, you need to be writing enough data relative to the max chunk size (default is 64MB in 2.0) to generate the chunks needed for the balancer to move them around appropriately. If that is not going to happen with your data, then you can look at:
Decreasing the chunk size (has drawbacks too - http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations)
Manually split/move the chunks
For the manual instructions see:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
http://www.mongodb.org/display/DOCS/Moving+Chunks