Is MongoDB always write to primary Shard and then rebalance? - mongodb

use vsm;
sh.enableSharding('vsm');
sh.shardCollection('vsm.pricelist', {maker_id:1});
Ok, we enabled sharding for Database (vsm) and collection in this database (pricelist).
We trying to write about 80 million documents to 'pricelist' collection.
And we have about 2000 distributed uniformly different maker_ids.
We have three shards. And Shard002 is PRIMARY for 'vsm' database.
We write to 'pricelist' collection from four application nodes with started mongos on each.
And during write data to 'pricelist' collection we see CPU Usage 100% ONLY on Shard002 !
We see rebalancing process. And data migrate to Shard000 and Shard003. But Shard002 has hight CPU Usage and Load Average!
Shards deployed on c4.xlarge EBS Optimized instances. dbdata stored on io1 with 2000 IOPS EBS Volumes.
It is looks like MongoDB write data only to one Shard :( What we do wrong?

The problem
What you describe is usually the indication that you have chosen a poor shard key with makerid, most likely monotonically increasing.
What usually happens is that one shard is assigned the key range from x to infinity (shard002 in your case). Now all new documents get written to that shard, until that shards holds more chunks in excess of the current migration threshold. Now the balancer kicks in and moves some chunks. Problem is that new documents still get written to said shard.
The solution
An easy solution for that problem is to use hashed keys for sharding
Now here comes the serious problem: you can not change the shard key.
So what you have to do is to make a backup of the sharded collection, drop it, reshard the collection using the hashed makerId and restore the backup into the new collection.

Is MongoDB always write to primary Shard and then rebalance?
Yes, if you are relying on auto balancer. And loading huge amounts of data into an empty collection
In your situation, you are relying on the autobalancer to do all the sharding / balancing stuff. I assume what you require is, as your data gets loaded it goes to each shard during load hence having less CPU usage etc.
This how sharding / autobalancing will take place on a high level.
Create chunks of data using split http://docs.mongodb.org/manual/reference/command/split/
Move the chunks to other shards http://docs.mongodb.org/manual/reference/command/moveChunk/#dbcmd.moveChunk
Now, when autobalancer is ON these two steps occur when your data is already loaded or loading.
Solution
Create empty collection. Execute the shard command on it. The collection where your data is going to get loaded.
Turn off the auto balancer http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#disable-the-balancer
Manually create empty chunks using split. http://docs.mongodb.org/manual/tutorial/create-chunks-in-sharded-cluster/
Move those empty chunks to different shards. http://docs.mongodb.org/manual/tutorial/migrate-chunks-in-sharded-cluster/
Start the load, This time all your data should go directly to their respective shards.
Turn ON the balancer. (Once the load is complete) http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#enable-the-balancer
You will have to test this approach using a small data set first. But I guess I have got enough information for you to get started.

Related

How shard a collection with data inside sharded cluster MongoDB

I want to shard a collection with data. When I try with sh.shardCollection("myDb.myCollection", {id:"hashed"}) then this collection shard but it's not spread to the whole shards. only spread to the primary shard. for example,
Empty collection after shard,
sh.status() result
Then data add it will spread to whole shards
Collection with data after shard,
sh.status() result
When data add only goes to the primary shard.
My question is how correctly shard a collection with data in MongoDB. Have any other alternative way?
I agree with #Wernfried Domscheit in the comments about the fact that the cluster will take care of distributing the data once the collection is sharded. As mentioned, that is done based on writing to the collection and happens over time. Your test may have too little data or too few writes to trigger the changes.
To your specific question about the initial distribution of chunks, this is covered in the documentation. Applying a hashed shard key on an empty collection in your first example is covered here:
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster. You can use numInitialChunks option to specify a different number of initial chunks. This initial creation and distribution of chunks allows for faster setup of sharding.
And behavior on the collection with data is covered just above it here:
The sharding operation creates the initial chunk(s) to cover the entire range of the shard key values. The number of chunks created depends on the configured chunk size.
Both of these described behaviors match what you have demonstrated in your question.

How to efficiently Shard a MongoDb collections that already has millions of documents?

I have a collection named order_error. Which has over 60 million documents. Today I was trying to shard it. I have 3 replica sets. Initially, no issues were there. The balancer was distributing the chunks among the clusters. But eventually, it has started to consume all Ram space and after all swap space too. Now everything is unresponsive. We can't follow this procedure in production. We need a better solution for that. How can I do the sharding in a better way?
If someone could help me with that please let me know
When you insert documents into an empty collection, then initially all date will be written to the primary shard, so it will not solve your issue.
But you can use sh.splitAt on empty collection to pre-split the it.
Note, even if the collection is empty it will take some time till chunks are distributed over all shards! When you split a chunk, then it still remains on the current shard. Check with db.collection.getShardDistribution() whether chunks are evenly distributed.

Is it good to add sharding on single server?

is it good to do sharding on single machine/server, if size of mongodb documents is above 10GB, will it perform well?
The key rule of sharding is don't shard until you absolutely have to. If you're not having problems with performance now don't need to shard. Choosing sharding keys can be a difficult process to make your data gets balanced correctly between shards. Sharding the database can add severe overhead to your deployment that can take a lot to manage since you will need additional mongod process, config servers, and replica sets in order for it to be stable for production.
I'm assuming you mean your collections are 10GB. Depending on the size of your machine a 10GB collection is not a lot for mongo to handle. IF you're having performance issue with queries my first step would be to go through your mongo log and see if there are any queries you can add indexes for.

MongoDB Sharded cluster: Inserts only hitting one shard

We are using a cluster with 6 shards.
The collection uses a non-hashed key.
The documents are rather big and our chunk-size is set to 512MB.
Two huge bulk inserts hit our cluster but everything is inserted on a single shard.
This leads to 120% effective lock, while the other shards are chilling at 5% lock.
I think that the bulk inserts only append the last chunk since the inserts are ordered. Due to heavy load there is no redistribution of chunks until the insert ends.
After the bulk insert redistribution works nicely.
MongoDB version is 2.6.5.
How can I configure the config servers to automatically distribute bulk inserts?
I will edit the post if more information is required.
Thank You all!!!
As answered below:
pre-splitting is the best solution for us. This allows us to evenly distribute the whole set before insertion since we know the key space! Thank you!
Sounds like your shard key is monotonic? The documentation has a large section about bulk insert in sharded environments.
Essentially,
either pre-split the collection
or insert to different mongos (not for the initial insert)
and/or make sure that your shard key doesn't increase monotonically (for non-hashed collections, that's usually a good idea).

mongodb - Reclaim disk space regularly with no downtime

We have a replica set of 1 primary, 1 secondary, and 1 arbiter. We delete collections often, so I am looking for a fast way to reclaim disk space used by deleted collections with no downtime, current database size is close to 3TB.
I've been researching various ways of doing this, 2 common approaches are:
repairDatabase(): which needs free space equal the size of used space to be able to run, it will take the primary offline, then start initial Sync on the secondary,which is very lengthy process, during which only one node is available for read only from secondary during repairDatabase, and read/write during initial Sync.
run initial Sync on a new node, then claim as primary and retire the old one. Repeat the process for secondary. With this option, both primary and secondary are available, but very lengthy process and take almost 1 week to run initial Sync twice.
is there a better solution to reclaim disk space on a regular basis and relatively faster than the above solutions.
Note that every single collection is subject to deletion.
Thanks
there's no easy way to achieve this, unless you design your DB structure to keep different collections in different databases, which in turn will mean storing them in different paths in your HDD as long as you have the directoryPerDB set to true in your mongo.conf. This is a workaround and depending on your app it might be unpractical.
While it's true that dropping a collection won't free the hdd space, it's also true that the used space it's not lost. It will be eventually reused for new collections.
That being said, unless you are really short on space, don't reclain that space. The CPU and I/O cost of doing that regularly is far more expensive than the storage capacity cost in every provider I know of.
I'd take a look at using MongoDB's sharding functionality to address some of your issues. To quote from the documentation:
Sharding is a method for storing data across multiple machines.
MongoDB uses sharding to support deployments with very large data sets
and high throughput operations.
While sharding is frequently used to balance thru put for a large collection across more servers, to avoid hot spots and spread the overall load, it's also useful for managing storage for large collections. In your specific case I'd investigate the use of shard tags to pin a collection to a specific shard.
Again, to quote the documentation, shard tags are useful to
isolate a specific subset of data on a specific set of shards.
For example, let's say you split your production environment into a couple of shards, shard1 and shard2. You could, using shard tags and the sharding tools, pin the collections that you frequently delete onto shard2. In this use case shard1 contains all your normal collections. When you then choose to reclaim disk storage via your second option, you'd perform this only on the shard that has the deleted collections - that way you avoid have to recreate more static data. It should run faster that way (how much faster is a function of how much data is in the deleted collections shard at any given time).
It also has the secondary benefit that as each shard (actually replica set within each shard) requires smaller servers as they only contain a subset of the overall data.
The specifics of the best way to do this will be driven by your exact use case - number and size of collections, insert, update, query and deletion frequency, etc. I described a simple 2 shard case but you can do this with many more shards. You can also have some shards running on higher performance hardware for collections that have more transaction volume.
I can't really do sharding justice within the limited space here other than to point you in the right direction to investigate it. MongoDB has a lot of good information within their documentation and their 2 online DBA courses (Which are free) get into this in some detail.
Some useful links:
http://docs.mongodb.org/manual/core/sharding-introduction/
http://docs.mongodb.org/manual/core/tag-aware-sharding/