How to make data in a sharded collection move away from a shard [migrated] - mongodb

This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 3 days ago.
I have a sharded collection with 3 shards (shard1, shard2, shard3). Each shard is a 3-node replicaset (rs1, rs2, rs3).
I have a db called activity that has a large sharded collection called items. ie( activity.items). The data in this collection is split across shard{1,2,3}.
I have another db called app and collection called users (ie, app.users). This is not a sharded collection. It is housed on shard1.
I want the data from activity.items that current resides on shard1 to no longer reside on shard1. I don't want any activity.items data on shard1.. It should either move to shard2 or shard3. Or if it's easier, I can spin up a shard4
Is this possible? Any high level guidance on which commands to be looking at example docs would be greatly appreciated. I'm open to alternative solutions that achieve my goal of moving the activity.items collection data away from shard1.

You can use moveChunk command.
This old script is used to move everything to one shard (Primary), but you get the idea. In your case, use just filter (for source) only one shard and then for the destination you create "loop" where you distribute those chunks evenly to those other shards. Of course, you can move them to only one shard, and let the balancer to distribute those chunks.
database = 'activity'
collection = database + '.items.chunks'
sh.stopBalancer()
use config
primary = db.databases.findOne({_id: database}).primary
// move all chunks to primary
db.chunks.find({ns: collection, shard: {$ne: primary}}).forEach(function(chunk){
print('moving chunk from', chunk.shard, 'to', primary, '::', tojson(chunk.min), '-->', tojson(chunk.max));
sh.moveChunk(collection, chunk.min, primary);
});

Related

How shard a collection with data inside sharded cluster MongoDB

I want to shard a collection with data. When I try with sh.shardCollection("myDb.myCollection", {id:"hashed"}) then this collection shard but it's not spread to the whole shards. only spread to the primary shard. for example,
Empty collection after shard,
sh.status() result
Then data add it will spread to whole shards
Collection with data after shard,
sh.status() result
When data add only goes to the primary shard.
My question is how correctly shard a collection with data in MongoDB. Have any other alternative way?
I agree with #Wernfried Domscheit in the comments about the fact that the cluster will take care of distributing the data once the collection is sharded. As mentioned, that is done based on writing to the collection and happens over time. Your test may have too little data or too few writes to trigger the changes.
To your specific question about the initial distribution of chunks, this is covered in the documentation. Applying a hashed shard key on an empty collection in your first example is covered here:
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster. You can use numInitialChunks option to specify a different number of initial chunks. This initial creation and distribution of chunks allows for faster setup of sharding.
And behavior on the collection with data is covered just above it here:
The sharding operation creates the initial chunk(s) to cover the entire range of the shard key values. The number of chunks created depends on the configured chunk size.
Both of these described behaviors match what you have demonstrated in your question.

mongodb - one collection per shard

My system is built on multi-tenancy, and I'm intending to apply database sharding and replica set on it. This is new to me, so I have some questions below:
Is it possible to partition collection disjoint to one shard only? That means instead of splitting some documents in 1 shard and some others in another shard, I want to put 1 collection completely in 1 shard, and another collection completely in another shard. Because my multi-tenant system is built on schema-per-tenant, so 1 collection represents 1 tenant. Putting each of them completely in 1 shard would make aggregate query more reliable with in that tenant's scope.
If MongoDB is unable to support the answer of question 1, how can I aggregate the queried data among shards correctly if a collection's documents are scattered?
I want to know the full extent of support provided by DBMS instead of delegating the logic into backend. Thank you very much

Is MongoDB always write to primary Shard and then rebalance?

use vsm;
sh.enableSharding('vsm');
sh.shardCollection('vsm.pricelist', {maker_id:1});
Ok, we enabled sharding for Database (vsm) and collection in this database (pricelist).
We trying to write about 80 million documents to 'pricelist' collection.
And we have about 2000 distributed uniformly different maker_ids.
We have three shards. And Shard002 is PRIMARY for 'vsm' database.
We write to 'pricelist' collection from four application nodes with started mongos on each.
And during write data to 'pricelist' collection we see CPU Usage 100% ONLY on Shard002 !
We see rebalancing process. And data migrate to Shard000 and Shard003. But Shard002 has hight CPU Usage and Load Average!
Shards deployed on c4.xlarge EBS Optimized instances. dbdata stored on io1 with 2000 IOPS EBS Volumes.
It is looks like MongoDB write data only to one Shard :( What we do wrong?
The problem
What you describe is usually the indication that you have chosen a poor shard key with makerid, most likely monotonically increasing.
What usually happens is that one shard is assigned the key range from x to infinity (shard002 in your case). Now all new documents get written to that shard, until that shards holds more chunks in excess of the current migration threshold. Now the balancer kicks in and moves some chunks. Problem is that new documents still get written to said shard.
The solution
An easy solution for that problem is to use hashed keys for sharding
Now here comes the serious problem: you can not change the shard key.
So what you have to do is to make a backup of the sharded collection, drop it, reshard the collection using the hashed makerId and restore the backup into the new collection.
Is MongoDB always write to primary Shard and then rebalance?
Yes, if you are relying on auto balancer. And loading huge amounts of data into an empty collection
In your situation, you are relying on the autobalancer to do all the sharding / balancing stuff. I assume what you require is, as your data gets loaded it goes to each shard during load hence having less CPU usage etc.
This how sharding / autobalancing will take place on a high level.
Create chunks of data using split http://docs.mongodb.org/manual/reference/command/split/
Move the chunks to other shards http://docs.mongodb.org/manual/reference/command/moveChunk/#dbcmd.moveChunk
Now, when autobalancer is ON these two steps occur when your data is already loaded or loading.
Solution
Create empty collection. Execute the shard command on it. The collection where your data is going to get loaded.
Turn off the auto balancer http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#disable-the-balancer
Manually create empty chunks using split. http://docs.mongodb.org/manual/tutorial/create-chunks-in-sharded-cluster/
Move those empty chunks to different shards. http://docs.mongodb.org/manual/tutorial/migrate-chunks-in-sharded-cluster/
Start the load, This time all your data should go directly to their respective shards.
Turn ON the balancer. (Once the load is complete) http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#enable-the-balancer
You will have to test this approach using a small data set first. But I guess I have got enough information for you to get started.

MongoDB 'Manually Sharding' for multi-tenancy

We are using Mongo to host a multi-tenant application. Each tenant is going to have their own database. To get around resource utilization issues the approach that we are taking is to shard by database (as opposed to by collection - if that is the correct term to use).
This means for every x tenants we will create a new 3-node replica set. So we may have for example 1000 tenants on 1 shard and another 1000 tenants on another shard.
My question is regarding the placement of the databases for new signups. The approach we were going to take was to flag a shard as being the 'active' shard and creating all new tenants on that shard. When it reaches capacity, create a new shard, flag that as the active shard and continue on.
Can you choose which shard you create a new database on in Mongo directly? If left to Mongo, from what I understand, it will do it in round robin fashion when there is more then one shard which may leave our shards imbalanced.
Is this the right approach or is there an alternative better approach?
You can use shard tags to force some collections to reside only on specific shards. So you could, for example, tag each shard with its serial number, and tag the collections/databases you want to have on that shard with that tag, until it runs full at which point you create a new shard, increase the counter and use that for new data.
Another option then is to not enable sharding on the individual databases at all, and use the movePrimary command to force a specific shard to act as the primary shard for a specific database. Since the database won't be sharded, all its data will remain on its designated primary shard, which is exactly what you want.
That being said, it seems to me like this approach conflicts with the very concept of sharding, which is meant to evenly distribute data across multiple machines automatically.

Mongodb Sharding and Indexing

I have been struggling to deploy a large database.
I have deployed 3 shard clusters and started indexing my data.
However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit:
Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate
Shard keys: subject, predicate, object
Setup:
3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM
(Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos.
What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)
If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats() and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.