How to Programmatically Pre-Split a GUID Based Shard Key with MongoDB - mongodb

Let's say I am using a fairly standard 32 character hex GUID, and I have determined that, because it is randomly generated for my users, it is perfect for use as a shard key to horizontally scale writes to the MongoDB collection that I will be storing the user information in (and write scaling is my primary concern).
I also know that I will need to start with at least 4 shards, because of traffic projections and some benchmark work done with a test environment.
Finally, I have a decent idea of my initial data size (average document size * number of initial users) - which comes to around ~120GB.
I'd like to make the initial load nice and fast and utilize all 4 shards as much as possible. How do I pre-split this data so that I take advantage of the 4 shards and minimize the number of moves, splits etc. that need to happen on the shards during the initial data load?

We know the intial data size (120GB) and we know the default maximium chunk size in MongoDB is 64MB. If we divide 64MB into 120GB we get 1920 - so that is the minimum number of chunks we should look to start with. As it happens 2048 happens to be a power of 16 divided by 2, and given that the GUID (our shard key) is hex based, that's a much easier number to deal with than 1920 (see below).
NOTE: This pre-splitting must be done before any data is added to the collection. If you use the enableSharding() command on a collection that contains data, MongoDB will split the data itself and you will then be running this while chunks already exist - that can lead to quite odd chunk distribution, so beware.
For the purposes of this answer, let's assume that the database is going to be called users and the collection is called userInfo. Let's also assume that the GUID will be written into the _id field. With those parameters we would connect to a mongos and run the following commands:
// first switch to the users DB
use users;
// now enable sharding for the users DB
sh.enableSharding("users");
// enable sharding on the relevant collection
sh.shardCollection("users.userInfo", {"_id" : 1});
// finally, disable the balancer (see below for options on a per-collection basis)
// this prevents migrations from kicking off and interfering with the splits by competing for meta data locks
sh.stopBalancer();
Now, per the calculation above, we need to split the GUID range into 2048 chunks. To do that we need at least 3 hex digits (16 ^ 3 = 4096) and we'll be putting them in the most significant digits (i.e. the 3 leftmost) for the ranges. Again, this should be run from a mongos shell
// Simply use a for loop for each digit
for ( var x=0; x < 16; x++ ){
for( var y=0; y<16; y++ ) {
// for the innermost loop we will increment by 2 to get 2048 total iterations
// make this z++ for 4096 - that would give ~30MB chunks based on the original figures
for ( var z=0; z<16; z+=2 ) {
// now construct the GUID with zeroes for padding - handily the toString method takes an argument to specify the base
var prefix = "" + x.toString(16) + y.toString(16) + z.toString(16) + "00000000000000000000000000000";
// finally, use the split command to create the appropriate chunk
db.adminCommand( { split : "users.userInfo" , middle : { _id : prefix } } );
}
}
}
Once that is done, let's check the state of play using the sh.status() helper:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("527056b8f6985e1bcce4c4cb")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
{ "_id" : "shard0002", "host" : "localhost:30002" }
{ "_id" : "shard0003", "host" : "localhost:30003" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "users", "partitioned" : true, "primary" : "shard0001" }
users.userInfo
shard key: { "_id" : 1 }
chunks:
shard0001 2049
too many chunks to print, use verbose if you want to force print
We have our 2048 chunks (plus one extra thanks to the min/max chunks), but they are all still on the original shard because the balancer is off. So, let's re-enable the balancer:
sh.startBalancer();
This will immediately begin to balance out, and it will be relatively quick because all the chunks are empty, but it will still take a little while (much slower if it is competing with migrations from other collections). Once some time has elapsed, run sh.status() again and there you (should) have it - 2048 chunks all nicely split out across 4 shards and ready for an initial data load:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("527056b8f6985e1bcce4c4cb")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
{ "_id" : "shard0002", "host" : "localhost:30002" }
{ "_id" : "shard0003", "host" : "localhost:30003" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "users", "partitioned" : true, "primary" : "shard0001" }
users.userInfo
shard key: { "_id" : 1 }
chunks:
shard0000 512
shard0002 512
shard0003 512
shard0001 513
too many chunks to print, use verbose if you want to force print
{ "_id" : "test", "partitioned" : false, "primary" : "shard0002" }
You are now ready to start loading data, but to absolutely guarantee that no splits or migrates happen until your data load is complete, you need to do one more thing - turn off the balancer and autosplitting for the duration of the import:
To disable all balancing, run this command from the mongos: sh.stopBalancer()
If you want to leave other balancing operations running, you can disable on a specific collection. Using the namespace above as an example: sh.disableBalancing("users.userInfo")
To turn off auto splitting during the load, you will need to restart each mongos you will be using to load the data with the --noAutoSplit option.
Once the import is complete, reverse the steps as needed (sh.startBalancer(), sh.enableBalancing("users.userInfo"), and restart the mongos without --noAutoSplit) to return everything to the default settings.
**
Update: Optimizing for Speed
**
The approach above is fine if you are not in a hurry. As things stand, and as you will discover if you test this, the balancer is not very fast - even with empty chunks. Hence as you increase the number of chunks you create, the longer it is going to take to balance. I have seen it take more than 30 minutes to finish balancing 2048 chunks though this will vary depending on the deployment.
That might be OK for testing, or for a relatively quiet cluster, but having the balancer off and requiring no other updates interfere will be much harder to ensure on a busy cluster. So, how do we speed things up?
The answer is to do some manual moves early, then split the chunks once they are on their respective shards. Note that this is only desirable with certain shard keys (like a randomly distributed UUID), or certain data access patterns, so be careful that you don't end up with poor data distribution as a result.
Using the example above we have 4 shards, so rather than doing all the splits, then balancing, we split into 4 instead. We then put one chunk on each shard by manually moving them, and then finally we split those chunks into the required number.
The ranges in the example above would look like this:
$min --> "40000000000000000000000000000000"
"40000000000000000000000000000000" --> "80000000000000000000000000000000"
"80000000000000000000000000000000" --> "c0000000000000000000000000000000"
"c0000000000000000000000000000000" --> $max
It's only 4 commands to create these, but since we have it, why not re-use the loop above in a simplified/modified form:
for ( var x=4; x < 16; x+=4){
var prefix = "" + x.toString(16) + "0000000000000000000000000000000";
db.adminCommand( { split : "users.userInfo" , middle : { _id : prefix } } );
}
Here's how thinks look now - we have our 4 chunks, all on shard0001:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("53467e59aea36af7b82a75c1")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
{ "_id" : "shard0002", "host" : "localhost:30002" }
{ "_id" : "shard0003", "host" : "localhost:30003" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard0001" }
{ "_id" : "users", "partitioned" : true, "primary" : "shard0001" }
users.userInfo
shard key: { "_id" : 1 }
chunks:
shard0001 4
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : "40000000000000000000000000000000" } on : shard0001 Timestamp(1, 1)
{ "_id" : "40000000000000000000000000000000" } -->> { "_id" : "80000000000000000000000000000000" } on : shard0001 Timestamp(1, 3)
{ "_id" : "80000000000000000000000000000000" } -->> { "_id" : "c0000000000000000000000000000000" } on : shard0001 Timestamp(1, 5)
{ "_id" : "c0000000000000000000000000000000" } -->> { "_id" : { "$maxKey" : 1 } } on : shard0001 Timestamp(1, 6)
We will leave the $min chunk where it is, and move the other three. You can do this programatically, but it does depend on where the chunks reside initially, how you have named your shards etc. so I will leave this manual for now, it is not too onerous - just 3 moveChunk commands:
mongos> sh.moveChunk("users.userInfo", {"_id" : "40000000000000000000000000000000"}, "shard0000")
{ "millis" : 1091, "ok" : 1 }
mongos> sh.moveChunk("users.userInfo", {"_id" : "80000000000000000000000000000000"}, "shard0002")
{ "millis" : 1078, "ok" : 1 }
mongos> sh.moveChunk("users.userInfo", {"_id" : "c0000000000000000000000000000000"}, "shard0003")
{ "millis" : 1083, "ok" : 1 }
Let's double check, and make sure that the chunks are where we expect them to be:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("53467e59aea36af7b82a75c1")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
{ "_id" : "shard0002", "host" : "localhost:30002" }
{ "_id" : "shard0003", "host" : "localhost:30003" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard0001" }
{ "_id" : "users", "partitioned" : true, "primary" : "shard0001" }
users.userInfo
shard key: { "_id" : 1 }
chunks:
shard0001 1
shard0000 1
shard0002 1
shard0003 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : "40000000000000000000000000000000" } on : shard0001 Timestamp(4, 1)
{ "_id" : "40000000000000000000000000000000" } -->> { "_id" : "80000000000000000000000000000000" } on : shard0000 Timestamp(2, 0)
{ "_id" : "80000000000000000000000000000000" } -->> { "_id" : "c0000000000000000000000000000000" } on : shard0002 Timestamp(3, 0)
{ "_id" : "c0000000000000000000000000000000" } -->> { "_id" : { "$maxKey" : 1 } } on : shard0003 Timestamp(4, 0)
That matches our proposed ranges above, so all looks good. Now run the original loop above to split them "in place" on each shard and we should have a balanced distribution as soon as the loop finishes. One more sh.status() should confirm things:
mongos> for ( var x=0; x < 16; x++ ){
... for( var y=0; y<16; y++ ) {
... // for the innermost loop we will increment by 2 to get 2048 total iterations
... // make this z++ for 4096 - that would give ~30MB chunks based on the original figures
... for ( var z=0; z<16; z+=2 ) {
... // now construct the GUID with zeroes for padding - handily the toString method takes an argument to specify the base
... var prefix = "" + x.toString(16) + y.toString(16) + z.toString(16) + "00000000000000000000000000000";
... // finally, use the split command to create the appropriate chunk
... db.adminCommand( { split : "users.userInfo" , middle : { _id : prefix } } );
... }
... }
... }
{ "ok" : 1 }
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("53467e59aea36af7b82a75c1")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
{ "_id" : "shard0001", "host" : "localhost:30001" }
{ "_id" : "shard0002", "host" : "localhost:30002" }
{ "_id" : "shard0003", "host" : "localhost:30003" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard0001" }
{ "_id" : "users", "partitioned" : true, "primary" : "shard0001" }
users.userInfo
shard key: { "_id" : 1 }
chunks:
shard0001 513
shard0000 512
shard0002 512
shard0003 512
too many chunks to print, use verbose if you want to force print
And there you have it - no waiting for the balancer, the distribution is already even.

Related

Sharding in a replicaset MongoDB

I have mongoDb replica set , One primary one secondary and an arbiter to vote. I'm planning to implement sharding as the data is expected to grow exponentially.I find difficult in following mongoDb document for sharding. Could someone explain it clearly to set it up. Thanks in advance.
If you could do accomplish replicaset, sharding is pretty simple. Pretty much repeating the mongo documentation in fast forward here:
Below is for a sample setup: 3 configDB and 3 shards
For the below example you can run all of it one machine to see it all working.
If you need three shards setup three replica sets. (Assuming 3 Primary's are 127:0.0.1:27000, 127.0.0.1:37000, 127.0.0.1:47000)
Run 3 instances mongod as three config servers. (Assuming: 127.0.0.1:27020, 127.0.0.1:27021, 127.0.0.1:270122)
Start mongos (note the s in mongos) letting it know where your config servers are. (ex: 127.0.0.1:27023)
Connect to mongos from mongo shell and add the three primary mongod's of your 3 replica sets as the shards.
Enable sharding for your DB.
If required enable sharding for a collection.
Select a shard key if required. (Very Important you do it right the first time!!!)
Check the shard status
Pump data; connect to individual mongod primarys and see the data distributed across the three shards.
#start mongos with three configs:
mongos --port 27023 --configdb localhost:27017,localhost:27018,localhost:27019
mongos> sh.addShard("127.0.0.1:27000");
{ "shardAdded" : "shard0000", "ok" : 1 }
mongos> sh.addShard("127.0.0.1:37000");
{ "shardAdded" : "shard0001", "ok" : 1 }
mongos> sh.addShard("127.0.0.1:47000");
{ "shardAdded" : "shard0002", "ok" : 1 }
mongos> sh.enableSharding("db_to_shard");
{ "ok" : 1 }
mongos> use db_to_shard;
switched to db db_to_shard
mongos>
mongos> sh.shardCollection("db_to_shard.coll_to_shard", {collId: 1, createdDate: 1} );
{ "collectionsharded" : "db_to_shard.coll_to_shard", "ok" : 1 }
mongos> show databases;
admin (empty)
config 0.063GB
db_to_shard 0.078GB
mongos> sh.status();
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("557003eb4a4e61bb2ea0555b")
}
shards:
{ "_id" : "shard0000", "host" : "127.0.0.1:27000" }
{ "_id" : "shard0001", "host" : "127.0.0.1:37000" }
{ "_id" : "shard0002", "host" : "127.0.0.1:47000" }
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
No recent migrations
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "db_to_shard", "partitioned" : true, "primary" : "shard0000" }
db_to_shard.coll_to_shard
shard key: { "collId" : 1, "createdDate" : 1 }
chunks:
shard0000 1
{ "collId" : { "$minKey" : 1 }, "createdDate" : { "$minKey" : 1 } } -->> { "collId" : { "$maxKey" : 1 }, "createdDate" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)

Sharding on MongoDB what about the other collections?

I understand that 'Hash Sharding' can be done on the collection level on a database on based on the key of the collection that is passed.
This ensures that records for that collection are distributed across all the shards.
I understand what happens with one collection. What about the other collections?
Does all the data of all the other tables get stored in one shard only?
Does it get replicated across all the shards?
Does it also get split and spread across all the shards?
The other collections will reside on a single shard (known as the primary shard) unless you decide to shard them also. The primary shard is set at the database level rather than collection, so all non-sharded collections in a particular database will all have the same primary shard. You can see the primary for any given database in the sh.status() output, as per the example below:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54185b2c2a2835b6e47f7984")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "shardTest", "partitioned" : true, "primary" : "shard0000" }
shardTest.foo
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "bar", "partitioned" : true, "primary" : "shard0000" }
bar.data
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
foo.data
shard key: { "_id" : 1 }
chunks:
shard0000 9
In this example there is only one shard (shard0000), and hence it is the primary for all the databases ("primary" : "shard0000") except config which is a special case (and resides on the config servers). The primary shard for a database is chosen when the database is created.
Hence, if you only had one shard, created all your databases first and then added more shards later, all the databases you created before adding new shards will have their primary set to that first shard (there was nothing else to choose). Any databases created after you have multiple shards could end up with any shard as their primary, essentially it is selected using round robin, but each mongos will have its own idea about where it is in that round robin selection.

MongoDB replicates data to all shards

I have 4 servers in a test environment which I use to test MongoDB replica and distribution:
RepSetA holds RepSetA1 and RepSetA2.
RepSetB holds RepSetB1 and RepSetB2.
All servers act as routers, RepSetA1 acts as a single config server.
I have a "Player" data (10,000 records, the object consists an "Id" and a "Name" fields), and I want it to be sharded (or distributed) between the replica sets, and cloned among the servers in the same replica set. So, just for a plain example:
Player1-5000: Exists in both RepSetA1 and RepSetA2, but not in RepSetB1 and RepSetB2.
Player5000-10000: Exists in both RepSetB1 and RepSetB2, but not in RepSetA1 and RepSetA2.
What I get instead is having all players in all 4 servers.
If I print the sharding status, I get the following:
mongos> db.printShardingStatus();
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "RepSetA", "host" : "RepSetA/MongoRepSetA1:27018,MongoRepSetA2:27018" }
{ "_id" : "RepSetB", "host" : "RepSetB/MongoRepSetB1:27018,MongoRepSetB2:27018" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "GamesDB", "partitioned" : true, "primary" : "RepSetA" }
GamesDB.Player chunks:
RepSetA 2
{ "_id" : { $minKey : 1 } } -->> { "_id" : 0 } on : RepSetA { "t" : 1000, "i" : 1 }
{ "_id" : 0 } -->> { "_id" : { $maxKey : 1 } } on : RepSetA { "t" : 1000, "i" : 2 }
{ "_id" : "test", "partitioned" : false, "primary" : "RepSetB" }
{ "_id" : "EOO", "partitioned" : false, "primary" : "RepSetB" }
I used the following commands to build the shards:
db.adminCommand( { addShard : "RepSetA/MongoRepSetA1:27018,MongoRepSetA2:27018" } )
db.adminCommand( { addShard : "RepSetB/MongoRepSetB1:27018,MongoRepSetB2:27018" } )
db.runCommand( { enablesharding : "GamesDB" } );
db.runCommand( { shardcollection : "GamesDB.Player", key : { _id :1 } , unique : true} );
What am I doing wrong?
If you connect through a mongos process to your nodes, it will look like all contain the data. From your output, It doesn't look like all data is available on all nodes. RepSetA holds 2 chunks and RepSetB should contain none. You can verify this by connecting directly with your nodes rather than through mongos.
By the way, if you're using MongoDBs ObjectId as your _id(shard key), consider to shard on another key as this will cause all inserts be made into one node as the key changes monoton.
This is fine. It does not show that all data is on all servers. The output shows that all chunks (data) of GamesDB.Player are on shard RepSetA
GamesDB.Player chunks:
RepSetA 2
{ "_id" : { $minKey : 1 } } -->> { "_id" : 0 } on : RepSetA { "t" : 1000, "i" : 1 }
{ "_id" : 0 } -->> { "_id" : { $maxKey : 1 } } on : RepSetA { "t" : 1000, "i" : 2 }
This means that the balancer has not started to balance your chunks. The balancer only kicks in when there is an 8 chunk difference.
http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-Balancing
You can force balancing by manually splitting chunks (if you want to)
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
Or you can reduce the chunk size if you want to see balancing sooner.
http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations

How to Verify Sharding?

I am trying to shard MongoDB. I am done with Sharding configuration, but I am not sure how to verify if sharding is functional.
How do i check whether my data is get sharded? Is there a query to verify/validate the shards?
You can also execute a simple command on your mongos router :
> use admin
> db.printShardingStatus();
which should output informations about your shards, your sharded dbs and your sharded collection as mentioned in the mongodb documentation
sharding version: { "_id" : 1, "version" : 2 }
shards:
{ "_id" : ObjectId("4bd9ae3e0a2e26420e556876"), "host" : "localhost:30001" }
{ "_id" : ObjectId("4bd9ae420a2e26420e556877"), "host" : "localhost:30002" }
{ "_id" : ObjectId("4bd9ae460a2e26420e556878"), "host" : "localhost:30003" }
databases:
{ "name" : "admin", "partitioned" : false,
"primary" : "localhost:20001",
"_id" : ObjectId("4bd9add2c0302e394c6844b6") }
my chunks
{ "name" : "foo", "partitioned" : true,
"primary" : "localhost:30002",
"sharded" : { "foo.foo" : { "key" : { "_id" : 1 }, "unique" : false } },
"_id" : ObjectId("4bd9ae60c0302e394c6844b7") }
my chunks
foo.foo { "_id" : { $minKey : 1 } } -->> { "_id" : { $maxKey : 1 } }
on : localhost:30002 { "t" : 1272557259000, "i" : 1 }
MongoDB has detailed documentation on Sharding here ...
http://www.mongodb.org/display/DOCS/Sharding+Introduction
To anwser you question (I think), see the portion on the config Servers ...
Each config server has a complete copy
of all chunk information. A two-phase
commit is used to ensure the
consistency of the configuration data
among the config servers.
Basically, it is the config server's job to make sure everything get sharded ... correctly.
Also, there are system collections you can query ...
db.runCommand( { listshards : 1 } );
Lots of help in the prez below too ...
http://www.slideshare.net/mongodb/mongodb-sharding-internals
http://www.10gen.com/video/mongosv2010/sharding
If you just want to check whether you are conencted to a sharded cluster or not:
db.isMaster() can be used to detect that you are connected to a sharding router (mongos).
If db.isMaster().msg is "isdbgrid", you are connected to a sharded instance.
db.isMaster() can be run without authentication.
For checking the details of the shards, sh.status() also works, which has the same output as db.printShardingStatus(); works.

Mongo sharding fails to split large collection between shards

I'm having problems with what seems to be a simple sharding setup in mongo.
I have two shards, a single mongos instance, and a single config server set up like this:
Machine A - 10.0.44.16 - config server, mongos
Machine B - 10.0.44.10 - shard 1
Machine C - 10.0.44.11 - shard 2
I have a collection called 'Seeds' that has a shard key 'SeedType' which is a field that is present on every document in the collection, and contains one of four values (take a look at the sharding status below). Two of the values have significantly more entries than the other two (two of them have 784,000 records each, and two have about 5,000).
The behavior I'm expecting to see is that records in the 'Seeds' collection with InventoryPOS will end up on one shard, and the ones with InventoryOnHand will end up on the other.
However, it seems that all records for both the two larger shard keys end up on the primary shard.
Here's my sharding status text (other collections removed for clarity):
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "shard0000", "host" : "10.44.0.11:27019" }
{ "_id" : "shard0001", "host" : "10.44.0.10:27017" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "TimMulti", "partitioned" : true, "primary" : "shard0001" }
TimMulti.Seeds chunks:
{ "SeedType" : { $minKey : 1 } } -->> { "SeedType" : "PBI.AnalyticsServer.KPI" } on : shard0000 { "t" : 2000, "i" : 0 }
{ "SeedType" : "PBI.AnalyticsServer.KPI" } -->> { "SeedType" : "PBI.Retail.InventoryOnHand" } on : shard0001 { "t" : 2000, "i" : 7 }
{ "SeedType" : "PBI.Retail.InventoryOnHand" } -->> { "SeedType" : "PBI.Retail.InventoryPOS" } on : shard0001 { "t" : 2000, "i" : 8 }
{ "SeedType" : "PBI.Retail.InventoryPOS" } -->> { "SeedType" : "PBI.Retail.SKU" } on : shard0001 { "t" : 2000, "i" : 9 }
{ "SeedType" : "PBI.Retail.SKU" } -->> { "SeedType" : { $maxKey : 1 } } on : shard0001 { "t" : 2000, "i" : 10 }
Am I doing anything wrong?
Semi-unrelated question:
What is the best way to atomically transfer an object from one collection to another without blocking the entire mongo service?
Thanks in advance,
-Tim
Sharding really isn't meant to be used this way. You should choose a shard key with some variation (or make a compound shard key) so that MongoDB can make reasonable-size chunks. One of the points of sharding is that your application doesn't have to know where your data is.
If you want to manually shard, you should do that: start unlinked MongoDB servers and route things yourself from the client side.
Finally, if you're really dedicated to this setup, you could migrate the chunk yourself (there's a moveChunk command).
The balancer moves chunks based on how much is mapped in memory (run serverStatus and look at the "mapped" field). It can take a while, MongoDB doesn't want your data flying all over the place in production, so it's pretty conservative.
Semi-unrelated answer: you can't do it atomically with sharding (eval isn't atomic across multiple servers). You'll have to do a findOne, insert, remove.