Sharding in a replicaset MongoDB - mongodb

I have mongoDb replica set , One primary one secondary and an arbiter to vote. I'm planning to implement sharding as the data is expected to grow exponentially.I find difficult in following mongoDb document for sharding. Could someone explain it clearly to set it up. Thanks in advance.

If you could do accomplish replicaset, sharding is pretty simple. Pretty much repeating the mongo documentation in fast forward here:
Below is for a sample setup: 3 configDB and 3 shards
For the below example you can run all of it one machine to see it all working.
If you need three shards setup three replica sets. (Assuming 3 Primary's are 127:0.0.1:27000, 127.0.0.1:37000, 127.0.0.1:47000)
Run 3 instances mongod as three config servers. (Assuming: 127.0.0.1:27020, 127.0.0.1:27021, 127.0.0.1:270122)
Start mongos (note the s in mongos) letting it know where your config servers are. (ex: 127.0.0.1:27023)
Connect to mongos from mongo shell and add the three primary mongod's of your 3 replica sets as the shards.
Enable sharding for your DB.
If required enable sharding for a collection.
Select a shard key if required. (Very Important you do it right the first time!!!)
Check the shard status
Pump data; connect to individual mongod primarys and see the data distributed across the three shards.
#start mongos with three configs:
mongos --port 27023 --configdb localhost:27017,localhost:27018,localhost:27019
mongos> sh.addShard("127.0.0.1:27000");
{ "shardAdded" : "shard0000", "ok" : 1 }
mongos> sh.addShard("127.0.0.1:37000");
{ "shardAdded" : "shard0001", "ok" : 1 }
mongos> sh.addShard("127.0.0.1:47000");
{ "shardAdded" : "shard0002", "ok" : 1 }
mongos> sh.enableSharding("db_to_shard");
{ "ok" : 1 }
mongos> use db_to_shard;
switched to db db_to_shard
mongos>
mongos> sh.shardCollection("db_to_shard.coll_to_shard", {collId: 1, createdDate: 1} );
{ "collectionsharded" : "db_to_shard.coll_to_shard", "ok" : 1 }
mongos> show databases;
admin (empty)
config 0.063GB
db_to_shard 0.078GB
mongos> sh.status();
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("557003eb4a4e61bb2ea0555b")
}
shards:
{ "_id" : "shard0000", "host" : "127.0.0.1:27000" }
{ "_id" : "shard0001", "host" : "127.0.0.1:37000" }
{ "_id" : "shard0002", "host" : "127.0.0.1:47000" }
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
No recent migrations
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "db_to_shard", "partitioned" : true, "primary" : "shard0000" }
db_to_shard.coll_to_shard
shard key: { "collId" : 1, "createdDate" : 1 }
chunks:
shard0000 1
{ "collId" : { "$minKey" : 1 }, "createdDate" : { "$minKey" : 1 } } -->> { "collId" : { "$maxKey" : 1 }, "createdDate" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)

Related

Sharded mongodb cluster use only one shard

I'm trying to create a sharded mongodb cluster, I have 3 VM to use and I was trying to get the following setup:
VM1) Node.js Application + mongos instance
VM2) CSRS PRIMARY + Shard
VM3) CSRS SECONDARY + Shard
I'm not interested in replication but I need sharding because I will have to query over large number of results.
To achieve what stated above I have done the following:
On both VM2 and VM3
mongod –configsvr --replSet configM4C –bind_ip <ipVM2/3> --port 27040
mongod –shardsvr --replSet shardM4C --bind_ip <ipVM2/3> --port 27041
Then, connected to the configM4C on VM2(mongo --host --port 27040) and run
rs.initiate({
_id:”configM4C”,
configsvr: true,
members: [
{_id: 0, host: “<ipVM2>:27040”},
{_id:1, host: “<ipVM3>:27040”}
]
})
Then, connected to shardM4C on VM2(mongo --host --port 27041) and run
rs.initiate({
_id: shardM4C,
members: [
{_id: 0, host: “<ip1>:27041”},
{_id:1, host: “<ip2>:27041”}
]
})
Started mongos on VM1
mongos –configdb configM4C/<ipVM2>:27040,<ipVM3>:27040 --port 27042
Connected to mongos on VM1(mongo --port 27042):
sh.addShard("<ipVM2>:27041)
sh.addShard("<ipVM3>:27041)
sh.enableSharding("pings")
sh.shardCollection("pings.pings", {provider:1, from_zone: 1, to_zone: 1})
Anyway at this point if I run db.pings.getShardDistribution() I get:
Shard shardM4C at shardM4C/<ipVM3>:27041,<ipVM2>:27041
data : 19.47MiB docs : 102876 chunks : 1
estimated data per chunk : 19.47MiB
estimated docs per chunk : 102876
Totals
data : 19.47MiB docs : 102876 chunks : 1
Shard shardM4C contains 100% data, 100% docs in cluster, avg obj size on shard : 198B
sh.status():
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("5b58a6b4986cc2d94694a3f9")
}
shards:
{ "_id" : "shardM4C", "host" : "shardM4C/10.0.0.4:27041,10.0.0.7:27041", "state" : 1 }
active mongoses:
"3.6.3" : 1
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 5
Last reported error: Could not find host matching read preference { mode: "primary" } for set shardM4C
Time of Reported error: Wed Jul 25 2018 21:44:24 GMT+0000 (UTC)
Migration Results for the last 24 hours:
No recent migrations
databases:
{ "_id" : "config", "primary" : "config", "partitioned" : true }
config.system.sessions
shard key: { "_id" : 1 }
unique: false
balancing: true
chunks:
shardM4C 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shardM4C Timestamp(1, 0)
{ "_id" : "pings", "primary" : "shardM4C", "partitioned" : true }
pings.pings
shard key: { "provider" : 1 }
unique: false
balancing: true
chunks:
shardM4C 1
{ "provider" : { "$minKey" : 1 } } -->> { "provider" : { "$maxKey" : 1 } } on : shardM4C Timestamp(1, 0)
{ "_id" : "test", "primary" : "shardM4C", "partitioned" : false }
And if I run a query in the executionStats I get:
...
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SINGLE_SHARD",
"shards" : [
{
"shardName" : "shardM4C",
"connectionString" : "shardM4C/<ipVM3>:27041,<ipVM2>:27041",
"serverInfo" : {
"host" : "dpiscaglia-2",
"port" : 27041,
"version" : "3.6.3",
"gitVersion" : "9586e557d54ef70f9ca4b43c26892cd55257e1a5"
},
...
Both results makes me think that the query is actually working on only one shard, and the second is used only for replication and as a backup(which is not what i want), am i correct?
If yes could you please help me understand what I did wrong?
Thank you very much in advance for your help
Have a good day

MongoDB Sharding - why my data isn't being distributed on all shards?

I created a MongoDB Cluster with 3 shards, each shards contain 3 mongod-processes. My Cluster also contains 3 mongos and 3 config servers
In the connection string I put the 3 mongos
mongodb://user:pass#mongos1:27017,mongos2:27017,mongos3:27017/mydatabase
In the picture you can see that dbShard_2 has 1.19GB of data while the others are almost empty with only 4KB. But on the charts you can see that there is also read/write operation on all shards. Is everything fine or did I have made some wrong configurations? shall I worry?
I left the Cloud Manager do the whole configuration for me, I didn't set these by myself manually.
Here you can check my sharding status
mongos> db.printShardingStatus();
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("XXXXX")
}
shards:
{ "_id" : "dbShard_0", "host" : "dbShard_0/dbnode-0.x-app.com:27000,dbnode-1.x-app.com:27000,dbnode-2.x-app.com:27000" }
{ "_id" : "dbShard_1", "host" : "dbShard_1/dbnode-0.x-app.com:27001,dbnode-2.x-app.com:27001,dbnode-2.x-app.com:27002" }
{ "_id" : "dbShard_2", "host" : "dbShard_2/dbnode-0.x-app.com:27002,dbnode-1.x-app.com:27001,dbnode-2.x-app.com:27003" }
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
No recent migrations
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "mydatabase-staging", "partitioned" : false, "primary" : "dbShard_2" }
{ "_id" : "mydatabase", "partitioned" : false, "primary" : "dbShard_2" }
{ "_id" : "test", "partitioned" : false, "primary" : "dbShard_0" }
Cluster

Sharding on MongoDB what about the other collections?

I understand that 'Hash Sharding' can be done on the collection level on a database on based on the key of the collection that is passed.
This ensures that records for that collection are distributed across all the shards.
I understand what happens with one collection. What about the other collections?
Does all the data of all the other tables get stored in one shard only?
Does it get replicated across all the shards?
Does it also get split and spread across all the shards?
The other collections will reside on a single shard (known as the primary shard) unless you decide to shard them also. The primary shard is set at the database level rather than collection, so all non-sharded collections in a particular database will all have the same primary shard. You can see the primary for any given database in the sh.status() output, as per the example below:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54185b2c2a2835b6e47f7984")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "shardTest", "partitioned" : true, "primary" : "shard0000" }
shardTest.foo
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "bar", "partitioned" : true, "primary" : "shard0000" }
bar.data
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
foo.data
shard key: { "_id" : 1 }
chunks:
shard0000 9
In this example there is only one shard (shard0000), and hence it is the primary for all the databases ("primary" : "shard0000") except config which is a special case (and resides on the config servers). The primary shard for a database is chosen when the database is created.
Hence, if you only had one shard, created all your databases first and then added more shards later, all the databases you created before adding new shards will have their primary set to that first shard (there was nothing else to choose). Any databases created after you have multiple shards could end up with any shard as their primary, essentially it is selected using round robin, but each mongos will have its own idea about where it is in that round robin selection.

MongoDB replicates data to all shards

I have 4 servers in a test environment which I use to test MongoDB replica and distribution:
RepSetA holds RepSetA1 and RepSetA2.
RepSetB holds RepSetB1 and RepSetB2.
All servers act as routers, RepSetA1 acts as a single config server.
I have a "Player" data (10,000 records, the object consists an "Id" and a "Name" fields), and I want it to be sharded (or distributed) between the replica sets, and cloned among the servers in the same replica set. So, just for a plain example:
Player1-5000: Exists in both RepSetA1 and RepSetA2, but not in RepSetB1 and RepSetB2.
Player5000-10000: Exists in both RepSetB1 and RepSetB2, but not in RepSetA1 and RepSetA2.
What I get instead is having all players in all 4 servers.
If I print the sharding status, I get the following:
mongos> db.printShardingStatus();
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "RepSetA", "host" : "RepSetA/MongoRepSetA1:27018,MongoRepSetA2:27018" }
{ "_id" : "RepSetB", "host" : "RepSetB/MongoRepSetB1:27018,MongoRepSetB2:27018" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "GamesDB", "partitioned" : true, "primary" : "RepSetA" }
GamesDB.Player chunks:
RepSetA 2
{ "_id" : { $minKey : 1 } } -->> { "_id" : 0 } on : RepSetA { "t" : 1000, "i" : 1 }
{ "_id" : 0 } -->> { "_id" : { $maxKey : 1 } } on : RepSetA { "t" : 1000, "i" : 2 }
{ "_id" : "test", "partitioned" : false, "primary" : "RepSetB" }
{ "_id" : "EOO", "partitioned" : false, "primary" : "RepSetB" }
I used the following commands to build the shards:
db.adminCommand( { addShard : "RepSetA/MongoRepSetA1:27018,MongoRepSetA2:27018" } )
db.adminCommand( { addShard : "RepSetB/MongoRepSetB1:27018,MongoRepSetB2:27018" } )
db.runCommand( { enablesharding : "GamesDB" } );
db.runCommand( { shardcollection : "GamesDB.Player", key : { _id :1 } , unique : true} );
What am I doing wrong?
If you connect through a mongos process to your nodes, it will look like all contain the data. From your output, It doesn't look like all data is available on all nodes. RepSetA holds 2 chunks and RepSetB should contain none. You can verify this by connecting directly with your nodes rather than through mongos.
By the way, if you're using MongoDBs ObjectId as your _id(shard key), consider to shard on another key as this will cause all inserts be made into one node as the key changes monoton.
This is fine. It does not show that all data is on all servers. The output shows that all chunks (data) of GamesDB.Player are on shard RepSetA
GamesDB.Player chunks:
RepSetA 2
{ "_id" : { $minKey : 1 } } -->> { "_id" : 0 } on : RepSetA { "t" : 1000, "i" : 1 }
{ "_id" : 0 } -->> { "_id" : { $maxKey : 1 } } on : RepSetA { "t" : 1000, "i" : 2 }
This means that the balancer has not started to balance your chunks. The balancer only kicks in when there is an 8 chunk difference.
http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-Balancing
You can force balancing by manually splitting chunks (if you want to)
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
Or you can reduce the chunk size if you want to see balancing sooner.
http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations

How to Verify Sharding?

I am trying to shard MongoDB. I am done with Sharding configuration, but I am not sure how to verify if sharding is functional.
How do i check whether my data is get sharded? Is there a query to verify/validate the shards?
You can also execute a simple command on your mongos router :
> use admin
> db.printShardingStatus();
which should output informations about your shards, your sharded dbs and your sharded collection as mentioned in the mongodb documentation
sharding version: { "_id" : 1, "version" : 2 }
shards:
{ "_id" : ObjectId("4bd9ae3e0a2e26420e556876"), "host" : "localhost:30001" }
{ "_id" : ObjectId("4bd9ae420a2e26420e556877"), "host" : "localhost:30002" }
{ "_id" : ObjectId("4bd9ae460a2e26420e556878"), "host" : "localhost:30003" }
databases:
{ "name" : "admin", "partitioned" : false,
"primary" : "localhost:20001",
"_id" : ObjectId("4bd9add2c0302e394c6844b6") }
my chunks
{ "name" : "foo", "partitioned" : true,
"primary" : "localhost:30002",
"sharded" : { "foo.foo" : { "key" : { "_id" : 1 }, "unique" : false } },
"_id" : ObjectId("4bd9ae60c0302e394c6844b7") }
my chunks
foo.foo { "_id" : { $minKey : 1 } } -->> { "_id" : { $maxKey : 1 } }
on : localhost:30002 { "t" : 1272557259000, "i" : 1 }
MongoDB has detailed documentation on Sharding here ...
http://www.mongodb.org/display/DOCS/Sharding+Introduction
To anwser you question (I think), see the portion on the config Servers ...
Each config server has a complete copy
of all chunk information. A two-phase
commit is used to ensure the
consistency of the configuration data
among the config servers.
Basically, it is the config server's job to make sure everything get sharded ... correctly.
Also, there are system collections you can query ...
db.runCommand( { listshards : 1 } );
Lots of help in the prez below too ...
http://www.slideshare.net/mongodb/mongodb-sharding-internals
http://www.10gen.com/video/mongosv2010/sharding
If you just want to check whether you are conencted to a sharded cluster or not:
db.isMaster() can be used to detect that you are connected to a sharding router (mongos).
If db.isMaster().msg is "isdbgrid", you are connected to a sharded instance.
db.isMaster() can be run without authentication.
For checking the details of the shards, sh.status() also works, which has the same output as db.printShardingStatus(); works.