I am trying to understand the following behavior displayed by my sharding setup. The data seems to only increase on a single shard as I continuously add data. How does MongoDB shard or distribute data across different servers? Am I doing this correctly? MongoDB version 2.4.1 used here on OS X 10.5.
As requested, sh.status() as follows:
mongos> sh.status()
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("52787cc2c10fcbb58607b07f") }
shards:
{ "_id" : "shard0000", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0001", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0002", "host" : "xx.xx.xx.xxx:xxxxx" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "newdb", "partitioned" : true, "primary" : "shard0001" }
newdb.prov
shard key: { "_id" : 1, "jobID" : 1, "user" : 1 }
chunks:
shard0000 43
shard0001 50
shard0002 43
Looks like you have chosen a very poor shard key. You partitioned along the values of { "_id" : 1, "jobID" : 1, "user" : 1 } - this will not be a good distribution for inserts since _id values are monotonically increasing since you are using ObjectId() values for _id.
You want to select a shard key that represents how you access the data - it doesn't make sense that you have two more fields after _id - since _id is unique the other two fields are never going to be used to partition the data.
Did you perhaps intend to shard on jobID, user? It's hard to know what the best shard key would be in your case, but it's clear that all the inserts are going into the highest chunk (top value through maxKey) since every new _id is a higher value than the previous one.
Eventually they should be balanced to other shards, but only if the balancer is running, if all your config servers are up and if secondaries are caught up. Best to pick a better shard key and have inserts be distributed evenly across the cluster from the start.
Related
MongoDB sharding cluster uses a "primary shard" to hold collection data in DBs in which sharding has been enabled (with sh.enableSharding()) but the collection itself has not been yet enabled (with sh.shardCollection()). The mongos process choses automatically the primary shard, except if the user state it explicitly as parameter of sh.enableSharding()
However, what happens in DBs where sh.enableSharding() has not been executed yet? Is there some "global primary" for these cases? How can I know which one it is? sh.status() doesn't show information about it...
I'm using MongoDB 4.2 version.
Thanks!
The documentation says:
The mongos selects the primary shard when creating a new database by picking the shard in the cluster that has the least amount of data.
If enableSharding is called on a database which already exists, the above quote would define the location of the database prior to sharding being enabled on it.
sh.status() shows where the database is stored:
MongoDB Enterprise mongos> use foo
switched to db foo
MongoDB Enterprise mongos> db.foo.insert({a:1})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("5eade78756d7ba8d40fc4317")
}
shards:
{ "_id" : "shard01", "host" : "shard01/localhost:14442,localhost:14443", "state" : 1 }
{ "_id" : "shard02", "host" : "shard02/localhost:14444,localhost:14445", "state" : 1 }
active mongoses:
"4.3.6" : 2
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
No recent migrations
databases:
{ "_id" : "config", "primary" : "config", "partitioned" : true }
{ "_id" : "foo", "primary" : "shard02", "partitioned" : false, "version" : { "uuid" : UUID("ff618243-f4b9-4607-8f79-3075d14d737d"), "lastMod" : 1 } }
{ "_id" : "test", "primary" : "shard01", "partitioned" : false, "version" : { "uuid" : UUID("4d76cf84-4697-4e8c-82f8-a0cfad87be80"), "lastMod" : 1 } }
foo is not partitioned and stored in shard02.
If enableSharding is called on a database which doesn't yet exist, the database is created and, the primary shard is specified, the specified shard is used as the primary shard. Test code here.
I am design a Mongodb collection which can save the statistic for daily volume
Here is my DB schema
mongos> db.arq.findOne()
{
"_id" : ObjectId("553b78637e6962c36d67c728"),
"ip" : NumberLong(635860665),
"ts" : ISODate("2015-04-25T00:00:00Z"),
"values" : {
"07" : 2,
"12" : 1
},
"daily_ct" : 5
}
mongos>
And Here is my Indexes
mongos> db.arq.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ip" : 1
},
"name" : "ip_1",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ts" : 1
},
"name" : "ts_1",
"expireAfterSeconds" : 15552000,
"ns" : "Query_Volume.test"
}
]
mongos>
Note: I have a time stamp index since I need to use TTL mechanism.
But the Sharding Key has any suggestion?
You have multiple options:
{ts: 1} Your timestamp. The data of certain ranges will be located together, but the key is monotonically increasing, and I'm not sure, whether the TTL index will clean up shard chunks. Means: The write load switches from shard to shard, and you have a shard with high write load whereas the other shards will get no writes for the data. This pattern works nicely if you query contiguous time ranges but has downsides in writing.
{ts: "hashed"} Hash-based sharding. The data will be sharded more or less evenly across the shards. Hash-based sharding distributes the write load but involves all shards (more or less) when querying for data.
You will need to test, what fits the best for your reads and writes. The sharding key depends on the data structure and the read/write patterns of your application.
Background Information
I have the following shard cluster defined:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("547496dd009cc54d845c2ff1")
}
shards:
{ "_id" : "jjrs0", "host" : "jjrs0/mongohost1:27017" }
{ "_id" : "jjrs1", "host" : "jjrs1/mongohost2:27017" }
{ "_id" : "jjrs2", "host" : "jjrs2/mongohost3:27017" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "rtables", "partitioned" : true, "primary" : "jjrs1" }
rtables.widgets
shard key: { "location" : 1, "name" : 1 }
chunks:
jjrs1 1
{ "location" : { "$minKey" : 1 }, "name" : { "$minKey" : 1 } } -->> { "location" : { "$maxKey" : 1 }, "name" : { "$maxKey" : 1 } } on : jjrs1 Timestamp(1, 0)
{ "_id" : "test", "partitioned" : false, "primary" : "jjrs0" }
mongos>
I have 3 replicasets (each with just a primary for testing purposes).
I've defined "location" as being my shard key, where location will contain values like "CAN" for "Canada" and "USA" for "United States".
I am hoping to have a different location on different shards.
Shard Key
I've made what I *think it a compound key. The location, and the name fields make up the key. A widget's name will always be unique.
What the Data / CSV Files Look Like
The Canada CSV files look like this:
location,name,rt_id,type
canada,can-widget111,123,the best widget
canada,can-widget222,1,the next best widget
The USA CSV looks like:
location,name,rt_id,type
usa,usa-widget1,24,test widget
usa,usa-widget2,25,widget widget
Problem
Either I'm misunderstanding how the shard keys work, or I've set something up incorrectly... or maybe I'm not validating my data properly. In any case, here's whats happening:
I've imported all canada records into the primary shard for the "rtables" database "jjrs1". This is the command I ran:
mongoimport -h mongohost2 --port 27017 -d rtables -c widgets --type csv /tmp/canada_rtables.csv --headerline
I started mongo on this host.. and then check the number of records in the rtables.widgets collection, and it matches the number of records imported. Let's say 50.
Then i imported all records for United States by running a similar command, like so:
mongoimport -h mongohost2 --port 27017 -d rtables -c widgets --type
csv /tmp/usa_rtables.csv --headerline
I checked the records in the "primary" replicaset jjsr1 and it had the new records in the collection.
The other two databases on the other replicasets (jjrs0 and jjrs2) and empty. Infact the rtables databases on both servers are empty, as you can see below:
jjrs2:PRIMARY> use rtables
switched to db rtables
jlrs2:PRIMARY> show databases
admin (empty)
local 1.078GB
rtables (empty)
jjrs2:PRIMARY>
and
jjrs0:PRIMARY> show databases
admin (empty)
local 1.078GB
rtables (empty)
test 0.078GB
jjrs0:PRIMARY>
Questions
Am I correct in assuming that the data will be divided in such a way where all the Canadian content will be in one replicaset and the USA data in another?
If so, am I testing properly?
If my assumption is incorrect, can you please explain how the division of data is supposed to occur based on the shard key I've defined?
I've found the following post that might be related... but I haven't been able to apply the answer to my own questions..:
Mongo sharding fails to split large collection between shards
I'm still mulling it over.
Thanks
EDIT 1
I'm thinking that maybe I should use a tag as a shard key?
Maybe something like this:
mongos> sh.addShardTag("jjrs1", "CAN")
mongos> sh.addShardTag("jjrs1", "USA")
mongos> sh.addShardTag("jjrs0", "JPN")
mongos> sh.addShardTag("jjrs2", "IND")
mongos> sh.addShardTag("jjrs2", "TAI")
mongos> sh.addShardTag("jjrs1", "VET")
mongos> sh.status()
The next question would be how to then associate every record that has "can" in the location field with jjrs1...
First of all:
mongoimport -h mongohost2 ...
Here is your first mistake. You are importing directly into a "regular" mongod host. You should have used mongos for imports, which holds routing logic (which chunks reside on which shard).
Also, take a look at the following:
chunks:
jjrs1 1
{"location" : {"$minKey" : 1}, "name" : {"$minKey" : 1} } -->>
{"location" : {"$maxKey" : 1}, "name" : {"$maxKey" : 1} } on : jjrs1 Timestamp(1, 0)
This can be read as "for every location and name values, the document should be routed to jjrs1.
This is no wonder. A chunk is a sequence of documents with a shard key within a certain range. Also, each chunk's default size is 64MB. If a document's size average is 64KB for example, you will have to insert more than 1000 documents (let's say 500+ for US and 500+ for CANADA) before the chunk will split into 2.
Then, sh.status() may resemble something like:
chunks:
jjrs0 0
{"location" : {"$minKey" : 1}, "name" : {"$minKey" : 1} } -->>
{"location" : "canada", "name" : {"$maxKey" : 1} } on : jjrs0 Timestamp(1, 0)
jjrs1 1
{"location" : {"usa" : 1}, "name" : {"$minKey" : 1} } -->>
{"location" : {"$maxKey" : 1}, "name" : {"$maxKey" : 1} } on : jjrs1 Timestamp(1, 0)
I'm thinking that maybe I should use a tag as a shard key? Maybe something like this:
Well this is not a shard key, but a mechanism to associate tags with shards. This is a step in the right direction, but not enough.
For example, using the (imaginary) sh.status() above, once the chunk had split into two, the "canadian" chunk had been migrated to jjrs0. This may not be the desired behavior, assuming you want to rout the "american" documents to jjrs0.
Why does it happen? Because you've missed declaring tag ranges. Without these the shard tags are just static cluster metadata.
You will also have to configure tag ranges as follows:
sh.addTagRange( "rtables.widgets",
{ country: "usa", name: MinKey }, { country: "usa", name: MaxKey }, "USA")
sh.addTagRange( "rtables.widgets",
{ country: "canada", name: MinKey }, { country: "canada", name: MaxKey }, "CAN")
To wrap it up, in order for the american documents to be on jjrs0 shard, and Canadian to be on jjrs1, do the following:
Optionally delete all widget documents from jjrs1.
Declare the shard tags as in the edit section of your post (for sake of testing, CAN and USA are enough).
Declare the tag ranges as above.
Use a mongos for import.
Comment: for the rest of the countries - where tags had not been declared -an attempt will be made to balance the cluster, meaning chunks will be routed / migrated randomly in order to balance the number of chunks among shards.
I understand that 'Hash Sharding' can be done on the collection level on a database on based on the key of the collection that is passed.
This ensures that records for that collection are distributed across all the shards.
I understand what happens with one collection. What about the other collections?
Does all the data of all the other tables get stored in one shard only?
Does it get replicated across all the shards?
Does it also get split and spread across all the shards?
The other collections will reside on a single shard (known as the primary shard) unless you decide to shard them also. The primary shard is set at the database level rather than collection, so all non-sharded collections in a particular database will all have the same primary shard. You can see the primary for any given database in the sh.status() output, as per the example below:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54185b2c2a2835b6e47f7984")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "shardTest", "partitioned" : true, "primary" : "shard0000" }
shardTest.foo
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "bar", "partitioned" : true, "primary" : "shard0000" }
bar.data
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
foo.data
shard key: { "_id" : 1 }
chunks:
shard0000 9
In this example there is only one shard (shard0000), and hence it is the primary for all the databases ("primary" : "shard0000") except config which is a special case (and resides on the config servers). The primary shard for a database is chosen when the database is created.
Hence, if you only had one shard, created all your databases first and then added more shards later, all the databases you created before adding new shards will have their primary set to that first shard (there was nothing else to choose). Any databases created after you have multiple shards could end up with any shard as their primary, essentially it is selected using round robin, but each mongos will have its own idea about where it is in that round robin selection.
I'm having problems with what seems to be a simple sharding setup in mongo.
I have two shards, a single mongos instance, and a single config server set up like this:
Machine A - 10.0.44.16 - config server, mongos
Machine B - 10.0.44.10 - shard 1
Machine C - 10.0.44.11 - shard 2
I have a collection called 'Seeds' that has a shard key 'SeedType' which is a field that is present on every document in the collection, and contains one of four values (take a look at the sharding status below). Two of the values have significantly more entries than the other two (two of them have 784,000 records each, and two have about 5,000).
The behavior I'm expecting to see is that records in the 'Seeds' collection with InventoryPOS will end up on one shard, and the ones with InventoryOnHand will end up on the other.
However, it seems that all records for both the two larger shard keys end up on the primary shard.
Here's my sharding status text (other collections removed for clarity):
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
{ "_id" : "shard0000", "host" : "10.44.0.11:27019" }
{ "_id" : "shard0001", "host" : "10.44.0.10:27017" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "TimMulti", "partitioned" : true, "primary" : "shard0001" }
TimMulti.Seeds chunks:
{ "SeedType" : { $minKey : 1 } } -->> { "SeedType" : "PBI.AnalyticsServer.KPI" } on : shard0000 { "t" : 2000, "i" : 0 }
{ "SeedType" : "PBI.AnalyticsServer.KPI" } -->> { "SeedType" : "PBI.Retail.InventoryOnHand" } on : shard0001 { "t" : 2000, "i" : 7 }
{ "SeedType" : "PBI.Retail.InventoryOnHand" } -->> { "SeedType" : "PBI.Retail.InventoryPOS" } on : shard0001 { "t" : 2000, "i" : 8 }
{ "SeedType" : "PBI.Retail.InventoryPOS" } -->> { "SeedType" : "PBI.Retail.SKU" } on : shard0001 { "t" : 2000, "i" : 9 }
{ "SeedType" : "PBI.Retail.SKU" } -->> { "SeedType" : { $maxKey : 1 } } on : shard0001 { "t" : 2000, "i" : 10 }
Am I doing anything wrong?
Semi-unrelated question:
What is the best way to atomically transfer an object from one collection to another without blocking the entire mongo service?
Thanks in advance,
-Tim
Sharding really isn't meant to be used this way. You should choose a shard key with some variation (or make a compound shard key) so that MongoDB can make reasonable-size chunks. One of the points of sharding is that your application doesn't have to know where your data is.
If you want to manually shard, you should do that: start unlinked MongoDB servers and route things yourself from the client side.
Finally, if you're really dedicated to this setup, you could migrate the chunk yourself (there's a moveChunk command).
The balancer moves chunks based on how much is mapped in memory (run serverStatus and look at the "mapped" field). It can take a while, MongoDB doesn't want your data flying all over the place in production, so it's pretty conservative.
Semi-unrelated answer: you can't do it atomically with sharding (eval isn't atomic across multiple servers). You'll have to do a findOne, insert, remove.