Get shard id for given value of sharding key in mongodb - mongodb

I have two replica sets in sharding cluster with documents sharded by userId field.
Is there any way to query which shard (replica set) is containgin given document (by _id or sharding key filed) without reimplemnting shadring key hashing on client side

You can use query explain() to identify the shard for a document, by querying based on the shard key.
The winning plan should have a SINGLE_SHARD stage with an equality query similar to the following (with some extra output trimmed for clarity):
> db.users.find({userId:123}).explain().queryPlanner.winningPlan
{
"stage" : "SINGLE_SHARD",
"shards" : [
{
"shardName" : "shard01",
"plannerVersion" : 1,
"namespace" : "test.users",
"indexFilterSet" : false,
"parsedQuery" : {
"userId" : {
"$eq" : 123
}
},
}
]
}
If you only want the shard name, you can use JavaScript notation to reference the full path:
> db.users.find({userId:123}).explain().queryPlanner.winningPlan.shards[0].shardName
shard01

Related

Mongo DB, document count mismatch for a collection

I have a collection User in mongo. When I do a count on this collection I got 13204951 documents
> db.User.count()
13204951
But when I tried to find the count of non-stale documents like this I got a count of 13208778
> db.User.find({"_id": {$exists: true, $ne: null}}).count()
13208778
> db.User.find({"UserId": {$exists: true, $ne: null}}).count()
13208778
I even tried to get the count of this collection using MongoEngine
user_list = set(User.objects().values_list('UserId'))
len(resume_list)
13208778
Here are the indexes of this User collection
>db.User.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "user_db.User"
},
{
"v" : 1,
"unique" : true,
"key" : {
"UserId" : 1
},
"name" : "UserId_1",
"ns" : "user_db.User",
"sparse" : false,
"background" : true
}
]
Any pointers on how to debug the mismatch in counts from different queries.
refer to this document
On a sharded cluster, db.collection.count() can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress.
Also, refer to this question
If you are not using sharding cluster, you can refer to this question
The basic idea is db.{collection}.count() might do some tricks to make it fast to return a count, and it might be not accurate, use a count() with query should be accurate.

What is the best Mongodb Sharding key for my schema?

I am design a Mongodb collection which can save the statistic for daily volume
Here is my DB schema
mongos> db.arq.findOne()
{
"_id" : ObjectId("553b78637e6962c36d67c728"),
"ip" : NumberLong(635860665),
"ts" : ISODate("2015-04-25T00:00:00Z"),
"values" : {
"07" : 2,
"12" : 1
},
"daily_ct" : 5
}
mongos>
And Here is my Indexes
mongos> db.arq.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ip" : 1
},
"name" : "ip_1",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ts" : 1
},
"name" : "ts_1",
"expireAfterSeconds" : 15552000,
"ns" : "Query_Volume.test"
}
]
mongos>
Note: I have a time stamp index since I need to use TTL mechanism.
But the Sharding Key has any suggestion?
You have multiple options:
{ts: 1} Your timestamp. The data of certain ranges will be located together, but the key is monotonically increasing, and I'm not sure, whether the TTL index will clean up shard chunks. Means: The write load switches from shard to shard, and you have a shard with high write load whereas the other shards will get no writes for the data. This pattern works nicely if you query contiguous time ranges but has downsides in writing.
{ts: "hashed"} Hash-based sharding. The data will be sharded more or less evenly across the shards. Hash-based sharding distributes the write load but involves all shards (more or less) when querying for data.
You will need to test, what fits the best for your reads and writes. The sharding key depends on the data structure and the read/write patterns of your application.

shard cluster doesn't seem to be dividing data based on keys on mongoDB prototype

Background Information
I have the following shard cluster defined:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("547496dd009cc54d845c2ff1")
}
shards:
{ "_id" : "jjrs0", "host" : "jjrs0/mongohost1:27017" }
{ "_id" : "jjrs1", "host" : "jjrs1/mongohost2:27017" }
{ "_id" : "jjrs2", "host" : "jjrs2/mongohost3:27017" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "rtables", "partitioned" : true, "primary" : "jjrs1" }
rtables.widgets
shard key: { "location" : 1, "name" : 1 }
chunks:
jjrs1 1
{ "location" : { "$minKey" : 1 }, "name" : { "$minKey" : 1 } } -->> { "location" : { "$maxKey" : 1 }, "name" : { "$maxKey" : 1 } } on : jjrs1 Timestamp(1, 0)
{ "_id" : "test", "partitioned" : false, "primary" : "jjrs0" }
mongos>
I have 3 replicasets (each with just a primary for testing purposes).
I've defined "location" as being my shard key, where location will contain values like "CAN" for "Canada" and "USA" for "United States".
I am hoping to have a different location on different shards.
Shard Key
I've made what I *think it a compound key. The location, and the name fields make up the key. A widget's name will always be unique.
What the Data / CSV Files Look Like
The Canada CSV files look like this:
location,name,rt_id,type
canada,can-widget111,123,the best widget
canada,can-widget222,1,the next best widget
The USA CSV looks like:
location,name,rt_id,type
usa,usa-widget1,24,test widget
usa,usa-widget2,25,widget widget
Problem
Either I'm misunderstanding how the shard keys work, or I've set something up incorrectly... or maybe I'm not validating my data properly. In any case, here's whats happening:
I've imported all canada records into the primary shard for the "rtables" database "jjrs1". This is the command I ran:
mongoimport -h mongohost2 --port 27017 -d rtables -c widgets --type csv /tmp/canada_rtables.csv --headerline
I started mongo on this host.. and then check the number of records in the rtables.widgets collection, and it matches the number of records imported. Let's say 50.
Then i imported all records for United States by running a similar command, like so:
mongoimport -h mongohost2 --port 27017 -d rtables -c widgets --type
csv /tmp/usa_rtables.csv --headerline
I checked the records in the "primary" replicaset jjsr1 and it had the new records in the collection.
The other two databases on the other replicasets (jjrs0 and jjrs2) and empty. Infact the rtables databases on both servers are empty, as you can see below:
jjrs2:PRIMARY> use rtables
switched to db rtables
jlrs2:PRIMARY> show databases
admin (empty)
local 1.078GB
rtables (empty)
jjrs2:PRIMARY>
and
jjrs0:PRIMARY> show databases
admin (empty)
local 1.078GB
rtables (empty)
test 0.078GB
jjrs0:PRIMARY>
Questions
Am I correct in assuming that the data will be divided in such a way where all the Canadian content will be in one replicaset and the USA data in another?
If so, am I testing properly?
If my assumption is incorrect, can you please explain how the division of data is supposed to occur based on the shard key I've defined?
I've found the following post that might be related... but I haven't been able to apply the answer to my own questions..:
Mongo sharding fails to split large collection between shards
I'm still mulling it over.
Thanks
EDIT 1
I'm thinking that maybe I should use a tag as a shard key?
Maybe something like this:
mongos> sh.addShardTag("jjrs1", "CAN")
mongos> sh.addShardTag("jjrs1", "USA")
mongos> sh.addShardTag("jjrs0", "JPN")
mongos> sh.addShardTag("jjrs2", "IND")
mongos> sh.addShardTag("jjrs2", "TAI")
mongos> sh.addShardTag("jjrs1", "VET")
mongos> sh.status()
The next question would be how to then associate every record that has "can" in the location field with jjrs1...
First of all:
mongoimport -h mongohost2 ...
Here is your first mistake. You are importing directly into a "regular" mongod host. You should have used mongos for imports, which holds routing logic (which chunks reside on which shard).
Also, take a look at the following:
chunks:
jjrs1 1
{"location" : {"$minKey" : 1}, "name" : {"$minKey" : 1} } -->>
{"location" : {"$maxKey" : 1}, "name" : {"$maxKey" : 1} } on : jjrs1 Timestamp(1, 0)
This can be read as "for every location and name values, the document should be routed to jjrs1.
This is no wonder. A chunk is a sequence of documents with a shard key within a certain range. Also, each chunk's default size is 64MB. If a document's size average is 64KB for example, you will have to insert more than 1000 documents (let's say 500+ for US and 500+ for CANADA) before the chunk will split into 2.
Then, sh.status() may resemble something like:
chunks:
jjrs0 0
{"location" : {"$minKey" : 1}, "name" : {"$minKey" : 1} } -->>
{"location" : "canada", "name" : {"$maxKey" : 1} } on : jjrs0 Timestamp(1, 0)
jjrs1 1
{"location" : {"usa" : 1}, "name" : {"$minKey" : 1} } -->>
{"location" : {"$maxKey" : 1}, "name" : {"$maxKey" : 1} } on : jjrs1 Timestamp(1, 0)
I'm thinking that maybe I should use a tag as a shard key? Maybe something like this:
Well this is not a shard key, but a mechanism to associate tags with shards. This is a step in the right direction, but not enough.
For example, using the (imaginary) sh.status() above, once the chunk had split into two, the "canadian" chunk had been migrated to jjrs0. This may not be the desired behavior, assuming you want to rout the "american" documents to jjrs0.
Why does it happen? Because you've missed declaring tag ranges. Without these the shard tags are just static cluster metadata.
You will also have to configure tag ranges as follows:
sh.addTagRange( "rtables.widgets",
{ country: "usa", name: MinKey }, { country: "usa", name: MaxKey }, "USA")
sh.addTagRange( "rtables.widgets",
{ country: "canada", name: MinKey }, { country: "canada", name: MaxKey }, "CAN")
To wrap it up, in order for the american documents to be on jjrs0 shard, and Canadian to be on jjrs1, do the following:
Optionally delete all widget documents from jjrs1.
Declare the shard tags as in the edit section of your post (for sake of testing, CAN and USA are enough).
Declare the tag ranges as above.
Use a mongos for import.
Comment: for the rest of the countries - where tags had not been declared -an attempt will be made to balance the cluster, meaning chunks will be routed / migrated randomly in order to balance the number of chunks among shards.

Sharding on MongoDB what about the other collections?

I understand that 'Hash Sharding' can be done on the collection level on a database on based on the key of the collection that is passed.
This ensures that records for that collection are distributed across all the shards.
I understand what happens with one collection. What about the other collections?
Does all the data of all the other tables get stored in one shard only?
Does it get replicated across all the shards?
Does it also get split and spread across all the shards?
The other collections will reside on a single shard (known as the primary shard) unless you decide to shard them also. The primary shard is set at the database level rather than collection, so all non-sharded collections in a particular database will all have the same primary shard. You can see the primary for any given database in the sh.status() output, as per the example below:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54185b2c2a2835b6e47f7984")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "shardTest", "partitioned" : true, "primary" : "shard0000" }
shardTest.foo
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "bar", "partitioned" : true, "primary" : "shard0000" }
bar.data
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
foo.data
shard key: { "_id" : 1 }
chunks:
shard0000 9
In this example there is only one shard (shard0000), and hence it is the primary for all the databases ("primary" : "shard0000") except config which is a special case (and resides on the config servers). The primary shard for a database is chosen when the database is created.
Hence, if you only had one shard, created all your databases first and then added more shards later, all the databases you created before adding new shards will have their primary set to that first shard (there was nothing else to choose). Any databases created after you have multiple shards could end up with any shard as their primary, essentially it is selected using round robin, but each mongos will have its own idea about where it is in that round robin selection.

MongoDB Sharding Policy

I am trying to understand the following behavior displayed by my sharding setup. The data seems to only increase on a single shard as I continuously add data. How does MongoDB shard or distribute data across different servers? Am I doing this correctly? MongoDB version 2.4.1 used here on OS X 10.5.
As requested, sh.status() as follows:
mongos> sh.status()
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("52787cc2c10fcbb58607b07f") }
shards:
{ "_id" : "shard0000", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0001", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0002", "host" : "xx.xx.xx.xxx:xxxxx" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "newdb", "partitioned" : true, "primary" : "shard0001" }
newdb.prov
shard key: { "_id" : 1, "jobID" : 1, "user" : 1 }
chunks:
shard0000 43
shard0001 50
shard0002 43
Looks like you have chosen a very poor shard key. You partitioned along the values of { "_id" : 1, "jobID" : 1, "user" : 1 } - this will not be a good distribution for inserts since _id values are monotonically increasing since you are using ObjectId() values for _id.
You want to select a shard key that represents how you access the data - it doesn't make sense that you have two more fields after _id - since _id is unique the other two fields are never going to be used to partition the data.
Did you perhaps intend to shard on jobID, user? It's hard to know what the best shard key would be in your case, but it's clear that all the inserts are going into the highest chunk (top value through maxKey) since every new _id is a higher value than the previous one.
Eventually they should be balanced to other shards, but only if the balancer is running, if all your config servers are up and if secondaries are caught up. Best to pick a better shard key and have inserts be distributed evenly across the cluster from the start.