How to find the range of data on a MongoDB shard - mongodb

When sharding a MongoDB collection using range partitioning and letting MongoDB handle balancing, how can I determine the range of data stored on each shard (without querying)?
For example, if partitioning on a string field that took values "A"-"Z", how could I find what letters are being stored on each node?
I need auto-balancing, so I cannot use zones/tagging either

Check if db.printShardingStatus() or sh.status() help you in this regard. You can add boolean parameter to control verbose output which will display distribution of shard key across chunks.

Related

How shard a collection with data inside sharded cluster MongoDB

I want to shard a collection with data. When I try with sh.shardCollection("myDb.myCollection", {id:"hashed"}) then this collection shard but it's not spread to the whole shards. only spread to the primary shard. for example,
Empty collection after shard,
sh.status() result
Then data add it will spread to whole shards
Collection with data after shard,
sh.status() result
When data add only goes to the primary shard.
My question is how correctly shard a collection with data in MongoDB. Have any other alternative way?
I agree with #Wernfried Domscheit in the comments about the fact that the cluster will take care of distributing the data once the collection is sharded. As mentioned, that is done based on writing to the collection and happens over time. Your test may have too little data or too few writes to trigger the changes.
To your specific question about the initial distribution of chunks, this is covered in the documentation. Applying a hashed shard key on an empty collection in your first example is covered here:
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster. You can use numInitialChunks option to specify a different number of initial chunks. This initial creation and distribution of chunks allows for faster setup of sharding.
And behavior on the collection with data is covered just above it here:
The sharding operation creates the initial chunk(s) to cover the entire range of the shard key values. The number of chunks created depends on the configured chunk size.
Both of these described behaviors match what you have demonstrated in your question.

Mongodb: Determining shard key strategy on compound index

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.

Mongodb Range Based sharding

I would like to shard my collection on the basis of range on mongodb shards, my question is if shard key is string field then how will we divide string based shard key in different chunks for range based sharding ???
You can divide a string across shards using tag aware sharding. You create the "tags" denoting the ranges of the key to assign to a specific shard. Mongo's balancer will handle the distribution of the data and when you write a query for the key in question Mongo will know to target only that shard.
For more information see the following URL from the vendor. sharding-introduction/

MongoDB and dynamic shard keys

I have been thinking about sharding with MongoDB and came across a use case which I haven't been able to figure out ... so here it is:
If I have documents that look like this one...
_id [Integer]
username [String]
password [String] <-- SHA1 hash
firstname [String]
lastname [String]
...and I now choose the password field as my shard key, it would be a good fit for sharding since it has a very high cardinality and would scale nicely. But the question remains, what happens if a user changes his password? Will the corresponding document be automatically migrated to a different chunk?
Does someone know how MongoDB handles cases like this one?
Thanks
No, shard keys are immutable.
Consider the mongo documentation, Can I change the shard key after sharding a collection?:
Can I change the shard key after sharding a collection?
No.
There is no automatic support in MongoDB for changing a shard key
after sharding a collection. This reality underscores the importance
of choosing a good shard key. If you must change a shard key
after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.
My understanding of your question is that you asked:
what happens if a user changes his password?
Not:
what happens if I change the shard key?
Completely different questions. For the second case the accepted answer is correct.
For your original question:
In shared clusters mongodb has a component called balancer. The balancer will balance your shards and migrate your chunks so they are balanced in size if possible.
Please read: Sharded Cluster Balancer.
So, yes, if user changes their password the corresponding document will be automatically migrated to a different chunk, only if balancer thinks is needed. The balancer takes care of this.
As an important note with the release of new version starting 4.2, the following statement does not apply.
"Once inserted, a document's shard key value cannot be modified" .
So the answer to the question, Can shard key be changed?
Although you cannot select a different shard key for a sharded collection, starting in MongoDB 4.2, you can update a document's shard key value unless the shard key field is the immutable _id field

Sharding GridFS on MongoDB

I'm documenting about the GridFS and the possibility to shard it among different machines.
Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.
In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).
what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?
thank you.
You should shard on files_id to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _ids in the fs.files collection (probably MD5s would be better than ObjectIds).
We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.
You can shard gridfs data because gridfs it just two collecttions: chunks and files. And gridfs sharding it's very useful and great thing. About gridfs shard key it's always bad choose random or incremental shard key, because data not evenly distribute across shards. In case of incremental shard key all writes going to the last shard and it growth and once difference between become 10 or more chunks, balancer move data to another shards. Moving data to another shard always difficult task that should be avoided as it possible.
So when you choose shard key you should care about even distribution of data.
Also if you get luck mb author of 'Scaling MongoDB' kristina(great specialist in shard keys) will answer to your question.
Documentation says that in common cases you should choose default index fileId:1,n:1 as shard key:
There are different ways that GridFS
can be sharded, depending on the need.
One common way to shard, based on
pre-existing indexes, is:
"files" collection is not sharded. All
file records will live in 1 shard. It
is highly recommended to make that
shard very resilient (at least 3 node
replica set) "chunks" collection gets
sharded using the existing index
"files_id: 1, n: 1". Some files at the
end of ranges may have their chunks
split across shards, but most files
will be fully contained within the
same shard.
Currently MongoDB as of version 1.8.1 supports only sharding on "file_id" field, because of using md5 to verify the upload, but it doesn't
work across shards yet. So you cannot split single file across shards.
Answer on google group7