How Mongo balancer work in regards to the hashed shard key - mongodb

It is well-know concept of balancing the data among nodes and it is clearly described in the manual. How does the balancer work in regards to the hashed shard key chunks? Can you migrate a chunk if according to the shard key the documents from this chunk belong to this shard? If it is not true, why then the documents are in the shard they do not belong to? Or the balancer is irrelevant for hashed shard key as long as the number of shards is not changing?

Hashed shard key still performs like a normal range-based shard key. They also can be migrated, split, etc. just like a normal shard key.
The difference being: hashed shard key have some additional process when initially sharding a collection. To paraphrase from the docs page at Hashed sharding:
When sharding a populated collection:
The sharding operation creates the initial chunk(s) to cover the entire range of the shard key values. The number of chunks created depends on the configured chunk size.
When sharding an empty collection:
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster.
In short, hashed sharding presplit the whole hashed keyspace, then the balancer moves the chunks across the cluster like usual.
Other than the initial chunk splits, they function just like a normal shard key.
Note that hashed shard key can help if your shard key is very much monotonically increasing to prevent a "hot shard" and help to spread your writes across the whole cluster. Using a hashed shard key is not a guarantee that you cannot have jumbo chunks, since the cardinality of the hashed key will reflect the cardinality of the actual key.
For example: assuming that the hash algorithm is MD5 (128 bit in size). Potentially, the MD5 space is very large. However, if the underlying key (unhashed) value can only be either A or B, those two hashed with MD5 will always result in A -> bf072e9119077b4e76437a93986787ef and B -> 30cf3d7d133b08543cb6c8933c29dfd7, making the cardinality extremely small.

Related

Inserting data in empty Sharded database in mongo when balancer is not enabled result all data in one shard

We have to 2 mongo db shard servers(3 Replica Set each).
We created Sharded collection and inserted 200k documents. Balancer was disabled in that window and we enabled it after first test and started insert again.
While in first test all data was inserted in one shard and we got lots of warning in mongolog:-
splitChunk cannot find chunk [{ articleId: MinKey, sessionId: MinKey },{ articleId: "59830791", sessionId: "fb0ccc50-3d6a-4fc9-aa66-e0ccf87306ea" }) to split, the chunk boundaries may be stale
Reason mentioned in log is possible low cardinality shard key
After second and third test when balancer was on data was balanced on both shards.
We did one more test and stopped balancer again in this test, data was going in both shards even balancer was off (pageIds were reader ids which are repeated from old tests along with some new ids for both)
Could you please tell how this mechanism is working as data should go in both shards no matter balancer is ON or OFF when key's cardinality is good.
Shard Key is :- (pageid) and (unique readerid)
Below are the insertion stats:-
Page read in duration 200k
Unique page IDs 2000
Unqiue session reading pages in duration :- 70000
Thanks in Advance!
When you enable sharding for a database, a primary shard will get assigned for each database.
If you insert data with balancer as disabled, all the data will go into the primary shard. Mongo Split will calculate the split point as your data grows and chunks will get created.
Since your balancer is disabled, all the chunks will remain on same shard.
If your balancer is in enabled state then it will balance those chunks between the shards which will result in better data distribution.
We did one more test and stopped balancer again in this test, data was going
in both shards even balancer was off (pageIds were reader ids which are
repeated from old tests along with some new ids for both)
The data is already distributed in chunks and these chunks are well distributed between 2 shards. If the range of your shard key is also distributed evenly among the chunks then any new document will go in respective chunk which will lead into even data distribution.

picking a shardkey for mongodb

I want to shard my MongoDB database. I have a high insert rate and want to distribute my documents on two shards evenly.
I have considered rangebase sharding, because I have range queries; but I can not find a solution for picking a good shard key.
{
Timestamp : ISODate("2016-10-02T00:01:00.000Z"),
Machine_ID: "100",
Temperature:"50"
}
If this is my document and I have 100,000 different machines, would the Machine_ID be a suitable shardkey? And if so, how will MongoDB distribute it on the shards, i.e. do i have to specify the shard range myself? like put Machine_ID 0-49,999 on shard A, and 50,000-100,000 on shard B?
I think the Machine_ID would be a suitable shard key if your queries afterwards will be per Machine, i.e. get all the temperatures for a specific machine for a certain time range. Reading more about shard keys can be found here: Choosing shard key
MongoDB has two kinds of sharding: Hashed sharding and Range sharding which you can read more about here: Sharding strategies. Having said that, you don't need to specify the range of the shards yourself, mongo will take care of it. Especially when a time comes when you'll need to add a new shard, mongo will rearrange the chunks into the new shard.
If your cluster has only two shards, then it isn't difficult to design for. However, if your data will continue to grow and you end up having a lot more shards, then the choice of shard key is more difficult.
For example, if some machines have many more records than others (e.g. one machine has 3000 records i.e. 3% of the total), then that doesn't cause problems with only two shards. But if your data grows so that you need 100 shards, and one machine still has 3% of the total, then Machine_ID is no longer a good choice: because a single machine's records have to be a single chunk, and cannot be distributed across several shards.
In that case, a better strategy might be to use a hash of the Timestamp - but it depends on the overall shape of your dataset.

How to balance data when write overload is too heavy

I deployed a sharded cluster of two shards with MongoDB version 3.0.3.
Unfortunately, I chose a monotonic shard key just like:
{insertTime: 1}
When data size was small and the write speed was slow, the balancer can balance the data between the two shards. But when the data size grows big and our write speed is much faster, the balancing speed is so slow.
Now, the hard disk's storage of one of the two shards called shard2 is near the limit.
How Can I solve this problem without stopping our service and application??
I strongly suggest that you change your shard key while it's not too late to do so to avoid the preditable death of your cluster.
When a shard key increase monotonically, all the writes operations are sent to a single shard. Thus, this shard will grow then split into 2 shards. You will continue to hammer one of them until it splits again. At some point, you cluster won't be balanced anymore and your cluster will trigger some chunk moves and slow down your cluster even more.
MongoDB generates ObjectId values upon document creation to produce a unique identifier for the object. However, the most significant bits of data in this value represent a time stamp, which means that they increment in a regular and predictable pattern. Even though this value has high cardinality, when using this, any date, or other monotonically increasing number as the shard key, all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster.
You do not benefit from the good part of the sharding with this shard key. It's actually worst in performance than a single node.
You should read this to select your new shard key and avoid the typical anti patterns. http://docs.mongodb.org/manual/tutorial/choose-a-shard-key/
You could add a shard to the cluster to increase capacity.
From the docs:
You add shards to a sharded cluster after you create the cluster or any time that you need to add capacity to the cluster. If you have not created a sharded cluster, see Deploy a Sharded Cluster.
When adding a shard to a cluster, always ensure that the cluster has enough capacity to support the migration required for balancing the cluster without affecting legitimate production traffic.

Relation between shard keys and chunks in MongoDB sharded cluster?

I can't really understand the shard key concept in a MongoDB sharded cluster, as I've just started learning MongoDB.
Citing the MongoDB documentation:
A chunk is a contiguous range of shard key values assigned to a
particular shard. When they grow beyond the configured chunk size, a
mongos splits the chunk into two chunks.
It seems that chuck size is something related to a particular shard, not to the cluster itself. Am I right?
Speaking about the cardinality of a shard key:
Consider the use of a state field as a shard key:
The state key’s value
holds the US state for a given address document. This field has a low
cardinality as all documents that have the same value in the state
field must reside on the same shard, even if a particular state’s
chunk exceeds the maximum chunk size.
Since there are a limited number of possible values for the state field, MongoDB may distribute data unevenly among a small number of fixed chunks.
My question is how the shard key relates to the chunk size.
It seems to me that, having just two shard servers, it wouldn't be possible to distribute the data because same value in the state field must reside on the same shard. With three documents with states like Arizona, Indiana and Maine, how data is distributed among just two shards?
In order to understand the answer to your question you need to understand range based partitioning. If you have N documents they will be partitioned into chunks - the way the split points are determined is based on your shard key.
With shard key being some field in your document, all the possible values of the shard key will be considered and all the documents will be (logically) split into chunks/ranges, based on what value each document's shard key is.
In your example there are 50 possible values for "state" (okay, probably more like 52) so at most there can only be 52 chunks. Default chunk size is 64MB. Now imagine that you are sharding a collection with ten million documents which are 1K each. Each chunk should not contain more than about 65K documents. Ten million documents should be split into more than 150 chunks, but we only have 52 distinct values for the shard key! So your chunks are going to be very large. Why is that a problem? Well, in order to auto-balance chunk among shards the system needs to migrate chunks between shards and if the chunk is too big, it can't be moved. And since it can't be split, you'll be stuck with unbalanced cluster.
There is definitely a relationship between shard key and chunk size. You want to choose a shard key with a high level of cardinality. That is, you want a shard key that can have many possible values as opposed to something like State which is basically locked into only 50 possible values. Low cardinality shard keys like that can result in chunks that consist of only one of the shard key values and thus can not be split and moved to another shard in a balancing operation.
High cardinality of the shard key (like a person's phone number as opposed to their State or Zip Code) is essential to ensure even distribution of data. Low cardinality shard keys can lead to larger chunks (because you have more contiguous values that need to be kept together) that can not be split.

How does MongoDB do both sharding and replication at the same time?

For scaling/failover mongodb uses a “replica set” where there is a primary and one or more secondary servers. Primary is used for writes. Secondaries are used for reads. This is pretty much master slave pattern used in SQL programming.
If the primary goes down a secondary in the cluster of secondaries takes its place.
So the issue of horizontally scaling and failover is taken care of. However, this is not a solution which allows for sharding it seems. A true shard holds only a portion of the entire data, so if the secondary in a replica set is shard how can it qualify as primary when it doesn’t have all of the data needed to service the requests ?
Wouldn't we have to have a replica set for each one of the shards?
This obviously a beginner question so a link that visually or otherwise illustrates how this is done would be helpful.
Your assumption is correct, each shard contains a separate replica set. When a write request comes in, MongoS finds the right shard for it based on the shard key, and the data is written to the Primary of the replica set contained in that shard. This results in write scaling, as a (well chosen) shard key should distribute writes over all your shards.
A shard is the sum of a primary and secondaries (replica set), so yes, you would have to have a replica set in each shard.
The portion of the entire data is held in the primary and it's shared with the secondaries to maintain consistency. If the primary goes out, a secondary is elected to be the new primary and has the same data as its predecessor to begin serving immediately. That means that the sharded data is still present and not lost.
You would typically map individual shards to separate replica sets.
See http://docs.mongodb.org/manual/core/sharded-clusters/ for an overview of MongoDB sharding.