picking a shardkey for mongodb

picking a shardkey for mongodb - mongodb

I want to shard my MongoDB database. I have a high insert rate and want to distribute my documents on two shards evenly.
I have considered rangebase sharding, because I have range queries; but I can not find a solution for picking a good shard key.
{
Timestamp : ISODate("2016-10-02T00:01:00.000Z"),
Machine_ID: "100",
Temperature:"50"
}
If this is my document and I have 100,000 different machines, would the Machine_ID be a suitable shardkey? And if so, how will MongoDB distribute it on the shards, i.e. do i have to specify the shard range myself? like put Machine_ID 0-49,999 on shard A, and 50,000-100,000 on shard B?

I think the Machine_ID would be a suitable shard key if your queries afterwards will be per Machine, i.e. get all the temperatures for a specific machine for a certain time range. Reading more about shard keys can be found here: Choosing shard key
MongoDB has two kinds of sharding: Hashed sharding and Range sharding which you can read more about here: Sharding strategies. Having said that, you don't need to specify the range of the shards yourself, mongo will take care of it. Especially when a time comes when you'll need to add a new shard, mongo will rearrange the chunks into the new shard.

If your cluster has only two shards, then it isn't difficult to design for. However, if your data will continue to grow and you end up having a lot more shards, then the choice of shard key is more difficult.
For example, if some machines have many more records than others (e.g. one machine has 3000 records i.e. 3% of the total), then that doesn't cause problems with only two shards. But if your data grows so that you need 100 shards, and one machine still has 3% of the total, then Machine_ID is no longer a good choice: because a single machine's records have to be a single chunk, and cannot be distributed across several shards.
In that case, a better strategy might be to use a hash of the Timestamp - but it depends on the overall shape of your dataset.

Related

How to balance data when write overload is too heavy

I deployed a sharded cluster of two shards with MongoDB version 3.0.3.
Unfortunately, I chose a monotonic shard key just like:
{insertTime: 1}
When data size was small and the write speed was slow, the balancer can balance the data between the two shards. But when the data size grows big and our write speed is much faster, the balancing speed is so slow.
Now, the hard disk's storage of one of the two shards called shard2 is near the limit.
How Can I solve this problem without stopping our service and application??

I strongly suggest that you change your shard key while it's not too late to do so to avoid the preditable death of your cluster.
When a shard key increase monotonically, all the writes operations are sent to a single shard. Thus, this shard will grow then split into 2 shards. You will continue to hammer one of them until it splits again. At some point, you cluster won't be balanced anymore and your cluster will trigger some chunk moves and slow down your cluster even more.
MongoDB generates ObjectId values upon document creation to produce a unique identifier for the object. However, the most significant bits of data in this value represent a time stamp, which means that they increment in a regular and predictable pattern. Even though this value has high cardinality, when using this, any date, or other monotonically increasing number as the shard key, all insert operations will be storing data into a single chunk, and therefore, a single shard. As a result, the write capacity of this shard will define the effective write capacity of the cluster.
You do not benefit from the good part of the sharding with this shard key. It's actually worst in performance than a single node.
You should read this to select your new shard key and avoid the typical anti patterns. http://docs.mongodb.org/manual/tutorial/choose-a-shard-key/

You could add a shard to the cluster to increase capacity.
From the docs:
You add shards to a sharded cluster after you create the cluster or any time that you need to add capacity to the cluster. If you have not created a sharded cluster, see Deploy a Sharded Cluster.
When adding a shard to a cluster, always ensure that the cluster has enough capacity to support the migration required for balancing the cluster without affecting legitimate production traffic.

Migrating chunks from primary shard to the other single shard takes too long

Each chunk move takes about 30-40 mins.
The shard key is a random looking but monotically increasing integer string which is a long sequence of digits. A "hashed" index is created for that field.
There are 150M documents each about 1.5Kb in size. The sharded collection has 10 indexes (some of them compound).
I have a total of ~11k chunks reported in sh.status(). So far I could only transfer 42 of them to the other shard.
The system consists of one mongos, one config server and one primary (mongod) shard and other (mongod) shard. All in the same server which has 8 cores and 32 GB ram.
I know the ideal is to use seperate machines but none of the CPUs were utilized so I thought it was good for a start.
What is your comment?
What do I need to investigate?
Is it normal?

As said on the mongodb documentation : " Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations."
You should definitely not have your shards on the same machine. It is useless. The interest of sharding is that you scale horizontaly. So if you shard on the same machine.... You are just killing your throughput.
Your database will be faster without sharding if you have one machine.
To avoid data loss, before using sharding you should use : raid (not 0), replicaset and then sharding.

Advise on mongodb sharding

I read mongodb sharding guide, but I am not sure what kind of shard key suits my application. Any suggestions are welcome.
My application is a huge database of network events. Each document has a time filed and a couple of network related values like IP address and port number. I have the insert rate of 100-1000 items per seconds. In my experience with a single mongod, one single shard has no problem with this insert rate.
But I extensively use aggregation framework on huge amount of data. All the aggregations has a time limit--i.e. mostly the recent month or the recent week. I tested the aggregations on one singe mongod and a query with response time of 5 minutes while the insertion is off could take as long as two hours if the 200 insert per seconds is activated.
Can I improve mongodb aggregation query response time by sharding?
If yes, I think I have to use time as the shard key, because in my application, every query has to be run on a time limit (e.g. Top IP addresses in the recent month) and if we can separate the shard that inserting is take place and the shard that the query is working the mongodb could work much faster.
But the documenation says
"If the shard key is a linearly increasing field, such as time, then all requests for a given time range will map to the same chunk, and thus the same shard. In this situation, a small set of shards may receive the majority of requests and the system would not scale very well."
So what shall I do?

Can I improve mongodb aggregation query response time by sharding?
Yes.
If you shard your database on different machines this will provide parallel computing power on aggregation functions. The important thing is here the distribution of the data in your shards. It should be uniform distributed.
If you choose your shard key as a monotonicly increasing(or decreasing) field of the document-like time- and use hashed sharding, this will provide uniform distribution over the cluster.

Difference between Sharding And Replication on MongoDB

I am just confuse about the Sharding and Replication that how they works..According to Definition
Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.
Sharding: Sharding is a method for storing data across multiple machines.
As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial
Shards store the data. To provide high availability and data
consistency, in a production sharded cluster, each shard is a replica
set
As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. Please help me in this.

Lets try with this analogy. You are running the library.
As any person who has is running a library you have books in the library. You store all the books you have on the shelf. This is good, but your library became so good that your rival wants to burn it. So you decide to make many additional shelves in other places. There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books.
This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). It just makes additional copies of the data and if something goes wrong it automatically selects another primary.
This concept may help if you
want to scale reads (but they might lag behind the primary).
do some offline reads which do not touch main server
serve some part of the data for a specific region from a server from that specific region
But the main reason behind replication is data availability. So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data.
Look at another scenario. There is no rival so you do not want to make copy of your shelves. But right now you have another problem. You became so good that one shelf is not enough. You decide to distribute your books between many shelves. You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. This is sharding.
This concept may help you:
distribute a workload
be able to save data which much more then can fit on a single server
do map-reduce things
store more data in ram for faster queries
Here you are partially correct. If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally.
But here is a problem with only sharding. Right now your rival appeared and he just came to one of your shelves and burned it. All the data on that shelf is lost. So you want to replicate every shard as well. Basically the notion that
each shard is a replica set
is not true. But if you are doing sharding you have to create a replication for every shard. Because the more shards you have, the bigger is the probability that at least one will die.

Answering Saad's followup answer:
Also you can have shards and replicas together on the same server, it is not recommended way of doing it. Each server should have a single role in the system. If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines.
I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine).

I am writing it as an answer but actually its a question to #Salvador Sir's answer.
Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have
1) Its own shard.
2) Replication of Shard present on server-2
3) Replication of Shard present on server-3
same goes with Server-2 and server-3. Am i right?..if this is the case then each server again have 75GB of data again. Right or wrong?

Since we want to make 3 shards and also replicate the data so following is the solution to the above problem.
r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard.
However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable..

Sharding is like partition of data.
Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key)
Why sharding is needed? Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. So its almost similar to partition. And sharding helps for fast accessing of data.
Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). Replication helps for fail over.

Relation between shard keys and chunks in MongoDB sharded cluster?

I can't really understand the shard key concept in a MongoDB sharded cluster, as I've just started learning MongoDB.
Citing the MongoDB documentation:
A chunk is a contiguous range of shard key values assigned to a
particular shard. When they grow beyond the configured chunk size, a
mongos splits the chunk into two chunks.
It seems that chuck size is something related to a particular shard, not to the cluster itself. Am I right?
Speaking about the cardinality of a shard key:
Consider the use of a state field as a shard key:
The state key’s value
holds the US state for a given address document. This field has a low
cardinality as all documents that have the same value in the state
field must reside on the same shard, even if a particular state’s
chunk exceeds the maximum chunk size.
Since there are a limited number of possible values for the state field, MongoDB may distribute data unevenly among a small number of fixed chunks.
My question is how the shard key relates to the chunk size.
It seems to me that, having just two shard servers, it wouldn't be possible to distribute the data because same value in the state field must reside on the same shard. With three documents with states like Arizona, Indiana and Maine, how data is distributed among just two shards?

In order to understand the answer to your question you need to understand range based partitioning. If you have N documents they will be partitioned into chunks - the way the split points are determined is based on your shard key.
With shard key being some field in your document, all the possible values of the shard key will be considered and all the documents will be (logically) split into chunks/ranges, based on what value each document's shard key is.
In your example there are 50 possible values for "state" (okay, probably more like 52) so at most there can only be 52 chunks. Default chunk size is 64MB. Now imagine that you are sharding a collection with ten million documents which are 1K each. Each chunk should not contain more than about 65K documents. Ten million documents should be split into more than 150 chunks, but we only have 52 distinct values for the shard key! So your chunks are going to be very large. Why is that a problem? Well, in order to auto-balance chunk among shards the system needs to migrate chunks between shards and if the chunk is too big, it can't be moved. And since it can't be split, you'll be stuck with unbalanced cluster.

There is definitely a relationship between shard key and chunk size. You want to choose a shard key with a high level of cardinality. That is, you want a shard key that can have many possible values as opposed to something like State which is basically locked into only 50 possible values. Low cardinality shard keys like that can result in chunks that consist of only one of the shard key values and thus can not be split and moved to another shard in a balancing operation.
High cardinality of the shard key (like a person's phone number as opposed to their State or Zip Code) is essential to ensure even distribution of data. Low cardinality shard keys can lead to larger chunks (because you have more contiguous values that need to be kept together) that can not be split.