servers with different hard drive sizes scenario - mongodb

my scenario is that i have for example 2 servers (shards) one with a bigger hard drive than the other. So if one is 500GB and the other is 1TB and the first gets full with data, what happens when I add more data to the servers. Will the balancer know that the first is full and transfer the extra data from the first server to the second?

No. The balancer will try to evenly partition the chunks on all shards.
First, your first largest shard will not get filled first. Along time you will probably have a similar amount of chunks and data on both shards. This is why it is recommended to have similar server specs.
Nevertheless, if you want to partition your chunks in a ratio of two to one you can do one of the two:
Change the Maximum Storage Size for a Given Shard . Use a ratio of 2 to 1
Use Tag Aware Sharding , which is more manageable, predictable option

Related

Chunk counld not split when the chunksize greater than specified chunksize

Here is the situation:
There is a chunk, has the shard key range [10001, 100030], but currently, it has only one key (e.g. 10001) has the data, key range from [10002, 10030] is just empty, the chuck data is beyond 8M, then we set the current chuck size to 8M.
After we fill the data in the key range [10002, 10030], this chunk starts to split, and stopped at a key range like this `[10001, 10003], it has two keys, and we just wonder if this is OK or not.
From the document on the official site we thought that the chunk might NOT contains more than ONE key.
So, would you please help us make sure if this is ok or not ?
What we want to is to split the chunks as many as possible, so that to make sure the data is balanced.
There is a notion called jumbo chunks. Every chunk which exceeds its specified size or has more documents than the maximum configured is considered a jumbo chunk.
Since MongoDB usually splits a chunk when about half the chunk size is reached, I take Jumbo chunks as a sign that there is something rather wrong with the cluster.
The most likely reason for jumbo chunks is that one or more config servers wasn't available for a time.
Metadata updates need to be written to all three config servers (they don't build a replica set), metadata updates can not be made in case one of the config servers is down. Both chunk splitting and migration need a metadata update. So when one config server is down, a chunk can not be split early enough and it will grow in size and ultimately become a jumbo chunk.
Jumbo chunks aren't automatically split, even when all three config servers are available. The reason for this is... Well, IMHO MongoDB plays a little save here. And Jumbo chunks aren't moved, either. The reason for this is rather obvious - moving data which in theory can have any size > 16MB simply is a too costly operation.
Proceed at your own risk! You have been warned!
Since you can identify the jumbo chunks, they are pretty easy to deal with.
Simply identify the key range of the chunk and use it within
sh.splitFind("database.collection", query)
This will identify the shard in question and split in half, which is quite important. Please, please read Split Chunks in a Sharded Cluster and make sure you understood all of it and the implications before trying to split the chunks manually.

what happens when maxsize is exceeded

When the size of the cluster rises chunks are divided. Docs say that "the balancer will not move chunks off an overloaded shard. This must happen manually."(doc here). So will redudant chunks of a shard, that has reached the maxsize limit, be moved to another shard that hasn't exceeded the maxsize, or will they stay on the same shard and one must manually move those extra bytes and chunks off the shard?
Docs say that "the balancer will not move chunks off an overloaded shard. This must happen manually.". So will redudant chunks of a shard, that has reached the maxsize limit, be moved to another shard that hasn't exceeded the maxsize, or will they stay on the same shard and one must manually move those extra bytes and chunks off the shard?
This is specific to when you have set maxSize limit for a shard and that limit has been reached. The balancer will no longer migrate chunks to that shard, and it will remain "full" unless you manually move some chunks to another shard via sh.moveChunk(). The default behaviour is to have no maxSize set so shards can use as much disk space as is available.
my scenario is that i have for example 2 servers one with a bigger hard drive than the other. So if one is 500GB and the other is 1TB and the first gets full with data, what happens when I add more data to the servers. Will the balancer know that the first is full and transfer the extra data from the first server to the second?
MongoDB balances data between shards on the basis of logical chunks that are contiguous ranges of values based on the shard key you have selected. By default a chunk represents roughly 64MB of data.
MongoDB is unaware of the underlying disk configuration, so if server with shardA has twice as much disk space as a server with shardB the balancer is still only considering the number of chunks associated with each shard (not the actual disk usage). Ideally all shards should have similar configuration in terms of hardware and disk space.
If you use the maxSize option to limit the storage on a specific shard, this setting only controls whether the balancer will move chunks to that shard once the maxSize has been reached.
For more information see Sharded Collection Balancing in the MongoDB documentation.

Difference between Sharding And Replication on MongoDB

I am just confuse about the Sharding and Replication that how they works..According to Definition
Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.
Sharding: Sharding is a method for storing data across multiple machines.
As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial
Shards store the data. To provide high availability and data
consistency, in a production sharded cluster, each shard is a replica
set
As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. Please help me in this.
Lets try with this analogy. You are running the library.
As any person who has is running a library you have books in the library. You store all the books you have on the shelf. This is good, but your library became so good that your rival wants to burn it. So you decide to make many additional shelves in other places. There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books.
This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). It just makes additional copies of the data and if something goes wrong it automatically selects another primary.
This concept may help if you
want to scale reads (but they might lag behind the primary).
do some offline reads which do not touch main server
serve some part of the data for a specific region from a server from that specific region
But the main reason behind replication is data availability. So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data.
Look at another scenario. There is no rival so you do not want to make copy of your shelves. But right now you have another problem. You became so good that one shelf is not enough. You decide to distribute your books between many shelves. You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. This is sharding.
This concept may help you:
distribute a workload
be able to save data which much more then can fit on a single server
do map-reduce things
store more data in ram for faster queries
Here you are partially correct. If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally.
But here is a problem with only sharding. Right now your rival appeared and he just came to one of your shelves and burned it. All the data on that shelf is lost. So you want to replicate every shard as well. Basically the notion that
each shard is a replica set
is not true. But if you are doing sharding you have to create a replication for every shard. Because the more shards you have, the bigger is the probability that at least one will die.
Answering Saad's followup answer:
Also you can have shards and replicas together on the same server, it is not recommended way of doing it. Each server should have a single role in the system. If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines.
I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine).
I am writing it as an answer but actually its a question to #Salvador Sir's answer.
Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have
1) Its own shard.
2) Replication of Shard present on server-2
3) Replication of Shard present on server-3
same goes with Server-2 and server-3. Am i right?..if this is the case then each server again have 75GB of data again. Right or wrong?
Since we want to make 3 shards and also replicate the data so following is the solution to the above problem.
r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard.
However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable..
Sharding is like partition of data.
Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key)
Why sharding is needed? Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. So its almost similar to partition. And sharding helps for fast accessing of data.
Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). Replication helps for fail over.

Does auto-sharding in MongoDB work on shards with many small collections/small databases

In the MongoDB documentation for auto-sharding it says: "Sharding is performed on a per-collection basis. Small collections need not be sharded."
Our business has many databases (~100), with many small collections (~30), each with a document count of 1 - 3000. Our DB system is looking at approximately 100,000,000 page views per month.
In that scenario will sharding ever activate since the collections are never big enough even though the DB usage and site traffic is certainly high enough to require load balancing. From the docs I can't seem to find a clear answer.
Whether it makes sense to shard depends a little bit on whether you have mostly writes or reads to the database. Sharding is primarily used for write-scaling, but if you are not doing a lot of writes, then simply using replicasets with "slaveOkay" for the reads might work just as well.
From the numbers that you provided you seem to get about 9 million documents, but are they large documents? If they easily fit in memory, then there is most likely not even going to be a need for replicasets besides for failover capabilities.
This is hard to answer without knowing more about your use case, but I'll give it a shot.
Are you sure sharding is what you need? What does your insert rate look like?
If you are going to have a static set of data, or even a relatively static set, then you probably don't need to shard, you could simply use more secondaries and enable slaveOK reads. The reads will be distributed to the various secondaries and scale up your read capacity.
If that is not the case, and you do need to shard, then there are options. But first, to explain briefly and at a high level how automatic sharding works:
The mongos process is responsible for splitting and migrating chunks in general. These are two separate operations - splitting and balancing.
Splits occur when the mongos sees that a certain portion of the
maximum chunk size has been written, it initiates a split if there is
in fact enough data to warrant it. Over time, with enough data
written, the number of chunks grows.
Balancing occurs when there is an imbalance of chunks (currently 8 in
2.0, though moving to a more dynamic heuristic in 2.2). The balancer migrates the chunks around the shards until a balance is achieved.
So, you need to be writing enough data relative to the max chunk size (default is 64MB in 2.0) to generate the chunks needed for the balancer to move them around appropriately. If that is not going to happen with your data, then you can look at:
Decreasing the chunk size (has drawbacks too - http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations)
Manually split/move the chunks
For the manual instructions see:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
http://www.mongodb.org/display/DOCS/Moving+Chunks

MongoDB chunks selection to move

If a shard say has 200 chunks on it and it is time to move some of those chunks to another shard,
1>how does mongo db decide which chunks to move?
2>Is this move logic in config server or mongos?
3>How can I affect/control chunk selection algorithm such that mongodb will move chunks to other shards such that it helps to distribute my reads based on my users access pattern?
Movement of chunks between shards is triggered by mongos. Mongos will move chunks under two circumstances. If one shard contains 9 or more chunks than any other, mongos will trigger a balancing round and redistribute the chunks between the other shards. In this situation, the chunks with the lowest shard keys will be moved. Additionally, if the top most chunk is split, mongos will move the chunk with the higher shard key to another shard.
One of the features of Mongo is that in a properly set-up sharded cluster, chunks are split and moved automatically such that your application does not need to be aware that it is interacting with a sharded database. Everything happens behind-the-scenes.
However, it is possible to split and move chunks manually using the "split" and "moveChunks" commands. Please see the mongo documentation for examples of how to use these commands: "Splitting Shard Chunks" http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks and "Moving
Chunks" http://www.mongodb.org/display/DOCS/Moving+Chunks There have been cases where users have written their own custom balancers taylored to their own applications, but this is not common, and only attempted by the most advanced of users.
As an alternative, it is possible to give the balancer window of time when it may operate, or to disable it entirely. Some users will temporarily disable the balancer for maintenance, or give it a window so it is not competing for write locks at times when they expect their application to be putting the db under high loads.
More information on the balancer is available in the "Balancing" and "Balancer window" sections of the "Sharding Administration" documentation.
http://www.mongodb.org/display/DOCS/Sharding+Administration
Hopefully the above resources will give you a better understanding of how sharding works, and how chunks are balanced between shards.