Chunk counld not split when the chunksize greater than specified chunksize - mongodb

Here is the situation:
There is a chunk, has the shard key range [10001, 100030], but currently, it has only one key (e.g. 10001) has the data, key range from [10002, 10030] is just empty, the chuck data is beyond 8M, then we set the current chuck size to 8M.
After we fill the data in the key range [10002, 10030], this chunk starts to split, and stopped at a key range like this `[10001, 10003], it has two keys, and we just wonder if this is OK or not.
From the document on the official site we thought that the chunk might NOT contains more than ONE key.
So, would you please help us make sure if this is ok or not ?
What we want to is to split the chunks as many as possible, so that to make sure the data is balanced.

There is a notion called jumbo chunks. Every chunk which exceeds its specified size or has more documents than the maximum configured is considered a jumbo chunk.
Since MongoDB usually splits a chunk when about half the chunk size is reached, I take Jumbo chunks as a sign that there is something rather wrong with the cluster.
The most likely reason for jumbo chunks is that one or more config servers wasn't available for a time.
Metadata updates need to be written to all three config servers (they don't build a replica set), metadata updates can not be made in case one of the config servers is down. Both chunk splitting and migration need a metadata update. So when one config server is down, a chunk can not be split early enough and it will grow in size and ultimately become a jumbo chunk.
Jumbo chunks aren't automatically split, even when all three config servers are available. The reason for this is... Well, IMHO MongoDB plays a little save here. And Jumbo chunks aren't moved, either. The reason for this is rather obvious - moving data which in theory can have any size > 16MB simply is a too costly operation.
Proceed at your own risk! You have been warned!
Since you can identify the jumbo chunks, they are pretty easy to deal with.
Simply identify the key range of the chunk and use it within
sh.splitFind("database.collection", query)
This will identify the shard in question and split in half, which is quite important. Please, please read Split Chunks in a Sharded Cluster and make sure you understood all of it and the implications before trying to split the chunks manually.

Related

servers with different hard drive sizes scenario

my scenario is that i have for example 2 servers (shards) one with a bigger hard drive than the other. So if one is 500GB and the other is 1TB and the first gets full with data, what happens when I add more data to the servers. Will the balancer know that the first is full and transfer the extra data from the first server to the second?
No. The balancer will try to evenly partition the chunks on all shards.
First, your first largest shard will not get filled first. Along time you will probably have a similar amount of chunks and data on both shards. This is why it is recommended to have similar server specs.
Nevertheless, if you want to partition your chunks in a ratio of two to one you can do one of the two:
Change the Maximum Storage Size for a Given Shard . Use a ratio of 2 to 1
Use Tag Aware Sharding , which is more manageable, predictable option

How to remove chunks from mongodb shard

I have a collection where the sharding key is UUID (hexidecimal string). The collection is huge: 812 millions of documents, about 9600 chunks on 2 shards. For some reason I initially stored documents which instead of UUID had integer in sharding key field. Later I deleted them completely, and now all of my documents are sharded by UUID. But I am now facing a problem with chunk distribution. While I had documents with integer instead of UUID, balancer created about 2700 chunks for these documents, and left all of them on one shard. When I deleted all these documents, chunks were not deleted, they stayed empty and they will always be empty because I only use UUID now. Since balancer distrubutes chunks relying on chunk count per shard, not document count or size, one of my shards takes 3 times more disk space than another:
--- Sharding Status ---
db.click chunks:
set1 4863
set2 4784 // 2717 of them are empty
set1> db.click.count()
191488373
set2> db.click.count()
621237120
The sad thing here is mongodb does not provide commands to remove or merge chunks manually.
My main question is, whould anything of this work to get rid of empty chunks:
Stop the balancer. Connect to each config server, remove from config.chunks ranges of empty chunks and also fix minKey slice to end at beginning of first non-empty chunk. Start the balancer.
Seems risky, but as far as I see, config.chunks is the only place where chunk information is stored.
Stop the balancer. Start a new mongod instance and connect it as a 3rd shard. Manually move all empty chunks to this new shard, then shut it down forever. Start the balancer.
Not sure, but as long as I dont use integer values in sharding key again, all queries should run fine.
Some might read this and think that the empty chunks are occupying space. That's not the case - chunks themselves take up no space - they are logical ranges of shard keys.
However, chunk balancing across shards is based on the number of chunks, not the size of each chunk.
You might want to add your voice to this ticket: https://jira.mongodb.org/browse/SERVER-2487
Since the mongodb balancer only balances chunks number across shards, having too many empty chunks in a collection can cause shards to be balanced by chunk number but severely unbalanced by data size per shard (e.g., as shown by db.myCollection.getShardDistribution()).
You need to identify the empty chunks, and merge them into chunks that have data. This will eliminate the empty chunks. This is all now documented in Mongodb docs (at least 3.2 and above, maybe even prior to that).

how to know or compute the current chunk size for a mongodb shard?

in fact the following is what i really want to ask:
the set of chunk size is 1M. when a config server down, the whole config servers are readonly. current cluster have only a chunk, if I want to insert a lot of data to this cluster, the capacity of these data is more than 1M, Can I successfully insert these data?
if yes, do it describe that the real chunk size can more than the set of chunk size?
Thanks!
Short answer: yes you can, but you should fix your config server(s) asap to avoid unbalancing your shards for long.
Chunks are split automatically when they reach their size threshold - stated here.
However, during a config server failure, chunks cannot be split. Even if just one server fails. See here.
Edits
As stated by Sergio Tulentsev, you should fix your config server(s) before performing your insert. The system's metadata will continue to be readonly until then.
As Adam C's link points out, your shard will become unbalanced if you were to perform an insert like you describe before fixing your config server(s).

Does auto-sharding in MongoDB work on shards with many small collections/small databases

In the MongoDB documentation for auto-sharding it says: "Sharding is performed on a per-collection basis. Small collections need not be sharded."
Our business has many databases (~100), with many small collections (~30), each with a document count of 1 - 3000. Our DB system is looking at approximately 100,000,000 page views per month.
In that scenario will sharding ever activate since the collections are never big enough even though the DB usage and site traffic is certainly high enough to require load balancing. From the docs I can't seem to find a clear answer.
Whether it makes sense to shard depends a little bit on whether you have mostly writes or reads to the database. Sharding is primarily used for write-scaling, but if you are not doing a lot of writes, then simply using replicasets with "slaveOkay" for the reads might work just as well.
From the numbers that you provided you seem to get about 9 million documents, but are they large documents? If they easily fit in memory, then there is most likely not even going to be a need for replicasets besides for failover capabilities.
This is hard to answer without knowing more about your use case, but I'll give it a shot.
Are you sure sharding is what you need? What does your insert rate look like?
If you are going to have a static set of data, or even a relatively static set, then you probably don't need to shard, you could simply use more secondaries and enable slaveOK reads. The reads will be distributed to the various secondaries and scale up your read capacity.
If that is not the case, and you do need to shard, then there are options. But first, to explain briefly and at a high level how automatic sharding works:
The mongos process is responsible for splitting and migrating chunks in general. These are two separate operations - splitting and balancing.
Splits occur when the mongos sees that a certain portion of the
maximum chunk size has been written, it initiates a split if there is
in fact enough data to warrant it. Over time, with enough data
written, the number of chunks grows.
Balancing occurs when there is an imbalance of chunks (currently 8 in
2.0, though moving to a more dynamic heuristic in 2.2). The balancer migrates the chunks around the shards until a balance is achieved.
So, you need to be writing enough data relative to the max chunk size (default is 64MB in 2.0) to generate the chunks needed for the balancer to move them around appropriately. If that is not going to happen with your data, then you can look at:
Decreasing the chunk size (has drawbacks too - http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations)
Manually split/move the chunks
For the manual instructions see:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
http://www.mongodb.org/display/DOCS/Moving+Chunks

mongo DB Balancing

I have a collections with a shard key and index endabled. But when I run balancing, the chunks are not moved for this collections where as the other collection chunks are moving as expected to other machines. Only one chunk is moved from this collection.
Currently (this will change in the near future), the balancer will only start moving chunks when there is a sufficient imbalance (8 or more). If the chunk counts are closer than that, then there will be no movement. The number of chunks is dependent on the max chunk size (64MB at the time of writing this in 2.0.x) and the amount of data written. There is a split triggered every time a certain amount of data is written to a chunk.
So, if you are not adding data to the collection, or the data is not very large, it can take some time to create the number of chunks necessary yo trigger a balancing round.
You can take this into your own hands by manually splitting and moving a chunk:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
Or, you can add more data to trigger splits and the balancer will eventually kick in and move the chunks around for you.