behaviour of balancer in mongodb sharding - mongodb

I was experimenting with mongo sharding. The collection has shard key as {policyId,startTime}.
policyId - java UUID (limited values,lets say 50)
startTime - monotonically increasing time.
After inserting around 30M(32 GB) documents in the collection : Below is the data distribution:
shard key: { "policyId" : 1, "startDate" : 1 }
unique: false
balancing: true
chunks:
sharda 63
shardb 138
During insertion sh.isBalancerRunning() was giving 'false' as result. When I stopped inserting more documents, balancer started moving chunks. After that I got even distribution of data.
Below are my concerns / Questions regarding balancer:
1. If insertion in db is stopped, then only balancer is active and started moving chunks. If I insert more data for longer duration which will create more chunks and data will be more skewed. Chunk migration will itself take more time to balance the shards. So how does mongo decide when to migrate chunks?
2. I was able notice spikes in write latency if data is getting inserted after 20M docs. Does it mean balancer is moving some of the chunks intermittently?
3. Count API gives inconsistent result during chunk migration because balancer copies chunks from one shard to another and deletes the old chunk. Should we expect Find API will also give incorrect result (duplicate docs)?
If is possible could any one share any documentation/blog for mongo balancer for better understanding.

Assumption is wrong (i.e. If insertion in db is stopped, then only balancer is active and started moving chunks). The balancer process automatically migrates chunks when there is an uneven distribution of a sharded collection’s chunks across the shards.
Migration is not a continuous or steadily process. Automatic migration happens when it is required. for more details refer https://docs.mongodb.com/v3.0/core/sharding-balancing/#sharding-migration-thresholds
Read while migration will not give incorrect result. No duplicates records should come via find API.
For more about balancer refer https://docs.mongodb.com/manual/core/sharding-balancer-administration/
About migration refer https://docs.mongodb.com/v3.0/core/sharding-chunk-migration/

There are various things to consider
Default chunk size - 64 MB
Cardinality - If cardinality is more then and your data over period of time will not cause same value to be more than 64 MB ( assume you store 1 or more years data ) then you don't have to worry. In case not then you probably had to increase the default chunk size
Suppose you have 2 shards - Cardinality (hash key) is 100 then 50 values data will go to 1 shard and 50% to other. If you have range keys then 0-50 will go to 1 shard and 50-100 in other.
Now suppose your current chunk with value A to F reaches size 64 MB then this chunk will be split and data will be moved to other shard.
If your cardinality is low then A value itself can be more than 64 MB and chunk will not be able to split and marked as Jumbo chunk

Related

What happen if chunk size goes beyond the limit[64 mb] for single shard key in mongodb

We have a sharded cluster of mongodb. shard key is sellerId. We have nearly 20k sellers. We capture responses for sellers. Some sellers may have huge response set. Now lets say sellerId 10001 has some very good listing and got millions of responses in that case single shard key 10001 has huge data and goes beyond of 64 mb of default size. As per mongo documentation there can be only one chunk per shard key in res in replica set. What will happen with this chunk. Does the chunk size automatically increase?

Inserting data in empty Sharded database in mongo when balancer is not enabled result all data in one shard

We have to 2 mongo db shard servers(3 Replica Set each).
We created Sharded collection and inserted 200k documents. Balancer was disabled in that window and we enabled it after first test and started insert again.
While in first test all data was inserted in one shard and we got lots of warning in mongolog:-
splitChunk cannot find chunk [{ articleId: MinKey, sessionId: MinKey },{ articleId: "59830791", sessionId: "fb0ccc50-3d6a-4fc9-aa66-e0ccf87306ea" }) to split, the chunk boundaries may be stale
Reason mentioned in log is possible low cardinality shard key
After second and third test when balancer was on data was balanced on both shards.
We did one more test and stopped balancer again in this test, data was going in both shards even balancer was off (pageIds were reader ids which are repeated from old tests along with some new ids for both)
Could you please tell how this mechanism is working as data should go in both shards no matter balancer is ON or OFF when key's cardinality is good.
Shard Key is :- (pageid) and (unique readerid)
Below are the insertion stats:-
Page read in duration 200k
Unique page IDs 2000
Unqiue session reading pages in duration :- 70000
Thanks in Advance!
When you enable sharding for a database, a primary shard will get assigned for each database.
If you insert data with balancer as disabled, all the data will go into the primary shard. Mongo Split will calculate the split point as your data grows and chunks will get created.
Since your balancer is disabled, all the chunks will remain on same shard.
If your balancer is in enabled state then it will balance those chunks between the shards which will result in better data distribution.
We did one more test and stopped balancer again in this test, data was going
in both shards even balancer was off (pageIds were reader ids which are
repeated from old tests along with some new ids for both)
The data is already distributed in chunks and these chunks are well distributed between 2 shards. If the range of your shard key is also distributed evenly among the chunks then any new document will go in respective chunk which will lead into even data distribution.

mongodb sharding - chunks are not having the same size

I am new on playing with mongodb.
Due to the fact that I have to store +-50 mln of documents, I had to set up a mongodb shard cluster with two replica sets
The document looks like this:
{
"_id" : "predefined_unique_id",
"appNr" : "abcde",
"modifiedDate" : ISODate("2016-09-16T13:00:57.000Z"),
"size" : NumberLong(803),
"crc32" : NumberLong(538462645)
}
The shard key is appNr (was selected because for query performance reasons, all documents having same appNr have to stay within one chunk).
Usually multiple documents have the same appNr.
After loading like two million records, I see the chunks are equally balanced however when running db.my_collection.getShardDistribution(), I get :
Shard rs0 at rs0/...
data : 733.97MiB docs : 5618348 chunks : 22
estimated data per chunk : 33.36MiB
estimated docs per chunk : 255379
Shard rs1 at rs1/...
data : 210.09MiB docs : 1734181 chunks : 19
estimated data per chunk : 11.05MiB
estimated docs per chunk : 91272
Totals
data : 944.07MiB docs : 7352529 chunks : 41
Shard rs0 contains 77.74% data, 76.41% docs in cluster, avg obj size on shard : 136B
Shard rs1 contains 22.25% data, 23.58% docs in cluster, avg obj size on shard : 127B
My question is what settings I should do in order to get the data equally distributed between shards? I would like to understand how the data gets split in chunks. I have defined a ranged shard key and chunk size 264.
MongoDB uses the shard key associated to the collection to partition the data into chunks. A chunk consists of a subset of sharded data. Each chunk has a inclusive lower and exclusive upper range based on the shard key.
Diagram of the shard key value space segmented into smaller ranges or chunks.
The mongos routes writes to the appropriate chunk based on the shard key value. MongoDB splits chunks when they grows beyond the configured chunk size. Both inserts and updates can trigger a chunk split.
The smallest range a chunk can represent is a single unique shard key
value. A chunk that only contains documents with a single shard key
value cannot be split.
Chunk Size will have a major impact on the shards.
The default chunk size in MongoDB is 64 megabytes. We can increase or reduce the chunk size. But modification of the chunk size should be done after considering the below items
Small chunks lead to a more even distribution of data at the expense of more frequent migrations. This creates expense at the query routing (mongos) layer.
Large chunks lead to fewer migrations. This is more efficient both from the networking perspective and in terms of internal overhead at the query routing layer. But, these efficiencies come at the expense of a potentially uneven distribution of data.
Chunk size affects the Maximum Number of Documents Per Chunk to Migrate.
Chunk size affects the maximum collection size when sharding an existing collection. Post-sharding, chunk size does not constrain collection size.
By referring these information and your shard key "appNr", this would have happened because of chunk size.
Try resizing the chunk size instead of 264MB(which you have currently) to a lower size and see whether there is a change in the document distribution. But this would be a trial and error approach and it would take considerable amount of time and iterations.
Reference : https://docs.mongodb.com/v3.2/core/sharding-data-partitioning/
Hope it Helps!
I'll post my findings here - maybe they will have some further use.
The mongodb documentation says that "when a chunk grows beyond specified chunk size" it gets splitted.
I think the documentation is not fully accurate or rather incomplete.
When mongo does auto-splitting, splitVector command will ask the primary shard for splitting points, then will split accordingly.This will happen first when like 20% from specified chunk size is reached and - if no splitting points found - will retry at 40%,60% so on...so the splitting should not wait for max size .
In my case, for the first half of the shards this happened ok, but then for the second half - the split happened only after the max chunk size was exceeded. Still have to investigate why the split didn't happened earlier, as I see no reason for this behaviour.
After splitting in chunks, the balancer starts. This will divide the chunks equally across shards, without considering chunk size ( a chunk with 0 documents is equal to a chunk with 100 documents from this regard).The chunks will be moved following the order of their creation.
My problem was that the second half of the chunks was almost twice the size than the first half. Therefore as balancer allways moved the first half of the chunks collection to the other shard, the cluster became unbalanced.
a much better explanation I found here
In order to fix it, I have changed the sharding key to "hashed".

Migrating chunks from primary shard to the other single shard takes too long

Each chunk move takes about 30-40 mins.
The shard key is a random looking but monotically increasing integer string which is a long sequence of digits. A "hashed" index is created for that field.
There are 150M documents each about 1.5Kb in size. The sharded collection has 10 indexes (some of them compound).
I have a total of ~11k chunks reported in sh.status(). So far I could only transfer 42 of them to the other shard.
The system consists of one mongos, one config server and one primary (mongod) shard and other (mongod) shard. All in the same server which has 8 cores and 32 GB ram.
I know the ideal is to use seperate machines but none of the CPUs were utilized so I thought it was good for a start.
What is your comment?
What do I need to investigate?
Is it normal?
As said on the mongodb documentation : " Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations."
You should definitely not have your shards on the same machine. It is useless. The interest of sharding is that you scale horizontaly. So if you shard on the same machine.... You are just killing your throughput.
Your database will be faster without sharding if you have one machine.
To avoid data loss, before using sharding you should use : raid (not 0), replicaset and then sharding.

How to remove chunks from mongodb shard

I have a collection where the sharding key is UUID (hexidecimal string). The collection is huge: 812 millions of documents, about 9600 chunks on 2 shards. For some reason I initially stored documents which instead of UUID had integer in sharding key field. Later I deleted them completely, and now all of my documents are sharded by UUID. But I am now facing a problem with chunk distribution. While I had documents with integer instead of UUID, balancer created about 2700 chunks for these documents, and left all of them on one shard. When I deleted all these documents, chunks were not deleted, they stayed empty and they will always be empty because I only use UUID now. Since balancer distrubutes chunks relying on chunk count per shard, not document count or size, one of my shards takes 3 times more disk space than another:
--- Sharding Status ---
db.click chunks:
set1 4863
set2 4784 // 2717 of them are empty
set1> db.click.count()
191488373
set2> db.click.count()
621237120
The sad thing here is mongodb does not provide commands to remove or merge chunks manually.
My main question is, whould anything of this work to get rid of empty chunks:
Stop the balancer. Connect to each config server, remove from config.chunks ranges of empty chunks and also fix minKey slice to end at beginning of first non-empty chunk. Start the balancer.
Seems risky, but as far as I see, config.chunks is the only place where chunk information is stored.
Stop the balancer. Start a new mongod instance and connect it as a 3rd shard. Manually move all empty chunks to this new shard, then shut it down forever. Start the balancer.
Not sure, but as long as I dont use integer values in sharding key again, all queries should run fine.
Some might read this and think that the empty chunks are occupying space. That's not the case - chunks themselves take up no space - they are logical ranges of shard keys.
However, chunk balancing across shards is based on the number of chunks, not the size of each chunk.
You might want to add your voice to this ticket: https://jira.mongodb.org/browse/SERVER-2487
Since the mongodb balancer only balances chunks number across shards, having too many empty chunks in a collection can cause shards to be balanced by chunk number but severely unbalanced by data size per shard (e.g., as shown by db.myCollection.getShardDistribution()).
You need to identify the empty chunks, and merge them into chunks that have data. This will eliminate the empty chunks. This is all now documented in Mongodb docs (at least 3.2 and above, maybe even prior to that).