MongoDB sharding, how does it rebalance when adding new nodes? - mongodb

I'm trying to understand MongoDB and the concept of sharding. If we start with 2 nodes and partition say, customer data, based on last name where A thru M data is stored on node 1 and N thru Z data is stored on node 2. What happens when we want to scale out and add more nodes? I just don't see how that will work.

If you have 2 nodes it doesn't mean that data is partitioned into 2 chunks. It can by partitioned to let's say 10 chunks and 6 of them are on server 1 ane rest is on server 2.
When you add another server MongoDB is able to redistribute those chunks between nodes of new configuration
You can read more in official docs:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

If there are multiple shards available, MongoDB will start migrating data to other shards
once you have a sufficient number of chunks. This migration is called balancing and is
performed by a process called the balancer.The balancer moves chunks from one shard to another.
For a balancing round to occur, a shard must have at least nine more chunks than the
least-populous shard. At that point, chunks will be migrated off of the crowded shard
until it is even with the rest of the shards.
When you add new node to cluster, MongoDB redistribute those chunks among nodes of new configuration.
It's a little extract ,to get complete understanding of how does it rebalance when adding new node read chapter 2 "Understanding Sharding" of Kristina Chodrow's book "Scaling MongoDB"

Related

MongoDB doesn't scale properly when adding new shard with collection already filled

My MongoDB sharded cluster ingestion performances don't scale up when adding a new shard.
I have a small cluster setup with 1 mongos + 1 config replica set (3 nodes) + N shards replica sets (3 nodes each).
Mongos is on a dedicated Kubernetes node, and each mongo process hosting shards has its dedicated k8s node, while the config mong processes run a bit here and there where they happens to be deployed.
The cluster is used mainly for GridFS file hosting, with a typical file being around 100Mb.
I am doing stress tests with 1, 2 and 3 shards to see if it scales properly, and it doesn't.
If I start a brand new cluster with 2 shards and run my test it ingest files at (approx) twice the speed I had with 1 shard, but if I start the cluster with 1 shard, then perform the test, then add 1 more shard (total 2 shards), then perform the test again, the speed of ingestion is approx the same as before with 1 shard.
Looking at where chunks go, when I start the cluster immediately with 2 shards the load is evenly balanced between shards.
If I start with 1 shard and add a second after a some insertions, then the chunks tend to go all on the old shard and the balancer must bring them later to the second shard.
Quick facts:
chunksize 1024 MB
sharding key is GridFS file_id, hashed
This is due to how hashed sharding and balancing works.
In an empty collection (from Shard an Empty Collection):
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster.
So if you execute sh.shardCollection() on a cluster with x numbers of shards, it will create 2 chunks per shard and distribute them across the shards, totalling 2x chunks across the cluster. Since the collection is empty, moving the chunks around take little time. Your ingestion will now be distributed evenly across the shards (assuming other things e.g. good cardinality of the hashed field).
Now if you add a new shard after the chunks were created, that shard starts empty and the balancer will start to send chunks to it using the Migration Thresholds. In a populated collection, this process may take a while to finish.
If while the balancer is still moving chunks around (which may not be empty now) you do another ingestion, the cluster is now doing two different jobs at the same time: 1) ingestion, and 2) balancing.
When you're doing this with 1 shard and add another shard, it's likely that the chunks you're ingesting into are still located in shard 1 and haven't moved to the new shard yet, so most data will go into that shard.
Thus you should wait until the cluster is balanced after adding the new shard before doing another ingestion. After it's balanced, the ingestion load should be more evenly distributed.
Note: since your shard key is file_id, I'm assuming that each file is approximately the same size (~100 MB). If some files are much larger than others, some chunks will be busier than others as well.

Choosing between MongoDB and ElasticSearch - Scaling/Sharding

I'm currently deciding between MongoDB and Elasticsearch as a backend to a logging and analytics platform. I plan to use a cluster of 5 Intel Xeon Quad Core servers with 64GB RAM and a 500GB NVMe drive in each. With 1 replica set, it should support 1TB+ of data I'm guessing.
From what I've read on Elasticsearch, the recommended set-up for the above servers would be 5-10 shards, but shards cannot be increased in the future without a huge migration. So maybe I can add 5 more servers/nodes to the cluster for the same index, but not 10 or 20, because I can't create more shards to spread across the new nodes/servers - correct?
MongoDB appears to automatically manage sharding based on a key value and redistribute those shards as more nodes get added. So does that mean that I can add 50 more servers to the cluster in the future and MongoDB will happily spread the data from this one index across all the servers?
I basically only need 1TB of storage right now, but don't want to paint myself into a corner, should this 1 dataset end up growing to 100TB.
Without starting Elasticsearch with 100 shards at the beginning, which seems inefficient and bad practice, how can it scale past 5/10 servers for this single dataset?
As Val said, you would normally have time based indices, so you can easily (in a performant way) remove data after a certain retention period. So as your requirements change over time, you change your shard number (normally through an index template).
Current versions of Elasticsearch now support a _split API, which does exactly what you are asking for: Use 5 shards initially, but have the option to go up to any factor of 20 (just as an example) — so 5 -> 10 -> 30 would be options.
If you have 5 primary shards and a replication factor of 1, you could still spread out the load over 10 nodes: Writes to the 5 primary and 5 replica shards; reads will go to either one of them. Elasticsearch's write / read model is generally different than MongoDB's.
PS disclaimer: I work for Elastic now, but I have used MongoDB in production for 5 years as well.

Difference between Sharding And Replication on MongoDB

I am just confuse about the Sharding and Replication that how they works..According to Definition
Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.
Sharding: Sharding is a method for storing data across multiple machines.
As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial
Shards store the data. To provide high availability and data
consistency, in a production sharded cluster, each shard is a replica
set
As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. Please help me in this.
Lets try with this analogy. You are running the library.
As any person who has is running a library you have books in the library. You store all the books you have on the shelf. This is good, but your library became so good that your rival wants to burn it. So you decide to make many additional shelves in other places. There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books.
This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). It just makes additional copies of the data and if something goes wrong it automatically selects another primary.
This concept may help if you
want to scale reads (but they might lag behind the primary).
do some offline reads which do not touch main server
serve some part of the data for a specific region from a server from that specific region
But the main reason behind replication is data availability. So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data.
Look at another scenario. There is no rival so you do not want to make copy of your shelves. But right now you have another problem. You became so good that one shelf is not enough. You decide to distribute your books between many shelves. You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. This is sharding.
This concept may help you:
distribute a workload
be able to save data which much more then can fit on a single server
do map-reduce things
store more data in ram for faster queries
Here you are partially correct. If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally.
But here is a problem with only sharding. Right now your rival appeared and he just came to one of your shelves and burned it. All the data on that shelf is lost. So you want to replicate every shard as well. Basically the notion that
each shard is a replica set
is not true. But if you are doing sharding you have to create a replication for every shard. Because the more shards you have, the bigger is the probability that at least one will die.
Answering Saad's followup answer:
Also you can have shards and replicas together on the same server, it is not recommended way of doing it. Each server should have a single role in the system. If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines.
I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine).
I am writing it as an answer but actually its a question to #Salvador Sir's answer.
Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have
1) Its own shard.
2) Replication of Shard present on server-2
3) Replication of Shard present on server-3
same goes with Server-2 and server-3. Am i right?..if this is the case then each server again have 75GB of data again. Right or wrong?
Since we want to make 3 shards and also replicate the data so following is the solution to the above problem.
r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard.
However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable..
Sharding is like partition of data.
Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key)
Why sharding is needed? Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. So its almost similar to partition. And sharding helps for fast accessing of data.
Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). Replication helps for fail over.

how to divide mongodb data without stopping service

I got a mongodb running online, which contains about 200 million object, and the file size is about 20GB. And, I found that the insert speed become very slow (about 2000 per sec, and this value is more than 10000 in the beginning). So I decide to divide the data to optimize the insert speed.
I would like the know if i can divide the mongodb data without stopping service, and how?
You just described "sharding". Luckily for you, MongoDB has nice sharding features out of the box.
Your migration will consist of:
Create 3 mongo config servers
Create more than 1 mongos router
Add your current replica set as a shard
Point your application to connect to your mongos
Configure shard keys your current collections
Then, you are set to add shards as needed
For detailed instructions, see 10gen's sharding overview
At MongoHQ, we convert replica sets to shards all the time. So, should be quick, painless, and without downtime.

MongoDB chunks selection to move

If a shard say has 200 chunks on it and it is time to move some of those chunks to another shard,
1>how does mongo db decide which chunks to move?
2>Is this move logic in config server or mongos?
3>How can I affect/control chunk selection algorithm such that mongodb will move chunks to other shards such that it helps to distribute my reads based on my users access pattern?
Movement of chunks between shards is triggered by mongos. Mongos will move chunks under two circumstances. If one shard contains 9 or more chunks than any other, mongos will trigger a balancing round and redistribute the chunks between the other shards. In this situation, the chunks with the lowest shard keys will be moved. Additionally, if the top most chunk is split, mongos will move the chunk with the higher shard key to another shard.
One of the features of Mongo is that in a properly set-up sharded cluster, chunks are split and moved automatically such that your application does not need to be aware that it is interacting with a sharded database. Everything happens behind-the-scenes.
However, it is possible to split and move chunks manually using the "split" and "moveChunks" commands. Please see the mongo documentation for examples of how to use these commands: "Splitting Shard Chunks" http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks and "Moving
Chunks" http://www.mongodb.org/display/DOCS/Moving+Chunks There have been cases where users have written their own custom balancers taylored to their own applications, but this is not common, and only attempted by the most advanced of users.
As an alternative, it is possible to give the balancer window of time when it may operate, or to disable it entirely. Some users will temporarily disable the balancer for maintenance, or give it a window so it is not competing for write locks at times when they expect their application to be putting the db under high loads.
More information on the balancer is available in the "Balancing" and "Balancer window" sections of the "Sharding Administration" documentation.
http://www.mongodb.org/display/DOCS/Sharding+Administration
Hopefully the above resources will give you a better understanding of how sharding works, and how chunks are balanced between shards.