MongoDB sharding concept - mongodb

I'm trying to figure out the mechanism of MongoDB's sharding.
Can someone tell me if I got this thing right ?
Here is shards example of what I figured out:
Sharded Cluster 1 :
Shard1 - contains chunk1, chunk2 and chunk5 (Replica Set of primary and two secondaries, so we have backup for those chunks)
Shard2 - contains chunk3, chunk4 and chunk6 (MongoDB single instance, so we do not have any back up for those chunks)
Sharded Cluster 2:
Shard1 - contains chunk2, chunk3 and chunk6 (MongoDB single instance, so we do not have any back up for those chunks)
Shard2 - contains chunk1, chunk4 and chunk5 (Replica Set of primary and two secondaries, so we have backup for those chunks)

Thanks for providing such a wonderful information. Also I would like to add that sharding is basically done to increase I/O bandwidth and partition in-memory data for more proficient usage of distributed caching. Do click here for more information regarding modern alternative to database sharding.

Related

MongoDB sharding storage usage

I am reading about sharding in MongoDB. After understanding how it works, I have a very basic question regarding the storage space used by it.
Suppose, I have a server containing 1 GB of storage. Now assuming my data will grow beyond 1 GB, it won't be sufficient for my purpose. So, I add one more server and shard Mongo.
So now, let's say I have 2 servers, with storage space say 1 GB each, which are to be included in the cluster. If I perform sharding, then both of these servers will be used to distribute Mongo data. So, in total, I must have 2 GB storage available for Mongo. But, I find that the official sharding documentation mentions that shards are replica sets. If that is so, then wouldn't the addition of 1 GB server just mean that I have only 1 GB storage (like before) for actual MongoDB data and remaining 1 GB is just replicated data?
If my understanding is correct, then is there any way to not create a replica set? Can we use 2 GB storage from both the servers like a logical volume?
Otherwise, if my understanding is wrong, what is the correct thing?
The documentation of Sharding at MongoDB says that - "Sharding distributes data across the shards in the cluster, allowing each shard to contain a subset of the total cluster data. As the data set grows, additional shards increase the storage capacity of the cluster". Here: https://docs.mongodb.com/manual/sharding/ (storage capacity)
Since its a subset both contain different sets of data. So there can be multiple use's of a replica set (shards to store subset of data or saving data as a backup and creating replicas) based on the usage.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
I hope this helps.

MongoDB Sharding, Replication and Clustering

Based on my analysis below is understanding, correct me if my understating is wrong.
Sharding - Horizontal scaling, split the records into multiple chunks and store across multiple machine with good shard key for all collections.
Replication - Replicate the data across multiple machine for high availability
Clustering - As per Mongo architecture there will be one Master and multiple slave machine. Write and sensitive read operation performs against Master and read operation performs against slaves.
I am not able to correlate Clustering with Replication and Sharding, could you please someone guide me how to relate them?
Term "clustering" is not normally used with mongodb. Instead, its meaning included in the term "sharding". A shard is a node/replicaset with only a portion of your data, yes. And cluster is simply a collection of shards (and supporting nodes, like config servers and mongos routers)
Whereas replication is done with replica sets, which have one primary node (master) and other nodes are secondaries (slaves).

Why do individual shards in MongoDB report more delete operations compared to corresponding mongos in a sharded cluster?

So I have a production sharded MongoDB cluster that has 8 shards (replica sets) managed by mongos. Let's say I have 20 servers which are running my application and each of the servers runs a mongos process that manages the 8 shards.
Given this setup, when I check the number of ops on each of my mongos on the 20 servers, I can see that my number of inserts and deletes are in proportion - which is in accordance with my application logic. However, when I run mongostat --discover on the individual shards, I see that deletes are nearly 4x the number of inserts which violates both my application logic as well as the 1:1 ratio indicated by mongos. Straightforward intuition supports that mongos would write to only one shard and so the average ratio of inserts and deletes across individual shards should be the same as that on mongos (which the application directly writes to) unless mongos does something different internally with the shards.
Could anyone point me to any relevant info on why this would happen or let me know if something could possibly wrong with my infra?
Thanks
The reason for this is that I was running the remove() queries to mongos without specifying my shard key. In that case, mongos does not know which shard to direct the query to and thus broadcasts the query to all the shards effectively performing more deletes than a targeted query.
Check documentation for more information.

How to understand "The shards are replica sets."

When I put shard and Replica Set together, I am confused.
Why does the reference say that the shards are replica sets?
Do replica sets contains shards?
Can someone give me a conceptual explanation?
Replica Set is a cluster of MongoDB servers which implements Master - slave implementation. So, basically same data is shared between multiple replica i.e Master and Slave(s). Master is also termed as primary node and Slave(s) is/are considered as Secondary nodes.
It replicates your data on multiple mongo instances to solve/avoid fail overs. MongoDB also perform election of Primary node between secondary nodes automatically whenever Primary node goes down.
Sharding is used to store large data set between multiple machines. So basically, if you simply wants to compare Sharded nodes doesnt/may not contain same data where as Relicated nodes contains same data.
Sharding has different purpose,large data set is spread accross multiple machines.
Now, this large data set's subset can also be replicated to multiple nodes as primary and secondary to overcome failovers. So basically a shard can have multiple replica-set. These replica set of a shard contains subset of data for a large data set.
So, multiple shards can complete the whole large data set which are separated in the form of chunks. These chunks can be replicated within a Shard using Replica set.
You can also get more details related to this in MongoDB manual.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
This is a language barrier, in English to say such a thing really means the same as "each shard is a replica set" in this context.
So to explain, say you have a collection of names a-z. Shard 1 holds a-b. This shard is also a replica set which means it has automated failover and replication of that range as well. So sharding in this sense is a top level term that comes above replica sets.
Shards are used to break a collection and store parts of it in different places. It is not necessary that a shard be a replica set, it can be a single server, but to achieve reliability and avoid loss of data, a replica set can be used as a shard instead of a single server. So, if one of the servers in the replica set goes down, the others will still hold the data.

How does MongoDB do both sharding and replication at the same time?

For scaling/failover mongodb uses a “replica set” where there is a primary and one or more secondary servers. Primary is used for writes. Secondaries are used for reads. This is pretty much master slave pattern used in SQL programming.
If the primary goes down a secondary in the cluster of secondaries takes its place.
So the issue of horizontally scaling and failover is taken care of. However, this is not a solution which allows for sharding it seems. A true shard holds only a portion of the entire data, so if the secondary in a replica set is shard how can it qualify as primary when it doesn’t have all of the data needed to service the requests ?
Wouldn't we have to have a replica set for each one of the shards?
This obviously a beginner question so a link that visually or otherwise illustrates how this is done would be helpful.
Your assumption is correct, each shard contains a separate replica set. When a write request comes in, MongoS finds the right shard for it based on the shard key, and the data is written to the Primary of the replica set contained in that shard. This results in write scaling, as a (well chosen) shard key should distribute writes over all your shards.
A shard is the sum of a primary and secondaries (replica set), so yes, you would have to have a replica set in each shard.
The portion of the entire data is held in the primary and it's shared with the secondaries to maintain consistency. If the primary goes out, a secondary is elected to be the new primary and has the same data as its predecessor to begin serving immediately. That means that the sharded data is still present and not lost.
You would typically map individual shards to separate replica sets.
See http://docs.mongodb.org/manual/core/sharded-clusters/ for an overview of MongoDB sharding.