How to understand "The shards are replica sets." - mongodb

When I put shard and Replica Set together, I am confused.
Why does the reference say that the shards are replica sets?
Do replica sets contains shards?
Can someone give me a conceptual explanation?

Replica Set is a cluster of MongoDB servers which implements Master - slave implementation. So, basically same data is shared between multiple replica i.e Master and Slave(s). Master is also termed as primary node and Slave(s) is/are considered as Secondary nodes.
It replicates your data on multiple mongo instances to solve/avoid fail overs. MongoDB also perform election of Primary node between secondary nodes automatically whenever Primary node goes down.
Sharding is used to store large data set between multiple machines. So basically, if you simply wants to compare Sharded nodes doesnt/may not contain same data where as Relicated nodes contains same data.
Sharding has different purpose,large data set is spread accross multiple machines.
Now, this large data set's subset can also be replicated to multiple nodes as primary and secondary to overcome failovers. So basically a shard can have multiple replica-set. These replica set of a shard contains subset of data for a large data set.
So, multiple shards can complete the whole large data set which are separated in the form of chunks. These chunks can be replicated within a Shard using Replica set.
You can also get more details related to this in MongoDB manual.

Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.

Each shard is a replica set, not the shards are replica sets.
This is a language barrier, in English to say such a thing really means the same as "each shard is a replica set" in this context.
So to explain, say you have a collection of names a-z. Shard 1 holds a-b. This shard is also a replica set which means it has automated failover and replication of that range as well. So sharding in this sense is a top level term that comes above replica sets.

Shards are used to break a collection and store parts of it in different places. It is not necessary that a shard be a replica set, it can be a single server, but to achieve reliability and avoid loss of data, a replica set can be used as a shard instead of a single server. So, if one of the servers in the replica set goes down, the others will still hold the data.

Related

mongodb write concern for sharded cluster

From the mongodb docs:
mongos uses "majority" for the write concern of the shardCollection command and its helper sh.shardCollection().
In replica sets majority is Nodes/2 and round up.
Is it the majority of the config servers or what exactly?
I guess mongo doesn't specify very specifically what happens. If you dug one more page down, Write Concern, you'll read that "In sharded clusters, mongos instances will pass the write concern on to the shards."
Assuming your shard cluster is also a replica set e.g. P-S-S (primary-secondary-secondary) it should follow the same behavior as if you only had a single unsharded replica set. "For this replica set, calculated majority is two, and the write must propagate to the primary and one secondary to acknowledge the write concern to the client". The client here technically being mongos.
The other shard clusters do not need to acknowledge the write if they are not being written to. If you are writing to multiple shard clusters, then I'd assume both shard clusters need to ack the write similarly.

MongoDB Sharding, Replication and Clustering

Based on my analysis below is understanding, correct me if my understating is wrong.
Sharding - Horizontal scaling, split the records into multiple chunks and store across multiple machine with good shard key for all collections.
Replication - Replicate the data across multiple machine for high availability
Clustering - As per Mongo architecture there will be one Master and multiple slave machine. Write and sensitive read operation performs against Master and read operation performs against slaves.
I am not able to correlate Clustering with Replication and Sharding, could you please someone guide me how to relate them?
Term "clustering" is not normally used with mongodb. Instead, its meaning included in the term "sharding". A shard is a node/replicaset with only a portion of your data, yes. And cluster is simply a collection of shards (and supporting nodes, like config servers and mongos routers)
Whereas replication is done with replica sets, which have one primary node (master) and other nodes are secondaries (slaves).

MongoDB replication factors

I'm new to Mongo, have been using Cassandra for a while. I didn't find any clear answers from Mongo's docs. My questions below are all closely related.
1) How to specify the replication factors in Mongo?
In C* you can define replication factors (how many replicas) for the entire database and/or for each table. Mongo's newer replica set concept is similar to C*'s idea. But, I don't see how I can define the replication factors for each table/collection in Mongo. Where to define that in Mongo?
2) Replica sets and replication factors?
It seems you can define multiple replica sets in a Mongo cluster for your databases. A replica set can have up to 7 voting members. Does a 5-member replica set means the data is replicated 5 times? Or replica sets are only for voting the primary?
3) Replica sets for what collections?
The replica set configuration doc didn't mention anything about databases or collections. Is there a way to specify what collections a replica set is intended for?
4) Replica sets definition?
If I wanted to create a 15-node Mongo cluster and keeping 3 copies of each record, how to partition the nodes into multiple replica sets?
Thanks.
Mongo replication works by replicating the entire instance. It is not done at the individual database or collection level. All replicas contain a copy of the data except for arbiters. Arbiters do not hold any data and only participate in elections of a new primary. They are usually deployed to create enough of a majority that if an instance goes down a new instance can be elected as the primary.
Its pretty well explained here https://docs.mongodb.com/manual/replication/
Replication is referred to the process of ensuring that the same data is available on more than one Mongo DB Server. This is sometimes required for the purpose of increasing data availability.
Because if your main MongoDB Server goes down for any reason, there will be no access to the data. But if you had the data replicated to another server at regular intervals, you will be able to access the data from another server even if the primary server fails.
Another purpose of replication is the possibility of load balancing. If there are many users connecting to the system, instead of having everyone connect to one system, users can be connected to multiple servers so that there is an equal distribution of the load.
In MongoDB, multiple MongDB Servers are grouped in sets called Replica sets. The Replica set will have a primary server which will accept all the write operation from clients. All other instances added to the set after this will be called the secondary instances which can be used primarily for all read operations.

Mongo sharding confusion with replication turned on

Lets say I have 3 machines. Call them A, B, C. I have mongo replication turned on. So assume A is primary. B and C are secondary.
But now I would like to shard the database. But I believe all writes still go to A since A is primary. Is that true ? Can somebody explain the need of sharding and how it increases write throughput ?
I understand mongos will find the right shard. But the writes will have to be done on the primary. And rest other machines B, C will follow. Am I missing something.
I also don't understand the statement : "Each shard is a replica set". The shard is incomplete unless it is also the primary.
Sharding is completely different to replication.
I am now going to attempt to explain your confusion without making you more confused.
When sharding is taken into consideration a replica set is a replicated range of sharded information.
As such this means that every shard in the cluster is actually a replica set in itself holding a range (can hold differnt range chunks but that's another topic) of the sharded data and replicating itself across the members of that self contained replica set.
Imagine that a shard will be a primary of its own replica set with its own set of members and what not. This is what is meant by each shard being a replica set.
As such each shard will have its own self contained primary but at the same time each primary of each replica set of each shard can receive writes from a mongos if the range that the replica set holds matches the shard key target being sent down by the client.
I hope that makes sense.
Edit
This post might help: http://www.kchodorow.com/blog/2010/08/09/sharding-and-replica-sets-illustrated/
Does that apply here ? Say I want writes acknowledged from 2 mongod instances. Is that possible ? I still have trouble understanding. Can you please explain with an example ?
Ok let's take an example. You have 3 shards which are in fact 3 replica sets, called rs1 and rs2 and rs3, consisting of 3 nodes each (1 primary and 2 secondaries). If you want a write to be acknowledged by 2 of the members of the replica set then you can do that as you normally would:
db.users.insert({username:'sammaye'},{w:2})
Where username is the shard key.
The mongos this query is sent to will seek out the correct shard (node) which is in fact a replica set, connect to that replica set and then perform the insert as normal.
So as an example, rs2 actually holds the range of all usernames starting with the letters m-s. The mongos will use its internal mapping of rs2 (gotten when connecting to the replica set) along with your write/read concern to judge which members to read from and write to.
Tags will apply all the same too.
If mongos finds a shard, which is not "primary"(as in replication) is the write still performed in the secondary ?
You are still confusing yourself here, there is no "primary" like in replication, there is a seed (sometimes called a master) server for the sharded database but that is a different thing entirely. The mongos does not have to write to the master node in anyway, it is free to write to any part of the sharded cluster that your queries allow it.

How does MongoDB do both sharding and replication at the same time?

For scaling/failover mongodb uses a “replica set” where there is a primary and one or more secondary servers. Primary is used for writes. Secondaries are used for reads. This is pretty much master slave pattern used in SQL programming.
If the primary goes down a secondary in the cluster of secondaries takes its place.
So the issue of horizontally scaling and failover is taken care of. However, this is not a solution which allows for sharding it seems. A true shard holds only a portion of the entire data, so if the secondary in a replica set is shard how can it qualify as primary when it doesn’t have all of the data needed to service the requests ?
Wouldn't we have to have a replica set for each one of the shards?
This obviously a beginner question so a link that visually or otherwise illustrates how this is done would be helpful.
Your assumption is correct, each shard contains a separate replica set. When a write request comes in, MongoS finds the right shard for it based on the shard key, and the data is written to the Primary of the replica set contained in that shard. This results in write scaling, as a (well chosen) shard key should distribute writes over all your shards.
A shard is the sum of a primary and secondaries (replica set), so yes, you would have to have a replica set in each shard.
The portion of the entire data is held in the primary and it's shared with the secondaries to maintain consistency. If the primary goes out, a secondary is elected to be the new primary and has the same data as its predecessor to begin serving immediately. That means that the sharded data is still present and not lost.
You would typically map individual shards to separate replica sets.
See http://docs.mongodb.org/manual/core/sharded-clusters/ for an overview of MongoDB sharding.