How does MongoDB do both sharding and replication at the same time? - mongodb

For scaling/failover mongodb uses a “replica set” where there is a primary and one or more secondary servers. Primary is used for writes. Secondaries are used for reads. This is pretty much master slave pattern used in SQL programming.
If the primary goes down a secondary in the cluster of secondaries takes its place.
So the issue of horizontally scaling and failover is taken care of. However, this is not a solution which allows for sharding it seems. A true shard holds only a portion of the entire data, so if the secondary in a replica set is shard how can it qualify as primary when it doesn’t have all of the data needed to service the requests ?
Wouldn't we have to have a replica set for each one of the shards?
This obviously a beginner question so a link that visually or otherwise illustrates how this is done would be helpful.

Your assumption is correct, each shard contains a separate replica set. When a write request comes in, MongoS finds the right shard for it based on the shard key, and the data is written to the Primary of the replica set contained in that shard. This results in write scaling, as a (well chosen) shard key should distribute writes over all your shards.

A shard is the sum of a primary and secondaries (replica set), so yes, you would have to have a replica set in each shard.
The portion of the entire data is held in the primary and it's shared with the secondaries to maintain consistency. If the primary goes out, a secondary is elected to be the new primary and has the same data as its predecessor to begin serving immediately. That means that the sharded data is still present and not lost.

You would typically map individual shards to separate replica sets.
See http://docs.mongodb.org/manual/core/sharded-clusters/ for an overview of MongoDB sharding.

Related

Guarantee consistency of data across microservices access a sharded cluster in MongoDB

My application is essentially a bunch of microservices deployed across Node.js instances. One service might write some data while a different service will read those updates. (specific example, I'm processing data that is inbound to my solution using a processing pipeline. Stage 1 does something, stage 2 does something else to the same data, etc. It's a fairly common pattern)
So, I have a large data set (~250GB now, and I've read that once a DB gets much larger than this size, it is impossible to introduce sharding to a database, at least, not without some major hoop jumping). I want to have a highly available DB, so I'm planning on a replica set with at least one secondary and an arbiter.
I am still researching my 'sharding' options, but I think that I can shard my data by the 'client' that it belongs to and so I think it makes sense for me to have 3 shards.
First question, if I am correct, if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?
Second question. I've read conflicting info about what 'majority' means... If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down? If the Arbiter is still there, the election can happen and I'll still have a Primary. But, does Majority refer to members of the replication set? Or to Secondaries? So, if I only have a Primary and I try to write with 'majority' option, will I ever get an acknowledgement? If there is only a Primary, then 'majority' would mean a write to the Primary alone triggers the acknowledgement. Or, would this just block until my timeout was reached and then I would get an error?
Third question... I'm assuming that as long as I do writes with 'majority' acknowledgement and do reads from all the Primaries, I don't need to worry about causally consistent data? I've read that doing reads from 'Secondary' nodes is not worth the effort. If reading from a Secondary, you have to worry about 'eventual consistency' and since writes are getting synchronized, the Secondaries are essentially seeing the same amount of traffic that the Primaries are. So there isn't any benefit to reading from the Secondaries. If that is the case, I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards. Is this correct?
Fourth (and last) question... When are causally consistent sessions worthwhile? If I understand correctly, and I'm not sure that I do, then I think it is when I have a case like a typical web app (not some distributed application, like my current one), where there is just one (or two) nodes doing the reading and writing. In that case, I would use causally consistent sessions and do my writes to the Primary and reads from the Secondary. But, in that case, what would the benefit of reading from the Secondaries be, anyway? What am I missing? What is the use case for causally consistent sessions?
if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?
A replica set Arbiter is still an instance of mongod. It's just that an Arbiter does not have a copy of the data and cannot become a Primary. You should have 3 instances per shard, which means 9 instances in total.
Since you mentioned that you would like to have a highly available database deployment, please note that the minimum recommended replica set members for production deployment would be a Primary with two Secondaries.
If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down?
When either the Primary or Secondary becomes unavailable, a w:majority writes will either:
Wait indefinitely,
Wait until either nodes is restored, or
Failed with timeout.
This is because an Arbiter carries no data and unable to acknowledge writes but still counted as a voting member. See also Write Concern for Replica sets.
I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards
Correct, MongoDB Sharding is to scale horizontally to distribute load across shards. While MongoDB Replication is to provide high availability.
If you read only from the Primary and also specifies readConcern:majority, the application will read data that has been acknowledged by the majority of the replica set members. This data is durable in the event of partition (i.e. not rolled back). See also Read Concern 'majority'.
What is the use case for causally consistent sessions?
Causal Consistency is used if the application requires an operation to be logically dependent on a preceding operation (causal). For example, a write operation that deletes all documents based on a specified condition and a subsequent read operation that verifies the delete operation have a causal relationship. This is especially important in a sharded cluster environment, where write operations may go to different replica sets.

MongoDB SHARDING_FILTER in plan

I have a problem on Sharded Cluster. I'm testing performance to compare between Sharded and Replica Set.
I have inserted data to Shard 1 directly without mongos and then query it by aggregate query but I cannot found it. I checked in explain plan that shows "SHARDING_FILTER" in stage on Primary shard but doesn't have that in Secondary when I checked explain plan.
What's configuration to control about it?
MongoDB version : 3.0.12
I have inserted data to Shard 1 directly without mongos and then query it by aggregate query but I cannot found it.
It's not entirely clear what your performance comparison is, but irrespective you should always interact with data via mongos for a sharded cluster.
The role of mongos includes keeping track of the sharded cluster metadata (as cached from the config servers), observing data inserts/updates/deletions, and routing requests. Bypassing mongos will lead to potential complications in collection/data visibility (as you have observed) because you are skipping some of the expected data management infrastructure for your sharded deployment.
I checked in explain plan that shows "SHARDING_FILTER" in stage on Primary shard but doesn't have that in Secondary when I checked explain plan.
Secondary reads are eventually consistent, so the state of data on a given secondary may not necessarily match the current sharded cluster metadata. This becomes more problematic with many shards: with a secondary read preference results can potentially be combined from secondaries with significant differences in replication lag.
For consistent queries for a sharded cluster you should always use primary reads (which is the default behaviour) via mongos. Queries against primaries through mongos may include a SHARDING_FILTER stage which filters result documents that are not owned by the current shard (for example, due to migrations in progress where documents need to transiently exist on both a donor and target shard).
As at MongoDB 3.4, secondaries do not have the ability to filter results because they'd need to maintain a separate view of the cluster metadata which matches their eventually consistent state. There's a relevant Jira issue to watch/upvote: SERVER-5931 - Secondary reads in sharded clusters need stronger consistency. I currently would not recommend secondary reads in a sharded cluster (or in general) without careful consideration of the impact of eventual consistency on your use case. For the general case, please read Can I use more replica nodes to scale?.
What's configuration to control about it?
Use the default read preference (primary reads) and always interact with your sharded deployment through mongos.

How to understand "The shards are replica sets."

When I put shard and Replica Set together, I am confused.
Why does the reference say that the shards are replica sets?
Do replica sets contains shards?
Can someone give me a conceptual explanation?
Replica Set is a cluster of MongoDB servers which implements Master - slave implementation. So, basically same data is shared between multiple replica i.e Master and Slave(s). Master is also termed as primary node and Slave(s) is/are considered as Secondary nodes.
It replicates your data on multiple mongo instances to solve/avoid fail overs. MongoDB also perform election of Primary node between secondary nodes automatically whenever Primary node goes down.
Sharding is used to store large data set between multiple machines. So basically, if you simply wants to compare Sharded nodes doesnt/may not contain same data where as Relicated nodes contains same data.
Sharding has different purpose,large data set is spread accross multiple machines.
Now, this large data set's subset can also be replicated to multiple nodes as primary and secondary to overcome failovers. So basically a shard can have multiple replica-set. These replica set of a shard contains subset of data for a large data set.
So, multiple shards can complete the whole large data set which are separated in the form of chunks. These chunks can be replicated within a Shard using Replica set.
You can also get more details related to this in MongoDB manual.
Sharding happens one level above replication.
When you use both sharding and replication, your cluster consists of many replica-sets and one replica-set consists of many mongod instances.
However, it is also possible to create a cluster of stand-alone mongod instances which are not replicated or have only some shards implemented as replica-sets and some shards implemented as stand-alone mongod instances.
Each shard is a replica set, not the shards are replica sets.
This is a language barrier, in English to say such a thing really means the same as "each shard is a replica set" in this context.
So to explain, say you have a collection of names a-z. Shard 1 holds a-b. This shard is also a replica set which means it has automated failover and replication of that range as well. So sharding in this sense is a top level term that comes above replica sets.
Shards are used to break a collection and store parts of it in different places. It is not necessary that a shard be a replica set, it can be a single server, but to achieve reliability and avoid loss of data, a replica set can be used as a shard instead of a single server. So, if one of the servers in the replica set goes down, the others will still hold the data.

Why is an arbiter needed for an election in a primary - secondary - arbiter MongoDB replica set?

Mongo docs list this three-member configuration: primary, secondary, arbiter, as the minimal architecture of a replica set.
Why would an arbiter be necessary there? If the primary fails, the secondary won't see the heartbeat, so it needs to become primary. In other words, why wouldn't a primary + secondary configuration be sufficient? This related question doesn't seem to address the issue, as it discusses larger numbers of nodes.
Suppose you have only two servers, one primary and one secondary.
If suddenly the secondary can not reach the primary server it could be that the primary is down (in that case the secondary should become primary) but it could be as well a network issue that isolated the secondary (this the secondary is the one that is in deed down).
however, if you have an arbiter and the secondary cannot reach the primary but it CAN reach the arbiter then the issue is with the primary so it must become the new primary. If it CANNOT reach the primary, nor the arbiter, then the secondary knows that the issue is that he is isolated/broken -poor secondary :(- so he must not become the primary
If you bring the Arbiter down to its core it is essentially a none-data holding member used for voting.
One case for an Arbiter is as I state in the linked question: Why do we need an 'arbiter' in MongoDB replication? to break the problems of CAP but that is not its true purpose since you could easily replace that Arbiter with a data holding node and have the same effect.
However, an Arbiter will have a few benefits:
Small footprint
No data
No need to synch
can instantly vote
can be put literally anywhere in your network, app server or even another secondary to boost that part of your network (this comes into partitions).
So an Arbiter is extremely useful, even on one side of a partition (i.e. you have no partitioning in your network).
Now to explain base setup. An Arbiter would NOT be required, you could factor it out for a data holding node, but 3 data holding nodes is not the minimum (that is the minimum you need to keep automatic failover), 2 data holding nodes and 1 Arbiter is actually the minimum.
Now to answer:
In other words, why wouldn't a primary + secondary configuration be sufficient?
Because if one of those goes down there is only 50% of the vote left (2-1 = 1) and 50% is not classed as a sufficient majority for MongoDB to actually vote in a member (judged by the total configured voteable members in your rs.config).
Also in this case MongoDB does not actually know if that last member is the last member. It needs other members to tell it otherwise.
So yes, this is why you need a third guy.

Mongo sharding confusion with replication turned on

Lets say I have 3 machines. Call them A, B, C. I have mongo replication turned on. So assume A is primary. B and C are secondary.
But now I would like to shard the database. But I believe all writes still go to A since A is primary. Is that true ? Can somebody explain the need of sharding and how it increases write throughput ?
I understand mongos will find the right shard. But the writes will have to be done on the primary. And rest other machines B, C will follow. Am I missing something.
I also don't understand the statement : "Each shard is a replica set". The shard is incomplete unless it is also the primary.
Sharding is completely different to replication.
I am now going to attempt to explain your confusion without making you more confused.
When sharding is taken into consideration a replica set is a replicated range of sharded information.
As such this means that every shard in the cluster is actually a replica set in itself holding a range (can hold differnt range chunks but that's another topic) of the sharded data and replicating itself across the members of that self contained replica set.
Imagine that a shard will be a primary of its own replica set with its own set of members and what not. This is what is meant by each shard being a replica set.
As such each shard will have its own self contained primary but at the same time each primary of each replica set of each shard can receive writes from a mongos if the range that the replica set holds matches the shard key target being sent down by the client.
I hope that makes sense.
Edit
This post might help: http://www.kchodorow.com/blog/2010/08/09/sharding-and-replica-sets-illustrated/
Does that apply here ? Say I want writes acknowledged from 2 mongod instances. Is that possible ? I still have trouble understanding. Can you please explain with an example ?
Ok let's take an example. You have 3 shards which are in fact 3 replica sets, called rs1 and rs2 and rs3, consisting of 3 nodes each (1 primary and 2 secondaries). If you want a write to be acknowledged by 2 of the members of the replica set then you can do that as you normally would:
db.users.insert({username:'sammaye'},{w:2})
Where username is the shard key.
The mongos this query is sent to will seek out the correct shard (node) which is in fact a replica set, connect to that replica set and then perform the insert as normal.
So as an example, rs2 actually holds the range of all usernames starting with the letters m-s. The mongos will use its internal mapping of rs2 (gotten when connecting to the replica set) along with your write/read concern to judge which members to read from and write to.
Tags will apply all the same too.
If mongos finds a shard, which is not "primary"(as in replication) is the write still performed in the secondary ?
You are still confusing yourself here, there is no "primary" like in replication, there is a seed (sometimes called a master) server for the sharded database but that is a different thing entirely. The mongos does not have to write to the master node in anyway, it is free to write to any part of the sharded cluster that your queries allow it.