How to avoid loss of internal state of a master during fail-over to new master during a network partition - distributed-computing

I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture.
Currently this is what my system looks like:
N different nodes, each one identical. 1 master node running a simple webserver.
All nodes communicate with each other using simple heartbeat protocol and each maintain global state (count of nodes available, who is master, downtime and uptime of each other.)
If any node does not hear from master for some set time, if raises a alarm. If a consensus is reached that the master is down, new master is elected.
If the network of nodes gets partitioned.
And the master is in minor partition, then it will stop serving request and go down by itself after a set period of time. Minor group cannot elect master (some minimum nodes require to make decision)
New master gets selected in the major partition after a set time after not hearing from old master.
Now I am stuck with a problem, that is, in the step 4 above, there is a time gap where the old master is still serving the requests, while new master getting elected in the major node.
This seems can cause inconsistent data across the system if some client decided to write new data to old master. How we avoid this issue. Would be glad if someone points me to right direction.

Rather than accepting writes to the minority master, what you want is to simply reject writes to the old master in that case, and you can do so by attempting to verify its mastership with a majority of the cluster on each write. If the master is on the minority side of a partition, it will no longer be able to contact a majority of the cluster and so will not be able to acknowledge clients’ requests. This brief period of unavailability is preferable to losing acknowledged writes in quorum based systems.
You should read the Raft paper. You’re slowly moving towards an implementation of the Raft protocol, and it will probably answer many of the questions you might have alonggn the way.

Related

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Is there any service registry without master-cluster/server?

I recently started to learn more about service registries and their usage in distributed architecture.
All the applications providing service registries that I found (etcd, Consul, or Zookeeper) are based on the same model: a master-server/cluster with leader election.
Correct me if I'm wrong but... doesn't this make the architecture less reliable ? In the sense that the master cluster brings a point-of-failure. To circumvent this we could always make a bigger cluster but it's more costly and/or less-performance effective.
My questions here are:
as all these service registries elect a leader — wouldn't it be possible to do the same without specifying the machines that form the master cluster but rather let them discover themselves through broadcasting and elect a leader or a leading group ?
does a service registry without master-server/cluster exists ?
and if not, what are the current limitations that prevent us from doing this ?
All of those services are based on one whitepaper - Google Chubby(https://ai.google/research/pubs/pub27897). The idea is to have fast and consistent configuration storage for distributed systems. To get there you need to eliminate a single point of failure. How you can do that? You introduce multiple machines storing the same data and also replicate the data. But in that case, considering unreliable network between those machines, how do you make sure that the data is consistent among nodes? You choose one of the nodes from the cluster to be Leader(using distributed leader election algorithm), if nodes have inconsistent values between them, it's a leaders job to pick the "correct" one. It looks like we've returned to a "single point of failure" situation, but in reality if the leader fails, the rest of the cluster just votes and promotes a new leader. So Leader role in those systems is NOT to be a Single point of truth, but rather a Single point of decision making

Why does a mongodb replica set need an odd number of voting members?

If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?
TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.

Why is Chubby lockserver not multi-master?

As I understand Chubby at any given time there are 5 chubby servers. One is the master and handles coordination of writes to the quorum, and the other 4 servers are read only and forward handling of writes to the master. Writes use Paxos to maintain consistency.
Can someone explain to me why there is a distinction between the master and the 4 replicas. Why isn't Chubby multi-master? This question could also apply to Zookeeper.
Having a single master is more efficient because nodes don't have to deal with as much contention.
Both Chubby and Zookeeper implement a distributed state-machine where the point of the system is to decide a total ordering on transitions from one state to the next. It can take a lot of messages (theoretically infinite messages) to resolve a contention when multiple nodes are proposing a transition at the same time.
Paxos (and thus Chubby) uses an optimization called a "distinguished leader" where the replicas forward writes to the distinguished leader to reduce contention. (I am assuming Chubby replicas do this. If not they could, but the designers merely push that responsibility to the client.) Zookeeper does this too.
Both Chubby and Zookeeper actually do handle multiple leaders because they must deal with a leader that doesn't know it has died and then comes back from the dead. For Chubby, this is the whole point of using Paxos: eventually one of the leaders will win. (Well theoretically it may not, but we engineers do practical things like randomized backoff to make that probability tolerably small.) Zookeeper, on the other hand, assigns a non-decreasing id to each leader; and any non-current leader's transitions are rejected. This means that when a leader dies, Zookeeper has to pause and go through a reconfiguration phase before accepting new transitions.