why kubernetes HA need odd number of master - kubernetes

I have kubernetes HA environment with three masters. Just have a test, shutdown two masters(kill the apiserver/kcm/scheduler process), then only one master can work well. I can use kubectl to create a deployment successfully ,some pods were scheduled to different nodes and start. So can anyone explain why it is advised odd number of masters? Thanks.

Because if you have an even number of servers, it's a lot easier to end up in a situation where the network breaks and you have exactly 50% on each side. With an odd number, you can't (easily) have a situation where more than one partition in the network thinks it has majority control.

Short answer: To have higher fault tolerence for etcd.
Etcd uses RAFT for leader selection. An etcd cluster needs a majority of nodes, a quorum, to agree on a leader. For a cluster with n members, quorum is (n/2)+1.
In terms of fault tolerance, adding an additional node to an odd-sized cluster decreases the fault tolerance. How? We still have the same number of nodes that may fail without losing quorum however we have more nodes that can fail which means possibility of losing quorum is actually higher than before.
For fault tolerance please check this official etcd doc for more information.

Related

Is it possible to overwrite the etcd quorum?

Due to customer's requirements, I have to install k8s on two nodes(there's only two available nodes so adding another one is not an option). In this situation, setting up etcd in only one node would cause a problem if that node went down; however, setting up a etcd cluster(etcd for each node) would still not solve the problem since if one node went down, etcd cluster would not be able to achieve the majority vote for the quorum. So I was wondering if it would be possible to override the "majority rule" for the quorum so that quorum could just be 1? Or would there be any other way to solve this
No, you cannot force it to lie to itself. What you see is what you get, two nodes provide the same redundancy as one.

Kubeadm replace node

I have K8s cluster with 3 nodes "VMs" doing both master/worker with Etcd installed on all of the 3 nodes "untainted master" installed via kubespray "kubeadm based tool"
Now I would like to replace one VM with another.
Is there a direct way to do so because the only workaround will be adding 2 nodes to have always an odd number of ETCD via kubespray scale.yml ex:node4 and node 5 then removing the additional node3 and node 5 keeping node 4
I don't like the approach.
Any ideas are welcomed
Best Regards
If you have 3 main (please avoid using master, 💪) control plane nodes you should be fine replacing 1 at a time. The only thing is that your cluster will not be able to make any decisions/schedule any new workload, but the existing workloads will run fine.
The recommendation of 5 main nodes is based on the fact that you will always have a majority to reach quorum on the state decisions 📝 for etcd even if one node goes down. So if you have 5 nodes and one of them goes down you will still be able to schedule/run workloads.
In other words:
3 main nodes
Can tolerate a failure of one node.
Will not be able to make decisions
5 main nodes
Can tolerate a failure of one node.
Can still make decisions because there are 4 nodes still available.
If 2 failures happen then it's tolerated but there is no quorum.
To summarize, Raft 🚣 which is the consensus protocol for etcd tolerates up to (N-1)/2 failures and requires a majority or quorum of (N/2)+1. The recommended procedure is to update the nodes one at a time: bring one node down, and then a bring another one up, and wait for it to join the cluster (all control plane components)

Is Stateful set for single replica an overkill?

I need to run my application with "at most once" semantics. It is absolutely crucial that only one instance of my app is running at any given time or none at all
At first I was using resource type "deployment" with single replica but then I realized the during network partition we might inadvertently be running more than one instances.
I stumbled up on "stateful sets" while I was searching for most once semantics in kubernetes. On reading further, the examples dealt with cases where the containers needed a persistent volume and typically these containers were running with more than one replica. My application is not even using any volumes.
I also read about tolerations to kill the pod if the node is unreachable. Given that tolerations can handle pod unreachable cases, is stateful set an overkill for my usecase?
I am justifying the use stateful set - because even in that mean time the node becomes unreachable and toleration seconds reached and kubelet realizes that it is cut off from the network and kills the processes, kubernetes can spin up another instance. And I believe stateful set prevent this corner case too.
Am I right? Is there any other approach to achieve this?
To quote a Kubernetes doc:
...StatefulSets maintain a sticky, stable identity for their
Pods...Guaranteeing an identity for each Pod helps avoid split-brain
side effects in the case when a node becomes unreachable (network
partition).
As described in the same doc, StatefulSet Pods on a Node are marked as "Unknown" and aren't rescheduled unless forcefully deleted when a node becomes unreachable. Something to consider for proper recovery, if going this route.
So, yes - StatefulSet may be more suitable for the given use case than Deployment.
In my opinion, it won't be an overkill to use StatefulSet - choose the Kubernetes object that works best for your use case.
Statefulsets are not the recourse for at-most one semantics - they are typically used for deploying "state full" applications like databases which use the persistent identity of their pods to cluster among themselves
We have faced similar issues like what you mentioned - we had implicitly assumed that a old pod would be fully deleted before bringing up the new instance
One option is to use the combination of preStop hooks + init-containers combination
preStop hook will do the necessary cleanup (say delete a app specific etcd key)
Init container can wait till the etcd key disappears (with an upper bound).
References:
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
One alternative is to try with the anti-affinity settings, but i am not very sure about this one though
It is absolutely crucial that only one instance of my app is running at any given time.
Use a leader election pattern to guarantee at most one active replica. If you use more than one replica and leader election, the other replicas are stand by in case of network partition situations. This is how the components in the Kubernetes control plane solves this problem when needing only one active instance.
Leader election algorithms in Kubernetes usually work by taking a lock (e.g. in etcd) with a timeout. Only the instance that has the lock is active. When the lock is timed out, the leader election algorithm either extend the lock timeout or elect a new leader. How it works depends on the implementation but there is a guarantee that there is at most one leader - the active instance.
See e.g. Simple Leader Election with Kubernetes that also describe how to solve this in a side car container.
If your application is stateless, you should use Deployment and not StatefulSet. It can appear that StatefulSet is a way to solve at most one instance in a situation with a network partition, but that is mostly in case of a stateful replicated application like e.g. a cache or database cluster even though it may solve your specific situation as well.

In a Kubernetes cluster. Does the Master Node need always to run alone in a cluster node?

I am aware that it is possible to enable the master node to execute pods and that is my concern. Since the default configuration is do not allow the master to run pods. Should I change it? What is the reason for the default configuration as it is?
If the change can be performed in some situations. I would like to ask if my cluster in one of these. It has only three nodes with exactly the same hardware and possibly more nodes are not going to be added in the foreseeable future. In my opinion, as I have three equal nodes, it will be a waste of resources to use 1/3 of my cluster computational power to run the kubernetes master. Am I right?
[Edit1]
I have found the following reason in Kubernets documentation.
It is, the security, the only reason?
Technically, it doesn't need to run on a dedicated node. But for your Kubernetes cluster to run, you need your masters to work properly. And one of the ways how to ensure it can be secure, stable and perform well is to use separate node which runs only the master components and not regular pod. If you share the node with different pods, there could be several ways how it can impact the master. For example:
The other pods will impact the perforamnce of the masters (network or disk latencies, CPU cache etc.)
They migth be a security risk (if someone manages to hack from some other pod into the master node)
A badly written application can cause stability issues to the node
While it can be seen as wasting resources, you can also see it as a price to pay for the stability of your master / Kubernetes cluster. However, it doesn't have to be waste of 1/3 of resources. Depending on how you deploy your Kubernetes cluster you can use different hosts for different nodes. So for example you can use small host for the master and bigger nodes for the workers.
No, this is not required, but strongly recommended. Security is one aspect, but performance is another. Etcd is usually run on those control plane nodes and it tends to chug if it runs out of IOPS. So a rogue pod running application code could destabilize the control plane, which then reduces your ability to fix the problem.
When running small clusters for testing purposes, it is common to run everything (control plane and workloads) on a single node specifically to save money/complexity.

kubernetes. Priorities selecting the leader master node

I have a kubernetes cluster with three master nodes acting in active/passive mode. One of them is in a separate location with bigger latency and more prone to have network problems.
I'm looking for being able to set priorities about what master node must be the leader if possible.
Does exist this configuration?
Thank you.
You cannot change prioritize of master nodes in election. Here is a current implementation.
If you have a separate location with bigger latency, maybe it will be better to create two clusters and use them in Federaion?