Due to customer's requirements, I have to install k8s on two nodes(there's only two available nodes so adding another one is not an option). In this situation, setting up etcd in only one node would cause a problem if that node went down; however, setting up a etcd cluster(etcd for each node) would still not solve the problem since if one node went down, etcd cluster would not be able to achieve the majority vote for the quorum. So I was wondering if it would be possible to override the "majority rule" for the quorum so that quorum could just be 1? Or would there be any other way to solve this
No, you cannot force it to lie to itself. What you see is what you get, two nodes provide the same redundancy as one.
Related
I have K8s cluster with 3 nodes "VMs" doing both master/worker with Etcd installed on all of the 3 nodes "untainted master" installed via kubespray "kubeadm based tool"
Now I would like to replace one VM with another.
Is there a direct way to do so because the only workaround will be adding 2 nodes to have always an odd number of ETCD via kubespray scale.yml ex:node4 and node 5 then removing the additional node3 and node 5 keeping node 4
I don't like the approach.
Any ideas are welcomed
Best Regards
If you have 3 main (please avoid using master, 💪) control plane nodes you should be fine replacing 1 at a time. The only thing is that your cluster will not be able to make any decisions/schedule any new workload, but the existing workloads will run fine.
The recommendation of 5 main nodes is based on the fact that you will always have a majority to reach quorum on the state decisions 📝 for etcd even if one node goes down. So if you have 5 nodes and one of them goes down you will still be able to schedule/run workloads.
In other words:
3 main nodes
Can tolerate a failure of one node.
Will not be able to make decisions
5 main nodes
Can tolerate a failure of one node.
Can still make decisions because there are 4 nodes still available.
If 2 failures happen then it's tolerated but there is no quorum.
To summarize, Raft 🚣 which is the consensus protocol for etcd tolerates up to (N-1)/2 failures and requires a majority or quorum of (N/2)+1. The recommended procedure is to update the nodes one at a time: bring one node down, and then a bring another one up, and wait for it to join the cluster (all control plane components)
I need to run my application with "at most once" semantics. It is absolutely crucial that only one instance of my app is running at any given time or none at all
At first I was using resource type "deployment" with single replica but then I realized the during network partition we might inadvertently be running more than one instances.
I stumbled up on "stateful sets" while I was searching for most once semantics in kubernetes. On reading further, the examples dealt with cases where the containers needed a persistent volume and typically these containers were running with more than one replica. My application is not even using any volumes.
I also read about tolerations to kill the pod if the node is unreachable. Given that tolerations can handle pod unreachable cases, is stateful set an overkill for my usecase?
I am justifying the use stateful set - because even in that mean time the node becomes unreachable and toleration seconds reached and kubelet realizes that it is cut off from the network and kills the processes, kubernetes can spin up another instance. And I believe stateful set prevent this corner case too.
Am I right? Is there any other approach to achieve this?
To quote a Kubernetes doc:
...StatefulSets maintain a sticky, stable identity for their
Pods...Guaranteeing an identity for each Pod helps avoid split-brain
side effects in the case when a node becomes unreachable (network
partition).
As described in the same doc, StatefulSet Pods on a Node are marked as "Unknown" and aren't rescheduled unless forcefully deleted when a node becomes unreachable. Something to consider for proper recovery, if going this route.
So, yes - StatefulSet may be more suitable for the given use case than Deployment.
In my opinion, it won't be an overkill to use StatefulSet - choose the Kubernetes object that works best for your use case.
Statefulsets are not the recourse for at-most one semantics - they are typically used for deploying "state full" applications like databases which use the persistent identity of their pods to cluster among themselves
We have faced similar issues like what you mentioned - we had implicitly assumed that a old pod would be fully deleted before bringing up the new instance
One option is to use the combination of preStop hooks + init-containers combination
preStop hook will do the necessary cleanup (say delete a app specific etcd key)
Init container can wait till the etcd key disappears (with an upper bound).
References:
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
One alternative is to try with the anti-affinity settings, but i am not very sure about this one though
It is absolutely crucial that only one instance of my app is running at any given time.
Use a leader election pattern to guarantee at most one active replica. If you use more than one replica and leader election, the other replicas are stand by in case of network partition situations. This is how the components in the Kubernetes control plane solves this problem when needing only one active instance.
Leader election algorithms in Kubernetes usually work by taking a lock (e.g. in etcd) with a timeout. Only the instance that has the lock is active. When the lock is timed out, the leader election algorithm either extend the lock timeout or elect a new leader. How it works depends on the implementation but there is a guarantee that there is at most one leader - the active instance.
See e.g. Simple Leader Election with Kubernetes that also describe how to solve this in a side car container.
If your application is stateless, you should use Deployment and not StatefulSet. It can appear that StatefulSet is a way to solve at most one instance in a situation with a network partition, but that is mostly in case of a stateful replicated application like e.g. a cache or database cluster even though it may solve your specific situation as well.
I have three demonset pods which contain a container of hadoop resource manager in each pod. One of three is active node. And the other two are standby nodes.
So there is two quesion:
Is there a way to let kubernetes know the hadoop resource manager
inside the pod is a active node or standby node?
I want to control the rolling update way to update the standby node at first and update the active node in last for decrease the times
changing active node which may cause risk.
Consider the following: Deployments, DaemonSets and ReplicaSets are abstractions meant to manage a uniform group of objects.
In your specific case, although you're running the same application, you can't say it's a uniform group of object as you have two types: active and standby objects.
There is no way for telling Kubernetes which is which if they're grouped in what is supposed to be an uniform set of objects.
As suggested by #wolmi, having them in a Deployment instead of DaemonSet still leaves you with the issue that deployment strategies can't individually identify objects to control when they're updated because of the aforementioned logic.
My suggestion would be that, additional to using a Deployment with node affinity to ensure a highly available environment, you separate active and standby objects in different Deployments/Services and base your rolling update strategy on that scenario.
This will ensure that you're updating the standby nodes first, removing the risk of updating the active nodes before the other.
I think this is not the best way to do that, totally understand that you use Daemonset to be sure that Hadoop exists on an HA environment one per node but you can have that same scenario using a deployment and affinity parameters more concrete the pod affinity, then you can be sure only one Hadoop node exists per K8S node.
With that new approach, you can use a replication-controller to control the rolling-update, some resources from the documentation:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/
For HA and Quorum I will install three master / etc nodes in three different data centers.
But I want to configure one node to never become a leader. Only acts as follower for etcd quorum.
Is this possible?
I believe, today it is not a supported option and is not recommended.
what you want is to have 3 node control plane ( including etcd ) and one of the node should participate in leader election but not become leader and shouldnt store data. you are looking for some kind of ARBITER feature that exists in mongodb HA cluster.
ARBITER feature is not supported in ETCD. you might need to raise a PR to get that addressed.
The controller manager and scheduler always connect the local apiserver. You might want to route those calls to apiserver on the active master. You might need to open another PR for kubernetes community to get that addressed.
I have kubernetes HA environment with three masters. Just have a test, shutdown two masters(kill the apiserver/kcm/scheduler process), then only one master can work well. I can use kubectl to create a deployment successfully ,some pods were scheduled to different nodes and start. So can anyone explain why it is advised odd number of masters? Thanks.
Because if you have an even number of servers, it's a lot easier to end up in a situation where the network breaks and you have exactly 50% on each side. With an odd number, you can't (easily) have a situation where more than one partition in the network thinks it has majority control.
Short answer: To have higher fault tolerence for etcd.
Etcd uses RAFT for leader selection. An etcd cluster needs a majority of nodes, a quorum, to agree on a leader. For a cluster with n members, quorum is (n/2)+1.
In terms of fault tolerance, adding an additional node to an odd-sized cluster decreases the fault tolerance. How? We still have the same number of nodes that may fail without losing quorum however we have more nodes that can fail which means possibility of losing quorum is actually higher than before.
For fault tolerance please check this official etcd doc for more information.