K8s fault tolerance - kubernetes

K8s fault tolerance - kubernetes

I was going through the differences between Swarm vs K8s one of the cons of Swarm is that it has limited fault tolerance functionality. How does K8s achieve fault tolerance, is it via K8s multi-master. Please share your inputs

Yes! In order to achieve Kubernetes fault-tolerance is recommended to have multiples Control Planes (master) nodes and if you are running in cloud providers multiples availability zones are recommended.
The Control Plane’s components make global decisions about the cluster (for example, scheduling), as well as detecting and responding to cluster events (for example, starting up a new pod when a deployment’s replicas field is unsatisfied).
Basically, the control plane is composed by theses components:
kube-apiserver - Exposes the Kubernetes API. Is the front for Kubernetes control plane
etcd - Key/Value Kubernetes' backing store for cluster data
kube-scheduler - Responsible for watches for newly created pods with no assigned node, and selects a node for them to run on
kube-controller-manager - One of controller responsibility is maintain the correct number of pods for every replication controller, populate endpoints objects and responding when nodes go down.
cloud-controller-manager - Interact with the underlying cloud providers,
Every cluster need 1 worker nodes at least, the work nodes is responsible to run your workloads.
Here’s the diagram of a Kubernetes cluster with all the components tied together:
For more info see here

Yes, all kubernetes control plane components are either clustered (etcd), run a leader election (controllers), or flat (apiserver). Traditionally you run three control plane nodes but you can do 5 in some complex topologies.

Related

Why there is no concept of nodepool in Kubernetes?

I can see GKE, AKS, EKS all are having nodepool concepts inbuilt but Kubernetes itself doesn't provide that support. What could be the reason behind this?
We usually need different Node types for different requirements such as below-
Some pods require either CPU or Memory intensive and optimized nodes.
Some pods are processing ML/AI algorithms and need GPU-enabled nodes. These GPU-enabled nodes should be used only by certain pods as they are expensive.
Some pods/jobs want to leverage spot/preemptible nodes to reduce the cost.
Is there any specific reason behind Kubernetes not having inbuilt such support?

Node Pools are cloud-provider specific technologies/groupings.
Kubernetes is intended to be deployed on various infrastructures, including on-prem/bare metal. Node Pools would not mean anything in this case.
Node Pools generally are a way to provide Kubernetes with a group of identically configured nodes to use in the cluster.
You would specify the node you want using node selectors and/or taints/tolerations.
So you could taint nodes with a GPU and then require pods to have the matching toleration in order to schedule onto those nodes. Node Pools wouldn't make a difference here. You could join a physical server to the cluster and taint that node in exactly the same way -- Kubernetes would not see that any differently to a Google, Amazon or Azure-based node that was also registered to the cluster, other than some different annotations on the node.

As Blender Fox mentioned Node group is more specific to Cloud provider Grouping/Target options.
In AWS we have Node groups or Target groups, While in GKE Managed/Unmanaged node groups.
You set the Cluster Autoscaler and it scales up & down the count in the Node pool or Node groups.
If you are running Kubernetes on On-prem there may not be the option of a Node pool, as the Node group is mostly a group of VM in the Cloud. While on the on-prem bare metal machines also work as Worker Nodes.
To scale up & Down there is Cluster autoscaler(CA adds or removes nodes from the cluster by creating/deleting VMs) in K8s which uses the Cloud provider node group API while on Bare metal it may not work simply.
Each provider have own implementation and logic which get determined from K8s side by flag --cloud-provider Code link
So if you are on On-prem private cloud write your own cloud client and interface.
It's not necessary to have to node group however it's more of Cloud provider side implementation.
For Scenario
Some pods require either CPU or Memory intensive and optimized nodes.
Some pods are processing ML/AI algorithms and need GPU-enabled nodes.
These GPU-enabled nodes should be used only by certain pods as they
are expensive. Some pods/jobs want to leverage spot/preemptible nodes
to reduce the cost.
You can use the Taints-toleration, Affinity, or Node selectors as per need to schedule the POD on the specific type of Nodes.

How to avoid downtime during scheduled maintenance window

I'm experiencing downtimes whenever the GKE cluster gets upgraded during the maintenance window. My services (APIs) become unreachable for like ~5min.
The cluster Location type is set to "Zonal", and all my pods have 2 replicas. The only affected pods seem to be the ones using nginx ingress controller.
Is there anything I can do to prevent this? I read that using Regional clusters should prevent downtimes in the control plane, but I'm not sure if it's related to my case. Any hints would be appreciated!

You mention "downtime" but is this downtime for you using the control plane (i.e. kubectl stop working) or is it downtime in that the end user who is using the services stops seeing the service working.
A GKE upgrade upgrades two parts of the cluster: the control plane or master nodes, and the worker nodes. These are two separate upgrades although they can happen at the same time depending on your configuration of the cluster.
Regional clusters can help with that, but they will cost more as you are having more nodes, but the upside is that the cluster is more resilient.
Going back to the earlier point about the control plane vs node upgrades. The control plane upgrade does NOT affect the end-user/customer perspective. The services will remaining running.
The node upgrade WILL affect the customer so you should consider various techniques to ensure high availability and resiliency on your services.
A common technique is to increase replicas and also to include pod antiaffinity. This will ensure the pods are scheduled on different nodes, so when the node upgrade comes around, it doesn't take the entire service out because the cluster scheduled all the replicas on the same node.
You mention the nginx ingress controller in your question. If you are using Helm to install that into your cluster, then out of the box, it is not setup to use anti-affinity, so it is liable to be taken out of service if all of its replicas get scheduled onto the same node, and then that node gets marked for upgrade or similar.

What is the difference between kubernetes labels node-role.kubernetes.io/master and node-role.kubernetes.io/control-plane?

I am newbie to kubernetes, I see one of my node's role is control-plane,master. What is the difference?
is a master node is a node running kube-apiserver?
Then what defined control-plane node?
I am using kubectl 1.20.2(kubeadm also 1.20.2).

The old node-role.kubernetes.io/master label and taint key has been deprecated and will be replaced with node-role.kubernetes.io/control-plane instead, they are both valid during a transition period. Adding them both ensures backward compatibility while also supporting tools using the newer terminology.
The reason for the name change is that The Kubernetes project is moving away from wording that is considered offensive. A new working group WG Naming has been created to track this work, and the word master was declared as offensive and the Recommendation: master -> control plane has been accepted:
Within the Kubernetes codebase, the term “master” is often used in
reference to the kubernetes control plane, either as a whole or to
some subset of the components within. We recommend control plane to
refer to the set of components as a whole. We recommend
context-specific alternatives when talking about individual components
or the roles they serve.
As part of the Kubernetes eco-system, kubeadm complies with this recommendation, more information in KEP-2067: Rename the kubeadm "master" label and taint:
Kubeadm applies a "node-role" label to its control plane Nodes.
Currently this label key is node-role.kubernetes.io/master and it
should be renamed to node-role.kubernetes.io/control-plane. Kubeadm
also uses the same "node-role" as key for a taint it applies on
control plane Nodes. This taint key should also be renamed to
"node-role.kubernetes.io/control-plane".
This is also mentioned in the Kubernetes v1.20.0 Release notes

Kubernetes Cluster has always been consisting of nodes. Think of nodes as machines that host your applications/[micro]services.
In the old days, cluster was separated into two logical areas:
Masters or Master Nodes - that is the set of nodes that comprise the Control Plane which manages worker nodes, pods, and basically entire K8s cluster infrastructure;
Worker Nodes - that is the set of nodes, that actually host your application (or application components/microservices, if you wish).
According to the kubernetes.io documentation:
Master is a legacy term, used as synonym for nodes hosting the control plane. The term is still being used by some provisioning tools, such as kubeadm, and managed services, to label nodes with kubernetes.io/role and control placement of control plane pods.
Another quote here:
hosts running these components (components of the Control Plane) were historically called masters.
Today, the new terminology of the Kubernetes splits these two (Control Plane nodes and Worker Nodes) into:
Control Plane components (previously master nodes), and
Worker Nodes, that host actual application components.
Last but not the least: Control Plane components (formerly - Master Nodes) are not just Kubernetes API Server, but, mostly, these (but not restricted to):
etcd
API Server
Scheduler
Controller Manager
Cloud Controller Manager

Why we need more than 3 master cluster for kubernetes HA

I see most of the K8S master components has a leader selection process except apiServer. If only one node will be the leader any point of time, why would we need more then 3 master cluster for bigger k8s cluster?

The requirement of minimum 3 hosts comes from the fact that Kubernetes HA cluster uses etcd for storing and syncing configuration. etcd requires minimum 3 nodes to ensure HA. In general case we need to use n+1 model when want to deploy Kubernetes HA cluster
In a single master setup, the master node manages the etcd database, API server, controller manager and scheduler, along with the worker nodes. However, if that single master node fails, all the worker node fail as well and entire cluster will be lost.
In a multi-master setup, by contrast, multi-master provides high availability for a single cluster and improves network performance because all the masters behave like a unified data center.
A multi-master setup protects against a wide range of failure modes, from a loss of single worker node to the failure of the master node’s etcd service. By providing redundancy, a multi-master cluster serves a highly available system for your end users.
Do not use a cluster with two master replicas. Consensus on a two-replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.
Here are useful documentation: kubernetes-ha-cluster, creating-ha-cluster.
Articles: ha-cluster, ha.

how does kubernetes guarantee reliability of kube proxy and kubelet?

If Kube proxy is down, the pods on a kubernetes node will not be able to communicate with the external world. Anything that Kubernetes does specially to guarantee the reliability of kube-proxy?
Similarly, how does Kubernetes guarantee reliability of kubelet?

It guarantees their reliability by:
Having multiple nodes: If one kubelet crashes, one node goes down. Similarly, every node runs a kube-proxy instance, which means losing one node means losing the kube-proxy instance on that node. Kubernetes is designed to handle node failures. And if you designed your app that is running on Kubernetes to be scalable, you will not be running it as single instance but rather as multiple instances - and kube-scheduler will distribute your workload across multiple nodes - which means your application will still be accessible.
Supporting a Highly-Available Setup: If you set up your Kubernetes cluster in High-Availability mode properly, there won't be one master node, but multiple. This means, you can even tolerate losing some master nodes. The managed Kubernetes offerings of the cloud providers are always highly-available.
These are the first 2 things that come to my mind. However, this is a broad question, so I can go into details if you elaborate what you mean by "reliability" a bit.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse