kubernetes - can we create 2 node master-only cluster with High availability - kubernetes

I am new to the Kubernetes and cluster.
I would like to bring up an High Availability Master Only Kubernetes Cluster(Need Not to!).
I have the 2 Instances/Servers running Kubernetes daemon, and running different kind of pods on both the Nodes.
Now I would like to somehow create the cluster and if the one of the host(2) down, then all the pods from that host(2) should move to the another host(1).
once the host(2) comes up. the pods should float back.
Please let me know if there is any way i can achieve this?

Since your requirement is to have a 2 node master-only cluster and also have HA capabilities then unfortunately there is no straightforward way to achieve it.
Reason being that a 2 node master-only cluster deployed by kubeadm has only 2 etcd pods (one on each node). This gives you no fault tolerance. Meaning if one of the nodes goes down, etcd cluster would lose quorum and the remaining k8s master won't be able to operate.
Now, if you were ok with having an external etcd cluster where you can maintain an odd number of etcd members then yes, you can have a 2 node k8s cluster and still have HA capabilities.

It is possible that master node serves also as a worker node however it is not advisable on production environments, mainly for performance reasons.
By default, kubeadm configures master node so that no workload can be run on it and only regular nodes, added later would be able to handle it. But you can easily override this default behaviour.
In order to enable workload to be scheduled also on master node you need to remove from it the following taint, which is added by default:
kubectl taint nodes --all node-role.kubernetes.io/master-
To install and configure multi-master kubernetes cluster you can follow this tutorial. It describes scenario with 3 master nodes but you can easily customize it to your needs.

Related

How does kube-proxy behave when it can't reach the master?

From what I've read about Kubernetes, if the master(s) die, the workers should still be able to function as normal (https://stackoverflow.com/a/39173007/281469), although no new scheduling will occur.
However, I've found this to not be the case when the master can also schedule worker pods. Take a 2-node cluster, where one node is a master and the other a worker, and the master has the taints removed:
If I shut down the master and docker exec into one of the containers on the worker I can see that:
nc -zv ip-of-pod 80
succeeds, but
nc -zv ip-of-service 80
fails half of the time. The Kubernetes version is v1.15.10, using iptables mode for kube-proxy.
I'm guessing that since the kube-proxy on the worker node can't connect to the apiserver, it will not remove the master node from the iptables rules.
Questions:
Is it expected behaviour that kube-proxy won't stop routing to pods on master nodes, or is there something "broken"?
Are any workarounds available for this kind of setup to allow the worker nodes to still function correctly?
I realise the best thing to do is separate the CP nodes but that's not viable for what I'm working on at the moment.
Is it expected behaviour that kube-proxy won't stop routing to pods on
master nodes, or is there something "broken"?
Are any workarounds
available for this kind of setup to allow the worker nodes to still
function correctly?
The cluster master plays the role of decision maker for the various activities in cluster's nodes. This can include scheduling workloads, managing the workloads' lifecycle, scaling etc.. Each node is managed by the master components and contains the services necessary to run pods. The services on a node typically includes the kube-proxy, container runtime and kubelet.
The kube-proxy component enforces network rules on nodes and helps kubernetes in managing the connectivity among Pods and Services. Also, the kube-proxy, acts as an egress-based load-balancing controller which keeps monitoring the the kubernetes API server and continually updates node's iptables subsystem based on it.
In simple terms, the master node only is aware of everything and is in charge of creating the list of routing rules as well based on node addition or deletion etc. kube-proxy plays a kind of enforcer whereby it takes charge of checking with master, syncing the information and enforcing the rules on the list.
If the master node(API server) is down, the cluster will not be able to respond to API commands or deploy nodes. If another master node is not available, there shall be no one else available who can instruct the worker nodes on change in work allocation and hence they shall continue to execute the operations that were earlier scheduled by the master until the time the master node is back and gives different instructions. Inline to it, kube-proxy shall also be unable to get the latest rules by sync up with master, however it shall not stop routing and shall continue to handle the networking and routing functionalities (uses the earlier iptable rules that were determined before the master node went down) that shall allow network communication to your pods provided all pods in worker nodes are still up and running.
Single master node based architecture is not a preferred deployment architecture for production. Considering that resilience and reliability is one of the major business goal of kubernetes, it is recommended as a best practice to have HA cluster based architecture to avoid single point of failure.
Once you remove taints, kubernetes scheduler don't need any tolerations to schedule pods on your master node. So it is as good as your worker node with control plane components running on it and you can also run your workload pods on this node (although its not a recommended practice).
Kube-proxy (https://kubernetes.io/docs/concepts/overview/components/#kube-proxy) is the component deployed on all the nodes of cluster and it handles the networking and routing connection to your pods. So, even if your master node is down kube-proxy still works fine on the worker node and it will route traffic to your pods running on worker node.
If all your pods are running in worker nodes (which are still up and running), then kube-proxy will continue to route traffic to your pods even via service.
There is nothing inherent in Kubernetes that would cause this. The master node role is just for humans, and if you've removed the taints then the nodes are just normal nodes. That said, remember that usual rules about scheduling and resource requests apply so if your pods don't all fit then things wouldn't be scheduled. It's possible your Kubernetes deploy system set up more specialized firewall rules or similar around the control plane nodes, but that would be dependent on that system.

Can I make sure that a "cordoned" node is the one deleted when I downsize the cluster in Azure AKS?

I need to downsize a cluster from 3 to 2 nodes.
I have critical pods running on some nodes (0 and 1). As I found that the last node (2) in the cluster is the one that has the non critical pods, I have "cordoned" it so it won't get any new ones.
I wonder is if I can make sure that that last node (2) is the one that will be removed when I go to Azure portal and downsize my cluster to 2 nodes (it is the last node and it is cordoned).
I have read that if I manually delete the node, the system will still consider there are 3 nodes running so it's important to use the cluster management to downsize it.
You cannot control which node will be removed when scaling down the AKS cluster.
However, there are some workarounds for that:
Delete the cordoned node manually via portal and than launch upgrade. It would try to add the node but with no success because the subnet has no space left.
Another option is to:
Set up cluster autoscaler with two nodes
Scale up the number of nodes in the UI
Drain the node you want to delete and wait for autoscaler do it's job
Here are some sources and useful info:
Scale the node count in an Azure Kubernetes Service (AKS) cluster
Support selection of nodes to remove when scaling down
az aks scale
Please let me know if that helped.

Kubernetes cluster recovery after linux host reboot

We are still in a design phase to move away from monolithic architecture towards Microservices with Docker and Kubernetes. We did some basic research on Docker and Kubernetes and got some understanding. We still have couple of open question considering we will be creating K8s cluster with multiple Linux hosts (due to some reason we can't think about Cloud right now) .
Consider a scenario where we have K8s Cluster spanning over multiple linux hosts (5+).
1) If one of the linux worker node crashes and once we bring it back, does enabling kubelet as part of systemctl in advance will be sufficient to bring up required K8s jobs so that it be detected by master again?
2) I believe once worker node is crashed (X pods), after the pod eviction timeout master will reschedule those X pods into some other healthy node(s). Once the node is UP it won't do any deployment of X pods as master already scheduled to other node but will be ready to accept new requests from Master.
Is this correct ?
Yes, should be the default behavior, check your Cluster deployment tool.
Yes, Kubernetes handles these things automatically for Deployments. For StatefulSets (with local volumes) and DaemonSets things can be node specific and Kubernetes will wait for the node to come back.
Better to create a test environment and see/test the failure scenarios

Why kubernetes taints the master node with "NoSchedule" by default?

A few days ago, I looked up why none of pods are being scheduled to the master node, and found this question: Allow scheduling of pods on Kubernetes master?
It tells that it is because the master node is tainted with "NoSchedule" effect, and gives the command to remove that taint.
But before I execute that command on my cluster, I want to understand why it was there in the first place.
Is there a reason why the master node should not run pods? Any best-practices it relates to?
The purpose of kubernetes is to deploy application easily and scale them based on the demand. The pod is a basic entity which runs the application and can be increased and decreased based on high and low demands respectively (Horizontal Pod Autoscalar).
These worker pods needs to be run on worker nodes specially if you’re looking at big application where your cluster might scale upto 100’s of nodes based on demand (Cluster Autoscalar). These increasing pods can put up pressure on your nodes and once they do you can always increase the worker node in cluster using cluster autoscalar.
Suppose, you made your master schedulable then the high memory and CPU pressure put your master at risk of crashing the master. Mind you can’t autoscale the master using autoscalar. This way you’re putting your whole cluster at risk. If you have single master then your will not be able to schedule anything if master crashed. If you have 3 master and one of them crashed, then the other two master has to take the extra load of scheduling and managing worker nodes and increasing the load on themselves and hence the increased risk of failure
Also, In case of larger cluster, you already need the master nodes with high resources just to manage your worker nodes. You can’t put additional load on master nodes to run the workload as well in that case. Please have a look at the setting up large cluster in kubernetes here
If you have manageable workload and you know it doesn’t increase beyond a certain level. You can make master schedulable. However for production cluster it is not recommended at all.
Primary role of master is cluster management. Already many components of k8 are running on master.Suppose If pods scheduled on master without limit of resources and pods are consuming all the resources( cpu or memory), then master and in turn whole cluster will be at risk.
So while designing Highly Available production cluster minimum 3 master, 3 etcd, 3 infra node are created and application pods are not scheduled on these nodes. Separate worker nodes added to assign workload.
Master is intended for cluster management tasks and should not be used to run workloads. In development and test environments it is ok to schedule pods on master servers but in production better to keep it only for cluster level management activities. Use workers or nodes to schedule workloads

Redistribute pods after adding a node in Kubernetes

What should I do with pods after adding a node to the Kubernetes cluster?
I mean, ideally I want some of them to be stopped and started on the newly added node. Do I have to manually pick some for stopping and hope that they'll be scheduled for restarting on the newly added node?
I don't care about affinity, just semi-even distribution.
Maybe there's a way to always have the number of pods be equal to the number of nodes?
For the sake of having an example:
I'm using juju to provision small Kubernetes cluster on AWS. One master and two workers. This is just a playground.
My application is apache serving PHP and static files. So I have a deployment, a service of type NodePort and an ingress using nginx-ingress-controller.
I've turned off one of the worker instances and my application pods were recreated on the one that remained working.
I then started the instance back, master picked it up and started nginx ingress controller there. But when I tried deleting my application pods, they were recreated on the instance that kept running, and not on the one that was restarted.
Not sure if it's important, but I don't have any DNS setup. Just added IP of one of the instances to /etc/hosts with host value from my ingress.
descheduler, a kuberenets incubator project could be helpful. Following is the introduction
As Kubernetes clusters are very dynamic and their state change over time, there may be desired to move already running pods to some other nodes for various reasons:
Some nodes are under or over utilized.
The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.
Some nodes failed and their pods moved to other nodes.
New nodes are added to clusters.
There is automatic redistribution in Kubernetes when you add a new node. You can force a redistribution of single pods by deleting them and having a host based antiaffinity policy in place​. Otherwise Kubernetes will prefer using the new node for scheduling and thus achieve a redistribution over time.
What are your reasons for a manual triggered redistribution​?