Why do we need PodDisruptionBudget on AKS? - kubernetes

I am going to implement PDB on AKS. Can someone please tell me why do we need it when we can use node autoscaler.
Also, does PDB allow zero unavailability by creating a node when one of the nodes fails?

PDB allows you to set rules before evicting your pods from a node.
Let's say you have a 2 nodes cluster and a deployment with 1 replica and you want to update your nodes.
kubectl drain will cordon node 1 so no pods can be schedule on that node
kubectl drain will remove the pod schedule on node 1
kubelet will then deploy your pod over node 2
Now if you set a PDB with a minAvailable: 50%, that drain command would fail as it would violates the rule.
The pods is killed and then kubelet tries to schedule it somewhere.
PDB allows you to prevent downtime by budgeting pods before evicting them.
Scenario without PDB
You perform node 1 update and node 2 cannot host the evicted pod :
pod is killed on node 1
kubelet cannot schedule pod anywhere
autoscaling provisions a third node
pod is scheduled on that new node
During that whole time your evicted pod was not running anywhere and your application was down.

Related

pod status when k8s upgrading cluster

The k8s documentation says that kubeadm upgrade does not touch your workloads, only components internal to Kubernetes, but I don't understand the status of the Pods at this time.
There are different upgrade strategies, but I assume you want to upgrade your cluster with zero downtime.
In this case, the upgrade procedure at high level is the following:
Upgrade control plane nodes - should be executed one node at a time.
Upgrade worker nodes - should be executed one node at a time or few nodes at a time, without compromising the minimum required capacity for running your workloads.
It's important to prepare the node for maintenance by marking it 'unschedulable' and evicting the workloads (moving the workloads to other nodes):
$ kubectl drain <node-to-drain> --ignore-daemonsets
NOTE: If there are Pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet , the drain operation will be refused, unless you use the --force option.
As you can see in the Safely Drain a Node documentation:
You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod's containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified.
If you finished the upgrade procedure on this node, you need to bring the node back online by running:
$ kubectl uncordon <node name>
To sum up: kubectl drain changes the status of the Pods (workflow moves to another node).
Unlike kubectl drain, kubeadm upgrade does not touch/affect your workloads, only modifies components internal to Kubernetes.
Using "kube-scheduler" as an example, we can see what exactly happens to the control plane components when we run the kubeadm upgrade apply command:
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2021-03-07-15-42-15/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
The way Pods can be scaled up and down, cluster can also be scaled up and down to increase/decrease the Nodes.
While scaling a cluster, if you increase the size of a node pool, existing Pods will not be moved to newer nodes.
While scaling a cluster, if you manually decrease the size of a node pool, any Pods on deleted nodes will not be restarted on other nodes.
If autoscaling decreases the size of a node pool, any Pods on deleted nodes that aren't managed by a replication controller will not be lost.

How to reschedule the pod from node in kubernetes ( baremetal servers )?

Kubernetes nodes are getting unscheduled while i initiate the drain or cordon but the pods which is available on the node are not getting moved to different node immediately ?
i mean, these pods are not created by daemonset.
So, how come, Application running pod can make 100% available when a node getting faulty or with some issues ?
any inputs ?
command used :
To drain / cordon to make the node unavailable:
kubectl drain node1
kubectl cordon node1
To check the node status :
kubectl get nodes
To check the pod status before / after cordon or drain :
kubectl get pods -o wide
kubectl describe pod <pod-name>
Surprising part is , even node is unavailable, the pod status showing always running. :-)
Pods by itself doesn't migrate to another node.
You can use workload resources to create and manage multiple Pods for you. A controller for the resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
Some examples of controllers are:
deployment
daemonset
statefulsets
Check this link to more information.

kubectl drain and rolling update, downtime

Does kubectl drain first make sure that pods with replicas=1 are healthy on some other node?
Assuming the pod is controlled by a deployment, and the pods can indeed be moved to other nodes.
Currently as I see it only evict (delete pods) from the nodes, without scheduling them first.
In addition to Suresh Vishnoi answer:
If PodDisruptionBudget is not specified and you have a deployment with one replica, the pod will be terminated and then new pod will be scheduled on a new node.
To make sure your application will be available during node draining process you have to specify PodDisruptionBudget and create more replicas. If you have 1 pod with minAvailable: 30% it will refuse to drain with following error:
error when evicting pod "pod01" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Briefly that's how draining process works:
As explained in documentation kubectl drain command "safely evicts all of your pods from a node before you perform maintenance on the node and allows the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified”
Drain does two things:
cordons the node- it means that node is marked as unschedulable, so new pods cannot be scheduled on this node. Makes sense- if we know that node will be under maintenance there is no point to schedule a pod there and then reschedule it on another node due to maintenance. From Kubernetes perspective it adds a taint to the node: node.kubernetes.io/unschedulable:NoSchedule
evicts/ deletes the pods- after node is marked as unschedulable it tries to evict the pods that are running on the node. It uses Eviction API which takes PodDisruptionBudgets into account (if it's not supported it will delete pods). It calls DELETE method to K8S but considers GracePeriodSeconds so it lets a pod finish it's processes.
New Pods are scheduled, when the number of pods are not available (desired state != current state) in respective of draining or node failure.
With the PodDisruptionBudget resource you can manage the disruption during the draining of the node.
You can specify only one of maxUnavailable and minAvailable in a single PodDisruptionBudget. maxUnavailable can only be used to control the eviction of pods that have an associated controller managing them. In the examples below, “desired replicas” is the scale of the controller managing the pods being selected by the PodDisruptionBudget. https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
Example 1: With a minAvailable of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget’s selector.
Example 2: With a minAvailable of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy.
Example 3: With a maxUnavailable of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas.
Example 4: With a maxUnavailable of 30%, evictions are allowed as long as no more than 30% of the desired replicas are unhealthy.

How to identify pod eviction policy?

I have a Kubernetes cluster deployed on GCP with a single node, 4 CPU's and 15GB memory. There are a few pods with all the pods bound to the persistent volume by a persistent volume claim. I have observed that the pods have restarted automatically and the data in the persistent volume is lost.
After some research, I suspect that this could be because of the pod eviction policy. When I used kubectl describe pod , I noticed the below error.
0/1 nodes are available: 1 node(s) were not ready, 1 node(s) were out of disk space, 1 node(s) were unschedulable.
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
How do I identify the pod eviction policy of my cluster and change it? so that this does not happen in the future
pod eviction policy of my cluster and change
These thresholds ( pod eviction) are flags of kubelet, you can tune these values according to your requirement. you can edit the kubelet config file, here is the detail config-file
Dynamic Kubelet Configuration allows you to edit these values in the live cluster
The restart policy of my pods is "always". So I think that the pods have restarted after being resource deprived.
Your pod has been rescheduled due to node's issue (not enough disk space ).
The restart policy of my pods is "always".
It means if the pod is not up and running then try to restart it .

How to gracefully remove a node from Kubernetes?

I want to scale up/down the number of machines to increase/decrease the number of nodes in my Kubernetes cluster. When I add one machine, I’m able to successfully register it with Kubernetes; therefore, a new node is created as expected. However, it is not clear to me how to smoothly shut down the machine later. A good workflow would be:
Mark the node related to the machine that I am going to shut down as unschedulable;
Start the pod(s) that is running in the node in other node(s);
Gracefully delete the pod(s) that is running in the node;
Delete the node.
If I understood correctly, even kubectl drain (discussion) doesn't do what I expect since it doesn’t start the pods before deleting them (it relies on a replication controller to start the pods afterwards which may cause downtime). Am I missing something?
How should I properly shutdown a machine?
List the nodes and get the <node-name> you want to drain or (remove from cluster)
kubectl get nodes
1) First drain the node
kubectl drain <node-name>
You might have to ignore daemonsets and local-data in the machine
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
2) Edit instance group for nodes (Only if you are using kops)
kops edit ig nodes
Set the MIN and MAX size to whatever it is -1
Just save the file (nothing extra to be done)
You still might see some pods in the drained node that are related to daemonsets like networking plugin, fluentd for logs, kubedns/coredns etc
3) Finally delete the node
kubectl delete node <node-name>
4) Commit the state for KOPS in s3: (Only if you are using kops)
kops update cluster --yes
OR (if you are using kubeadm)
If you are using kubeadm and would like to reset the machine to a state which was there before running kubeadm join then run
kubeadm reset
Find the node with kubectl get nodes. We’ll assume the name of node to be removed is “mynode”, replace that going forward with the actual node name.
Drain it with kubectl drain mynode
Delete it with kubectl delete node mynode
If using kubeadm, run on “mynode” itself kubeadm reset
Rafael. kubectl drain does work as you describe. There is some downtime, just as if the machine crashed.
Can you describe your setup? How many replicas do you have, and are you provisioned such that you can't handle any downtime of a single replica?
If the cluster is created by kops
1.kubectl drain <node-name>
now all the pods will be evicted
ignore daemeondet:
2.kubectl drain <node-name> --ignore-daemonsets --delete-local-data
3.kops edit ig nodes-3 --state=s3://bucketname
set max and min value of instance group to 0
4. kubectl delete node
5. kops update cluster --state=s3://bucketname --yes
Rolling update if required:
6. kops rolling-update cluster --state=s3://bucketname --yes
validate cluster:
7.kops validate cluster --state=s3://bucketname
Now the instance will be terminated.
The below command only works if you have a lot of replicas, disruption budgets, etc. - but helps a lot with improving cluster utilization. In our cluster we have integration tests kicked off throughout the day (pods run for an hour and then spin down) as well as some dev-workload (runs for a few days until a dev spins it down manually). I am running this every night and get from ~100 nodes in the cluster down to ~20 - which adds up to a fair amount of savings:
for node in $(kubectl get nodes -o name| cut -d "/" -f2); do
kubectl drain --ignore-daemonsets --delete-emptydir-data $node;
kubectl delete node $node;
done
Remove worker node from Kubernetes
kubectl get nodes
kubectl drain < node-name > --ignore-daemonsets
kubectl delete node < node-name >
When draining a node we can have the risk that the nodes remain unbalanced and that some processes suffer downtime. The purpose of this method is to maintain the load balance between nodes as much as possible in addition to avoiding downtime.
# Mark the node as unschedulable.
echo Mark the node as unschedulable $NODENAME
kubectl cordon $NODENAME
# Get the list of namespaces running on the node.
NAMESPACES=$(kubectl get pods --all-namespaces -o custom-columns=:metadata.namespace --field-selector spec.nodeName=$NODENAME | sort -u | sed -e "/^ *$/d")
# forcing a rollout on each of its deployments.
# Since the node is unschedulable, Kubernetes allocates
# the pods in other nodes automatically.
for NAMESPACE in $NAMESPACES
do
echo deployment restart for $NAMESPACE
kubectl rollout restart deployment/name -n $NAMESPACE
done
# Wait for deployments rollouts to finish.
for NAMESPACE in $NAMESPACES
do
echo deployment status for $NAMESPACE
kubectl rollout status deployment/name -n $NAMESPACE
done
# Drain node to be removed
kubectl drain $NODENAME
There exists some strange behaviors for me when kubectl drain. Here are my extra steps, otherwise DATA WILL LOST in my case!
Short answer: CHECK THAT no PersistentVolume is mounted to this node. If have some PV, see the following descriptions to remove it.
When executing kubectl drain, I noticed, some Pods are not evicted (they just did not appear in those logs like evicting pod xxx).
In my case, some are pods with soft anti-affinity (so they do not like to go to the remaining nodes), some are pods of StatefulSet of size 1 and wants to keep at least 1 pod.
If I directly delete that node (using the commands mentioned in other answers), data will get lost because those pods have some PersistentVolumes, and deleting a Node will also delete PersistentVolumes (if using some cloud providers).
Thus, please manually delete those pods one by one. After deleted, kuberentes will re-schedule the pods to other nodes (because this node is SchedulingDisabled).
After deleting all pods (excluding DaemonSets), please CHECK THAT no PersistentVolume is mounted to this node.
Then you can safely delete the node itself :)