Why kubernetes keeps pods in Error/Completed status - preemptible nodes, GKE - kubernetes

I have an issue with my GKE cluster. I am using two node pools: secondary - with standard set of highmen-n1 nodes, and primary - with preemptible highmem-n1 nodes. Issue is that I have many pods in Error/Completed status which are not cleared by k8s, all ran on preemptible set. THESE PODS ARE NOT JOBS.
GKE documentation says that:
"Preemptible VMs are Compute Engine VM instances that are priced lower than standard VMs and provide no guarantee of availability. Preemptible VMs offer similar functionality to Spot VMs, but only last up to 24 hours after creation."
"When Compute Engine needs to reclaim the resources used by preemptible VMs, a preemption notice is sent to GKE. Preemptible VMs terminate 30 seconds after receiving a termination notice."
Ref: https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms
And from the kubernetes documentation:
"For failed Pods, the API objects remain in the cluster's API until a human or controller process explicitly removes them.
The Pod garbage collector (PodGC), which is a controller in the control plane, cleans up terminated Pods (with a phase of Succeeded or Failed), when the number of Pods exceeds the configured threshold (determined by terminated-pod-gc-threshold in the kube-controller-manager). This avoids a resource leak as Pods are created and terminated over time."
Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection
So, from my understanding every 24 hours this set of nodes is changing, so it kills all the pods running on them and depending on graceful shutdown pods are ending up in Completed or Error state. Nevertheless, kubernetes is not clearing or removing them, so I have tons of pods in mentioned statuses in my cluster, which is not expected at all.
I am attaching screenshots for reference.
Example kubectl describe pod output:
Status: Failed
Reason: Terminated
Message: Pod was terminated in response to imminent node shutdown.
Apart from that, no events, logs, etc.
GKE version:
1.24.7-gke.900
Both Node pools versions:
1.24.5-gke.600
Did anyone encounter such issue or knows what's going on there? Is there solution to clear it in a different way than creating some script and running it periodically?
I tried digging in into GKE logs, but I couldn't find anything. I also tried to look for the answers in docs, but I've failed.

While using the node pool with Preemptible mode the clusters running GKE version 1.20 and later, the kubelet graceful node shutdown feature is enabled by default. The kubelet notices the termination notice and gracefully terminates Pods that are running on the node. If the Pods are part of a Deployment, the controller creates and schedules new Pods to replace the terminated Pods.
During graceful Pod termination, the kubelet updates the status of the Pod, assigning a Failed phase and a Terminated reason to the terminated Pods. When the number of terminated Pods reaches a threshold, garbage collection cleans up the Pods.
You can also delete shutdown Pods manually for GKE version 1.21.3-gke.1200 and later
Delete shutdown Pods manually:
kubectl get pods --all-namespaces | grep -i NodeShutdown | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n
kubectl get pods --all-namespaces | grep -i Terminated | awk '{print $1, $2}' | xargs -n2 kubectl delete pod -n

Related

How to reschedule the pod from node in kubernetes ( baremetal servers )?

Kubernetes nodes are getting unscheduled while i initiate the drain or cordon but the pods which is available on the node are not getting moved to different node immediately ?
i mean, these pods are not created by daemonset.
So, how come, Application running pod can make 100% available when a node getting faulty or with some issues ?
any inputs ?
command used :
To drain / cordon to make the node unavailable:
kubectl drain node1
kubectl cordon node1
To check the node status :
kubectl get nodes
To check the pod status before / after cordon or drain :
kubectl get pods -o wide
kubectl describe pod <pod-name>
Surprising part is , even node is unavailable, the pod status showing always running. :-)
Pods by itself doesn't migrate to another node.
You can use workload resources to create and manage multiple Pods for you. A controller for the resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
Some examples of controllers are:
deployment
daemonset
statefulsets
Check this link to more information.

AKS - incorrect Pod Status

I have an AKS Cluster with two nodepools. Node pool 1 has 3 nodes, and nodepool 2 has 1 node - all Linux VMs. I noticed that after stopping the VMs and then doing kubectl get pods, the Pods status shows "running" though the VMs are not actually running. How is this possible?
This is the command I tried: kubectl get pods -n development -o=wide
The screenshot is given below. Though VMs are not running, the Pod status shows "running". However, trying to access the app using the Public IP of the service resulted in
ERR_CONNECTION_TIMED_OUT
Here is a full thread (https://github.com/kubernetes/kubernetes/issues/55713) on this issue. The problem here is by default the pod waits for 5 minutes before evicting to another node when the current node becomes notReady, but in this case none of the worker nodes are ready and hence pods are not getting evicted. Refer the git issue, there are some suggestions and solutions provided.
What is actually going is related to the kubelet processes running on the nodes cannot provide their status to the Kubernetes API server. Kubernetes will always assume that your PODs are running when the nodes associated with the POD are offline. The fact that all nodes are offline, will in fact cause your POD to not be running hence not being accessible, causing the ERR_CONNECTION_TIMED_OUT
You can run kubectl get nodes to get the status of the nodes, they should show NotReady. Please check and let me know.
Also, can you please provide the output for kubectl get pods -A

Delete all the contents from a kubernetes node

How to delete all the contents from a kubernetes node? Contents include deployments, replica sets etc. I tried to delete deplyoments seperately. But kubernetes recreates all the pods again. Is there there any ways to delete all the replica sets present in a node?
If you are testing things, the easiest way would be
kubectl delete deployment --all
Althougth if you are using minikube, the easiest would probably be delete the machine and start again with a fresh node
minikube delete
minikube start
If we are talking about a production cluster, Kubernetes has a built-in feature to drain a node of the cluster, removing all the objects from that node safely.
You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node. Safe evictions allow the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified.
Note: By default kubectl drain will ignore certain system pods on the node that cannot be killed; see the kubectl drain documentation for more details.
When kubectl drain returns successfully, that indicates that all of the pods (except the ones excluded as described in the previous paragraph) have been safely evicted (respecting the desired graceful termination period, and without violating any application-level disruption SLOs). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.
First, identify the name of the node you wish to drain. You can list all of the nodes in your cluster with
kubectl get nodes
Next, tell Kubernetes to drain the node:
kubectl drain <node name>
Once it returns (without giving an error), you can power down the node (or equivalently, if on a cloud platform, delete the virtual machine backing the node). drain waits for graceful termination. You should not operate on the machine until the command completes.
If you leave the node in the cluster during the maintenance operation, you need to run
kubectl uncordon <node name>
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
Please, note that if there are any pods that are not managed by ReplicationController, ReplicaSet, DaemonSet, StatefulSet or Job, then drain will not delete any pods unless you use --force, as mentioned in the docs.
kubectl drain <node name> --force
minikube delete --all
in case you are using minikube
it will let you start a new clean cluster.
in case you run on Kubernetes :
kubectl delete pods,deployments -A --all
it will remove it from all namespaces, you can add more objects in the same command .
Kubenertes provides namespaces object for isolation and separation of concern. Therefore, It is recommended to apply all of the k8s resources objects (Deployment, ReplicaSet, Pods, Services and other) in a custom namespace.
Now If you want to remove all of the relevant and related k8s resources, you just need to delete the namespace which will remove all of these resources.
kubectl create namespace custom-namespace
kubectl create -f deployment.yaml --namespace=custom-namespace
kubectl delete namespaces custom-namespace
I have attached a link for further research.
Namespaces
I tried so many variations to delete old pods from tutorials, including everything here.
What finally worked for me was:
kubectl delete replicaset --all
Deleting them one at a time didn't seem to work; it was only with the --all flag that all pods were deleted without being recreated.

Kubernetes CronJob/pod terminated by occasional SIGTERM

I've had a series of cronjobs run on google container engine and I've been hit with an occasional SIGTERM to one of my pods triggered by my cronJob for unknown reason:
I've only been able to find this out as Sentry logged my tasks receiving a SIGTERM. Unable to find the pod itself that was terminated even with
kubectl get po -a
I've looked at the kubernetes nodes (which have autoscaling ability) and none of them have been drained and deleted. The ages of all the nodes is still the same.
I've also set the concurrencyPolicy: forbid (does kubernetes check for concurrency after it has initiated a pod or before it?)

How to gracefully remove a node from Kubernetes?

I want to scale up/down the number of machines to increase/decrease the number of nodes in my Kubernetes cluster. When I add one machine, I’m able to successfully register it with Kubernetes; therefore, a new node is created as expected. However, it is not clear to me how to smoothly shut down the machine later. A good workflow would be:
Mark the node related to the machine that I am going to shut down as unschedulable;
Start the pod(s) that is running in the node in other node(s);
Gracefully delete the pod(s) that is running in the node;
Delete the node.
If I understood correctly, even kubectl drain (discussion) doesn't do what I expect since it doesn’t start the pods before deleting them (it relies on a replication controller to start the pods afterwards which may cause downtime). Am I missing something?
How should I properly shutdown a machine?
List the nodes and get the <node-name> you want to drain or (remove from cluster)
kubectl get nodes
1) First drain the node
kubectl drain <node-name>
You might have to ignore daemonsets and local-data in the machine
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
2) Edit instance group for nodes (Only if you are using kops)
kops edit ig nodes
Set the MIN and MAX size to whatever it is -1
Just save the file (nothing extra to be done)
You still might see some pods in the drained node that are related to daemonsets like networking plugin, fluentd for logs, kubedns/coredns etc
3) Finally delete the node
kubectl delete node <node-name>
4) Commit the state for KOPS in s3: (Only if you are using kops)
kops update cluster --yes
OR (if you are using kubeadm)
If you are using kubeadm and would like to reset the machine to a state which was there before running kubeadm join then run
kubeadm reset
Find the node with kubectl get nodes. We’ll assume the name of node to be removed is “mynode”, replace that going forward with the actual node name.
Drain it with kubectl drain mynode
Delete it with kubectl delete node mynode
If using kubeadm, run on “mynode” itself kubeadm reset
Rafael. kubectl drain does work as you describe. There is some downtime, just as if the machine crashed.
Can you describe your setup? How many replicas do you have, and are you provisioned such that you can't handle any downtime of a single replica?
If the cluster is created by kops
1.kubectl drain <node-name>
now all the pods will be evicted
ignore daemeondet:
2.kubectl drain <node-name> --ignore-daemonsets --delete-local-data
3.kops edit ig nodes-3 --state=s3://bucketname
set max and min value of instance group to 0
4. kubectl delete node
5. kops update cluster --state=s3://bucketname --yes
Rolling update if required:
6. kops rolling-update cluster --state=s3://bucketname --yes
validate cluster:
7.kops validate cluster --state=s3://bucketname
Now the instance will be terminated.
The below command only works if you have a lot of replicas, disruption budgets, etc. - but helps a lot with improving cluster utilization. In our cluster we have integration tests kicked off throughout the day (pods run for an hour and then spin down) as well as some dev-workload (runs for a few days until a dev spins it down manually). I am running this every night and get from ~100 nodes in the cluster down to ~20 - which adds up to a fair amount of savings:
for node in $(kubectl get nodes -o name| cut -d "/" -f2); do
kubectl drain --ignore-daemonsets --delete-emptydir-data $node;
kubectl delete node $node;
done
Remove worker node from Kubernetes
kubectl get nodes
kubectl drain < node-name > --ignore-daemonsets
kubectl delete node < node-name >
When draining a node we can have the risk that the nodes remain unbalanced and that some processes suffer downtime. The purpose of this method is to maintain the load balance between nodes as much as possible in addition to avoiding downtime.
# Mark the node as unschedulable.
echo Mark the node as unschedulable $NODENAME
kubectl cordon $NODENAME
# Get the list of namespaces running on the node.
NAMESPACES=$(kubectl get pods --all-namespaces -o custom-columns=:metadata.namespace --field-selector spec.nodeName=$NODENAME | sort -u | sed -e "/^ *$/d")
# forcing a rollout on each of its deployments.
# Since the node is unschedulable, Kubernetes allocates
# the pods in other nodes automatically.
for NAMESPACE in $NAMESPACES
do
echo deployment restart for $NAMESPACE
kubectl rollout restart deployment/name -n $NAMESPACE
done
# Wait for deployments rollouts to finish.
for NAMESPACE in $NAMESPACES
do
echo deployment status for $NAMESPACE
kubectl rollout status deployment/name -n $NAMESPACE
done
# Drain node to be removed
kubectl drain $NODENAME
There exists some strange behaviors for me when kubectl drain. Here are my extra steps, otherwise DATA WILL LOST in my case!
Short answer: CHECK THAT no PersistentVolume is mounted to this node. If have some PV, see the following descriptions to remove it.
When executing kubectl drain, I noticed, some Pods are not evicted (they just did not appear in those logs like evicting pod xxx).
In my case, some are pods with soft anti-affinity (so they do not like to go to the remaining nodes), some are pods of StatefulSet of size 1 and wants to keep at least 1 pod.
If I directly delete that node (using the commands mentioned in other answers), data will get lost because those pods have some PersistentVolumes, and deleting a Node will also delete PersistentVolumes (if using some cloud providers).
Thus, please manually delete those pods one by one. After deleted, kuberentes will re-schedule the pods to other nodes (because this node is SchedulingDisabled).
After deleting all pods (excluding DaemonSets), please CHECK THAT no PersistentVolume is mounted to this node.
Then you can safely delete the node itself :)