Job still running even when deleting nodes - kubernetes

I created a two nodes clusters and I created a new job using the busybox image that sleeps for 300 secs. I checked on which node this job is running using
kubectl get pods -o wide
I deleted the node but surprisingly the job was still finishing to run on the same node. Any idea if this is a normal behavior? If not how can I fix it?

Jobs aren't scheduled or running on nodes. The role of a job is just to define a policy by making sure that a pod with certain specifications exists and ensure that it runs till the completion of the task whether it completed successfully or not.
When you create a job, you are declaring a policy that the built-in job-controller will see and will create a pod for. Then the built-in kube-scheduler will see this pod without a node and patch the pod to it with a node's identity. The kubelet will see a pod with a node matching it's own identity and hence a container will be started. As the container will be still running, the control-plane will know that the node and the pod still exist.
There are two ways of breaking a node, one with a drain and the second without a drain. The process of breaking a node without draining is identical to a network cut or a server crash. The api-server will keep the node resource for a while, but it 'll cease being Ready. The pods will be then terminated slowly. However, when you drain a node, it looks as if you are preventing new pods from scheduling on to the node and deleting the pods using kubectl delete pod.
In both ways, the pods will be deleted and you will be having a job that hasn't run to completion and doesn't have a pod, therefore job-controller will make a new pod for the job and the job's failed-attempts will be increased by 1, and the loop will start over again.

Related

Auto delete CrashBackoffLoop pods in a deployment

In my kubernetes cluster, there are multiple deployments in a namespace.
For a specific deployment, there is a need to not allow "CrashLoopBackoff" pods to exist.
So basically, when any pod gets to this state, I would want it to be deleted and later a new pod to be created which is already handled by the ReplicaSet.
I tried with custom controllers, with the thought that the SharedInformer would alert about the state of Pod and then I would delete it from that loop.
However, this brings dependency on the pod on which the custom controller would run.
I also tried searching for any option to be configured in the manifest itself, but could not find any.
I am pretty new to Kuberenetes, so need help in the implementation of this behaviour.
Firstly, you should address the reason why the pod has entered the CrashLoopBackOff state rather than just delete it. If you do this, you'll potentially just recreate the problem again and you'll be deleting pods repeatedly. For example, if your pod is trying to access an external DB and that DB is down, it'll CrashLoop, and deleting and restarting the pod won't help fix that.
Secondly, if you want to do this deleting in an automated manner, an easy way would be to run a CronJob resource that goes through your deployment and deletes the CrashLooped pods. You could set the cronjob to run once an hour or whatever schedule you wish.
Deleting the POD and waiting for the New one is like restarting the deployment or POD.
Kubernetes will auto restart your CrashLoopBackoff POD if failing, you can check the Restart count.
NAME READY STATUS RESTARTS AGE
te-pod-1 0/1 CrashLoopBackOff 2 1m44s
This restarts will be similar to what you have mentioned
when any pod gets to this state, I would want it to be deleted and
later a new pod to be created which is already handled by the
ReplicaSet.
If you want to remove Crashing the POD fully and not look for new POD to come up, you have to rollback the deployment.
If there is any issue with your Replicaset and your POD is crashing it would be useless, any number of times you delete and restart the POD it will crash all time, unless you check logs & debug to solve the real issue in replicaset(Deployment).

How to rollout without killing processes in K8s?

I'm using:
kubectl rollout restart deployment my_cool_workers
This terminates the workers and start new ones.
However I want to rollout in a way where if something is running on a specific worker I want to let the task finish - I don't want to kill the tasks (so the worker should finish the tasks but not accepting new)
Meaning - rollout new workers -> old workers no longer accept traffic -> when old worker is no longer running anything terminate it.
How can this be done?
If a Pod gets killed, manually via kubectl or by any k8s controller like during a deployment, it will instantly change from Running into Terminating state. At the same time, the SIGTERM signal will be sent to all containers inside that Pod.
Starting from Kubernetes 1.19 you can debug running pods using Ephemeral Containers and kubectl debug command.
While in Terminating state, containers of a Pod are not restarted if they end. Whenever a container inside a Pod stops while in Running state, the container is restarted. This is done because a Pod should always be running unless an error occurred.
For more information refer to this document.

Kubernetes limit number of retry

For some context, I'm creating an API in python that creates K8s Jobs with user input in ENV variables.
Sometimes, it happens that the Image selected does not exist or has been deleted. Secrets does not exists or Volume isn't created. So it makes the Job in a crashloopbackoff or imagepullbackoff state.
First I'm am wondering if the ressource during this state are allocated to the job?
If yes, I don't want the Job to loop forever and lock resources to a never starting Job.
I've set the backofflimit to 0, but this is when the Job detect a Pod that goes in fail and tries to relaunch an other Pod to retry. In my case, I know that if a Pod fails for a job, then it's mostly due to OOM or code that fails and will always fails due to user input. So retrying will always fail.
But it doesn't limit the number of tries to crashloopbackoff or imagepullbackoff. Is there a way to set to terminate or fail the Job? I don't want to kill it, but just free the ressource and keep the events in (status.container.state.waiting.reason + status.container.state.waiting.message) or (status.container.state.terminated.reason + status.container.state.terminated.exit_code)
Could there be an option to set to limit the number of retry at the creation so I can free resources, but not to remove it to keep logs.
I have tested your first question and YES even if a pod is in crashloopbackoff state, the resources are still allocated to it !!! Here is my test: Are the Kubernetes requested resources by a pod still allocated to it when it is in crashLoopBackOff state?
Thanks for your question !
Long answer short, unfortunately there is no such option in Kubernetes.
However, you can do this manually by checking if the pod is in a crashloopbackoff then, unallocate its resources or simply delete the pod itself.
The following script delete any pod in the crashloopbackoff state from a specified namespace
#!/bin/bash
# This script check the passed namespace and delete pods in 'CrashLoopBackOff state
NAMESPACE="test"
delpods=$(sudo kubectl get pods -n ${NAMESPACE} |
grep -i 'CrashLoopBackOff' |
awk '{print $1 }')
for i in ${delpods[#]}; do
sudo kubectl delete pod $i --force=true --wait=false \
--grace-period=0 -n ${NAMESPACE}
done
Since we have passed the option --grace-period=0 the pod won't automatically restart again.
But, if after using this script or assigning it to a job, you noticed that the pod continues to restart and fall in the CrashLoopBackOff state again for some weird reason. Thera is a workaround for this, which is changing the restart policy of the pod:
A PodSpec has a restartPolicy field with possible values Always,
OnFailure, and Never. The default value is Always. restartPolicy
applies to all Containers in the Pod. restartPolicy only refers to
restarts of the Containers by the kubelet on the same node. Exited
Containers that are restarted by the kubelet are restarted with an
exponential back-off delay (10s, 20s, 40s …) capped at five minutes,
and is reset after ten minutes of successful execution. As discussed
in the Pods document, once bound to a node, a Pod will never be
rebound to another node.
See more details in the documentation or from here.
And that is it! Happy hacking.
Regarding the first question, it is already answered by bguess here.

What happens when you drain nodes in a Kubernetes cluster?

I'd like to get some clarification for preparation for maintenance when you drain nodes in a Kubernetes cluster:
Here's what I know when you run kubectl drain MY_NODE:
Node is cordoned
Pods are gracefully shut down
You can opt to ignore Daemonset pods because if they are shut down, they'll just be re-spawned right away again.
I'm confused as to what happens when a node is drained though.
Questions:
What happens to the pods? As far as I know, there's no 'live migration' of pods in Kubernetes.
Will the pod be shut down and then automatically started on another node? Or does this depend on my configuration? (i.e. could a pod be shut down via drain and not start up on another node)
I would appreciate some clarification on this and any best practices or advice as well. Thanks in advance.
By default kubectl drain is non-destructive, you have to override to change that behaviour. It runs with the following defaults:
--delete-local-data=false
--force=false
--grace-period=-1
--ignore-daemonsets=false
--timeout=0s
Each of these safeguard deals with a different category of potential destruction (local data, bare pods, graceful termination, daemonsets). It also respects pod disruption budgets to adhere to workload availability. Any non-bare pod will be recreated on a new node by its respective controller (e.g. daemonset controller, replication controller).
It's up to you whether you want to override that behaviour (for example you might have a bare pod if running jenkins job. If you override by setting --force=true it will delete that pod and it won't be recreated). If you don't override it, the node will be in drain mode indefinitely (--timeout=0s)).
I just want to add a few things to eamon1234's answer:
You may find this useful as well:
Link to official docummentation (in case default flags change etc.). According to it:
The 'drain' evicts or deletes all pods except mirror pods (which
cannot be deleted through the API server). If there are
DaemonSet-managed pods, drain will not proceed without
--ignore-daemonsets, and regardless it will not delete any DaemonSet-managed pods, because those pods would be immediately
replaced by the DaemonSet controller, which ignores unschedulable
markings. If there are any pods that are neither mirror pods nor
managed by ReplicationController, ReplicaSet, DaemonSet, StatefulSet
or Job, then drain will not delete any pods unless you use --force.
--force will also allow deletion to proceed if the managing resource of one or more pods is missing.
Simple chart illustrating what actually happens when using kubectl drain.
Using kubectl drain with --dry-run option may be also a good idea so you can see its outcome before any actual changes are applied e.g.:
kubectl drain foo --force --dry-run
however it will not show any errors about existing local data or daemonsets which you can see whithout using --dry-run flag:
... error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore) ...
We can use kubectl drain to safely evict all of our pods from a node before we perform maintenance on the node.
If you want to update or patch or any kind of maintenance on Hardware/Node you should first drain all the pods(Migrate pods one node to another) kubectl drain
When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted. It is then safe to bring down the node
After maintenance work we can use kubectl uncordon to tell Kubernetes that it can resume scheduling new pods onto the node.

Use kubectl to delete a node that has running pods on it

We are using Heat + Kubernetes (V0.19) to manage our apps. When do rolling update, sometimes container staring will always fail on a node but kubelet on the node will always retry but always fail. So the updating will hang there which is not the behavior we expected.
I found that using "kubectl delete node" to remove the node can avoid pods scheduled to that node. But in our env, the node to be deleted may have running pods on it.
So my question is:
After using "kubectl delete node" to remove the node, will the pods on that node still worked correctly ?
If you just want to cancel the rolling update, remove the failed pods and try again later, I have found that it is best to stop the update loop with CTRL+c and then delete the replication controller corresponding to the new app that is failing.
^C
kubectl delete replicationcontrollers your-app-v1.2.3