https://kubernetes.io/docs/concepts/workloads/pods/pod/#force-deletion-of-pods
This section of the kubernetes documentation points out that "Force deletions can be potentially dangerous for some pods", but doesn't really go into detail on the dangers.
I understand that force deleting a pod will immediately "deregister" the pod from the API before the kubelet container confirms the underlying container is actually deleted, which could lead to a bunch of orphaned containers running if the kubelet fails to delete them. However, I don't know how to tell if a pod is "dangerous" to force-delete before I do so, or if there is even a way to predict this.
Are there any guidelines on safely force-deleting a pod? Or is this just an inherently unsafe operation?
It really depends on what point of view.
From the K8s master and etcd which keeps the state in K8s it's safe as the entry is deleted in etcd.
However, the kube-scheduler tells the kubelet on the node to kill the pod and sometimes the kubelet might not be able to kill it (Most of the times it is).
A reason why it might not be able to kill the pod is if something like docker or your runtime isn't responding or a Linux system resource is not being released which could be anything like a deadlock, hardware failure, etc.
So most of the times it's safe but there might be a few specific cases where it's not due to the nature of your application and the state of your system.
Related
Kubernetes tends to assume apps are small/lightweight/stateless microservices which can be stopped on one node and restarted on another node with no downtime.
We have a slow starting (20min) legacy (stateful) application which, once run as a set of pod should not be rescheduled without due cause. The reason being all user sessions will be killed and the users will have to login again. There is NO way to serialize the sessions and externalize them. We want 3 instances of the pod.
Can we tell k8s not to move a pod unless absolutely necessary (i.e. it dies)?
Additional information:
The app is a tomcat/java monolith
Assume for the sake of argument we would like to run it in Kubernetes
We do have a liveness test endpoint available
There is no benefit, if you tell k8s to use only one pod. That is not the "spirit" of k8s. In this case, it might be better to use a dedicated machine for your app.
But you can assign a pod to a special node - Assigning Pods to Nodes. The should be necessary only, when special hardware requirements are needed (e.g. the AI-microservice needs a GPU, which is only on node xy).
k8s don't restart your pod for fun. It will restart it, when there is a reason (node died, app died, ...) and I never noticed a "random reschedule" in a cluster. It is hard to say, without any further information (like deployment, logs, cluster) what exactly happened to you.
And for your comment: There are different types of recreation, one of them starts a fresh instance and will kill the old one, when the startup was successfully. Look here: Kubernetes deployment strategies
All points together:
Don't enforce a node to your app - k8s will "smart" select the node.
There are normally no planned reschedules in k8s.
k8s will recreate pods only, if there is a reason. Maybe your app didn't answer on the liveness-endpoint? Or someone/something deleting your pod?
I am aware of how replicaset works and how it will reconcile the state from its specification .
However, I am not completely aware of what are all the criteria Replicaset uses for it it to reconcile the state ?
I happened to take look the documentation to understand the scenarios.
https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/
One scenarios is when the pod is down for any reason - application issue
Node is down
What are all the other scenarios ? If the pod is stuck in making progress, will replica set take care of that ? Or is it just check whether the pod is alive or not ?
If the pod is stuck in making progress, will replica set take care of that ?
As long as the main process inside of a container is running, it is considered healthy by default and it will be treated as such. If there is an application issue which prevents your application from working correctly but the main process is still running, you will be stuck with an "unhealthy" pod.
That is the reason why you want to implement livenessProbe for your containers and specify what "behavior" represents a healthy state of the container. In such scenario, failure to successfully respond to health check multiple times (configurable) will result in a container being treated as failed and your replica set will take an action.
Example might be a simple HTTP GET request to some predefined path if you are running web application in your pod (eg /api/health). Now, even if main process is running, your application needs to periodically respond to this health-check query otherwise it will be replaced.
If the Pod or the Node is not down, the Pod will only fail and a new one will be created if you have Liveness Probe defined.
If you don't have it implemented, k8s will never know that your Pod is not up and running.
Take a look at this doc page for more info.
OOM Killed issue - which cause pod kill and restart pod
Cpu limit issue - This cause 404 issue but do not restart pod
I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.
I created a job in my kubernetes cluster, the job takes a long time to finish, I decided to cancel it, so I deleted the job, but I noticed the associated pod is NOT automatically deleted. Is this the expected behavior? why is it not consistent with deployment deletion? Is there a way to make pod automatically deleted?
If you're deleting a deployment, chances are you don't want any of the underlying pods, so it most likely forcefully deletes the pods by default. Also, the desired state of pods would be unknown.
On the other hand if you're deleting a pod, it doesn't know what kind of replication controller may be attached to it and what it is doing next. So it signals a shutdown to the container so that it can perhaps clean up gracefully. There may be processes that are still using the pod, like a web request etc. and it would not be good to kill their request if it may take a second to complete. This is what happens if you may be scaling up your pods or rolling out a new deployment, and you don't want any of the users to experience any downtime. This is in fact one of the benefits of Kubernetes, as opposed to a traditional application server which requires you to shutdown the system to upgrade (or to play with load balancers to redirect traffic) which may negatively affect users.
How long does a pod persist without a replication controller?
I have run some pods that have a very simple purpose, they execute and then terminate. Other pods like a database server pod persists for quite a longer time. However after a day or so, the pod would terminate. I know docker containers exit once their process has finished running, but why would my database pods continue running for a while and then randomly exit.
What controls the termination of a pod?
The easiest way for you to find a definitive answer to that question would be to kubectl describe pod <podName>, or kubectl get events. Any pod termination would have an associated event that you can use to diagnose the reason.
Pods may die due to several reasons, ranging from errors within the container, to a node going down for maintenance. You can usually set the appropriate RestartPolicy, which will restart the pod if it fails (except in case of node failure). If you have multiple nods and would like the pod to be restarted on a different node, you should use a higher level controller like a ReplicaSet or Deployment.
For pods expected to terminate, a job is better suited.