When I run shutdown -h now command to shutdown a node in kubernetes cluster, endpoint update its state after about 40 seconds, but when I run command kubectl delete pod POD-NAME, endpoint update its state very quick. Can anyone explain why?
When you "shutdown" a node, you should do it gracefully with kubectl drain. This will evict the pods in a controlled manner and this should be more friendly to your traffic.
The article Kubernetes best practices: terminating with grace has a detailed description on all steps that happen when a Pod is gracefully terminated. For all planned maintenance, use gracefully shutdown - for unplanned maintenance you can not do much.
I found the solution from here:
It may be caused by that the node is marked as NotReady after a grace
period. This is configurable. After that, pod will be terminating and
rescheduled
--node-monitor-grace-period duration Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status.
refer to: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
When I change the value of --node-monitor-grace-period to 10s, endpoint update its state more quick.
Related
I know that the moment the pod receives a deletion request, it is deleted from the service endpoint and no longer receives the request. However, I'm not sure if the pod can return a response to a request it received just before it was deleted from the service endpoint.
If the pod IP is missing from the service's endpoint, can it still respond to requests?
There are many reasons why Kubernetes might terminate a healthy container (for example, node drain, termination due to lack of resources on the node, rolling update).
Once Kubernetes has decided to terminate a Pod, a series of events takes place:
1 - Pod is set to the “Terminating” State and removed from the endpoints list of all Services
At this point, the pod stops getting new traffic. Containers running in the pod will not be affected.
2 - preStop Hook is executed
The preStop Hook is a special command or http request that is sent to the containers in the pod.
If your application doesn’t gracefully shut down when receiving a SIGTERM you can use this hook to trigger a graceful shutdown. Most programs gracefully shut down when receiving a SIGTERM, but if you are using third-party code or are managing a system you don’t have control over, the preStop hook is a great way to trigger a graceful shutdown without modifying the application.
3 - SIGTERM signal is sent to the pod
At this point, Kubernetes will send a SIGTERM signal to the containers in the pod. This signal lets the containers know that they are going to be shut down soon.
Your code should listen for this event and start shutting down cleanly at this point. This may include stopping any long-lived connections (like a database connection or WebSocket stream), saving the current state, or anything like that.
Even if you are using the preStop hook, it is important that you test what happens to your application if you send it a SIGTERM signal, so you are not surprised in production!
4 - Kubernetes waits for a grace period
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.
If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML.
5 - SIGKILL signal is sent to pod, and the pod is removed
If the containers are still running after the grace period, they are sent the SIGKILL signal and forcibly removed. At this point, all Kubernetes objects are cleaned up as well.
I hope this gives a good idea of the Kubernetes termination lifecycle and how to handle a Pod termination gracefully.
Based on this article.
As long as the pod is deleted and container stopped, it cannot respond to requests, no matter whether it is removed from the service's endpoints.
If the pod container is still alive, it can respond to requests, no matter whether you visit it through svc or not.
I have a Cassandra statefulSet in my Kubernetes cluster with a high terminationGracePeriod to handle data handover currently.
the problem is when a host machine goes down, K8s waits whole terminationGracePeriod in termination phase before rescheduling my pod on another node.
how can i make K8s to ignore terminationGracePeriod when the node is down and reschedule pods immediately?
the problem is when a host machine goes down, K8s waits whole terminationGracePeriod in termination phase before rescheduling my pod on another node.
I think this is wrong assumption. When a host machine goes down, the node health check is used to detect this. Typically this is e.g. 5 minutes. Only after that, the pods are scheduled to other nodes.
See Node Condition and pod eviction:
If the status of the Ready condition remains Unknown or False for longer than the pod-eviction-timeout (an argument passed to the kube-controller-manager), then the node controller triggers API-initiated eviction for all Pods assigned to that node. The default eviction timeout duration is five minutes.
how can i make K8s to ignore terminationGracePeriod when the node is down and reschedule pods immediately?
I don't think terminationGracePeriod is related to this. A pod gets a SIGTERM to shutdown, only if it hasn't successfully been shutdown during the whole terminationGracePeriod, it will be killed with SIGKILL.
I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.
We have a use case in our application where we need the pod to terminate after it processes a request. The corresponding deployment will take care of spinning up a new pod to maintain the replica count.
I was exploring to use liveness probes, but they only restart the containers and not the pods.
Is there any other way to terminate the pod, from service level or deployment level?
You need to get familiar with Pod lifetime
In general, Pods do not disappear until someone destroys them. This
might be a human or a controller. The only exception to this rule is
that Pods with a phase of Succeeded or Failed for more than some
duration (determined by terminated-pod-gc-threshold in the master)
will expire and be automatically destroyed.
In your case consider using Jobs.
Use a Job for Pods that are expected to terminate, for example, batch
computations. Jobs are appropriate only for Pods with restartPolicy
equal to OnFailure or Never.
Please let me know if that helped.
What is the strategy of kubelet managing containers within the machine if the connection with the master is broken due to some network issue? Is it possible to configure kubelet to kill all containers in such a situation?
Nodes in Kubernetes checkin with the master on regular intervals. If they fail to check in AND the master is still up then the pod eviction timout flag comes into play.
It basically waits this time before the pods are rescheduled elsewhere in the cluster. This is common to wait if the machine is just rebooting or something similar.
The flag is in the controller manager: --pod-eviction-timeout=5m0s: The grace period for deleting pods on failed nodes.
The second scenario is when the master goes down (or more specifically the controller-manager). If it stops responding then the cluster will still function as-is without interuption.