I have a Cassandra statefulSet in my Kubernetes cluster with a high terminationGracePeriod to handle data handover currently.
the problem is when a host machine goes down, K8s waits whole terminationGracePeriod in termination phase before rescheduling my pod on another node.
how can i make K8s to ignore terminationGracePeriod when the node is down and reschedule pods immediately?
the problem is when a host machine goes down, K8s waits whole terminationGracePeriod in termination phase before rescheduling my pod on another node.
I think this is wrong assumption. When a host machine goes down, the node health check is used to detect this. Typically this is e.g. 5 minutes. Only after that, the pods are scheduled to other nodes.
See Node Condition and pod eviction:
If the status of the Ready condition remains Unknown or False for longer than the pod-eviction-timeout (an argument passed to the kube-controller-manager), then the node controller triggers API-initiated eviction for all Pods assigned to that node. The default eviction timeout duration is five minutes.
how can i make K8s to ignore terminationGracePeriod when the node is down and reschedule pods immediately?
I don't think terminationGracePeriod is related to this. A pod gets a SIGTERM to shutdown, only if it hasn't successfully been shutdown during the whole terminationGracePeriod, it will be killed with SIGKILL.
Related
When I run shutdown -h now command to shutdown a node in kubernetes cluster, endpoint update its state after about 40 seconds, but when I run command kubectl delete pod POD-NAME, endpoint update its state very quick. Can anyone explain why?
When you "shutdown" a node, you should do it gracefully with kubectl drain. This will evict the pods in a controlled manner and this should be more friendly to your traffic.
The article Kubernetes best practices: terminating with grace has a detailed description on all steps that happen when a Pod is gracefully terminated. For all planned maintenance, use gracefully shutdown - for unplanned maintenance you can not do much.
I found the solution from here:
It may be caused by that the node is marked as NotReady after a grace
period. This is configurable. After that, pod will be terminating and
rescheduled
--node-monitor-grace-period duration Default: 40s
Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet's nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status.
refer to: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
When I change the value of --node-monitor-grace-period to 10s, endpoint update its state more quick.
I have set up K8S cluster using EKS. Cluster Auto-scaler(CA) has been configured to increase/decrease the number of nodes based on resources availability for pods. The CA terminates a node if it's unneeded and pods on the node can be scheduled to another node. Here, the CA terminates the node before rescheduling the pods on another node. So, the pods get scheduled on another node after the node gets terminated. Hence, There is some downtime of some services until the re-scheduled pods become healthy.
How can I avoid the downtime by ensuring that the pods get scheduled on another node before the node gets terminated?
The graceful termination period for nodes is set to 10 minutes(Default).
You need to have multiple replicas of your application running. That will allow your application to survive even in case of node sudden death. Also you may want to configure antiAffinity rules to your app manifest to ensure that replicas reside on different nodes.
I have a problem, I can't find how to change the pod check parameter to move on another node. When k8s detects that a node is down .
I found the parameter --sync-synchrionizes but I'm not sure.
Someone know how to do it ?
You need to change the kube-controller-manager.conf and update the following parameters: (you can find the file in /etc/kubernetes/manifests)
node-status-update-frequency: 10s
node-monitor-period: 5s
node-monitor-grace-period: 40s
pod-eviction-timeout: 30s
This is what happens when node dies or go into offline mode:
The kubelet posts its status to masters by --node-status-update-fequency=10s.
Node goes offline
kube-controller-manager is monitoring all the nodes by --node-monitor-period=5s
kube-controller-manager will see the node is unresponsive and has the grace period --node-monitor-grace-period=40s until it considers node unhealthy. PS: This parameter should be in N x node-status-update-fequency
Once the node marked unhealthy, the kube-controller-manager will remove the pods based on --pod-eviction-timeout=5m
Now, if you tweaked the parameter pod-eviction-timeout to say 30 seconds, it will still take total 70 seconds to evict the pod from node The node-status-update-fequency and node-monitor-grace-period time counts in node-monitor-grace-period also. You can tweak these variable as well to further lower down your total node eviction time.
Once pod is scheduled to particular node, it is not moved or shifted to any other node in any case.New pod created on available node.
If you don't have deployment or RC, to manage state(number of pods) of your application, it will be lost forever. But if you are using deploy or other objects who is responsible to maintain desired state, then if node goes down, it detects the changes in current state and then it create new pod to another node(Depending on node capacity).
Absolutely agree with praful above. It is quite challenging to evict the pods from failed node and move them on to another available node in 5 seconds. Practically not possible. You need to monitor the node status, allow grace period to confirm that node in indeed down, then mark the status as unhealthy. Finally move the pods to other active node.
You can tweak those node monitor parameters to much less values but the downside is control pane performance would be hit as more connections are made between Kubelet and api server.
Suggest you run 2 replicas for each pod so that your app is still be available to serve the user requests
What is the strategy of kubelet managing containers within the machine if the connection with the master is broken due to some network issue? Is it possible to configure kubelet to kill all containers in such a situation?
Nodes in Kubernetes checkin with the master on regular intervals. If they fail to check in AND the master is still up then the pod eviction timout flag comes into play.
It basically waits this time before the pods are rescheduled elsewhere in the cluster. This is common to wait if the machine is just rebooting or something similar.
The flag is in the controller manager: --pod-eviction-timeout=5m0s: The grace period for deleting pods on failed nodes.
The second scenario is when the master goes down (or more specifically the controller-manager). If it stops responding then the cluster will still function as-is without interuption.
I need to scale down the number of pods in a replica controllers. However, I need a clean scale down:
Stop to send load on the pods that will be scaled down
Wait for the pod to have finished to handle the load
Delete the pod
I do not want a pod to be deleted when it is still doing stuff. Is there a way to do that with Kubernetes?
Check out the Termination of Pods section in the pods user guide. You might wish to implement a preStop hook to ensure traffic is drained before the TERM signal is sent to the processes in the pod.