According to the docs -
Failed containers that are restarted by Kubelet, are restarted with an
exponential back-off delay, the delay is in multiples of
sync-frequency 0, 1x, 2x, 4x, 8x … capped at 5 minutes and is reset
after 10 minutes of successful execution.
Is there any way to define a custom RestartPolicy? I want to minimize the back-off delay as much as possible and drop off the exponential behavior.
As far as I can find, you can't even configure the RestartPoilcy, let alone make a new one...
The backoff delay is not tunable because it could severely affects the reliability of kubelet. Imagine you have some pods that keep crashing on the node, kubelet will continuously restarting all those pods/containers with no break, consuming a lot of resources.
Why do you want to change the restart backoff delay?
About customizing your RestartPolicy, according to Kubernetes documentation:
Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.
you can see the detailed answer of #Rohit here.
Related
I've spent over a full day trying to make sense of Kubernetes' resource management. Specifically, I'm trying to set up eviction thresholds and resource reservations in such a way that there is always at least 1GiB of memory available.
Going on the documentation regarding resource reservations and out-of-resource handling, I figured setting the following eviction policy would suffice:
--eviction-hard=memory.available<1Gi
However, in practice, this does not work at all, as the computation the kubelet does seems to be different from the computation the kernel does when it needs to determine whether or not the OOM killer needs to be invoked. E.g. when I load up my system with a bunch of pods running an artificial memory hog, I get the following report from free -m:
Total: 15866
Used: 14628
free: 161
shared: 53
buff/cache: 1077
available: 859
According to the kernel, there's 859 MiB memory available. Yet, the kubelet does not invoke its eviction policy. In fact, I've been able to invoke the system OOM killer before the kubelet eviction policy was invoked, even when ramping up memory usage incredibly slowly (to allow the kubelet housekeeing control loop to sleep 10 seconds, as per its default configuration).
I've found this script which used to be in Kubernetes documentation and is supposed to calculate the available memory in the same way the Kubelet does. I ran it in parallel to free -m above and got the following result:
memory.available_in_mb 1833
That's almost 1000M difference!
Now, I understand the calculation was by design, but that leaves me with the obvious question: how can I reliably manage system resource usage so that the system OOM killer does not get invoked? What eviction policy can I set so the kubelet will start evicting pods when there's less than a gigabyte of memory available?
According to documentation https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/, you should add the Kubelet flag --system-reserved=memory=1024Mi
I have sidekiq custom metrics coming from prometheus adapter. Using thoes queue metrics from prometheus i have setup HPA. When jobs in queue in sidekiq goes above say 1000 jobs HPA triggers 10 new pods. Then each pod will execute 100 jobs in queue. When jobs are reduced to say 400. HPA will scale-down. But when scale-down happens, hpa kills pods say 4 pods are killed. Thoes 4 pods were still running jobs say each pod was running 30-50 jobs. Now when hpa deletes these 4 pods, jobs running on them are also terminated. And thoes jobs are marked as failed in sidekiq.
So what i want to achieve is stop hpa from deleting pods which are executing the jobs. Moreover i want hpa to not scale-down even after load is reduced to minimum, instead delete pods when jobs in queue in sidekiq metrics is 0.
Is there any way to achieve this?
Weird usage, honestly: you're wasting resources even your traffic is on the cool-down phase but since you didn't provide further details, here it is.
Actually, it's not possible to achieve what you desire since the common behavior is to support a growing load against your workload. The unique wait to achieve this (and this is not recommended) is to change the horizontal-pod-autoscaler-downscale-stabilization Kubernetes Controller Manager's flag to a higher value.
JFI, the doc warns you:
Note: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there could be complaints that the Horizontal Pod Autoscaler is not responsive to workload changes. However, if the delay value is set too short, the scale of the replicas set may keep thrashing as usual.
As per the discussion and the work done by #Hb_1993 it can be done with a pre-stop hook to delay the eviction, where the delay is based on operation time or some logic to know if the procession is done or not.
A pre-stop hook is a lifecycle method which is invoked before a pod is evicted, and we can then attach to this event and perform some logic like performing ping check to make sure that our pod has completed the processing of current request.
PS- Use this solution with a pinch of salt as this might not work in all the cases or produce unintended results.
To do this, we introduce asleep in the preStop hook that delays the
shutdown sequence.
More details can be found in this article.
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304
I have 2 Slave and 1 Master node kubernetes cluster.When a node down it takes approximately 5 minutes to kubernetes see that failure.I am using dynamic provisioning for volumes and this time is a little bit much for me.How can i reduce that detecting failure time ?
I found a post about it:
https://fatalfailure.wordpress.com/2016/06/10/improving-kubernetes-reliability-quicker-detection-of-a-node-down/
At the bottom of the post,it says, we can reduce that detection time by changing that parameters:
kubelet: node-status-update-frequency=4s (from 10s)
controller-manager: node-monitor-period=2s (from 5s)
controller-manager: node-monitor-grace-period=16s (from 40s)
controller-manager: pod-eviction-timeout=30s (from 5m)
i can change node-status-update-frequency parameter from kubelet but i don't have any controller manager program or command on the cli.How can i change that parameters? Any other suggestions about reducing detect downtime will be appreciated.
..but i don't have any controller manager program or command on the
cli.How can i change that parameters?
You can change/add that parameter in controller-manger systemd unit file and restart the daemon. Please check the man pages for controller-manager here.
If you deploy controller-manager as micro service(pod), check the manifest file for that pod and change the parameters at container's command section(For example like this)
It's actually kube-controller-manager. You may also decrease --attach-detach-reconcile-sync-period from 1m to 15 or 30 seconds for kube-controller-manager. This will allow for more speedy volumes attach-detach actions. How you change those parameters depends on how you set up the cluster.
running v1.10 and i notice that kube-controller-managers memory usage spikes and the OOMs all the time. it wouldn't be so bad if the system didn't fall to a crawl before this happens tho.
i tried modifying /etc/kubernetes/manifests/kube-controller-manager.yaml to have a resource.limits.memory=1Gi but the kube-controller-manager pod never seems to want to come back up.
any other options?
There is a bug in kube-controller-manager, and it's fixed in https://github.com/kubernetes/kubernetes/pull/65339
First of all, you missed information about the amount of memory you use per node.
Second, what do you mean by "system didn't fall to a crawl" - do you mean nodes are swapping?
All Kubernetes masters and nodes are expected to have swap disabled - it's recommended by the Kubernetes community, as mentioned in the Kubernetes documentation.
Support for swap is non-trivial and degrades performance.
Turn off swap on every node by:
sudo swapoff -a
Finally,
resource.limits.memory=1Gi
is default value per pod. These limits are hard limits. Pod reaching this level of allocated memory can cause OOM, even if you have gigabytes of unallocated memory.
Similar question on SO has 10 answers as 'force delete the pod' -_-
Of course this is unacceptable as it causes problems on the cluster - too many pods are stuck on 'terminating', and many times if you try to delete a random pod it also gets stuck. It happens fairly randomly.
So how to determine, first why are 'termination' commands issued and second how to find the culprit behind the freezes.
Is it the CNI? Core components like kubelet, controllermanager?
Logs don't show anything useful, nor does 'describe pod'.
If your pods got terminated with apparently no cause, it could be:
the node is under stress (memory, cpu)
liveness condition is not respected
For these reasons, the scheduler kills some pods.
How to determine the precise cause?
If you found 'logs' and 'describe' command useless, it could be useful a monitoring system (ex. influxdb+grafana: https://github.com/kubernetes/heapster/tree/master/deploy/kube-config/influxdb).