Can we have --pod-eviction-timeout=300m? - kubernetes

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?

As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.

Related

Is it okay to change the pod eviction timeout?(k8s, openshift)

I want to know about pod eviction timeout. I've already read k8s, openshift manual and some blog.
but i couldn't find an article on impact of reducing pod-eviction-timeout.(default : 5m)
I think there is a reason why the default value is 5 minutes. but I can't find reason...
Can you tell me how it will affect k8s cluster if I change the settings?
(EX: Change pod-eviction-time: 2minute or less)
refer: we have openshift(okd) cluster and it has many services.
if the 5m timeout is a valid choice or not depends on your services and your infrastructure.
there are multiple reasons for a pod to be evicted like node pressure, scheduling priorities due to resource limits, priorityClasses, taints/tolerations, etc. basically, pods will be evicted on some kind of failure or on some kind of scheduling event that can also be initiated by a user.
if you change the timeout, kubernetes will not wait as long to forcefully kill the processes during the eviction. that can lead to some unwanted behaviour with stateful services, because it may not have enough time to be shutdown gracefully and the attached volume may not be available in time when the pod is scheduled on another node again. with stateless services everything is easier, so there won't be such problems.
in short: if you are running stateless services, this should not lead to any problems. if you have stateful services, it may cause problems, but that cannot be answered generally. you gotta test it and see what happens, because you (and your team) know your services best.

what is the use of vertical pod autoscaler "auto" mode

As far as I understand from the VPA documentation the vertical pod autoscaler stop/restart the pod based-on the predicted request/limit's lower/upper bounds and target.
In the "auto" mode it says that the pod will be stopped and restarted, however, I don't get the point of doing a prediction and restarting the pod while it is still working because although we know that it might go out of resource eventually it is still working and we can wait to rescale it once it has really gone out of memory/cpu. Isn't it more efficient to just wait for the pod to go out of memory/cpu and then restart it with the new predicted request?
Is recovering from a dead container more costly than stopping and restarting the pod ourselves? If yes, in what ways?
Isn't it more efficient to just wait for the pod to go out of
memory/cpu and then restart it with the new predicted request?
In my opinion this is not the best solution. If the pod would try to use more CPU than available limits than the container's CPU use is being throttled, if the container is trying to use more memory than limits kubernetes OOM kills the container due to limit overcommit but limit on npods usually can be higher than sum of node capacity so this can lead to memory exhaust in the node and can case the death of other workload/pods.
Answering your question - VPA was designed to simplify those scenarios:
Vertical Pod Autoscaler (VPA) frees the users from necessity of
setting up-to-date resource limits and requests for the containers in
their pods. When configured, it will set the requests automatically
based on usage and thus allow proper scheduling onto nodes so that
appropriate resource amount is available for each pod. It will also
maintain ratios between limits and requests that were specified in
initial containers configuration.
In addition VPA should is not only responsible for scaling up but also for scaling down:
it can both down-scale pods that are over-requesting resources, and also up-scale pods that are under-requesting resources based on their usage over time.
Is recovering from a dead container more costly than stopping and
restarting the pod ourselves? If yes, in what ways?
Talking about the cost of recovering from the dead container - the main possible cost might be requests that can eventually get lost during OOM killing process as per the official doc.
As per the official documentation VPAs operates in those mode:
"Auto": VPA assigns resource requests on pod creation as well as
updates them on existing pods using the preferred update mechanism
 Currently this is equivalent to "Recrete".
"Recreate": VPA assigns resource requests on pod creation as well as
updates them on existing pods by evicting them when the requested
resources differ significantly from the new recommendation (respecting
the Pod Disruption Budget, if defined).
"Initial": VPA only assigns resource requests on pod creation and
never changes them later.
"Off": VPA does not automatically change resource requirements of the
pods.
NOTE:
VPA Limitations
VPA recommendation might exceed available resources, such as you cluster capacity or your team’s quota. Not enough available resources may cause pods to go pending.
VPA in Auto or Recreate mode won’t evict pods with one replica as this would cause disruption.
Quick memory growth might cause the container to be out of memory killed. As out of memory killed pods aren’t rescheduled, VPA won’t apply new resource.
Please also take a look at some of the VPA Known limitations:
Updating running pods is an experimental feature of VPA. Whenever VPA updates the pod resources the pod is recreated, which causes all
running containers to be restarted. The pod may be recreated on a
different node.
VPA does not evict pods which are not run under a controller. For such pods Auto mode is currently equivalent to Initial.
VPA reacts to most out-of-memory events, but not in all situations.
Additional resources:
VERTICAL POD AUTOSCALING: THE DEFINITIVE GUIDE

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.

Kubernetes Deployment with Zero Down Time

As a leaner of Kubernetes concepts, their working, and deployment with it. I have a couple of cases which I don't know how to achieve. I am looking for advice or some guideline to achieve it.
I am using the Google Cloud Platform. The current running flow is described below. A push to the google source repository triggers Cloud Build which creates a docker image and pushes the image to the running cluster nodes.
Case 1: Now I want that when new pods are up and running. Then traffic is routed to the new pods. Kill old pod but after each pod complete their running request. Zero downtime is what I'm looking to achieve.
Case 2: What will happen if the space of running pod reaches 100 and in the Debian case that the inode count reaches full capacity. Will kubernetes create new pods to manage?
Case 3: How to manage pod to database connection limits?
Like the other answer use Liveness and Readiness probes. Basically, a new pod is added to the service pool then it will only serve traffic after the readiness probe has passed. The old pod is removed from the Service pool, then drained and then terminated. This happens on a rolling fashion one pod at a time.
This really depends on the capacity of your cluster and the ability to schedule pods depending on the limits for the containers in them. For more about setting up limits for containers refer to here. In terms of the inode limit, if you reach it on a node, the kubelet won't be able to run any more pods on that node. The kubelet eviction manager also has a mechanism in where evicts some pods using the most inodes. You can also configure your eviction thresholds on the kubelet.
This would be more a limitation at the OS level combined your stateful application configuration. You can keep this configuration in a ConfigMap. And for example in something for MySql the option would be max_connections.
I can answer case 1 since Ive done it myself.
Use Deployments with readinessProbes & livelinessProbes

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.