Is it okay to change the pod eviction timeout?(k8s, openshift) - kubernetes

I want to know about pod eviction timeout. I've already read k8s, openshift manual and some blog.
but i couldn't find an article on impact of reducing pod-eviction-timeout.(default : 5m)
I think there is a reason why the default value is 5 minutes. but I can't find reason...
Can you tell me how it will affect k8s cluster if I change the settings?
(EX: Change pod-eviction-time: 2minute or less)
refer: we have openshift(okd) cluster and it has many services.

if the 5m timeout is a valid choice or not depends on your services and your infrastructure.
there are multiple reasons for a pod to be evicted like node pressure, scheduling priorities due to resource limits, priorityClasses, taints/tolerations, etc. basically, pods will be evicted on some kind of failure or on some kind of scheduling event that can also be initiated by a user.
if you change the timeout, kubernetes will not wait as long to forcefully kill the processes during the eviction. that can lead to some unwanted behaviour with stateful services, because it may not have enough time to be shutdown gracefully and the attached volume may not be available in time when the pod is scheduled on another node again. with stateless services everything is easier, so there won't be such problems.
in short: if you are running stateless services, this should not lead to any problems. if you have stateful services, it may cause problems, but that cannot be answered generally. you gotta test it and see what happens, because you (and your team) know your services best.

Related

Schedule as many pods as will fit in the cluster?

I've got a batch job to run: process a large number of media files. I have a Kubernetes cluster to run it on, but I don't want to change the size of the cluster. I want to run the processing as a low-priority job. Any time there are spare compute resources, they should work on media-processing. Any time there are other jobs that need resources, the media process should be suspended.
Currently, I'm running a Deployment with one replica for each node in my cluster. I defined a PriorityClass for the batch-job and a different PriorityClass (with higher priority) for everything else. That seems to be working to evict running batch-jobs when something else needs the resources.
I define a Affinity, specifically a WeightedPod(Anti)Affinity to discourage the batch-job from scheduling on the same machine.
The code itself is a queue-worker: it pulls one work-item off a shared queue and processes it and then goes back for the next. If it gets interrupted (because it's being evicted) the partial work is lost (which is fine).
This is working OK, but I'm leaving a lot of resources on the table, still. Is there some way to define my replica-count as "as many as you can schedule"? I could ask for far more replicas than the cluster can handle; would that be a good solution? Or are there problems with Kubernetes having 10 pods stuck "pending" for months at a time?
I think there's no harm in asking for more pods than the cluster can handle and keeping them pending forever. My only concern is whether the scheduler will be able to discern normal priority pending pods over low priority pending pods, and be able to give precedence to the more urgent ones.
The pro way to go about this issue, IMHO, is to leverage prometheus adapter and use an HPA to target the current capacity of your cluster using a prometheus query. This can give you continuous of the cluster capacity and the ability to autoscale accordingly. This medium article has a pretty good introduction to the concept.

Scaling down video conference software in Kubernetes

I'm planning to deploy a WebRTC custom videoconference software (based on NodeJS, using websockets) with Kubernetes, but I have some doubts about scaling down this environment.
Actually, I'm planning to use cloud hosted Kubernetes (GKE, EKS, AKS or any) to be able to auto-scale nodes in the cluster to attend the demand increase and decrease. But, scaling up is not the problem, but it's about scaling down.
The cluster will scale down based on some CPU average usage metrics across the cluster, as I understand, and if it tries to remove some node, it will start to drain connections and stop receiving new connections, right? But now, imagine that there's a videoconference still running in this "pending deletion" node. There are two problems:
1 - Stopping the node before the videoconference finishes (it will drop the meeting)
2 - With the draining behaviour when it starts to scale down, it will stop receiving new connections, so if someone tries to join in this running video conference, it will receive a timeout, right?
So, which is the best strategy to scale down nodes for a video conference solution? Any ideas?
Thanks
I would say this is not a matter of resolving it on kubernetes level by some specific scaling strategy but rather application ability to handle such situations. It isn't even specific to kubernetes. Imagine that you deploy it directly on compute instances which are also subject to autoscale and you'll end up in exactly the same situation when the load decreases and one of the instances is removed from the set.
You should rather ask yourself if such application is suitable to be deployed as kubernetes workload. I can imagine that such videoconference session doesn't have to rely on the backend deployed on a single node only. You can even define some affinity or anti-affinity rules to prevent your Pods from being scheduled on the same node. So if the whole application cluster is still up and running (it's Pods are running on different nodes), eviction of a limited subset of Pods should not have a big impact.
You can actually face the same issue with any other application as vast majority of them base on some session which needs to be established between the client software and the server part. I would say it's application responsibility to be able to handle such scenarios. If some of the users unexpectedly loses the connection it should be possible to immediately redirect them to the running instance e.g. different Pod which is still able to accept new requests.
So basically if the application is designed to be highly available, scaling in (when we talk about horizontal scaling we actually talk about scaling in and scaling out) the underyling VMs, or more specifically kubernetes nodes, shouldn't affect it's high availability capabilities. From the other hand if it is not designed to be highly available, solution such as kubernetes probably won't help much.
There is no best strategy at your use case. When a cloud provider scales down, it is going to get one node randomly and kill it. It's not going to check whether this node has less resource consumption, so let's kill this one. It might end up killing the node with most pods running on it.
I would focus on how you want to schedule your pods. I would try to schedule them, if possible, on a node with running pods already (Pod inter-affinity), and would set up a Pod Disruption Budget to all Deployments/StatefulSets/etc (depending on how you want to run the pods). As a result it would only scale down when there are no pods running on a specific node, and it would kill that node, because on the other nodes there are pods; protected by a PDB.

K8s - Schedule new pod before the old one is terminated

I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.
I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?
From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.
EDIT
TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment
a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)
b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.
The scale down behavior can be configured with what is called a Disruption Budget
In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.
For how to do it, check out the K8s Documentation.
Below are some insight, hope this will help :
If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.
If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc
Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.
Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.
By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it.
So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.
I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.
Beware that is not official by any means but serves as a workaround for my use case.

Can we have --pod-eviction-timeout=300m?

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.

Pod is still running after I delete the parent job

I created a job in my kubernetes cluster, the job takes a long time to finish, I decided to cancel it, so I deleted the job, but I noticed the associated pod is NOT automatically deleted. Is this the expected behavior? why is it not consistent with deployment deletion? Is there a way to make pod automatically deleted?
If you're deleting a deployment, chances are you don't want any of the underlying pods, so it most likely forcefully deletes the pods by default. Also, the desired state of pods would be unknown.
On the other hand if you're deleting a pod, it doesn't know what kind of replication controller may be attached to it and what it is doing next. So it signals a shutdown to the container so that it can perhaps clean up gracefully. There may be processes that are still using the pod, like a web request etc. and it would not be good to kill their request if it may take a second to complete. This is what happens if you may be scaling up your pods or rolling out a new deployment, and you don't want any of the users to experience any downtime. This is in fact one of the benefits of Kubernetes, as opposed to a traditional application server which requires you to shutdown the system to upgrade (or to play with load balancers to redirect traffic) which may negatively affect users.