High loads causing node to become NotReady?

High loads causing node to become NotReady? - kubernetes

I'm running several experiments in GCE with a Kubernetes cluster built with KOPS. I can start my experiments, verify that they're running, then close to the end of the run the node responsible for generating the load for my cluster get a state "Unknown" for the "MemoryPressure", "DiskPressure" and "Ready" types.
Coincidentally the pods that run on the node require the most resources towards the end of the run as well.
So my question is, is it possible that the node is unable to respond to a request from the kube-controller or api-server due to its load-generation?
If so, how do I resolve this? Since, my experiments potentially render the node unresponsive for a maximum of about half an hour or more.
Thanks for any responses in advance.

Turns out one of my pods was consuming all the CPU on the node. Causing kubelte to become unresponsive. I've set a limit on the pod's CPU-time and that fixed the issue. Also, added a kube-reserved setting to ensure kubelet gets the CPU-time it needs.

If the load is growing because of growing amount of Pods, you can try to use Node autoscaling. Here you can find the instruction about it.
If only several Pods consume all Node resources, then the only way is to use Nodes with bigger amount of CPU and Memory

Related

Advantage of multiple pod on same node

I'm new with Kubernetes, i'm testing with Minikube locally. I need some advice with Kubernetes's horizontal scaling.
In the following scenario :
Cluster composed of only 1 node
There is only 1 pod on this node
Only one application running on this pod
Is there a benefit of deploying new pod on this node only to scale my application ?
If i understand correctly, pod are sharing the system's resources. So if i deploy 2 pods instead of 1 on the same node, there will be no performance increase.
There will be no availability increase either, because if the node fails, the two pods will also shut.
Am i right about my two previous statements ?
Thanks

Yes, you are right. Pods on the same node are anyhow utilizing the same CPU and Memory resources and therefore are expected to go down in event of node failure.
But, you need to consider it at pod level also. There can be situation where the pod itself gets failed but node is working fine. In such cases, multiple pods can help you serve better and make application highly available.
From performance perspective also, more number of pods can serve requests faster, thereby dropping down latency issues for your application.

Can we have --pod-eviction-timeout=300m?

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?

As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.

GKE Limit RAM & CPU

Am using GKE(google managed kubernetes) and I have requirement where I want to leave around 10% of memory on each Node as Idle so that during burst workload scenarios, the pod's already deployed on that Node can make use of those idle resources (within limit range)
Basically What I want to achieve is, I want to avoid a scenario where Pod's get scheduled onto a Node till 100% resources are consumed and assuming all the Pod's/Services are utilizing their allocated resources (set via requests) and one of the POD has a burst workload scenario or the pod got restarted and it needs more memory during boot up, then it should be able to make use of those idle resources
After going through the documentation I have come across this, but since GKE is a managed service, these properties aren't exposed anywhere, are there any other ways to achieve the same ?

GKE is a managed service and therefore you will not be able to costumize the worker node kublet parameters like --eviction-hard or --system-reserved.
As a workaround, you need to calculate your pod's memory requests and memory limits in order to configure a maximum number of pods per node, in this way you can controle the number of pods that run on your node and the spare CPU and memory to be used by your pods in case of a burst.

Kubernetes - NodeUnderMemoryPressure Issue

I'm very new to Kubernetes. We are using Kubernetes cluster on Google Cloud Platform.
I have created Cluster, Services, Pod, Replica controllers.
I have created Horizontal Pod Autoscaler and it is based on CPU Params.
Cluster details
Default running node count is set to 3
3GB allocatable memory per node
Default running node count is 3 in the cluster.
After running for 1 hour Service and Nodes showing NodeUnderMemoryPressure Issues.
How to resolve this ??
If you any more details, please ask
Thanks

I don't know how much traffic is hitting your cluster, but I would highly recommend running Prometheus in your cluster.
Prometheus is an open-source monitoring and alerting tool, and integrates very well with Kubernetes.
This tool should give you a much better view of memory consumption, CPU usage, amongst many other monitoring capabilities, that will allow you to effectively troubleshoot these types of issues.

There are several ways to address this issue that depends on the type of your workloads.
The easiest is simply scale your nodes, but it can be useless if there is a memory leakage. Even if now you are not affected by it you should always consider the possibility of a memory leakage happening, therefore the best practise is to introduce always memory limits for PODs and Namespaces.
Scale the cluster
if you have many pods running and there are not some of them way bigger that the others it would be useful to scale horizontally your cluster, in this way the number of running pods per nodes will reduce and the NodeUnderMemoryPressure warning should disappear.
if you are running few PODs or some of them are capable to make the cluster suffering alone, then the only option is to scale the nodes vertically adding a new node pool with Compute Engine instances having more memory and possibly delete the old one.
if your workload is correct and you memory suffer because in certain moment of the day you receive 100 times more the usual traffic and you create more pods to support this traffic, you should consider to make use of the Autoscaler.
Check Memory leakages
On the other hand if it is not an "healthy" situation and you have pods consuming way more RAM than expected then you should follow the advice of grizzthedj and understand why your PODs are consuming so much and maybe verify if some of your container is affected by memory leakage and in this case scale the amount of RAM is useless since at some point you will run out of it anyway.
Therefore start to understand which are the PODs consuming too much and then troubleshoot why they have this behaviour, if you do not want to make use of Prometeus simply SSH into the container and check with the classical Linux commands.
Limit the RAM consumed by PODs
To prevent this to happen in the future I advise you when writing YAML file to always limit the amount of RAM they can make use of, in this way you will control them and you will be sure that there is not the risk that they cause the Kubernetes "node agent" to fail because out of memory.
Consider also to limit the CPU and introduce minimum requirements of both RAM and CPU for PODs to help the scheduler to properly schedule the PODs to avoid to hit NodeUnderMemoryPressure under high workload.

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?

A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.

When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.