I am fairly new to Kubernetes and GKE (Google Container Engine) as a whole, so I was playing with the horizontal pod autoscaling and cluster autoscaling features by hitting my load balancer hard enough to make it scale up enough pods that it needed more instances, so it scaled those up but then it got to the point that there were some pods in Pending state, but it had also reached the max number of instances for the autoscaling cluster, so they were left in Pending state.
I then stopped the load test hoping it would come down on its own, but it wouldn't. I looked at kubectl describe hpa and I would see errors like:
7m 18s 18 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 4/5 of pods
7m 18s 18 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 4/5 of pods
There are actually only 4 pods running (and none in pending state), and looking at the heapster logs (kubectl logs -f heapster-v1.1.0-<id> --namespace=kube-system heapster) I can see it is actually looking for metrics in a pod that doesn't exist anymore (this would be the mystery 5th pod it's complaining about).
The issue with this is that because it is missing the 5th pod, it can't finish getting the current CPU utilization for the 4 pods that are running, and thus horizontal pod autoscaling doesn't work.
Any ideas how to get out of a situation like this?
I've tried removing the hpa and creating it again, but it didn't help.
Related
I'm running a kubernetes cluster of 20+ nodes. And one pod in a namespace got restarted. The pod got killed due to OOM with exit code 137 and restarted again as expected. But would like to know the node in which the pod was running earlier. Any place we could check the logs for the info? Like tiller, kubelet, kubeproxy etc...
But would like to know the node in which the pod was running earlier.
If a pod is killed with ExitCode: 137, e.g. when it used more memory than its limit, it will be restarted on the same node - not re-scheduled. For this, check your metrics or container logs.
But Pods can also be killed due to over-committing a node, see e.g. How to troubleshoot Kubernetes OOM and CPU Throttle.
I have a GKE cluster with two nodepools. I turned on autoscaling on one of my nodepools but it does not seem to automatically scale down.
I have enabled HPA and that works fine. It scales the pods down to 1 when I don't see traffic.
The API is currently not getting any traffic so I would expect the nodes to scale down as well.
But it still runs the maximum 5 nodes despite some nodes using less than 50% of allocatable memory/CPU.
What did I miss here? I am planning to move these pods to bigger machines but to do that I need the node autoscaling to work to control the monthly cost.
There are many reasons that can cause CA to not be downscaling successfully. If we resume how this should work normally it will be something like this:
Cluster autoscaler will periodically check (every 10 seconds) utilization of the nodes.
If the utilization factor is less than 0.5 the node will be considered as under utilization.
Then the nodes will be marked for removal and will be monitored for next 10 mins to make sure the utilization factor stays less than 0.5.
If even after 10 mins it stays under utilized then the node would be removed by cluster autoscaler.
If above is not being accomplished, then something else is preventing your nodes to be downscaling. In my experience PDBs needs to be applied to kube-system pods and I would say that could be the reason why; however, there are many reasons why this can be happening, here are reasons that can cause downscaling issues:
1. PDB is not applied to your kube-system pods. Kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. You can manually add Pod Disruption Budget(PDBs) for the kube-system pods that can be safely rescheduled elsewhere, this can be added with next command:
`kubectl create poddisruptionbudget PDB-NAME --namespace=kube-system --selector app=APP-NAME --max-unavailable 1`
2. Containers using local storage (volumes), even empty volumes. Kubernetes prevents scale down events on nodes with pods using local storage. Look for this kind of configuration that prevents Cluster Autoscaler to scale down nodes.
3. Pods annotated with cluster-autoscaler.kubernetes.io/safe-to-evict: true. Look for pods with this annotation that can be preventing Nodes scaledown
4. Nodes annotated with cluster-autoscaler.kubernetes.io/scale-down-disabled: true. Look for Nodes with this annotation that can be preventing cluster Autoscale. These configurations are the ones I will suggest you check on, in order to make your cluster to be scaling down nodes that are under utilized. -----
Also you can see this page where explains the configuration to prevent the downscales, which can be what is happening to you.
Cluster information:
Kubernetes version: 1.12.8-gke.10
Cloud being used: GKE
Installation method: gcloud
Host OS: (machine type) n1-standard-1
CNI and version: default
CRI and version: default
During node scaling, HPA couldn't get CPU metric.
At the same time, kubectl top pod and kubectl top node output is:
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
For more details, I'll show you the flow of my problem occurs:
Suddenly many requests arrive at the GKE server. (Using testing tool)
HPA detects current CPU usage above target CPU usage(50%), thus try pod scale up
incrementally.
Insufficient CPU warning occurs when creating pods, thus GKE try node scalie up
incrementally.
Soon the HPA fails to get the metric, and kubectl top node or kubectl top pod
doesn’t get a response.
- At this time one or more OutOfcpu pods are found, and several pods are in
ContainerCreating (from Pending state).
After node scale-up is complete and some time has elapsed (about a few minutes),
HPA starts to fetch the CPU metric successfully and try to scale up/down based on
metric.
Same situation happens when node scale down.
This causes pod scaling to stop and raises some failures on responding to client’s requests. Is this normal?
I think HPA should get CPU metric(or other metrics) on running pods even during node scaling, to keep track of the optimal pod size at the moment. So when node scaling done, HPA create the necessary pods at once (rather than incrementally).
Can I make my cluster work like this?
Maybe your node runs out of one resource either memory or cpu, there are config maps that describe how addons are scaled depending on the cluster size. You need to edit metrics-server-config config map in kube-system namespace:
kubectl edit cm/metrics-server-config -n kube-system
you should add
baseCPU
cpuPerNode
baseMemory
memoryPerNode
to NannyConfiguration, here you can find extensive manual:
Also heapster suffers from the same OOM issue: too many pods to handle all metrics within assigned resources please modify heapster's config map in accordingly:
kubectl edit cm/heapster-config -n kube-system
We are running a kubernetes (1.9.4) cluster with 5 masters and 20 worker nodes. We are running one statefulset pod with replication 3 among other pods in this cluster. Initially the statefulset pods are distributed to 3 nodes. However the pod-2 on node-2 got evicted due to the disk pressure on node-2. However, when the pod-2 is evicted it went to node-1 where pod-1 was already running and node-1 was already experiencing node pressure. As per our understanding, the kubernetes-scheduler should not have scheduled a pod (non critical) to a node where there is already disk pressure. Is this the default behavior to not schedule the pods to a node under disk pressure or is it allowed. The reason is, at the same time we do observe, node-0 without any disk issue. So we were hoping that evicted pod on node-2 should have ideally come on node-0 instead of node-1 which is under disk pressure.
Another observation we had was, when the pod-2 on node-2 was evicted, we see that same pod is successfully scheduled and spawned and moved to running state in node-1. However we still see "Failed to admit pod" error in node-2 for many times for the same pod-2 that was evicted. Is this any issue with the kube-scheduler.
Yes, Scheduler should not assign a new pod to a node with a DiskPressure Condition.
However, I think you can approach this problem from few different angles.
Look into configuration of your scheduler:
./kube-scheduler --write-config-to kube-config.yaml
and check it needs any adjustments. You can find info about additional options for kube-scheduler here:
You can also configure aditional scheduler(s) depending on your needs. Tutorial for that can be found here
Check the logs:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log (on the master)
Look more closely at Kubelet's Eviction Thresholds (soft and hard) and how much node memory capacity is set.
Bear in mind that:
Kubelet may not observe resources pressure fast enough
or
Kubelet may evict more Pods than needed due to stats collection timing gap
Please check out my suggestions and let me know if they helped.
I have a Kubernetes cluster and when I try to scale a Deployment up to 8 pods, it gives an error message:
"0/3 nodes are available: 3 insufficient cpu."
After some time it shows 3/8 pods available and then 5/8 pods available with the same error, but never reached 8 pods.
Recently we introduced CPU limits on Pods.
What is the cause and solution for this error?
Scheduler is not able to schedule pods to any of 3 nodes as required resources are not available on nodes.
This may be due to cpu request value of pod is more than available cpu of nodes or actually your nodes don't have any cpu capacity left to schedule new pods.
Check available cpu capacity of nodes and increase it by removing non required pods. Also reduce cpu request value of pod if specified.