When a pod can't be scheduled, what does the 3 in "Insufficient cpu (3)" refer to? - kubernetes

When I create a Pod that cannot be scheduled because there are no nodes with sufficient CPU to meet the Pod's CPU request, the events output from kubectl describe pod/... contain a message like No nodes are available that match all of the following predicates:: Insufficient cpu (3).
What does the (3) in Insufficient cpu (3) mean?
For example, if I try to create a pod that requests 24 CPU when all of my nodes only have 4 CPUs:
$ kubectl describe pod/large-cpu-request
Name: large-cpu-request
Namespace: default
Node: /
Labels: <none>
Annotations: <none>
Status: Pending
IP:
Controllers: <none>
Containers:
cpuhog:
...
Requests:
cpu: 24
...
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
23m 30s 84 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: Insufficient cpu (3).
At other times I have seen event messages like No nodes are available that match all of the following predicates:: Insufficient cpu (2), PodToleratesNodeTaints (1) when a pod's resource requests were too high, so the 3 does not seem like a constant number - nor does it seem related to my 24 CPU request either.

It means that your Pod doesn't fit on 3 nodes because of Insufficient CPU and 1 node because of taints (likely the master).

A pod can't be scheduled when it requests more cpu than you have in your cluster. For example, if you have 8 Kubernetes CPU (see this page to calculate how many kubernetes cpu you have) in total and if your existing pods have already consumed that much cpu then you can't schedule more pods unless some of your existing pods are killed by the time you request to schedule a new pod. Here is a simple equation can be followed in Horizontal Pod Autoscaler (HPA):
RESOURCE REQUEST CPU * HPA MAX PODS <= Total Kubernetes CPU
You can always tune up these numbers. In my case, I adjusted my manifest file for the RESOURCE REQUEST CPU. It can be 200m, or 1000m (= 1 kubernetes cpu).

Related

How to distribute CPU load among Kubernetes nodes

I have a kubernetes cluster in GCP with several nodes. Now I'm trying to install monitoring agent in all of them but problem is that two of them have too much CPU load, but on the other hand, the rest have low CPU load.. How can I distribute this load among them?
Resource Requests Limits
-------- -------- ------
cpu 413m (21%) 0 (0%)
memory 266Mi (4%) 550Mi (9%)
--
Resource Requests Limits
-------- -------- ------
cpu 513m (26%) 0 (0%)
memory 266Mi (4%) 550Mi (9%)
--
Resource Requests Limits
-------- -------- ------
cpu 923m (98%) 145m (15%)
memory 501Mi (18%) 1135Mi (43%)
--
Resource Requests Limits
-------- -------- ------
cpu 913m (97%) 0 (0%)
memory 266Mi (10%) 550Mi (20%)
--
Resource Requests Limits
-------- -------- ------
cpu 903m (96%) 10m (1%)
memory 406Mi (15%) 780Mi (29%)
I have not defined any affinity Rule so I don't understand how to k8s make this distribution.. I have thought also increase the machine types but I dont know if it is the best option.
Any help please?
In general, the kube-scheduler automatically tries to distribute the load among the available worker nodes. Besides the nodeAffinity there are several more configurations which force the scheduling of a Pod to a specific node.
nodeSelector
taints and tolerations
In your case I would recommend to check which pods are assigned on which node. You can get the information via the following command:
kubectl get pods -A -o wide
In the most Kubernetes configurations the master nodes are per default marked with the taint: node-role.kubernetes.io/master:NoSchedule. This means that besides the pods of the Kubernetes Control Plane no further workloads can be scheduled on your master nodes.
For further investigation more information of your workloads and on which nodes they are running, would be required.

1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate

My kubernetes K3s cluster gives this error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 17m default-scheduler 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate.
Warning FailedScheduling 17m default-scheduler 0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate.
In order to list the taints in the cluster I executed:
kubectl get nodes -o json | jq '.items[].spec'
which outputs:
{
"podCIDR": "10.42.0.0/24",
"podCIDRs": [
"10.42.0.0/24"
],
"providerID": "k3s://antonis-dell",
"taints": [
{
"effect": "NoSchedule",
"key": "node.kubernetes.io/disk-pressure",
"timeAdded": "2021-12-17T10:54:31Z"
}
]
}
{
"podCIDR": "10.42.1.0/24",
"podCIDRs": [
"10.42.1.0/24"
],
"providerID": "k3s://knodea"
}
When I use kubectl describe node antonis-dell I get:
Name: antonis-dell
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=antonis-dell
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=true
node-role.kubernetes.io/master=true
node.kubernetes.io/instance-type=k3s
Annotations: csi.volume.kubernetes.io/nodeid: {"ch.ctrox.csi.s3-driver":"antonis-dell"}
flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"f2:d5:6c:6a:85:0a"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.1.XX
k3s.io/hostname: antonis-dell
k3s.io/internal-ip: 192.168.1.XX
k3s.io/node-args: ["server"]
k3s.io/node-config-hash: YANNMDBIL7QEFSZANHGVW3PXY743NWWRVFKBKZ4FXLV5DM4C74WQ====
k3s.io/node-env:
{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/e61cd97f31a54dbcd9893f8325b7133cfdfd0229ff3bfae5a4f845780a93e84c","K3S_KUBECONFIG_MODE":"644"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 17 Dec 2021 12:11:39 +0200
Taints: node.kubernetes.io/disk-pressure:NoSchedule
where it seems that node has a disk-pressure taint.
This command doesn't work: kubectl taint node antonis-dell node.kubernetes.io/disk-pressure:NoSchedule- and it seems to me that even if it worked, this is not a good solution because the taint assigned by the control plane.
Furthermore in the end of command kubectl describe node antonis-dell I observed this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FreeDiskSpaceFailed 57m kubelet failed to garbage collect required amount of images. Wanted to free 32967806976 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 52m kubelet failed to garbage collect required amount of images. Wanted to free 32500092928 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 47m kubelet failed to garbage collect required amount of images. Wanted to free 32190205952 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 42m kubelet failed to garbage collect required amount of images. Wanted to free 32196628480 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 37m kubelet failed to garbage collect required amount of images. Wanted to free 32190926848 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 2m21s (x7 over 32m) kubelet (combined from similar events): failed to garbage collect required amount of images. Wanted to free 30909374464 bytes, but freed 0 bytes
Maybe the disk-pressure is related to this? How can I delete the unwanted images?
Posting the answer as a community wiki, feel free to edit and expand.
node.kubernetes.io/disk-pressure:NoSchedule taint indicates that some disk pressure happens (as it's called).
The kubelet detects disk pressure based on imagefs.available, imagefs.inodesFree, nodefs.available and nodefs.inodesFree(Linux only) observed on a Node. The observed values are then compared to the corresponding thresholds that can be set on the kubelet to determine if the Node condition and taint should be added/removed.
More details on disk-pressure are available in Efficient Node Out-of-Resource Management in Kubernetes under How Does Kubelet Decide that Resources Are Low? section:
memory.available — A signal that describes the state of cluster
memory. The default eviction threshold for the memory is 100 Mi. In
other words, the kubelet starts evicting Pods when the memory goes
down to 100 Mi.
nodefs.available — The nodefs is a filesystem used by
the kubelet for volumes, daemon logs, etc. By default, the kubelet
starts reclaiming node resources if the nodefs.available < 10%.
nodefs.inodesFree — A signal that describes the state of the nodefs
inode memory. By default, the kubelet starts evicting workloads if the
nodefs.inodesFree < 5%.
imagefs.available — The imagefs filesystem is
an optional filesystem used by a container runtime to store container
images and container-writable layers. By default, the kubelet starts
evicting workloads if the imagefs.available < 15 %.
imagefs.inodesFree — The state of the imagefs inode memory. It has no default eviction threshold.
What to check
There are different things that can help, such as:
prune unused objects like images (with Docker CRI) - prune images.
The docker image prune command allows you to clean up unused images. By default, docker image prune only cleans up dangling images. A dangling image is one that is not tagged and is not referenced by any container.
check files/logs on the node if they take a lot of space.
any another reason why disk space was consumed.

Azure kubernetes pods showing high cpu usage when they get restarted or hpa works?

We are using AKS version 1.19.11.
It is noticed that whenever a new rollout is in placed for our deployments or a new pod got created as part of the hpa settings or pod got restarted, We are getting high cpu usage alerts.
For example, -if a new pod got created as part of any of the above activities, will this take up more CPU than the allowed Threshold ? [ the “Maximum limit” of 1 core specified in the deployment spec and the apps are light weight and doesnt need thatmuch cpu anuyways ] ? its in turn makes sudden spike in the AzureMonitor for a short time and then it became normal.
Why the pods are taking more cpu during its startup or creation time?
if the pods are not using thatmuch cpu, what will be the reason for this repeating issues?
hpa settings as below
Name: myapp
Namespace: myapp
Labels: app.kubernetes.io/managed-by=Helm
Annotations: meta.helm.sh/release-name: myapp
meta.helm.sh/release-namespace: myapp
CreationTimestamp: Mon, 26 Apr 2021 07:02:32 +0000
Reference: Deployment/myapp
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 5% (17m) / 75%
Min replicas: 5
Max replicas: 12
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
ading the events when a new rollout placed.
as per the events captured from the “myapp” Namespace , there were new deployment rolled out for myapp as below.
During the new pods creation its showing more CPU spikes as we are getting alert from the Azuremonitor that its exceeds the threshold of 80%.[the “Maximum limit” of 1 core specified in the deployment spec]
30m Normal SuccessfulDelete replicaset/myapp-1a2b3c4d5e Deleted pod: myapp-1a2b3c4d5e-9fmrk
30m Normal SuccessfulDelete replicaset/myapp-1a2b3c4d5e Deleted pod: myapp-1a2b3c4d5e-hfr8w
29m Normal SuccessfulDelete replicaset/myapp-1a2b3c4d5e Deleted pod: myapp-1a2b3c4d5e-l2pnd
31m Normal ScalingReplicaSet deployment/myapp Scaled up replica set myapp-5ddc98fb69 to 1
30m Normal ScalingReplicaSet deployment/myapp Scaled down replica set myapp-1a2b3c4d5e to 2
30m Normal ScalingReplicaSet deployment/myapp Scaled up replica set myapp-5ddc98fb69 to 2
30m Normal ScalingReplicaSet deployment/myapp Scaled down replica set myapp-1a2b3c4d5e to 1
30m Normal ScalingReplicaSet deployment/myapp Scaled up replica set myapp-5ddc98fb69 to 3
29m Normal ScalingReplicaSet deployment/myapp Scaled down replica set myapp-1a2b3c4d5e to 0
Alert settings
Period Over the last 15 mins
Value 100.274747
Operator GreaterThan
Threshold 80
i am not sure what metrics you are looking at in AKS monitoring specifically as you have not mentioned it but it could be possible,
when you are deploying the POD or HPA scaling the replicas your AKS showing the total resource of all replicas.
During the deployment, it's possible at a certain stage all PODs are in the running phase and taking & consuming the resources.
Are you checking specific resources of one single POD and it's going
above the threshold ?
As you have mentioned application is lightweight however it is possible initially it taking resources to start the process, in that case, you might have to check resources using profiling.

Pod in status pending but autoscale is enabled, why doesn't work?

Did you find this behavior before?
I have a GKE cluster with 5 nodes, I have autoscaling enable as you can see below
autoscaling:
enabled: true
maxNodeCount: 9
minNodeCount: 1
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-1
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
initialNodeCount: 1
instanceGroupUrls:
- xxx
management:
autoRepair: true
autoUpgrade: true
name: default-pool
podIpv4CidrSize: 24
selfLink: xxxx
status: RUNNING
version: 1.13.7-gke.8
However when I'm trying to deploy one service I receive this error
Warning FailedScheduling 106s default-scheduler 0/5 nodes are available: 3 Insufficient cpu, 4 node(s) didn't match node selector.
Warning FailedScheduling 30s (x3 over 106s) default-scheduler 0/5 nodes are available: 4 node(s) didn't match node selector, 5 Insufficient cpu.
Normal NotTriggerScaleUp 0s (x11 over 104s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match node selector
And if I see the stats of my resources I don't see problem with CPU, right?
kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-pre-cluster-1-default-pool-17d2178b-4g9f 106m 11% 1871Mi 70%
gke-pre-cluster-1-default-pool-17d2178b-g8l1 209m 22% 3042Mi 115%
gke-pre-cluster-1-default-pool-17d2178b-grvg 167m 17% 2661Mi 100%
gke-pre-cluster-1-default-pool-17d2178b-l9gt 122m 12% 2564Mi 97%
gke-pre-cluster-1-default-pool-17d2178b-ppfw 159m 16% 2830Mi 107%
So... if the problem seems is not cpu with this message?
And the other thing is... why if there are a problem with resources don't scale up automatically?
Please anyone found this before can explain me? I don't understand.
Thank you so much
GKE's autoscaling functionality is based on Compute Engine instance groups. As such, it only pays attention to actual, dynamic resource usage (CPU, memory, etc), and not to the requests section in a Kubernetes pod template.
A GKE node that has 100% of its resources allocated (and therefore cannot schedule any more pods) is considered idle by the autoscaler if the software running in those pods isn't actually using the resources. If the software running in those pods is waiting for a "Pending" pod to start, then your workload is deadlocked.
Unfortunately, I know of no solution to this problem. If you control the pod templates that are being used to start the pods, you can try asking for less memory/CPU than your jobs actually need. But that might result in pods getting evicted.
GKE's autoscaler isn't particularly smart.
Could you check if you have this entry "ZONE_RESOURCE_POOL_EXHAUSTED" in StackDriver logging?
It probably that zone you are using with your kubernetes cluster is with problems.
Regards.

Cannot create a deployment that requests more than 2Gi memory

My deployment pod was evicted due to memory consumption:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Evicted 1h kubelet, gke-XXX-default-pool-XXX The node was low on resource: memory. Container my-container was using 1700040Ki, which exceeds its request of 0.
Normal Killing 1h kubelet, gke-XXX-default-pool-XXX Killing container with id docker://my-container:Need to kill Pod
I tried to grant it more memory by adding the following to my deployment yaml:
apiVersion: apps/v1
kind: Deployment
...
spec:
...
template:
...
spec:
...
containers:
- name: my-container
image: my-container:latest
...
resources:
requests:
memory: "3Gi"
However, it failed to deploy:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4s (x5 over 13s) default-scheduler 0/3 nodes are available: 3 Insufficient memory.
Normal NotTriggerScaleUp 0s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added)
The deployment requests only one container.
I'm using GKE with autoscaling, the nodes in the default (and only) pool have 3.75 GB memory.
From trial and error, I found that the maximum memory I can request is "2Gi". Why can't I utilize the full 3.75 of a node with a single pod? Do I need nodes with bigger memory capacity?
Even though the node has 3.75 GB of total memory, is very likely that the capacity allocatable is not all 3.75 GB.
Kubernetes reserve some capacity for the system services to avoid containers consuming too much resources in the node affecting the operation of systems services .
From the docs:
Kubernetes nodes can be scheduled to Capacity. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.
Because you are using GKE, is they don't use the defaults, running the following command will show how much allocatable resource you have in the node:
kubectl describe node [NODE_NAME] | grep Allocatable -B 4 -A 3
From the GKE docs:
Allocatable resources are calculated in the following way:
Allocatable = Capacity - Reserved - Eviction Threshold
For memory resources, GKE reserves the following:
25% of the first 4GB of memory
20% of the next 4GB of memory (up to 8GB)
10% of the next 8GB of memory (up to 16GB)
6% of the next 112GB of memory (up to 128GB)
2% of any memory above 128GB
GKE reserves an additional 100 MiB memory on each node for kubelet eviction.
As the error message suggests, scaling the cluster will not solve the problem because each node capacity is limited to X amount of memory and the POD need more than that.
Each node will reserve some memory for Kubernetes system workloads (such as kube-dns, and also for any add-ons you select). That means you will not be able to access all the node's 3.75 Gi memory.
So to request that a pod has a 3Gi memory reserved for it, you will indeed need nodes with bigger memory capacity.