How to diagnose a stuck Kubernetes rollout / deployment? - kubernetes

It seems a deployment has gotten stuck. How can I diagnose this further?
kubectl rollout status deployment/wordpress
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
It's stuck on that for ages already. It is not terminating the two older pods:
kubectl get pods
NAME READY STATUS RESTARTS AGE
nfs-server-r6g6w 1/1 Running 0 2h
redis-679c597dd-67rgw 1/1 Running 0 2h
wordpress-64c944d9bd-dvnwh 4/4 Running 3 3h
wordpress-64c944d9bd-vmrdd 4/4 Running 3 3h
wordpress-f59c459fd-qkfrt 0/4 Pending 0 22m
wordpress-f59c459fd-w8c65 0/4 Pending 0 22m
And the events:
kubectl get events --all-namespaces
NAMESPACE LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
default 25m 2h 333 wordpress-686ccd47b4-4pbfk.153408cdba627f50 Pod Warning FailedScheduling default-scheduler No nodes are available that match all of the predicates: Insufficient cpu (1), Insufficient memory (2), MatchInterPodAffinity (1).
default 25m 2h 337 wordpress-686ccd47b4-vv9dk.153408cc8661c49d Pod Warning FailedScheduling default-scheduler No nodes are available that match all of the predicates: Insufficient cpu (1), Insufficient memory (2), MatchInterPodAffinity (1).
default 22m 22m 1 wordpress-686ccd47b4.15340e5036ef7d1c ReplicaSet Normal SuccessfulDelete replicaset-controller Deleted pod: wordpress-686ccd47b4-4pbfk
default 22m 22m 1 wordpress-686ccd47b4.15340e5036f2fec1 ReplicaSet Normal SuccessfulDelete replicaset-controller Deleted pod: wordpress-686ccd47b4-vv9dk
default 2m 22m 72 wordpress-f59c459fd-qkfrt.15340e503bd4988c Pod Warning FailedScheduling default-scheduler No nodes are available that match all of the predicates: Insufficient cpu (1), Insufficient memory (2), MatchInterPodAffinity (1).
default 2m 22m 72 wordpress-f59c459fd-w8c65.15340e50399a8a5a Pod Warning FailedScheduling default-scheduler No nodes are available that match all of the predicates: Insufficient cpu (1), Insufficient memory (2), MatchInterPodAffinity (1).
default 22m 22m 1 wordpress-f59c459fd.15340e5039d6c622 ReplicaSet Normal SuccessfulCreate replicaset-controller Created pod: wordpress-f59c459fd-w8c65
default 22m 22m 1 wordpress-f59c459fd.15340e503bf844db ReplicaSet Normal SuccessfulCreate replicaset-controller Created pod: wordpress-f59c459fd-qkfrt
default 3m 23h 177 wordpress.1533c22c7bf657bd Ingress Normal Service loadbalancer-controller no user specified default backend, using system default
default 22m 22m 1 wordpress.15340e50356eaa6a Deployment Normal ScalingReplicaSet deployment-controller Scaled down replica set wordpress-686ccd47b4 to 0
default 22m 22m 1 wordpress.15340e5037c04da6 Deployment Normal ScalingReplicaSet deployment-controller Scaled up replica set wordpress-f59c459fd to 2

You can use describe kubectl describe po wordpress-f59c459fd-qkfrt but from the message the pods cannot be scheduled in any of the nodes.
Provide more capacity, like try to add a node, to allow the pods to be scheduled.

The new deployment had a replica count of 3 while the previous had 2. I assumed I could set a high value for replica count and it would try to deploy as many replicas as it could before it reaches it's resource capacity. However this does not seem to be the case...

Related

Pod creation in EKS cluster fails with FailedScheduling error

I have created a new EKS cluster with 1 worker node in a public subnet. I am able to query node, connect to the cluster, and run pod creation command, however, when I am trying to create a pod it fails with the below error got by describing the pod. Please guide.
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 81s default-scheduler 0/1 nodes are available: 1 Too many pods. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Warning FailedScheduling 16m default-scheduler 0/2 nodes are available: 2 Too many pods, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
Warning FailedScheduling 16m default-scheduler 0/3 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 3 Too many pods. preemption: 0/3 nodes are available: 1 No preemption victims found for incoming pod, 2 Preemption is not helpful for scheduling.
Warning FailedScheduling 14m (x3 over 22m) default-scheduler 0/2 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 1 node(s) were unschedulable, 2 Too many pods. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.
Warning FailedScheduling 12m default-scheduler 0/2 nodes are available: 1 Too many pods, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
Warning FailedScheduling 7m14s default-scheduler no nodes available to schedule pods
Warning FailedScheduling 105s (x5 over 35m) default-scheduler 0/1 nodes are available: 1 Too many pods. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
I am able to get status of the node and it looks ready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-12-61.ec2.internal Ready <none> 15m v1.24.7-eks-fb459a0
While troubleshooting I tried below options:
recreate the complete demo cluster - still the same error
try recreating pods with different images - still the same error
trying to increase to instance type to t3.micro - still the same error
reviewed security groups and other parameters in a cluster - Couldnt come to RCA
it's due to the node's POD limit or IP limit on Nodes.
So if we see official Amazon doc, t3.micro maximum 2 interface you can use and 2 private IP. Roughly you might be getting around 4 IPs to use and 1st IP get used by Node etc, There will be also default system PODs running as Daemon set and so.
Add new instance or upgrade to larger instance who can handle more pods.

How to delete a Kubernetes pod in Pending state without deleting a deployment?

I ran
kubectl rollout restart deployment <my-deployment-name>
in order to restart my single pod, launched under the deployment.
NAME READY STATUS RESTARTS AGE
<pod-name>-vf24n 1/1 Running 1 7d
<pod-name>-8fgqt 0/1 Pending 0 14m
Sadly, I had no available resources on my affinity node to perform this operation, so the new created pod stuck in Pending state.
Normal NotTriggerScaleUp 2m57s (x31 over 7m58s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added):
Warning FailedScheduling 47s (x9 over 8m) default-scheduler 0/3 nodes are available: 1 Insufficient memory, 2 node(s) didn't match Pod's node affinity/selector.
I would like to delete the Pending pod or somehow revert the rollout restart operation, without deleting the deployment. Is it possible?
kubectl rollout undo deployment/my-deployment-name will undo a rollout

Crashloopbackoff while creating nginx controller

I have installed Kubernetes on AWS-EC2 machines, the cluster has a master and 2 nodes connected to it
[root#k8-m deployments]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8-m Ready control-plane,master 107m v1.20.1
k8-n1 Ready <none> 101m v1.20.1
k8-n2 Ready <none> 91m v1.20.1
I have a requirement to install ingress controller for exposing the traffic outside and the chosen controller is nginx. Am creating the resources such as ns, service account, secret, rbac, config map, ap-rbac, daemon-set config taken from https://github.com/nginxinc/kubernetes-ingress.git.
After creating the properties for ingress controller, am seeing the pods going to crashloopbackoff state
[root#k8-m deployments]# kubectl get all -n nginx-ingress
NAME READY STATUS RESTARTS AGE
pod/nginx-ingress-555f75f85f-5vxf6 0/1 CrashLoopBackOff 7 11m
pod/nginx-ingress-7wmhw 0/1 CrashLoopBackOff 7 11m
pod/nginx-ingress-mss7v 0/1 CrashLoopBackOff 7 11m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/nginx-ingress 2 2 0 2 0 <none> 11m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx-ingress 0/1 1 0 11m
NAME DESIRED CURRENT READY AGE
replicaset.apps/nginx-ingress-555f75f85f 1 1 0 11m
By describing the pod i get the below(pasting only the event details),
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned nginx-ingress/nginx-ingress-555f75f85f-5vxf6 to k8-n2
Normal Pulled 14m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.456877779s
Normal Pulled 14m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.501405255s
Normal Pulled 13m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.63456627s
Normal Created 13m (x4 over 14m) kubelet Created container nginx-ingress
Normal Started 13m (x4 over 14m) kubelet Started container nginx-ingress
Normal Pulled 13m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.659821346s
Normal Pulling 12m (x5 over 14m) kubelet Pulling image "nginx/nginx-ingress:edge"
Warning BackOff 3m53s (x47 over 14m) kubelet Back-off restarting failed container
Am not able to see the logs though
Below are the executions while creating the nginx controller,
kubectl create -f common/ns-and-sa.yaml
kubectl create -f rbac/rbac.yaml
kubectl create -f rbac/ap-rbac.yaml
kubectl create -f common/default-server-secret.yaml
kubectl create -f common/nginx-config.yaml
kubectl create -f deployment/nginx-ingress.yaml
kubectl create -f daemon-set/nginx-ingress.yaml
Could anyone here advice me on this

Hashicorp vault on k8s: getting error 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity

I'm deploying ha vault on k8s (EKS) and getting this error on one of the vault pods, which I think is causing other pods to fail also :
This is the output of the kubectl get events:
search for : nodes are available: 1 Insufficient memory
26m Normal Created pod/vault-1 Created container vault
26m Normal Started pod/vault-1 Started container vault
26m Normal Pulled pod/vault-1 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
7m40s Warning BackOff pod/vault-1 Back-off restarting failed container
2m38s Normal Scheduled pod/vault-1 Successfully assigned vault-foo/vault-1 to ip-10-101-0-103.ec2.internal
2m35s Normal SuccessfulAttachVolume pod/vault-1 AttachVolume.Attach succeeded for volume "pvc-acfc7e26-3616-4075-ab79-0c3f7b0f6470"
2m35s Normal SuccessfulAttachVolume pod/vault-1 AttachVolume.Attach succeeded for volume "pvc-19d03d48-1de2-41f8-aadf-02d0a9f4bfbd"
48s Normal Pulled pod/vault-1 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
48s Normal Created pod/vault-1 Created container vault
99s Normal Started pod/vault-1 Started container vault
60s Warning BackOff pod/vault-1 Back-off restarting failed container
27m Normal TaintManagerEviction pod/vault-2 Cancelling deletion of Pod vault-foo/vault-2
28m Warning FailedScheduling pod/vault-2 0/4 nodes are available: 1 Insufficient memory, 4 Insufficient cpu.
28m Warning FailedScheduling pod/vault-2 0/5 nodes are available: 1 Insufficient memory, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu.
27m Normal Scheduled pod/vault-2 Successfully assigned vault-foo/vault-2 to ip-10-101-0-103.ec2.internal
27m Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-fb91141d-ebd9-4767-b122-da8c98349cba"
27m Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-95effe76-6e01-49ad-9bec-14e091e1a334"
27m Normal Pulling pod/vault-2 Pulling image "hashicorp/vault-enterprise:1.5.0_ent"
27m Normal Pulled pod/vault-2 Successfully pulled image "hashicorp/vault-enterprise:1.5.0_ent"
26m Normal Created pod/vault-2 Created container vault
26m Normal Started pod/vault-2 Started container vault
26m Normal Pulled pod/vault-2 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
7m26s Warning BackOff pod/vault-2 Back-off restarting failed container
2m36s Warning FailedScheduling pod/vault-2 0/7 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 4 Insufficient cpu.
114s Warning FailedScheduling pod/vault-2 0/8 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 4 Insufficient cpu.
104s Warning FailedScheduling pod/vault-2 0/9 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu.
93s Normal Scheduled pod/vault-2 Successfully assigned vault-foo/vault-2 to ip-10-101-0-82.ec2.internal
88s Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-fb91141d-ebd9-4767-b122-da8c98349cba"
88s Normal SuccessfulAttachVolume pod/vault-2 AttachVolume.Attach succeeded for volume "pvc-95effe76-6e01-49ad-9bec-14e091e1a334"
83s Normal Pulling pod/vault-2 Pulling image "hashicorp/vault-enterprise:1.5.0_ent"
81s Normal Pulled pod/vault-2 Successfully pulled image "hashicorp/vault-enterprise:1.5.0_ent"
38s Normal Created pod/vault-2 Created container vault
37s Normal Started pod/vault-2 Started container vault
38s Normal Pulled pod/vault-2 Container image "hashicorp/vault-enterprise:1.5.0_ent" already present on machine
4s Warning BackOff pod/vault-2 Back-off restarting failed container
2m38s Normal Scheduled pod/vault-agent-injector-d54bdc675-qwsmz Successfully assigned vault-foo/vault-agent-injector-d54bdc675-qwsmz to ip-10-101-2-91.ec2.internal
2m37s Normal Pulling pod/vault-agent-injector-d54bdc675-qwsmz Pulling image "hashicorp/vault-k8s:latest"
2m36s Normal Pulled pod/vault-agent-injector-d54bdc675-qwsmz Successfully pulled image "hashicorp/vault-k8s:latest"
2m36s Normal Created pod/vault-agent-injector-d54bdc675-qwsmz Created container sidecar-injector
2m35s Normal Started pod/vault-agent-injector-d54bdc675-qwsmz Started container sidecar-injector
28m Normal Scheduled pod/vault-agent-injector-d54bdc675-wz9ws Successfully assigned vault-foo/vault-agent-injector-d54bdc675-wz9ws to ip-10-101-0-87.ec2.internal
28m Normal Pulled pod/vault-agent-injector-d54bdc675-wz9ws Container image "hashicorp/vault-k8s:latest" already present on machine
28m Normal Created pod/vault-agent-injector-d54bdc675-wz9ws Created container sidecar-injector
28m Normal Started pod/vault-agent-injector-d54bdc675-wz9ws Started container sidecar-injector
3m22s Normal Killing pod/vault-agent-injector-d54bdc675-wz9ws Stopping container sidecar-injector
3m22s Warning Unhealthy pod/vault-agent-injector-d54bdc675-wz9ws Readiness probe failed: Get https://10.101.0.73:8080/health/ready: dial tcp 10.101.0.73:8080: connect: connection refused
3m18s Warning Unhealthy pod/vault-agent-injector-d54bdc675-wz9ws Liveness probe failed: Get https://10.101.0.73:8080/health/ready: dial tcp 10.101.0.73:8080: connect: no route to host
28m Normal SuccessfulCreate replicaset/vault-agent-injector-d54bdc675 Created pod: vault-agent-injector-d54bdc675-wz9ws
2m38s Normal SuccessfulCreate replicaset/vault-agent-injector-d54bdc675 Created pod: vault-agent-injector-d54bdc675-qwsmz
28m Normal ScalingReplicaSet deployment/vault-agent-injector Scaled up replica set vault-agent-injector-d54bdc675 to 1
2m38s Normal ScalingReplicaSet deployment/vault-agent-injector Scaled up replica set vault-agent-injector-d54bdc675 to 1
28m Normal EnsuringLoadBalancer service/vault-ui Ensuring load balancer
28m Normal EnsuredLoadBalancer service/vault-ui Ensured load balancer
26m Normal UpdatedLoadBalancer service/vault-ui Updated load balancer with new hosts
3m24s Normal DeletingLoadBalancer service/vault-ui Deleting load balancer
3m23s Warning PortNotAllocated service/vault-ui Port 32476 is not allocated; repairing
3m23s Warning ClusterIPNotAllocated service/vault-ui Cluster IP 172.20.216.143 is not allocated; repairing
3m22s Warning FailedToUpdateEndpointSlices service/vault-ui Error updating Endpoint Slices for Service vault-foo/vault-ui: failed to update vault-ui-crtg4 EndpointSlice for Service vault-foo/vault-ui: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "vault-ui-crtg4": the object has been modified; please apply your changes to the latest version and try again
3m16s Warning FailedToUpdateEndpoint endpoints/vault-ui Failed to update endpoint vault-foo/vault-ui: Operation cannot be fulfilled on endpoints "vault-ui": the object has been modified; please apply your changes to the latest version and try again
2m52s Normal DeletedLoadBalancer service/vault-ui Deleted load balancer
2m39s Normal EnsuringLoadBalancer service/vault-ui Ensuring load balancer
2m36s Normal EnsuredLoadBalancer service/vault-ui Ensured load balancer
96s Normal UpdatedLoadBalancer service/vault-ui Updated load balancer with new hosts
28m Normal NoPods poddisruptionbudget/vault No matching pods found
28m Normal SuccessfulCreate statefulset/vault create Pod vault-0 in StatefulSet vault successful
28m Normal SuccessfulCreate statefulset/vault create Pod vault-1 in StatefulSet vault successful
28m Normal SuccessfulCreate statefulset/vault create Pod vault-2 in StatefulSet vault successful
2m40s Normal NoPods poddisruptionbudget/vault No matching pods found
2m38s Normal SuccessfulCreate statefulset/vault create Pod vault-0 in StatefulSet vault successful
2m38s Normal SuccessfulCreate statefulset/vault create Pod vault-1 in StatefulSet vault successful
2m38s Normal SuccessfulCreate statefulset/vault create Pod vault-2 in StatefulSet vault successful
And this is my helm :
# Vault Helm Chart Value Overrides
global:
enabled: true
tlsDisable: false
injector:
enabled: true
# Use the Vault K8s Image https://github.com/hashicorp/vault-k8s/
image:
repository: "hashicorp/vault-k8s"
tag: "latest"
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 256Mi
cpu: 250m
server:
# Use the Enterprise Image
image:
repository: "hashicorp/vault-enterprise"
tag: "1.5.0_ent"
# These Resource Limits are in line with node requirements in the
# Vault Reference Architecture for a Small Cluster
resources:
requests:
memory: 8Gi
cpu: 2000m
limits:
memory: 16Gi
cpu: 2000m
# For HA configuration and because we need to manually init the vault,
# we need to define custom readiness/liveness Probe settings
readinessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
livenessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true"
initialDelaySeconds: 60
# extraEnvironmentVars is a list of extra environment variables to set with the stateful set. These could be
# used to include variables required for auto-unseal.
extraEnvironmentVars:
VAULT_CACERT: /vault/userconfig/vault-server-tls/vault.ca
# extraVolumes is a list of extra volumes to mount. These will be exposed
# to Vault in the path .
#extraVolumes:
# - type: secret
# name: tls-server
# - type: secret
# name: tls-ca
# - type: secret
# name: kms-creds
extraVolumes:
- type: secret
name: vault-server-tls
# This configures the Vault Statefulset to create a PVC for audit logs.
# See https://www.vaultproject.io/docs/audit/index.html to know more
auditStorage:
enabled: true
standalone:
enabled: false
# Run Vault in "HA" mode.
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true
config: |
ui = true
listener "tcp" {
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
tls_key_file = "/vault/userconfig/vault-server-tls/vault.key"
tls_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
}
storage "raft" {
path = "/vault/data"
retry_join {
leader_api_addr = "http://vault-0.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
}
retry_join {
leader_api_addr = "http://vault-1.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
}
retry_join {
leader_api_addr = "http://vault-2.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/vault-server-tls/vault.ca"
leader_client_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
leader_client_key_file = "/vault/userconfig/vault-server-tls/vault.key"
}
}
service_registration "kubernetes" {}
# Vault UI
ui:
enabled: true
serviceType: "LoadBalancer"
serviceNodePort: null
externalPort: 8200
# For Added Security, edit the below
#loadBalancerSourceRanges:
# - < Your IP RANGE Ex. 10.0.0.0/16 >
# - < YOUR SINGLE IP Ex. 1.78.23.3/32 >
what did I not configure right?
There are several issue here and they are all represented by the error messages like:
0/9 nodes are available: 1 Insufficient memory, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had volume node affinity conflict, 1 node(s) were unschedulable, 2 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 4 Insufficient cpu.
You got 9 Nodes but none of them are available for scheduling due to a different set of conditions. Note that each Node can be affected by multiple issues and so the numbers can add up to more than what you have on total nodes.
Let's break them down one by one:
Insufficient memory: Execute kubectl describe node <node-name> to check how much free memory is available there. Check the requests and limits of your pods. Note that Kubernetes will block the full amount of memory a pod requests regardless how much this pod uses.
Insufficient cpu: Analogical as above.
node(s) didn't match pod affinity/anti-affinity: Check your affinity/anti-affinity rules.
node(s) didn't satisfy existing pods anti-affinity rules: Same as above.
node(s) had volume node affinity conflict: Happens when pod was not able to be scheduled because it cannot connect to the volume from another Availability Zone. You can fix this by creating a storageclass for a single zone and than use that storageclass in your PVC.
node(s) were unschedulable: This is because the node is marked as Unschedulable. Which leads us to the next issue below:
node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate: This corresponds to the NodeCondition Ready = False. You can use kubectl describe node to check taints and kubectl taint nodes <node-name> <taint-name>- in order to remove them. Check the Taints and Tolerations for more details.
Also there is a GitHub thread with a similar issue that you may find useful.
Try checking/eliminating those issue one by one (starting from the first listed above) as they can make a "chain reaction" in some scenarios.

How to scale a kubernetes cluster while limiting costs on GCP

We have a GKE cluster set up on google cloud platform.
We have an activity that requires 'bursts' of computing power.
Imagine that we usually do 100 computations an hour on average, then suddently we need to be able to process 100000 in less then two minutes. However most of the time, everything is close to idle.
We do not want to pay for idle servers 99% of the time, and want to scale clusters depending on actual use (no data persistance needed, servers can be deleted afterwards). I looked up the documentation available on kubernetes regarding auto scaling, for adding more pods with HPA and adding more nodes with cluster autoscaler
However it doesn't seem like any of these solutions would actually reduce our costs or improve performances, because they do not seem to scale past the GCP plan:
Imagine that we have a google plan with 8 CPUs. My understanding is if we add more nodes with cluster autoscaler we will just instead of having e.g. 2 nodes using 4 CPUs each we will have 4 nodes using 2 cpus each. But the total available computing power will still be 8 CPU.
Same reasoning go for HPA with more pods instead of more nodes.
If we have the 8 CPU payment plan but only use 4 of them, my understanding is we still get billed for 8 so scaling down is not really useful.
What we want is autoscaling to change our payment plan temporarly (imagine from n1-standard-8 to n1-standard-16) and get actual new computing power.
I can't believe we are the only ones with this use case but I cannot find any documentation on this anywhere! Did I misunderstand something ?
TL;DR:
Create a small persistant node-pool
Create a powerfull node-pool that can be scaled to zero (and cease billing) while not in use.
Tools used:
GKE’s Cluster Autoscaling, Node selector, Anti-affinity rules and Taints and tolerations.
GKE Pricing:
From GKE Pricing:
Starting June 6, 2020, GKE will charge a cluster management fee of $0.10 per cluster per hour. The following conditions apply to the cluster management fee:
One zonal cluster per billing account is free.
The fee is flat, irrespective of cluster size and topology.
Billing is computed on a per-second basis for each cluster. The total amount is rounded to the nearest cent, at the end of each month.
From Pricing for Worker Nodes:
GKE uses Compute Engine instances for worker nodes in the cluster. You are billed for each of those instances according to Compute Engine's pricing, until the nodes are deleted. Compute Engine resources are billed on a per-second basis with a one-minute minimum usage cost.
Enters, Cluster Autoscaler:
automatically resize your GKE cluster’s node pools based on the demands of your workloads. When demand is high, cluster autoscaler adds nodes to the node pool. When demand is low, cluster autoscaler scales back down to a minimum size that you designate. This can increase the availability of your workloads when you need it, while controlling costs.
Cluster Autoscaler cannot scale the entire cluster to zero, at least one node must always be available in the cluster to run system pods.
Since you already have a persistent workload, this wont be a problem, what we will do is create a new node pool:
A node pool is a group of nodes within a cluster that all have the same configuration. Every cluster has at least one default node pool, but you can add other node pools as needed.
For this example I'll create two node pools:
A default node pool with a fixed size of one node with a small instance size (emulating the cluster you already have).
A second node pool with more compute power to run the jobs (I'll call it power-pool).
Choose the machine type with the power you need to run your AI Jobs, for this example I'll create a n1-standard-8.
This power-pool will have autoscaling set to allow max 4 nodes, minimum 0 nodes.
If you like to add GPUs you can check this great: Guide Scale to almost zero + GPUs.
Taints and Tolerations:
Only the jobs related to the AI workload will run on the power-pool, for that use a node selector in the job pods to make sure they run in the power-pool nodes.
Set a anti-affinity rule to ensure that two of your training pods cannot be scheduled on the same node (optimizing the price-performance ratio, this is optional depending on your workload).
Add a taint to the power-pool to avoid other workloads (and system resources) to be scheduled on the autoscalable pool.
Add the tolerations to the AI Jobs to let them run on those nodes.
Reproduction:
Create the Cluster with the persistent default-pool:
PROJECT_ID="YOUR_PROJECT_ID"
GCP_ZONE="CLUSTER_ZONE"
GKE_CLUSTER_NAME="CLUSTER_NAME"
AUTOSCALE_POOL="power-pool"
gcloud container clusters create ${GKE_CLUSTER_NAME} \
--machine-type="n1-standard-1" \
--num-nodes=1 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}
Create the auto-scale pool:
gcloud container node-pools create ${GKE_BURST_POOL} \
--cluster=${GKE_CLUSTER_NAME} \
--machine-type=n1-standard-8 \
--node-labels=load=on-demand \
--node-taints=reserved-pool=true:NoSchedule \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=4 \
--zone=${GCP_ZONE} \
--project=${PROJECT_ID}
Note about parameters:
--node-labels=load=on-demand: Add a label to the nodes in the power pool to allow selecting them in our AI job using a node selector.
--node-taints=reserved-pool=true:NoSchedule: Add a taint to the nodes to prevent any other workload from accidentally being scheduled in this node pool.
Here you can see the two pools we created, the static pool with 1 node and the autoscalable pool with 0-4 nodes.
Since we don't have workload running on the autoscalable node-pool, it shows 0 nodes running (and with no charge while there is no node in execution).
Now we'll create a job that create 4 parallel pods that run for 5 minutes.
This job will have the following parameters to differentiate from normal pods:
parallelism: 4: to use all 4 nodes to enhance performance
nodeSelector.load: on-demand: to assign to the nodes with that label.
podAntiAffinity: to declare that we do not want two pods with the same label app: greedy-job running in the same node (optional).
tolerations: to match the toleration to the taint that we attached to the nodes, so these pods are allowed to be scheduled in these nodes.
apiVersion: batch/v1
kind: Job
metadata:
name: greedy-job
spec:
parallelism: 4
template:
metadata:
name: greedy-job
labels:
app: greedy-app
spec:
containers:
- name: busybox
image: busybox
args:
- sleep
- "300"
nodeSelector:
load: on-demand
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- greedy-app
topologyKey: "kubernetes.io/hostname"
tolerations:
- key: reserved-pool
operator: Equal
value: "true"
effect: NoSchedule
restartPolicy: OnFailure
Now that our cluster is in standby we will use the job yaml we just created (I'll call it greedyjob.yaml). This job will run four processes that will run in parallel and that will complete after about 5 minutes.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 42m v1.14.10-gke.27
$ kubectl get pods
No resources found in default namespace.
$ kubectl apply -f greedyjob.yaml
job.batch/greedy-job created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 0/1 Pending 0 11s
greedy-job-72j8r 0/1 Pending 0 11s
greedy-job-9dfdt 0/1 Pending 0 11s
greedy-job-wqct9 0/1 Pending 0 11s
Our job was applied, but is in pending, let's see what's going on in those pods:
$ kubectl describe pod greedy-job-2xbvx
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 28s (x2 over 28s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
Normal TriggeredScaleUp 23s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/owilliam/zones/us-central1-b/instanceGroups/gke-autoscale-to-zero-clus-power-pool-564148fd-grp 0->1 (max: 4)}]
The pod can't be scheduled on the current node due to the rules we defined, this triggers a Scale Up routine on our power-pool. This is a very dynamic process, after 90 seconds the first node is up and running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 0/1 Pending 0 93s
greedy-job-72j8r 0/1 ContainerCreating 0 93s
greedy-job-9dfdt 0/1 Pending 0 93s
greedy-job-wqct9 0/1 Pending 0 93s
$ kubectl nodes
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 44m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 11s v1.14.10-gke.27
Since we set pod anti-affinity rules, the second pod can't be scheduled on the node that was brought up and triggers the next scale up, take a look at the events on the second pod:
$ k describe pod greedy-job-2xbvx
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 2m45s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/owilliam/zones/us-central1-b/instanceGroups/gke-autoscale-to-zero-clus-power-pool-564148fd-grp 0->1 (max: 4)}]
Warning FailedScheduling 93s (x3 over 2m50s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
Warning FailedScheduling 79s (x3 over 83s) default-scheduler 0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taints that the pod didn't tolerate.
Normal TriggeredScaleUp 62s cluster-autoscaler pod triggered scale-up: [{https://content.googleapis.com/compute/v1/projects/owilliam/zones/us-central1-b/instanceGroups/gke-autoscale-to-zero-clus-power-pool-564148fd-grp 1->2 (max: 4)}]
Warning FailedScheduling 3s (x3 over 68s) default-scheduler 0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules.
The same process repeats until all requirements are satisfied:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 0/1 Pending 0 3m39s
greedy-job-72j8r 1/1 Running 0 3m39s
greedy-job-9dfdt 0/1 Pending 0 3m39s
greedy-job-wqct9 1/1 Running 0 3m39s
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 46m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 2m16s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q Ready <none> 28s v1.14.10-gke.27
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 0/1 Pending 0 5m19s
greedy-job-72j8r 1/1 Running 0 5m19s
greedy-job-9dfdt 1/1 Running 0 5m19s
greedy-job-wqct9 1/1 Running 0 5m19s
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 48m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2 Ready <none> 63s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 4m8s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q Ready <none> 2m20s v1.14.10-gke.27
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 1/1 Running 0 6m12s
greedy-job-72j8r 1/1 Running 0 6m12s
greedy-job-9dfdt 1/1 Running 0 6m12s
greedy-job-wqct9 1/1 Running 0 6m12s
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 48m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2 Ready <none> 113s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv Ready <none> 26s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 4m58s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q Ready <none> 3m10s v1.14.10-gke.27
Here we can see that all nodes are now up and running (thus, being billed by second)
Now all jobs are running, after a few minutes the jobs complete their tasks:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 1/1 Running 0 7m22s
greedy-job-72j8r 0/1 Completed 0 7m22s
greedy-job-9dfdt 1/1 Running 0 7m22s
greedy-job-wqct9 1/1 Running 0 7m22s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
greedy-job-2xbvx 0/1 Completed 0 11m
greedy-job-72j8r 0/1 Completed 0 11m
greedy-job-9dfdt 0/1 Completed 0 11m
greedy-job-wqct9 0/1 Completed 0 11m
Once the task is completed, the autoscaler starts downsizing the cluster.
You can learn more about the rules for this process here: GKE Cluster AutoScaler
$ while true; do kubectl get nodes ; sleep 60; done
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 54m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2 Ready <none> 7m26s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv Ready <none> 5m59s v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 10m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q Ready <none> 8m43s v1.14.10-gke.27
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 62m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2 Ready <none> 15m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv Ready <none> 14m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 18m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-sf6q NotReady <none> 16m v1.14.10-gke.27
Once conditions are met, autoscaler flags the node as NotReady and starts removing them:
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 64m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2 NotReady <none> 17m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv NotReady <none> 16m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw Ready <none> 20m v1.14.10-gke.27
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 65m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-39m2 NotReady <none> 18m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv NotReady <none> 17m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-qxkw NotReady <none> 21m v1.14.10-gke.27
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 66m v1.14.10-gke.27
gke-autoscale-to-zero-clus-power-pool-564148fd-ggxv NotReady <none> 18m v1.14.10-gke.27
NAME STATUS ROLES AGE VERSION
gke-autoscale-to-zero-cl-default-pool-9f6d80d3-x9lb Ready <none> 67m v1.14.10-gke.27
Here is the confirmation that the nodes were removed from GKE and from VMs(remember that every node is a Virtual Machine billed as Compute Engine):
Compute Engine: (note that gke-cluster-1-default-pool is from another cluster, I added it to the screenshot to show you that there is no other node from cluster gke-autoscale-to-zero other than the default persistent one.)
GKE:
Final Thoughts:
When scaling down, cluster autoscaler respects scheduling and eviction rules set on Pods. These restrictions can prevent a node from being deleted by the autoscaler. A node's deletion could be prevented if it contains a Pod with any of these conditions:
An application's PodDisruptionBudget can also prevent autoscaling; if deleting nodes would cause the budget to be exceeded, the cluster does not scale down.
You can note that the process is really fast, in our example it took around 90 seconds to upscale a node and 5 minutes to finish downscaling a standby node, providing a HUGE improvement in your billing.
Preemptible VMs can reduce even further your billing, but you will have to consider the kind of workload you are running:
Preemptible VMs are Compute Engine VM instances that last a maximum of 24 hours and provide no availability guarantees. Preemptible VMs are priced lower than standard Compute Engine VMs and offer the same machine types and options.
I know you are still considering the best architecture for your app.
Using APP Engine and IA Platform are optimal solutions as well, but since you are currently running your workload on GKE I wanted to show you an example as requested.
If you have any further questions let me know in the comments.