pod hangs in Pending state - kubernetes

I have a kubernetes deployment in which I am trying to run 5 docker containers inside a single pod on a single node. The containers hang in "Pending" state and are never scheduled. I do not mind running more than 1 pod but I'd like to keep the number of nodes down. I have assumed 1 node with 1 CPU and 1.7G RAM will be enough for the 5 containers and I have attempted to split the workload across.
Initially I came to the conclusion that I have insufficient resources. I enabled autoscaling of nodes which produced the following (see kubectl describe pod command):
pod didn't trigger scale-up (it wouldn't fit if a new node is added)
Anyway, each docker container has a simple command which runs a fairly simple app. Ideally I wouldn't like to have to deal with setting CPU and RAM allocation of resources but even setting the CPU/mem limits within bounds so they don't add up to > 1, I still get (see kubectl describe po/test-529945953-gh6cl) I get this:
No nodes are available that match all of the following predicates::
Insufficient cpu (1), Insufficient memory (1).
Below are various commands that show the state. Any help on what I'm doing wrong will be appreciated.
kubectl get all
user_s#testing-11111:~/gce$ kubectl get all
NAME READY STATUS RESTARTS AGE
po/test-529945953-gh6cl 0/5 Pending 0 34m
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes 10.7.240.1 <none> 443/TCP 19d
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/test 1 1 1 0 34m
NAME DESIRED CURRENT READY AGE
rs/test-529945953 1 1 0 34m
user_s#testing-11111:~/gce$
kubectl describe po/test-529945953-gh6cl
user_s#testing-11111:~/gce$ kubectl describe po/test-529945953-gh6cl
Name: test-529945953-gh6cl
Namespace: default
Node: <none>
Labels: app=test
pod-template-hash=529945953
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"test-529945953","uid":"c6e889cb-a2a0-11e7-ac18-42010a9a001a"...
Status: Pending
IP:
Created By: ReplicaSet/test-529945953
Controlled By: ReplicaSet/test-529945953
Containers:
container-test2-tickers:
Image: gcr.io/testing-11111/testology:latest
Port: <none>
Command:
process_cmd
arg1
test2
Limits:
cpu: 150m
memory: 375Mi
Requests:
cpu: 100m
memory: 375Mi
Environment:
DB_HOST: 127.0.0.1:5432
DB_PASSWORD: <set to the key 'password' in secret 'cloudsql-db-credentials'> Optional: false
DB_USER: <set to the key 'username' in secret 'cloudsql-db-credentials'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b2mxc (ro)
container-kraken-tickers:
Image: gcr.io/testing-11111/testology:latest
Port: <none>
Command:
process_cmd
arg1
arg2
Limits:
cpu: 150m
memory: 375Mi
Requests:
cpu: 100m
memory: 375Mi
Environment:
DB_HOST: 127.0.0.1:5432
DB_PASSWORD: <set to the key 'password' in secret 'cloudsql-db-credentials'> Optional: false
DB_USER: <set to the key 'username' in secret 'cloudsql-db-credentials'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b2mxc (ro)
container-gdax-tickers:
Image: gcr.io/testing-11111/testology:latest
Port: <none>
Command:
process_cmd
arg1
arg2
Limits:
cpu: 150m
memory: 375Mi
Requests:
cpu: 100m
memory: 375Mi
Environment:
DB_HOST: 127.0.0.1:5432
DB_PASSWORD: <set to the key 'password' in secret 'cloudsql-db-credentials'> Optional: false
DB_USER: <set to the key 'username' in secret 'cloudsql-db-credentials'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b2mxc (ro)
container-bittrex-tickers:
Image: gcr.io/testing-11111/testology:latest
Port: <none>
Command:
process_cmd
arg1
arg2
Limits:
cpu: 150m
memory: 375Mi
Requests:
cpu: 100m
memory: 375Mi
Environment:
DB_HOST: 127.0.0.1:5432
DB_PASSWORD: <set to the key 'password' in secret 'cloudsql-db-credentials'> Optional: false
DB_USER: <set to the key 'username' in secret 'cloudsql-db-credentials'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b2mxc (ro)
cloudsql-proxy:
Image: gcr.io/cloudsql-docker/gce-proxy:1.09
Port: <none>
Command:
/cloud_sql_proxy
--dir=/cloudsql
-instances=testing-11111:europe-west2:testology=tcp:5432
-credential_file=/secrets/cloudsql/credentials.json
Limits:
cpu: 150m
memory: 375Mi
Requests:
cpu: 100m
memory: 375Mi
Environment: <none>
Mounts:
/cloudsql from cloudsql (rw)
/etc/ssl/certs from ssl-certs (rw)
/secrets/cloudsql from cloudsql-instance-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b2mxc (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
cloudsql-instance-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: cloudsql-instance-credentials
Optional: false
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
cloudsql:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
default-token-b2mxc:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b2mxc
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
27m 17m 44 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: Insufficient cpu (1), Insufficient memory (2).
26m 8s 150 cluster-autoscaler Normal NotTriggerScaleUp pod didn't trigger scale-up (it wouldn't fit if a new node is added)
16m 2s 63 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: Insufficient cpu (1), Insufficient memory (1).
user_s#testing-11111:~/gce$
> Blockquote
kubectl get nodes
user_s#testing-11111:~/gce$ kubectl get nodes
NAME STATUS AGE VERSION
gke-test-default-pool-abdf83f7-p4zw Ready 9h v1.6.7
kubectl get pods
user_s#testing-11111:~/gce$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-529945953-gh6cl 0/5 Pending 0 38m
kubectl describe nodes
user_s#testing-11111:~/gce$ kubectl describe nodes
Name: gke-test-default-pool-abdf83f7-p4zw
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/fluentd-ds-ready=true
beta.kubernetes.io/instance-type=g1-small
beta.kubernetes.io/os=linux
cloud.google.com/gke-nodepool=default-pool
failure-domain.beta.kubernetes.io/region=europe-west2
failure-domain.beta.kubernetes.io/zone=europe-west2-c
kubernetes.io/hostname=gke-test-default-pool-abdf83f7-p4zw
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Tue, 26 Sep 2017 02:05:45 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 26 Sep 2017 02:06:05 +0100 Tue, 26 Sep 2017 02:06:05 +0100 RouteCreated RouteController created a route
OutOfDisk False Tue, 26 Sep 2017 11:33:57 +0100 Tue, 26 Sep 2017 02:05:45 +0100 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 26 Sep 2017 11:33:57 +0100 Tue, 26 Sep 2017 02:05:45 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 26 Sep 2017 11:33:57 +0100 Tue, 26 Sep 2017 02:05:45 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Tue, 26 Sep 2017 11:33:57 +0100 Tue, 26 Sep 2017 02:06:05 +0100 KubeletReady kubelet is posting ready status. AppArmor enabled
KernelDeadlock False Tue, 26 Sep 2017 11:33:12 +0100 Tue, 26 Sep 2017 02:05:45 +0100 KernelHasNoDeadlock kernel has no deadlock
Addresses:
InternalIP: 10.154.0.2
ExternalIP: 35.197.217.1
Hostname: gke-test-default-pool-abdf83f7-p4zw
Capacity:
cpu: 1
memory: 1742968Ki
pods: 110
Allocatable:
cpu: 1
memory: 1742968Ki
pods: 110
System Info:
Machine ID: e6119abf844c564193495c64fd9bd341
System UUID: E6119ABF-844C-5641-9349-5C64FD9BD341
Boot ID: 1c2f2ea0-1f5b-4c90-9e14-d1d9d7b75221
Kernel Version: 4.4.52+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.11.2
Kubelet Version: v1.6.7
Kube-Proxy Version: v1.6.7
PodCIDR: 10.4.1.0/24
ExternalID: 6073438913956157854
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system fluentd-gcp-v2.0-k565g 100m (10%) 0 (0%) 200Mi (11%) 300Mi (17%)
kube-system heapster-v1.3.0-3440173064-1ztvw 138m (13%) 138m (13%) 301456Ki (17%) 301456Ki (17%)
kube-system kube-dns-1829567597-gdz52 260m (26%) 0 (0%) 110Mi (6%) 170Mi (9%)
kube-system kube-dns-autoscaler-2501648610-7q9dd 20m (2%) 0 (0%) 10Mi (0%) 0 (0%)
kube-system kube-proxy-gke-test-default-pool-abdf83f7-p4zw 100m (10%) 0 (0%) 0 (0%) 0 (0%)
kube-system kubernetes-dashboard-490794276-25hmn 100m (10%) 100m (10%) 50Mi (2%) 50Mi (2%)
kube-system l7-default-backend-3574702981-flqck 10m (1%) 10m (1%) 20Mi (1%) 20Mi (1%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
728m (72%) 248m (24%) 700816Ki (40%) 854416Ki (49%)
Events: <none>

As you can see in the output of your kubectl describe nodes command under Allocated resources:, there is 728m (72%) CPU and 700816Ki (40%) Memory already requested by Pods running in the kube-system namespace on the node. The sum of resource requests of your test Pod both exceeds the remaining CPU and Memory available on your node, as you can see under Events of your kubectl describe po/[…] command.
If you want to keep all containers in a single pod, you need to reduce the resource requests of your containers or run them on a node with more CPU and Memory. The better solution would be to split your application in multiple pods, this enables distribution over multiple nodes.

Related

What is the kubernetes equivalent of docker inspect?

When a docker container is running it is sometimes helpful to look at runtime configuration.
What is the equivalent command for kubernetes?
I did a search on so for this and came up with some similar questions: See https://stackoverflow.com/search?q=What+is+the+kubernetes+equivalent, but not this question.
What's the kubectl equivalent of docker exec bash in Kubernetes?
Docker volume and kubernetes volume
Kubernetes is a container orchestrator, so you'll not find container-level commands.
You can check the container logs:
kubectl logs pod-name
Mon Jan 1 00:00:00 UTC 2001 INFO 0
Mon Jan 1 00:00:01 UTC 2001 INFO 1
Mon Jan 1 00:00:02 UTC 2001 INFO 2
You can describe a pod to see pod details, as well as possible pull image errors:
kubectl describe pod nginx-deployment-1006230814-6winp
Name: nginx-deployment-1006230814-6winp
Namespace: default
Node: kubernetes-node-wul5/10.240.0.9
Start Time: Thu, 24 Mar 2016 01:39:49 +0000
Labels: app=nginx,pod-template-hash=1006230814
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nginx-deployment-1956810328","uid":"14e607e7-8ba1-11e7-b5cb-fa16" ...
Status: Running
IP: 10.244.0.6
Controllers: ReplicaSet/nginx-deployment-1006230814
Containers:
nginx:
Container ID: docker://90315cc9f513c724e9957a4788d3e625a078de84750f244a40f97ae355eb1149
Image: nginx
Image ID: docker://6f62f48c4e55d700cf3eb1b5e33fa051802986b77b874cc351cce539e5163707
Port: 80/TCP
QoS Tier:
cpu: Guaranteed
memory: Guaranteed
Limits:
cpu: 500m
memory: 128Mi
Requests:
memory: 128Mi
cpu: 500m
State: Running
Started: Thu, 24 Mar 2016 01:39:51 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5kdvl (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
default-token-4bcbi:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4bcbi
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
54s 54s 1 {default-scheduler } Normal Scheduled Successfully assigned nginx-deployment-1006230814-6winp to kubernetes-node-wul5
54s 54s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulling pulling image "nginx"
53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Pulled Successfully pulled image "nginx"
53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Created Created container with docker id 90315cc9f513
53s 53s 1 {kubelet kubernetes-node-wul5} spec.containers{nginx} Normal Started Started container with docker id 90315cc9f513
If you need see details about a container, use the docker client or whatever other container runtime client for this purpose.

Pod staying in Pending state

I have a kubernetes pod that is staying in Pending state. When I describe the pod, I am not seeing why it fails to start, I can just see Back-off restarting failed container.
This is what I can see when I describe the pod.
kubectl describe po jenkins-68d5474964-slpkj -n infrastructure
Name: jenkins-68d5474964-slpkj
Namespace: infrastructure
Priority: 0
PriorityClassName: <none>
Node: ip-172-20-120-29.eu-west-1.compute.internal/172.20.120.29
Start Time: Fri, 05 Feb 2021 17:10:34 +0100
Labels: app=jenkins
chart=jenkins-0.35.0
component=jenkins-jenkins-master
heritage=Tiller
pod-template-hash=2481030520
release=jenkins
Annotations: checksum/config=fc546aa316b7bb9bd6a7cbeb69562ca9f224dbfe53973411f97fea27e90cd4d7
Status: Pending
IP: 100.125.247.153
Controlled By: ReplicaSet/jenkins-68d5474964
Init Containers:
copy-default-config:
Container ID: docker://a6ce91864c181d4fc851afdd4a6dc2258c23e75bbed6981fe1cafad74a764ff2
Image: jenkins/jenkins:2.248
Image ID: docker-pullable://jenkins/jenkins#sha256:352f10079331b1e63c170b6f4b5dc5e2367728f0da00b6ad34424b2b2476426a
Port: <none>
Host Port: <none>
Command:
sh
/var/jenkins_config/apply_config.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 05 Feb 2021 17:15:16 +0100
Finished: Fri, 05 Feb 2021 17:15:36 +0100
Ready: False
Restart Count: 5
Limits:
cpu: 2560m
memory: 2Gi
Requests:
cpu: 50m
memory: 256Mi
Environment:
ADMIN_PASSWORD: <set to the key 'jenkins-admin-password' in secret 'jenkins'> Optional: false
ADMIN_USER: <set to the key 'jenkins-admin-user' in secret 'jenkins'> Optional: false
Mounts:
/usr/share/jenkins/ref/secrets/ from secrets-dir (rw)
/var/jenkins_config from jenkins-config (rw)
/var/jenkins_home from jenkins-home (rw)
/var/jenkins_plugins from plugin-dir (rw)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5tbbb (rw)
Containers:
jenkins:
Container ID:
Image: jenkins/jenkins:2.248
Image ID:
Ports: 8080/TCP, 50000/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--argumentsRealm.passwd.$(ADMIN_USER)=$(ADMIN_PASSWORD)
--argumentsRealm.roles.$(ADMIN_USER)=admin
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 2560m
memory: 2Gi
Requests:
cpu: 50m
memory: 256Mi
Environment:
JAVA_OPTS:
JENKINS_OPTS:
JENKINS_SLAVE_AGENT_PORT: 50000
ADMIN_PASSWORD: <set to the key 'jenkins-admin-password' in secret 'jenkins'> Optional: false
ADMIN_USER: <set to the key 'jenkins-admin-user' in secret 'jenkins'> Optional: false
Mounts:
/usr/share/jenkins/ref/plugins/ from plugin-dir (rw)
/usr/share/jenkins/ref/secrets/ from secrets-dir (rw)
/var/jenkins_config from jenkins-config (ro)
/var/jenkins_home from jenkins-home (rw)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5tbbb (rw)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
jenkins-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: jenkins
Optional: false
plugin-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
secrets-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
jenkins-home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: jenkins
ReadOnly: false
default-token-5tbbb:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5tbbb
Optional: false
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
QoS Class: Burstable
Node-Selectors: nodePool=ci
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m default-scheduler Successfully assigned infrastructure/jenkins-68d5474964-slpkj to ip-172-20-120-29.eu-west-1.compute.internal
Normal Started 5m (x4 over 7m) kubelet, ip-172-20-120-29.eu-west-1.compute.internal Started container
Normal Pulling 4m (x5 over 7m) kubelet, ip-172-20-120-29.eu-west-1.compute.internal pulling image "jenkins/jenkins:2.248"
Normal Pulled 4m (x5 over 7m) kubelet, ip-172-20-120-29.eu-west-1.compute.internal Successfully pulled image "jenkins/jenkins:2.248"
Normal Created 4m (x5 over 7m) kubelet, ip-172-20-120-29.eu-west-1.compute.internal Created container
Warning BackOff 2m (x14 over 6m) kubelet, ip-172-20-120-29.eu-west-1.compute.internal Back-off restarting failed container
Once I run helm upgrade for that container, I can see:
RESOURCES:
==> v1/ConfigMap
NAME DATA AGE
jenkins 5 441d
jenkins-configs 1 441d
jenkins-tests 1 441d
==> v1/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
jenkins 0/1 1 0 441d
==> v1/PersistentVolumeClaim
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
jenkins Bound pvc-8813319f-0d37-11ea-9864-0a7b1d347c8a 4Gi RWO aws-efs 441d
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
jenkins-7b85495f65-2w5mv 0/1 Init:0/1 3 2m9s
==> v1/Secret
NAME TYPE DATA AGE
jenkins Opaque 2 441d
jenkins-secrets Opaque 3 441d
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
jenkins LoadBalancer 100.65.2.235 a881a20a40d37... 8080:31962/TCP 441d
jenkins-agent ClusterIP 100.64.69.113 <none> 50000/TCP 441d
==> v1/ServiceAccount
NAME SECRETS AGE
jenkins 1 441d
==> v1beta1/ClusterRoleBinding
NAME AGE
jenkins-role-binding 441d
Can someone advice?
Now you cannot get any logs by kubectl logs pod_name because the pod status is initializing.
When you use kubectl logs command;
If the pod has multiple containers, you have to specify the container name explicitly.
If you have only one container, then no need to specify the container name.
If you want to get logs of initContainers, you need to specify the initContainer name.
For your case, the pod has one init container and seems it stuck now.
Init Containers:
copy-default-config:
Command:
sh
/var/jenkins_config/apply_config.sh
You can check the log of this container.
kubectl logs jenkins-68d5474964-slpkj copy-default-config
For me, the deployment was in this state because the installPlugins list was incorrectly set in the values passed to the Helm chart.
If it can help :)

why kubernetes pods error tips not refresh after fix problem

My pod get red status and show this error:
Usage of EmptyDir volume "agent" exceeds the limit "100Mi".
after I fix this problem, the error did not disappear.
how to make the error tips disappear? This is the pod info:
dolphin#dolphins-MacBook-Pro ~ % kubectl describe pods soa-task-745d48d955-bd4j8
Name: soa-task-745d48d955-bd4j8
Namespace: dabai-fat
Priority: 0
Node: azshara-k8s03/172.19.150.82
Start Time: Tue, 17 Aug 2021 15:23:18 +0800
Labels: k8s-app=soa-task
pod-template-hash=745d48d955
Annotations: kubectl.kubernetes.io/restartedAt: 2021-04-20T07:42:03Z
Status: Running
IP: 172.30.184.3
IPs: <none>
Controlled By: ReplicaSet/soa-task-745d48d955
Init Containers:
init-agent:
Container ID: docker://25d947147300edba8bc1861d40cea314047674b74f82d7de9013eead41f1f20f
Image: registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai_fat/skywalking-agent:6.5.0
Image ID: docker-pullable://registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai_fat/skywalking-agent#sha256:eda5426bc7bc06fc184e740f4783f263f151ae25e55aae37eec8b67e5dbb2fb0
Port: <none>
Host Port: <none>
Command:
sh
-c
set -ex;mkdir -p /skywalking/agent;cp -r /opt/skywalking/agent/* /skywalking/agent;
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 17 Aug 2021 15:23:19 +0800
Finished: Tue, 17 Aug 2021 15:23:19 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/skywalking/agent from agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xnrwt (ro)
Containers:
soa-task:
Container ID: docker://5406a90606e0a3905fa8fa4827e19db0d8d58609c06c2fd4e756b718df5db3b9
Image: registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task:v1.0.0
Image ID: docker-pullable://registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task#sha256:aece36589aae4fdedcfb82d7e64e451e32ebb1169dccd2485f8fe4bd451944a8
Port: <none>
Host Port: <none>
State: Running
Started: Tue, 17 Aug 2021 15:23:34 +0800
Ready: True
Restart Count: 0
Liveness: http-get http://:11028/actuator/liveness delay=120s timeout=30s period=10s #success=1 #failure=3
Readiness: http-get http://:11028/actuator/health delay=90s timeout=30s period=10s #success=1 #failure=3
Environment:
SKYWALKING_ADDR: dabai-skywalking-skywalking-oap.apm.svc.cluster.local:11800
APOLLO_META: <set to the key 'apollo.meta' of config map 'fat-config'> Optional: false
ENV: <set to the key 'env' of config map 'fat-config'> Optional: false
Mounts:
/opt/skywalking/agent from agent (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xnrwt (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
agent:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 100Mi
default-token-xnrwt:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xnrwt
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 360s
node.kubernetes.io/unreachable:NoExecute op=Exists for 360s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned dabai-fat/soa-task-745d48d955-bd4j8 to azshara-k8s03
Normal Pulled 13m kubelet Container image "registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai_fat/skywalking-agent:6.5.0" already present on machine
Normal Created 13m kubelet Created container init-agent
Normal Started 13m kubelet Started container init-agent
Normal Pulling 13m kubelet Pulling image "registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task:v1.0.0"
Normal Pulled 13m kubelet Successfully pulled image "registry.cn-hangzhou.aliyuncs.com/dabai_app_k8s/dabai-fat/soa-task:v1.0.0"
Normal Created 13m kubelet Created container soa-task
Normal Started 13m kubelet Started container soa-task

NoExecuteTaintManager falsely deleting Pod?

I am receiving NoExecuteTaintManager events that are deleting my pod but I can't figure out why. The node is healthy and the Pod has the appropriate tolerations.
This is actually causing infinite scale up because my Pod is setup so that it uses 3/4 Node CPUs and has a Toleration Grace Period > 0. This forces a new node when a Pod terminates. Cluster Autoscaler tries to keep the replicas == 2.
How do I figure out which taint is causing it specifically? Any then why it thinks that node had that taint? Currently the pods are being killed at exactly 600 seconds (which I have changed tolerationSeconds to be for node.kubernetes.io/unreachable and node.kubernetes.io/not-ready) however the node does not appear to undergo either of those situations.
NAME READY STATUS RESTARTS AGE
my-api-67df7bd54c-dthbn 1/1 Running 0 8d
my-api-67df7bd54c-mh564 1/1 Running 0 8d
my-pod-6d7b698b5f-28rgw 1/1 Terminating 0 15m
my-pod-6d7b698b5f-2wmmg 1/1 Terminating 0 13m
my-pod-6d7b698b5f-4lmmg 1/1 Running 0 4m32s
my-pod-6d7b698b5f-7m4gh 1/1 Terminating 0 71m
my-pod-6d7b698b5f-8b47r 1/1 Terminating 0 27m
my-pod-6d7b698b5f-bb58b 1/1 Running 0 2m29s
my-pod-6d7b698b5f-dn26n 1/1 Terminating 0 25m
my-pod-6d7b698b5f-jrnkg 1/1 Terminating 0 38m
my-pod-6d7b698b5f-sswps 1/1 Terminating 0 36m
my-pod-6d7b698b5f-vhqnf 1/1 Terminating 0 59m
my-pod-6d7b698b5f-wkrtg 1/1 Terminating 0 50m
my-pod-6d7b698b5f-z6p2c 1/1 Terminating 0 47m
my-pod-6d7b698b5f-zplp6 1/1 Terminating 0 62m
14:22:43.678937 8 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: my-pod-6d7b698b5f-dn26n
14:22:43.679073 8 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"prod", Name:"my-pod-6d7b698b5f-dn26n", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TaintManagerEviction' Marking for deletion Pod prod/my-pod-6d7b698b5f-dn26n
# kubectl -n prod get pod my-pod-6d7b698b5f-8b47r -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
checksum/config: bcdc41c616f736849a6bef9c726eec9bf704ce7d2c61736005a6fedda0ee14d0
kubernetes.io/psp: eks.privileged
creationTimestamp: "2019-10-25T14:09:17Z"
deletionGracePeriodSeconds: 172800
deletionTimestamp: "2019-10-27T14:20:40Z"
generateName: my-pod-6d7b698b5f-
labels:
app.kubernetes.io/instance: my-pod
app.kubernetes.io/name: my-pod
pod-template-hash: 6d7b698b5f
name: my-pod-6d7b698b5f-8b47r
namespace: prod
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: my-pod-6d7b698b5f
uid: c6360643-f6a6-11e9-9459-12ff96456b32
resourceVersion: "2408256"
selfLink: /api/v1/namespaces/prod/pods/my-pod-6d7b698b5f-8b47r
uid: 08197175-f731-11e9-9459-12ff96456b32
spec:
containers:
- args:
- -c
- from time import sleep; sleep(10000)
command:
- python
envFrom:
- secretRef:
name: pix4d
- secretRef:
name: rabbitmq
image: python:3.7-buster
imagePullPolicy: Always
name: my-pod
ports:
- containerPort: 5000
name: http
protocol: TCP
resources:
requests:
cpu: "3"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-gv6q5
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-142-54-235.ec2.internal
nodeSelector:
nodepool: zeroscaling-gpu-accelerated-p2-xlarge
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 172800
tolerations:
- key: specialized
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 600
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 600
volumes:
- name: default-token-gv6q5
secret:
defaultMode: 420
secretName: default-token-gv6q5
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:10:40Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:11:09Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:11:09Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:10:40Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://15e2e658c459a91a86573c1096931fa4ac345e06f26652da2a58dc3e3b3d5aa2
image: python:3.7-buster
imageID: docker-pullable://python#sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
lastState: {}
name: my-pod
ready: true
restartCount: 0
state:
running:
startedAt: "2019-10-25T14:11:09Z"
hostIP: 10.142.54.235
phase: Running
podIP: 10.142.63.233
qosClass: Burstable
startTime: "2019-10-25T14:10:40Z"
# kubectl -n prod describe pod my-pod-6d7b698b5f-8b47r
Name: my-pod-6d7b698b5f-8b47r
Namespace: prod
Priority: 0
PriorityClassName: <none>
Node: ip-10-142-54-235.ec2.internal/10.142.54.235
Start Time: Fri, 25 Oct 2019 10:10:40 -0400
Labels: app.kubernetes.io/instance=my-pod
app.kubernetes.io/name=my-pod
pod-template-hash=6d7b698b5f
Annotations: checksum/config: bcdc41c616f736849a6bef9c726eec9bf704ce7d2c61736005a6fedda0ee14d0
kubernetes.io/psp: eks.privileged
Status: Terminating (lasts 47h)
Termination Grace Period: 172800s
IP: 10.142.63.233
Controlled By: ReplicaSet/my-pod-6d7b698b5f
Containers:
my-pod:
Container ID: docker://15e2e658c459a91a86573c1096931fa4ac345e06f26652da2a58dc3e3b3d5aa2
Image: python:3.7-buster
Image ID: docker-pullable://python#sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
Port: 5000/TCP
Host Port: 0/TCP
Command:
python
Args:
-c
from time import sleep; sleep(10000)
State: Running
Started: Fri, 25 Oct 2019 10:11:09 -0400
Ready: True
Restart Count: 0
Requests:
cpu: 3
Environment Variables from:
pix4d Secret Optional: false
rabbitmq Secret Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gv6q5 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-gv6q5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-gv6q5
Optional: false
QoS Class: Burstable
Node-Selectors: nodepool=zeroscaling-gpu-accelerated-p2-xlarge
Tolerations: node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
specialized
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12m (x2 over 12m) default-scheduler 0/13 nodes are available: 1 Insufficient pods, 13 Insufficient cpu, 6 node(s) didn't match node selector.
Normal TriggeredScaleUp 12m cluster-autoscaler pod triggered scale-up: [{prod-worker-gpu-accelerated-p2-xlarge 7->8 (max: 13)}]
Warning FailedScheduling 11m (x5 over 11m) default-scheduler 0/14 nodes are available: 1 Insufficient pods, 1 node(s) had taints that the pod didn't tolerate, 13 Insufficient cpu, 6 node(s) didn't match node selector.
Normal Scheduled 11m default-scheduler Successfully assigned prod/my-pod-6d7b698b5f-8b47r to ip-10-142-54-235.ec2.internal
Normal Pulling 11m kubelet, ip-10-142-54-235.ec2.internal pulling image "python:3.7-buster"
Normal Pulled 10m kubelet, ip-10-142-54-235.ec2.internal Successfully pulled image "python:3.7-buster"
Normal Created 10m kubelet, ip-10-142-54-235.ec2.internal Created container
Normal Started 10m kubelet, ip-10-142-54-235.ec2.internal Started container
# kubectl -n prod describe node ip-10-142-54-235.ec2.internal
Name: ip-10-142-54-235.ec2.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p2.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
kubernetes.io/hostname=ip-10-142-54-235.ec2.internal
nodepool=zeroscaling-gpu-accelerated-p2-xlarge
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 25 Oct 2019 10:10:20 -0400
Taints: specialized=true:NoExecute
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:19 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:19 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:19 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 25 Oct 2019 10:23:11 -0400 Fri, 25 Oct 2019 10:10:40 -0400 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.142.54.235
ExternalIP: 3.86.112.24
Hostname: ip-10-142-54-235.ec2.internal
InternalDNS: ip-10-142-54-235.ec2.internal
ExternalDNS: ec2-3-86-112-24.compute-1.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 209702892Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 62872868Ki
pods: 58
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 200777747706
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 61209892Ki
pods: 58
System Info:
Machine ID: 0e76fec3e06d41a6bf2c49a18fbe1795
System UUID: EC29973A-D616-F673-6899-A96C97D5AE2D
Boot ID: 4bc510b6-f615-48a7-9e1e-47261ddf26a4
Kernel Version: 4.14.146-119.123.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.1
Kubelet Version: v1.13.11-eks-5876d6
Kube-Proxy Version: v1.13.11-eks-5876d6
ProviderID: aws:///us-east-1b/i-0f5b519aa6e38e04a
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
amazon-cloudwatch cloudwatch-agent-4d24j 50m (1%) 250m (6%) 50Mi (0%) 250Mi (0%) 12m
amazon-cloudwatch fluentd-cloudwatch-wkslq 50m (1%) 0 (0%) 150Mi (0%) 300Mi (0%) 12m
prod my-pod-6d7b698b5f-8b47r 3 (75%) 0 (0%) 0 (0%) 0 (0%) 14m
kube-system aws-node-6nr6g 10m (0%) 0 (0%) 0 (0%) 0 (0%) 13m
kube-system kube-proxy-wf8k4 100m (2%) 0 (0%) 0 (0%) 0 (0%) 13m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3210m (80%) 250m (6%)
memory 200Mi (0%) 550Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 13m kubelet, ip-10-142-54-235.ec2.internal Starting kubelet.
Normal NodeHasSufficientMemory 13m (x2 over 13m) kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 13m (x2 over 13m) kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 13m (x2 over 13m) kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 13m kubelet, ip-10-142-54-235.ec2.internal Updated Node Allocatable limit across pods
Normal Starting 12m kube-proxy, ip-10-142-54-235.ec2.internal Starting kube-proxy.
Normal NodeReady 12m kubelet, ip-10-142-54-235.ec2.internal Node ip-10-142-54-235.ec2.internal status is now: NodeReady
# kubectl get node ip-10-142-54-235.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-10-25T14:10:20Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: p2.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1b
kubernetes.io/hostname: ip-10-142-54-235.ec2.internal
nodepool: zeroscaling-gpu-accelerated-p2-xlarge
name: ip-10-142-54-235.ec2.internal
resourceVersion: "2409195"
selfLink: /api/v1/nodes/ip-10-142-54-235.ec2.internal
uid: 2d934979-f731-11e9-89b8-0234143df588
spec:
providerID: aws:///us-east-1b/i-0f5b519aa6e38e04a
taints:
- effect: NoExecute
key: specialized
value: "true"
status:
addresses:
- address: 10.142.54.235
type: InternalIP
- address: 3.86.112.24
type: ExternalIP
- address: ip-10-142-54-235.ec2.internal
type: Hostname
- address: ip-10-142-54-235.ec2.internal
type: InternalDNS
- address: ec2-3-86-112-24.compute-1.amazonaws.com
type: ExternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: "200777747706"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 61209892Ki
pods: "58"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: 209702892Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 62872868Ki
pods: "58"
conditions:
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:40Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- python#sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
- python:3.7-buster
sizeBytes: 917672801
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni#sha256:5b7e7435f88a86bbbdb2a5ecd61e893dc14dd13c9511dc8ace362d299259700a
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.5.4
sizeBytes: 290739356
- names:
- fluent/fluentd-kubernetes-daemonset#sha256:582770d951f81e0971e852089239ced0186e0bdc3226daf16b99ca4cc22de4f7
- fluent/fluentd-kubernetes-daemonset:v1.3.3-debian-cloudwatch-1.4
sizeBytes: 261867521
- names:
- amazon/cloudwatch-agent#sha256:877106acbc56e747ebe373548c88cd37274f666ca11b5c782211db4c5c7fb64b
- amazon/cloudwatch-agent:latest
sizeBytes: 131360039
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy#sha256:4767b441ddc424b0ea63c305b79be154f65fb15ebefe8a3b2832ce55aa6de2f0
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.13.8
sizeBytes: 80183964
- names:
- busybox#sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e
- busybox:latest
sizeBytes: 1219782
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64#sha256:bea77c323c47f7b573355516acf927691182d1333333d1f41b7544012fab7adf
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1
sizeBytes: 742472
nodeInfo:
architecture: amd64
bootID: 4bc510b6-f615-48a7-9e1e-47261ddf26a4
containerRuntimeVersion: docker://18.6.1
kernelVersion: 4.14.146-119.123.amzn2.x86_64
kubeProxyVersion: v1.13.11-eks-5876d6
kubeletVersion: v1.13.11-eks-5876d6
machineID: 0e76fec3e06d41a6bf2c49a18fbe1795
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: EC29973A-D616-F673-6899-A96C97D5AE2D
Unfortunately, I don't have an exact answer to your issue, but I may have some workaround.
I think I had the same issue with Amazon EKS cluster, version 1.13.11 - my pod was triggering node scale-up, pod was scheduled, works for 300s and then evicted:
74m Normal TaintManagerEviction pod/master-3bb760a7-b782-4138-b09f-0ca385db9ad7-workspace Marking for deletion Pod project-beta/master-3bb760a7-b782-4138-b09f-0ca385db9ad7-workspace
Interesting, that the same pod was able to run with no problem if it was scheduled on existing node and not a just created one.
From my investigation, it really looks like some issue with this specific Kubernetes version. Maybe some edge case of the TaintBasedEvictions feature(I think it was enabled by default in 1.13 version of Kubernetes).
To "fix" this issue I updated cluster version to 1.14. After that, mysterious pod eviction did not happen anymore.
So, if it's possible to you, I suggest updating your cluster to 1.14 version(together with cluster-autoscaler).

cluster-autoscaler and dns-controller continuously evicting

I have just terminated a AWS K8S node, and now.
K8S recreated a new one, and installed new pods. Everything seems good so far.
But when I do:
kubectl get po -A
I get:
kube-system cluster-autoscaler-648b4df947-42hxv 0/1 Evicted 0 3m53s
kube-system cluster-autoscaler-648b4df947-45pcc 0/1 Evicted 0 47m
kube-system cluster-autoscaler-648b4df947-46w6h 0/1 Evicted 0 91m
kube-system cluster-autoscaler-648b4df947-4tlbl 0/1 Evicted 0 69m
kube-system cluster-autoscaler-648b4df947-52295 0/1 Evicted 0 3m54s
kube-system cluster-autoscaler-648b4df947-55wzb 0/1 Evicted 0 83m
kube-system cluster-autoscaler-648b4df947-57kv5 0/1 Evicted 0 107m
kube-system cluster-autoscaler-648b4df947-69rsl 0/1 Evicted 0 98m
kube-system cluster-autoscaler-648b4df947-6msx2 0/1 Evicted 0 11m
kube-system cluster-autoscaler-648b4df947-6pphs 0 18m
kube-system dns-controller-697f6d9457-zswm8 0/1 Evicted 0 54m
When I do:
kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
I get:
➜ monitoring git:(master) ✗ kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
Name: dns-controller-697f6d9457-zswm8
Namespace: kube-system
Priority: 0
Node: ip-172-20-57-13.eu-west-3.compute.internal/
Start Time: Mon, 07 Oct 2019 12:35:06 +0200
Labels: k8s-addon=dns-controller.addons.k8s.io
k8s-app=dns-controller
pod-template-hash=697f6d9457
version=v1.12.0
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
IP:
IPs: <none>
Controlled By: ReplicaSet/dns-controller-697f6d9457
Containers:
dns-controller:
Image: kope/dns-controller:1.12.0
Port: <none>
Host Port: <none>
Command:
/usr/bin/dns-controller
--watch-ingress=false
--dns=aws-route53
--zone=*/ZDOYTALGJJXCM
--zone=*/*
-v=2
Requests:
cpu: 50m
memory: 50Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from dns-controller-token-gvxxd (ro)
Volumes:
dns-controller-token-gvxxd:
Type: Secret (a volume populated by a Secret)
SecretName: dns-controller-token-gvxxd
Optional: false
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Evicted 59m kubelet, ip-172-20-57-13.eu-west-3.compute.internal The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
Normal Killing 59m kubelet, ip-172-20-57-13.eu-west-3.compute.internal Killing container with id docker://dns-controller:Need to kill Pod
And:
➜ monitoring git:(master) ✗ kubectl describe pod -n kube-system cluster-autoscaler-648b4df947-2zcrz
Name: cluster-autoscaler-648b4df947-2zcrz
Namespace: kube-system
Priority: 0
Node: ip-172-20-57-13.eu-west-3.compute.internal/
Start Time: Mon, 07 Oct 2019 13:26:26 +0200
Labels: app=cluster-autoscaler
k8s-addon=cluster-autoscaler.addons.k8s.io
pod-template-hash=648b4df947
Annotations: prometheus.io/port: 8085
prometheus.io/scrape: true
scheduler.alpha.kubernetes.io/tolerations: [{"key":"dedicated", "value":"master"}]
Status: Failed
Reason: Evicted
Message: Pod The node was low on resource: [DiskPressure].
IP:
IPs: <none>
Controlled By: ReplicaSet/cluster-autoscaler-648b4df947
Containers:
cluster-autoscaler:
Image: gcr.io/google-containers/cluster-autoscaler:v1.15.1
Port: <none>
Host Port: <none>
Command:
./cluster-autoscaler
--v=4
--stderrthreshold=info
--cloud-provider=aws
--skip-nodes-with-local-storage=false
--nodes=0:1:pamela-nodes.k8s-prod.sunchain.fr
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 100m
memory: 300Mi
Liveness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
AWS_REGION: eu-west-3
Mounts:
/etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-hld2m (ro)
Volumes:
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs/ca-certificates.crt
HostPathType:
cluster-autoscaler-token-hld2m:
Type: Secret (a volume populated by a Secret)
SecretName: cluster-autoscaler-token-hld2m
Optional: false
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/role=master
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned kube-system/cluster-autoscaler-648b4df947-2zcrz to ip-172-20-57-13.eu-west-3.compute.internal
Warning Evicted 11m kubelet, ip-172-20-57-13.eu-west-3.compute.internal The node was low on resource: [DiskPressure].
It seems to be a ressource issue. Weird thing is before I killed my EC2 instance, I didn t have this issue.
Why is it happening and what should I do? Is it mandatory to add more ressources ?
➜ scripts kubectl describe node ip-172-20-57-13.eu-west-3.compute.internal
Name: ip-172-20-57-13.eu-west-3.compute.internal
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-3
failure-domain.beta.kubernetes.io/zone=eu-west-3a
kops.k8s.io/instancegroup=master-eu-west-3a
kubernetes.io/hostname=ip-172-20-57-13.eu-west-3.compute.internal
kubernetes.io/role=master
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 28 Aug 2019 09:38:09 +0200
Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 28 Aug 2019 09:38:36 +0200 Wed, 28 Aug 2019 09:38:36 +0200 RouteCreated RouteController created a route
OutOfDisk False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Mon, 07 Oct 2019 14:14:32 +0200 Mon, 07 Oct 2019 14:11:02 +0200 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:35 +0200 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.20.57.13
ExternalIP: 35.180.187.101
InternalDNS: ip-172-20-57-13.eu-west-3.compute.internal
Hostname: ip-172-20-57-13.eu-west-3.compute.internal
ExternalDNS: ec2-35-180-187-101.eu-west-3.compute.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 7797156Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2013540Ki
pods: 110
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 7185858958
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1911140Ki
pods: 110
System Info:
Machine ID: ec2b3aa5df0e3ad288d210f309565f06
System UUID: EC2B3AA5-DF0E-3AD2-88D2-10F309565F06
Boot ID: f9d5417b-eba9-4544-9710-a25d01247b46
Kernel Version: 4.9.0-9-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.3
Kubelet Version: v1.12.10
Kube-Proxy Version: v1.12.10
PodCIDR: 100.96.1.0/24
ProviderID: aws:///eu-west-3a/i-03bf1b26313679d65
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system etcd-manager-events-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 100Mi (5%) 0 (0%) 40d
kube-system etcd-manager-main-ip-172-20-57-13.eu-west-3.compute.internal 200m (10%) 0 (0%) 100Mi (5%) 0 (0%) 40d
kube-system kube-apiserver-ip-172-20-57-13.eu-west-3.compute.internal 150m (7%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-controller-manager-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-proxy-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-scheduler-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 750m (37%) 0 (0%)
memory 200Mi (10%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeHasNoDiskPressure 55m (x324 over 40d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure
Warning EvictionThresholdMet 10m (x1809 over 16d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal Attempting to reclaim ephemeral-storage
Warning ImageGCFailed 4m30s (x6003 over 23d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal (combined from similar events): wanted to free 652348620 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29
I think a better command to debug it is:
devops git:(master) ✗ kubectl get events --sort-by=.metadata.creationTimestamp -o wide
LAST SEEN TYPE REASON KIND SOURCE MESSAGE SUBOBJECT FIRST SEEN COUNT NAME
10m Warning ImageGCFailed Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal (combined from similar events): wanted to free 653307084 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29 23d 6004 ip-172-20-57-13.eu-west-3.compute.internal.15c4124e15eb1d33
2m59s Warning ImageGCFailed Node kubelet, ip-172-20-36-135.eu-west-3.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 639524044 bytes, but freed 0 bytes 7d9h 2089 ip-172-20-36-135.eu-west-3.compute.internal.15c916d24afe2c25
4m59s Warning ImageGCFailed Node kubelet, ip-172-20-33-81.eu-west-3.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 458296524 bytes, but freed 0 bytes 4d14h 1183 ip-172-20-33-81.eu-west-3.compute.internal.15c9f3fe4e1525ec
6m43s Warning EvictionThresholdMet Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal Attempting to reclaim ephemeral-storage 16d 1841 ip-172-20-57-13.eu-west-3.compute.internal.15c66e349b761219
41s Normal NodeHasNoDiskPressure Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure 40d 333 ip-172-20-57-13.eu-west-3.compute.internal.15bf05cec37981b6
Now df -h
admin#ip-172-20-57-13:/var/log$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 972M 0 972M 0% /dev
tmpfs 197M 2.3M 195M 2% /run
/dev/nvme0n1p2 7.5G 6.4G 707M 91% /
tmpfs 984M 0 984M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 984M 0 984M 0% /sys/fs/cgroup
/dev/nvme1n1 20G 430M 20G 3% /mnt/master-vol-09618123eb79d92c8
/dev/nvme2n1 20G 229M 20G 2% /mnt/master-vol-05c9684f0edcbd876
It looks like your nodes/master is running low on storage? I see only 1GB for ephemeral storage available.
You should free up some space on the node and master. It should get rid of your problem.