Getting The node was low on resource: ephemeral-storage? - kubernetes

I am trying to understand master/node concept depoyment in labs.playwithk8s.com https://labs.play-with-k8s.com/
I have two nodes and one master.
It has the following config memory.
node1 ~]$ kubectl describe pod myapp-7f4dffc449-qh7pk
Name: myapp-7f4dffc449-qh7pk
Namespace: default
Priority: 0
Node: node3/192.168.0.16
Start Time: Tue, 07 Feb 2023 12:31:23 +0000
Labels: app=myapp
pod-template-hash=7f4dffc449
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/myapp-7f4dffc449
Containers:
myapp:
Container ID:
Image: changan1111/newdocker:latest
Image ID:
Port: 3000/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 1Gi
memory: 1Gi
Requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 1Gi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-t4nf7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-t4nf7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-t4nf7
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned default/myapp-7f4dffc449-qh7pk to node3
Normal Pulling 31s kubelet Pulling image "changan1111/newdocker:latest"
Warning Evicted 25s kubelet The node was low on resource: ephemeral-storage.
Warning ExceededGracePeriod 15s kubelet Container runtime did not kill the pod within specified grace period.
My yaml file is here: https://raw.githubusercontent.com/changan1111/UserManagement/main/kube/kube.yaml
Looks like that i am not seeing anything wrong.. but still i am seeing The node was low on resource: ephemeral-storage
How to resolve this?
Disk Usage:
overlay 10G 130M 9.9G 2% /
tmpfs 64M 0 64M 0% /dev
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sdb 64G 29G 36G 44% /etc/hosts
shm 64M 0 64M 0% /dev/shm
shm 64M 0 64M 0% /var/lib/docker/containers/403c120b0dd0909bd34e66d86c58fba18cd71468269e1aaa66e3244d331c3a1e/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/56dd63dad42dd26baba8610f70f1a0bd22fdaea36742c32deca3c196ce181851/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/50c4585ae8cc63de9077c1a58da67cc348c86a6643ca21a06b8998f94a2a2daf/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/6e9529ad6e6a836e77b17c713679abddf861fdc0e86946484dc2ec68a00ca2ff/mounts/shm
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/8e56095e-b0ec-4f13-a022-d29d04897410/volumes/kubernetes.io~secret/kube-proxy-token-j7sl8
shm 64M 0 64M 0% /var/lib/docker/containers/2b84d6dfebd4ea0c379588985cd43b623004632e71d63d07a39d521ddf694e8e/mounts/shm
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/1271ca18-97d0-48d2-9280-68eb8c57795f/volumes/kubernetes.io~secret/kube-router-token-rmpqv
shm 64M 0 64M 0% /var/lib/docker/containers/c4506095bf36356790795353862fc13b759d72af8edc0e4233341f2d3234fa02/mounts/shm
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/39885a73-d724-4be8-a9cf-3de8756c5b0c/volumes/kubernetes.io~secret/coredns-token-ckxbw
tmpfs 16G 12K 16G 1% /var/lib/kubelet/pods/8f137411-3af6-4e44-8be4-3e4f79570531/volumes/kubernetes.io~secret/coredns-token-ckxbw
shm 64M 0 64M 0% /var/lib/docker/containers/c32431f8e77652686f58e91aff01d211a5e0fb798f664ba675715005ee2cd5b0/mounts/shm
shm 64M 0 64M 0% /var/lib/docker/containers/3e284dd5f9b321301647eeb42f9dd82e81eb78aadcf9db7b5a6a3419504aa0e9/mount
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m16s default-scheduler Successfully assigned default/myapp-b5856bb-4znkj to node4
Normal Pulling 3m15s kubelet Pulling image "changan1111/newdocker:latest"
Normal Pulled 83s kubelet Successfully pulled image "changan1111/newdocker:latest" in 1m51.97169753s
Normal Created 28s kubelet Created container myapp
Normal Started 27s kubelet Started container myapp
Warning Evicted 1s kubelet Pod ephemeral local storage usage exceeds the total limit of containers 500Mi.
Normal Killing 1s kubelet Stopping container myapp
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 2
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
imagePullSecrets:
- name: dockercreds
containers:
- name: myapp
image: changan1111/newdocker:latest
resources:
limits:
memory: "2Gi"
cpu: "500m"
ephemeral-storage: "2Gi"
requests:
ephemeral-storage: "1Gi"
cpu: "500m"
memory: "1Gi"
ports:
- containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
ports:
- protocol: TCP
port: 80
targetPort: 3000
nodePort: 31110
type: LoadBalancer

Worker nodes may be running out of disk space in which case you should see something like no space left on device or The node was low on resource: ephemeral-storage.
Mitigation is to specify larger disk size for node VMs during Composer environment creation.
Pod eviction and scheduling problems are side effects of Kubernetes limits and requests, usually caused by a lack of planning. See Understanding Kubernetes pod evicted and scheduling problems for more information.
Refer to the similar SO how to set a quota limits.ephemeral-storage, requests.ephemeral-storage to limit this, as otherwise any container can write any amount of storage to its node filesystem.
Warning : Pod ephemeral local storage usage exceeds the total limit of containers 500Mi.
It may be because you're putting an upper limit of ephemeral-storage usage by setting resources.limits.ephemeral-storage to 500Mi. Try removing the limits.ephemeral-storage if safe or change the value depending upon your requirement.
Also see How to determine kubernetes pod ephemeral storage request and limit and how to Avoid running out of ephemeral storage space on your Kubernetes worker Nodes for more information.

Related

Minio Kubernetes Intstallation no memory error

I am trying to install the minio storage using kubernetes on my local .
Following the link , However i am facing error with no memory in all types of install ..
I am not sure how to set up the presistantVolume in my case.
https://github.com/minio/operator/blob/master/README.md
I am trying to create persistent volume so that enough memory will be available in the path i am selecting
cat pv.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
kubectl create -f pv.yaml
kubectl get sc
kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
hostpath (default) docker.io/hostpath Delete Immediate false 131m
local-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 56m
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-node
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storage-class: local-storage
local:
path: /mnt/d/minio
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- docker-desktop
kubectl create -f pvc.yaml
error: error parsing pvc.yaml: error converting YAML to JSON: yaml: line 8: mapping values are not allowed in this context
:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
docker-desktop Ready control-plane,master 126m v1.21.2
See 'kubectl get --help' for usage.
:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-558bd4d5db-j72z4 1/1 Running 1 128m
kube-system coredns-558bd4d5db-vw98z 1/1 Running 1 128m
kube-system etcd-docker-desktop 1/1 Running 1 128m
kube-system kube-apiserver-docker-desktop 1/1 Running 1 128m
kube-system kube-controller-manager-docker-desktop 1/1 Running 1 128m
kube-system kube-proxy-tqfnr 1/1 Running 1 128m
kube-system kube-scheduler-docker-desktop 1/1 Running 1 128m
kube-system storage-provisioner 1/1 Running 2 127m
kube-system vpnkit-controller 1/1 Running 12 127m
minio-operator console-6b6cf8946c-vxcqh 1/1 Running 0 76m
minio-operator minio-operator-69fd675557-s62nl 1/1 Running 0 76m
:/$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 251G 1.9G 237G 1% /
tmpfs 6.2G 401M 5.8G 7% /mnt/wsl
tools 477G 69G 409G 15% /init
none 6.1G 0 6.1G 0% /dev
none 6.2G 12K 6.2G 1% /run
none 6.2G 0 6.2G 0% /run/lock
none 6.2G 0 6.2G 0% /run/shm
none 6.2G 0 6.2G 0% /run/user
tmpfs 6.2G 0 6.2G 0% /sys/fs/cgroup
C:\ 477G 69G 409G 15% /mnt/c
D:\ 932G 132M 932G 1% /mnt/d
/dev/sdd 251G 2.7G 236G 2% /mnt/wsl/docker-desktop-data/isocache
none 6.2G 12K 6.2G 1% /mnt/wsl/docker-desktop/shared-sockets/host-services
/dev/sdc 251G 132M 239G 1% /mnt/wsl/docker-desktop/docker-desktop-proxy
/dev/loop0 396M 396M 0 100% /mnt/wsl/docker-desktop/cli-tools
I beeilve creating a persistent volume and using that in a namesapce and using that namespace whil creating a tenant should solve this issue. But i am stuck with the error of no memory available
As per the code:
if (memReqSize < minMemReq) {
return {
error: "The requested memory size must be greater than 2Gi",
request: 0,
limit: 0,
};
}
You need 2GB of RAM per node. Since you have 4 nodes, you need 8 GB of RAM for Minio alone. It's likely that you don't have the enough RAM to run this.

why the kubernetes dashboard pod aways pending

I am check the cluster info and find kubernetes dashboard pod is pending:
[root#ops001 data]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default my-nginx-756fb87568-rcdsm 0/1 Pending 0 81d
default my-nginx-756fb87568-vtf46 0/1 Pending 0 81d
default soa-room-service-768cfd68d-5zxgd 0/1 Pending 0 81d
kube-system coredns-89764d78c-mbcbz 0/1 Pending 0 123d
kube-system kubernetes-dashboard-74d7cc788-8fggl 0/1 Pending 0 15d
kube-system kubernetes-dashboard-74d7cc788-mk9c7 0/1 UnexpectedAdmissionError 0 123d
this is lack of resource?this is the detail output:
[root#ops001 ~]# kubectl describe pod kubernetes-dashboard-74d7cc788-8fggl --namespace kube-system
Name: kubernetes-dashboard-74d7cc788-8fggl
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=74d7cc788
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
seccomp.security.alpha.kubernetes.io/pod: docker/default
Status: Pending
IP:
Controlled By: ReplicaSet/kubernetes-dashboard-74d7cc788
Containers:
kubernetes-dashboard:
Image: gcr.azk8s.cn/google_containers/kubernetes-dashboard-amd64:v1.10.1
Port: 8443/TCP
Host Port: 0/TCP
Args:
--auto-generate-certificates
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 50m
memory: 100Mi
Liveness: http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kubernetes-dashboard-token-pmxpf (ro)
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kubernetes-dashboard-token-pmxpf:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-token-pmxpf
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node.kubernetes.io/not-ready:NoExecute for 360s
node.kubernetes.io/unreachable:NoExecute for 360s
Events: <none>
this is the node top output:
[root#ops001 ~]# top
top - 23:45:57 up 244 days, 5:56, 7 users, load average: 3.45, 2.93, 3.77
Tasks: 245 total, 1 running, 244 sleeping, 0 stopped, 0 zombie
%Cpu(s): 38.6 us, 8.4 sy, 0.0 ni, 49.2 id, 3.4 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 16266412 total, 3963688 free, 5617380 used, 6685344 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 10228760 avail Mem
the kube-scheduler service is up.I have no idea where is going wrong.
From what I can see you have all pods in pending state even coredns. This is the main reason why dashboard doesn't work.
I would focus on dealing with that first, for this I'd recommend checking Troubleshooting kubeadm.
This will tell you to install networking addon which can be found here.
You can also have a look at this question Kube-dns always in pending state.

cluster-autoscaler and dns-controller continuously evicting

I have just terminated a AWS K8S node, and now.
K8S recreated a new one, and installed new pods. Everything seems good so far.
But when I do:
kubectl get po -A
I get:
kube-system cluster-autoscaler-648b4df947-42hxv 0/1 Evicted 0 3m53s
kube-system cluster-autoscaler-648b4df947-45pcc 0/1 Evicted 0 47m
kube-system cluster-autoscaler-648b4df947-46w6h 0/1 Evicted 0 91m
kube-system cluster-autoscaler-648b4df947-4tlbl 0/1 Evicted 0 69m
kube-system cluster-autoscaler-648b4df947-52295 0/1 Evicted 0 3m54s
kube-system cluster-autoscaler-648b4df947-55wzb 0/1 Evicted 0 83m
kube-system cluster-autoscaler-648b4df947-57kv5 0/1 Evicted 0 107m
kube-system cluster-autoscaler-648b4df947-69rsl 0/1 Evicted 0 98m
kube-system cluster-autoscaler-648b4df947-6msx2 0/1 Evicted 0 11m
kube-system cluster-autoscaler-648b4df947-6pphs 0 18m
kube-system dns-controller-697f6d9457-zswm8 0/1 Evicted 0 54m
When I do:
kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
I get:
➜ monitoring git:(master) ✗ kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
Name: dns-controller-697f6d9457-zswm8
Namespace: kube-system
Priority: 0
Node: ip-172-20-57-13.eu-west-3.compute.internal/
Start Time: Mon, 07 Oct 2019 12:35:06 +0200
Labels: k8s-addon=dns-controller.addons.k8s.io
k8s-app=dns-controller
pod-template-hash=697f6d9457
version=v1.12.0
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
IP:
IPs: <none>
Controlled By: ReplicaSet/dns-controller-697f6d9457
Containers:
dns-controller:
Image: kope/dns-controller:1.12.0
Port: <none>
Host Port: <none>
Command:
/usr/bin/dns-controller
--watch-ingress=false
--dns=aws-route53
--zone=*/ZDOYTALGJJXCM
--zone=*/*
-v=2
Requests:
cpu: 50m
memory: 50Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from dns-controller-token-gvxxd (ro)
Volumes:
dns-controller-token-gvxxd:
Type: Secret (a volume populated by a Secret)
SecretName: dns-controller-token-gvxxd
Optional: false
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Evicted 59m kubelet, ip-172-20-57-13.eu-west-3.compute.internal The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
Normal Killing 59m kubelet, ip-172-20-57-13.eu-west-3.compute.internal Killing container with id docker://dns-controller:Need to kill Pod
And:
➜ monitoring git:(master) ✗ kubectl describe pod -n kube-system cluster-autoscaler-648b4df947-2zcrz
Name: cluster-autoscaler-648b4df947-2zcrz
Namespace: kube-system
Priority: 0
Node: ip-172-20-57-13.eu-west-3.compute.internal/
Start Time: Mon, 07 Oct 2019 13:26:26 +0200
Labels: app=cluster-autoscaler
k8s-addon=cluster-autoscaler.addons.k8s.io
pod-template-hash=648b4df947
Annotations: prometheus.io/port: 8085
prometheus.io/scrape: true
scheduler.alpha.kubernetes.io/tolerations: [{"key":"dedicated", "value":"master"}]
Status: Failed
Reason: Evicted
Message: Pod The node was low on resource: [DiskPressure].
IP:
IPs: <none>
Controlled By: ReplicaSet/cluster-autoscaler-648b4df947
Containers:
cluster-autoscaler:
Image: gcr.io/google-containers/cluster-autoscaler:v1.15.1
Port: <none>
Host Port: <none>
Command:
./cluster-autoscaler
--v=4
--stderrthreshold=info
--cloud-provider=aws
--skip-nodes-with-local-storage=false
--nodes=0:1:pamela-nodes.k8s-prod.sunchain.fr
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 100m
memory: 300Mi
Liveness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
AWS_REGION: eu-west-3
Mounts:
/etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-hld2m (ro)
Volumes:
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs/ca-certificates.crt
HostPathType:
cluster-autoscaler-token-hld2m:
Type: Secret (a volume populated by a Secret)
SecretName: cluster-autoscaler-token-hld2m
Optional: false
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/role=master
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned kube-system/cluster-autoscaler-648b4df947-2zcrz to ip-172-20-57-13.eu-west-3.compute.internal
Warning Evicted 11m kubelet, ip-172-20-57-13.eu-west-3.compute.internal The node was low on resource: [DiskPressure].
It seems to be a ressource issue. Weird thing is before I killed my EC2 instance, I didn t have this issue.
Why is it happening and what should I do? Is it mandatory to add more ressources ?
➜ scripts kubectl describe node ip-172-20-57-13.eu-west-3.compute.internal
Name: ip-172-20-57-13.eu-west-3.compute.internal
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-3
failure-domain.beta.kubernetes.io/zone=eu-west-3a
kops.k8s.io/instancegroup=master-eu-west-3a
kubernetes.io/hostname=ip-172-20-57-13.eu-west-3.compute.internal
kubernetes.io/role=master
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 28 Aug 2019 09:38:09 +0200
Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 28 Aug 2019 09:38:36 +0200 Wed, 28 Aug 2019 09:38:36 +0200 RouteCreated RouteController created a route
OutOfDisk False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Mon, 07 Oct 2019 14:14:32 +0200 Mon, 07 Oct 2019 14:11:02 +0200 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:35 +0200 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.20.57.13
ExternalIP: 35.180.187.101
InternalDNS: ip-172-20-57-13.eu-west-3.compute.internal
Hostname: ip-172-20-57-13.eu-west-3.compute.internal
ExternalDNS: ec2-35-180-187-101.eu-west-3.compute.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 7797156Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2013540Ki
pods: 110
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 7185858958
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1911140Ki
pods: 110
System Info:
Machine ID: ec2b3aa5df0e3ad288d210f309565f06
System UUID: EC2B3AA5-DF0E-3AD2-88D2-10F309565F06
Boot ID: f9d5417b-eba9-4544-9710-a25d01247b46
Kernel Version: 4.9.0-9-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.3
Kubelet Version: v1.12.10
Kube-Proxy Version: v1.12.10
PodCIDR: 100.96.1.0/24
ProviderID: aws:///eu-west-3a/i-03bf1b26313679d65
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system etcd-manager-events-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 100Mi (5%) 0 (0%) 40d
kube-system etcd-manager-main-ip-172-20-57-13.eu-west-3.compute.internal 200m (10%) 0 (0%) 100Mi (5%) 0 (0%) 40d
kube-system kube-apiserver-ip-172-20-57-13.eu-west-3.compute.internal 150m (7%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-controller-manager-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-proxy-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-scheduler-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 750m (37%) 0 (0%)
memory 200Mi (10%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeHasNoDiskPressure 55m (x324 over 40d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure
Warning EvictionThresholdMet 10m (x1809 over 16d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal Attempting to reclaim ephemeral-storage
Warning ImageGCFailed 4m30s (x6003 over 23d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal (combined from similar events): wanted to free 652348620 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29
I think a better command to debug it is:
devops git:(master) ✗ kubectl get events --sort-by=.metadata.creationTimestamp -o wide
LAST SEEN TYPE REASON KIND SOURCE MESSAGE SUBOBJECT FIRST SEEN COUNT NAME
10m Warning ImageGCFailed Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal (combined from similar events): wanted to free 653307084 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29 23d 6004 ip-172-20-57-13.eu-west-3.compute.internal.15c4124e15eb1d33
2m59s Warning ImageGCFailed Node kubelet, ip-172-20-36-135.eu-west-3.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 639524044 bytes, but freed 0 bytes 7d9h 2089 ip-172-20-36-135.eu-west-3.compute.internal.15c916d24afe2c25
4m59s Warning ImageGCFailed Node kubelet, ip-172-20-33-81.eu-west-3.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 458296524 bytes, but freed 0 bytes 4d14h 1183 ip-172-20-33-81.eu-west-3.compute.internal.15c9f3fe4e1525ec
6m43s Warning EvictionThresholdMet Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal Attempting to reclaim ephemeral-storage 16d 1841 ip-172-20-57-13.eu-west-3.compute.internal.15c66e349b761219
41s Normal NodeHasNoDiskPressure Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure 40d 333 ip-172-20-57-13.eu-west-3.compute.internal.15bf05cec37981b6
Now df -h
admin#ip-172-20-57-13:/var/log$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 972M 0 972M 0% /dev
tmpfs 197M 2.3M 195M 2% /run
/dev/nvme0n1p2 7.5G 6.4G 707M 91% /
tmpfs 984M 0 984M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 984M 0 984M 0% /sys/fs/cgroup
/dev/nvme1n1 20G 430M 20G 3% /mnt/master-vol-09618123eb79d92c8
/dev/nvme2n1 20G 229M 20G 2% /mnt/master-vol-05c9684f0edcbd876
It looks like your nodes/master is running low on storage? I see only 1GB for ephemeral storage available.
You should free up some space on the node and master. It should get rid of your problem.

Cluster-autoscaler not triggering scale-up on Daemonset deployment

I deployed the Datadog agent using the Datadog Helm chart which deploys a Daemonset in Kubernetes. However when checking the state of the Daemonset I saw it was not creating all pods:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
datadog-agent-datadog 5 2 2 2 2 <none> 1h
When describing the Daemonset to figure out what was going wrong I saw it did not have enough resources:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x5 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 42s (x7 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Normal SuccessfulCreate 42s daemonset-controller Created pod: datadog-agent-7b2kp
However, I have the Cluster-autoscaler installed in the cluster and configured properly (It does trigger on regular Pod deployments that do not have enough resources to schedule), but it does not seem to trigger on the Daemonset:
I0424 14:14:48.545689 1 static_autoscaler.go:273] No schedulable pods
I0424 14:14:48.545700 1 static_autoscaler.go:280] No unschedulable pods
The AutoScalingGroup has enough nodes left:
Did I miss something in the configuration of the Cluster-autoscaler? What can I do to make sure it triggers on Daemonset resources as well?
Edit:
Describe of the Daemonset
Name: datadog-agent
Selector: app=datadog-agent
Node-Selector: <none>
Labels: app=datadog-agent
chart=datadog-1.27.2
heritage=Tiller
release=datadog-agent
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=datadog-agent
Annotations: checksum/autoconf-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/checksd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/confd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
Service Account: datadog-agent
Containers:
datadog:
Image: datadog/agent:6.10.1
Port: 8125/UDP
Host Port: 0/UDP
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Liveness: http-get http://:5555/health delay=15s timeout=5s period=15s #success=1 #failure=6
Environment:
DD_API_KEY: <set to the key 'api-key' in secret 'datadog-secret'> Optional: false
DD_LOG_LEVEL: INFO
KUBERNETES: yes
DD_KUBERNETES_KUBELET_HOST: (v1:status.hostIP)
DD_HEALTH_PORT: 5555
Mounts:
/host/proc from procdir (ro)
/host/sys/fs/cgroup from cgroups (ro)
/var/run/docker.sock from runtimesocket (ro)
/var/run/s6 from s6-run (rw)
Volumes:
runtimesocket:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
procdir:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
cgroups:
Type: HostPath (bare host directory volume)
Path: /sys/fs/cgroup
HostPathType:
s6-run:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 33m (x6 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-144.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Normal SuccessfulCreate 33m daemonset-controller Created pod: datadog-agent-7b2kp
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-174.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-3-250.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
You can add priorityClassName to point to a high priority PriorityClass to your DaemonSet. Kubernetes will then remove other pods in order to run the DaemonSet's pods. If that results in unschedulable pods, cluster-autoscaler should add a node to schedule them on.
See the docs (Most examples based on that) (For some pre-1.14 versions, the apiVersion is likely a beta (1.11-1.13) or alpha version (1.8 - 1.10) instead)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for essential pods"
Apply it to your workload
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: datadog-agent
spec:
template:
metadata:
labels:
app: datadog-agent
name: datadog-agent
spec:
priorityClassName: high-priority
serviceAccountName: datadog-agent
containers:
- image: datadog/agent:latest
############ Rest of template goes here
You should understand how cluster autoscaler works. It is responsible only for adding or removing nodes. It is not responsible for creating or destroying pods. So in your case cluster autoscaler is not doing anything because it's useless. Even if you add one more node - there will be still a requirement to run DaemonSet pods on nodes where is not enough CPU. That's why it is not adding nodes.
What you should do is to manually remove some pods from occupied nodes. Then it will be able to schedule DaemonSet pods.
Alternatively you can reduce CPU requests of Datadog to, for example, 100m or 50m. This should be enough to start those pods.

Minikube NodeUnderDiskPressure issue

I'm constantly running into NodeUnderDiskPressure in my pods that are running in Minikube. Using minikube ssh to see df -h, I'm using 50% max on all of my mounts. In fact, one is 50% and the other 5 are <10%.
$ df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 7.3G 503M 6.8G 7% /
devtmpfs 7.3G 0 7.3G 0% /dev
tmpfs 7.4G 0 7.4G 0% /dev/shm
tmpfs 7.4G 9.2M 7.4G 1% /run
tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup
/dev/sda1 17G 7.5G 7.8G 50% /mnt/sda1
$ df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
rootfs 1.9M 4.1K 1.9M 1% /
devtmpfs 1.9M 324 1.9M 1% /dev
tmpfs 1.9M 1 1.9M 1% /dev/shm
tmpfs 1.9M 657 1.9M 1% /run
tmpfs 1.9M 14 1.9M 1% /sys/fs/cgroup
/dev/sda1 9.3M 757K 8.6M 8% /mnt/sda1
The probably usually just goes away after 1-5 minutes. Strangely, restarting Minikube doesn't seem to speed up this process. I've tried removing all evicted pods but, again, disk usage doesn't actually look very high.
The docker images I'm using are just under 2GB and I'm trying to spin up just a few of them, so that should still leave me with plenty of headroom.
Here's some kubectl describe output:
$ kubectl describe po/consumer-lag-reporter-3832025036-wlfnt
Name: consumer-lag-reporter-3832025036-wlfnt
Namespace: default
Node: <none>
Labels: app=consumer-lag-reporter
pod-template-hash=3832025036
tier=monitor
type=monitor
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"consumer-lag-reporter-3832025036","uid":"342b0f72-9d12-11e8-a735...
Status: Pending
IP:
Created By: ReplicaSet/consumer-lag-reporter-3832025036
Controlled By: ReplicaSet/consumer-lag-reporter-3832025036
Containers:
consumer-lag-reporter:
Image: avery-image:latest
Port: <none>
Command:
/bin/bash
-c
Args:
newrelic-admin run-program python manage.py lag_reporter_runner --settings-module project.settings
Environment Variables from:
local-config ConfigMap Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sjprm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-sjprm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sjprm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 15s (x7 over 46s) default-scheduler No nodes are available that match all of the following predicates:: NodeUnderDiskPressure (1).
Is this a bug? Anything else I can do to debug this?
I tried:
Cleaning up evicted pods (with kubectl get pods -a)
Cleaning up unused images
(with minikube ssh + docker images)
Cleaning up all non-running containers (with
minikube ssh + docker ps -a)
The disk usage remained low as shown in my question. I simply recreated a minikube cluster and used the --disk-size flag and this solved my problem. The key thing to note is that even though df showed that I was barely using any disk, it helped to make the disk even bigger.