Pods are getting killed and recreated stating - OutOfephemeral-storage? - kubernetes

My Pods are getting killed and recreated stating that OutOfephemeral-storage
Pod describe showing below message
Message: Pod Node didn't have enough resource: ephemeral-storage, requested: 53687091200, used: 0, capacity: 0
Node Capacity
Capacity:
cpu: 80
ephemeral-storage: 1845262880Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 790964944Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 79900m
ephemeral-storage: 1700594267393
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 790612544Ki
nvidia.com/gpu: 8
pods: 110
node disk usage
]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 1.7T 25G 1.7T 2% /
devtmpfs 378G 0 378G 0% /dev
tmpfs 378G 16K 378G 1% /dev/shm
tmpfs 378G 3.8M 378G 1% /run
tmpfs 378G 0 378G 0% /sys/fs/cgroup
Still, the pod is getting rescheduled after some time? any thought why?

In most cases, this is happening due to excess of log messages are consuming the storage. Solution for that would be to configure the Docker logging driver to limit the amount of saved logs:
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "10"
}
}
Also worth to mention Docker takes a conservative approach to cleaning up unused objects (often referred to as “garbage collection”), such as images, containers, volumes, and networks: these objects are generally not removed unless you explicitly ask Docker to do so. This can cause Docker to use extra disk space.
It helped for me to use docker function called prune. This will clean up the system from unused objects. If you wish to cleanup multiple objects you can use docker system prune. Check here more about prunning.
Next possible scenario is that that there are pods that use emptyDir without storage quotas. This will fill up the storage. The solution for this would be to set quota to limit this:
resources:
requests:
ephemeral-storage: "1Gi"
limits:
ephemeral-storage: "1Gi"
Without this being set any container can write any amount of storage to its node file system.
For more details how ephemeral storage works please see Ephemeral Storage Consumption.

The issue was with the filesystem, solved with help of the following steps
]# systemctl stop kubelet
]# systemctl stop docker
]# umount -l /<MountFolder>
]# fsck -y /dev/sdb1
]# init 6

Related

Unschedulable 0/1 nodes are available insufficient ephemeral-storage

I have one strange issue.
The error that I'm getting is:
unschedulable 0/1 nodes are available insufficient ephemeral-storage
My requests per workflow that I run in kubernetes are:
resources:
requests:
ephemeral-storage: 50Gi
memory: 8Gi
And my node capacity is 100GiB per node.
When I run kubectl describe node <node-name> I get the following result:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 125m (3%) 0 (0%)
memory 8Gi (55%) 0 (0%)
ephemeral-storage 50Gi (56%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Do ephemeral-storage and memory overlap? What can be the issue? I cannot resolve.
In the kubectl describe node output, Kubernetes believes it's used 50 GiB of disk ("ephemeral storage") and that's 56% of the available resources. That implies there's about 89 GiB of usable disk space, and about 39 GiB left, so less space than your container claims it needs.
If the node has a 100 GiB disk, space required by the operating system, Kubernetes, and any pulled images counts against that disk capacity before being considered available for ephemeral storage. You probably will never be able to run two Pods that both require 50 GiB of disk; with the OS overhead they will not both fit at the same time.
(It's also not impossible that the node has 100 GB and not 100 GiB storage. 100 * 10^9 is only 93 * 2^30, which would make that overhead about 4 GiB, which feels a little more typical to me.)
The easiest and "most Kubernetes" option here is to get another node, maybe via the cluster autoscaler. If you do control the node configuration, changing nodes to more like 120 GB storage would make these Pods fit better. Especially in an AWS/EKS context, current Kubernetes also supports generic ephemeral volumes which would let you get per-pod storage backed by a volume (on AWS, most likely an EBS volume) rather than fixed-size local disk.

What does Kubelet use to determine the ephemeral-storage capacity of the node?

I have Kubernetes cluster running on a VM. A truncated overview of the mounts is:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20G 4.5G 15G 24% /
/dev/mapper/vg001-lv--docker 140G 33G 108G 23% /var/lib/docker
As you can see, I added an extra disk to store the docker images and its volumes. However, when querying the node's capacity, the following is returned
Capacity:
cpu: 12
ephemeral-storage: 20145724Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65831264Ki
nvidia.com/gpu: 1
pods: 110
ephemeral-storage is 20145724Ki which is 20G, referring to the disk mounted at /.
How does Kubelet calculate its ephemeral-storage? Is it simply looking at the disk space available at /? Or is it looking at another folder like /var/log/containers?
This is a similar post where the user eventually succumbed to increasing the disk mounted at /.
Some theory
By default Capacity and Allocatable for ephemeral-storage in standard kubernetes environment is sourced from filesystem (mounted to /var/lib/kubelet).
This is the default location for kubelet directory.
The kubelet supports the following filesystem partitions:
nodefs: The node's main filesystem, used for local disk volumes, emptyDir, log storage, and more. For example, nodefs contains
/var/lib/kubelet/.
imagefs: An optional filesystem that container runtimes use to store container images and container writable layers.
Kubelet auto-discovers these filesystems and ignores other
filesystems. Kubelet does not support other configurations.
From Kubernetes website about volumes:
The storage media (such as Disk or SSD) of an emptyDir volume is
determined by the medium of the filesystem holding the kubelet root
dir (typically /var/lib/kubelet).
Location for kubelet directory can be configured by providing:
Command line parameter during kubelet initialization
--root-dir string
Default: /var/lib/kubelet
Via kubeadm with config file (e.g.)
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
root-dir: "/data/var/lib/kubelet"
Customizing kubelet:
To customize the kubelet you can add a KubeletConfiguration next to
the ClusterConfiguration or InitConfiguration separated by ---
within the same configuration file. This file can then be passed to
kubeadm init.
When bootstrapping kubernetes cluster using kubeadm, Capacity reported by kubectl get node is equal to the disk capacity mounted into /var/lib/kubelet
However Allocatable will be reported as:
Allocatable = Capacity - 10% nodefs using the standard kubeadm configuration, since the kubelet has the following default hard eviction thresholds:
nodefs.available<10%
It can be configured during kubelet initialization with:
-eviction-hard mapStringString
Default: imagefs.available<15%,memory.available<100Mi,nodefs.available<10%
Example
I set up a test environment for Kubernetes with a master node and two worker nodes (worker-1 and worker-2).
Both worker nodes have volumes of the same capacity: 50Gb.
Additionally, I mounted a second volume with a capacity of 20Gb for the Worker-1 node at the path /var/lib/kubelet.
Then I created a cluster with kubeadm.
Result
From worker-1 node:
skorkin#worker-1:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 49G 2.8G 46G 6% /
...
/dev/sdb 20G 45M 20G 1% /var/lib/kubelet
and
Capacity:
cpu: 2
ephemeral-storage: 20511312Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4027428Ki
pods: 110
Size of ephemeral-storage is the same as volume mounted at /var/lib/kubelet.
From worker-2 node:
skorkin#worker-2:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 49G 2.7G 46G 6% /
and
Capacity:
cpu: 2
ephemeral-storage: 50633164Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4027420Ki
pods: 110

Why does DigitalOcean k8s node capacity shows subtracted value from node pool config?

I'm running a 4vCPU 8GB Node Pool, but all of my nodes report this for Capacity:
Capacity:
cpu: 4
ephemeral-storage: 165103360Ki
hugepages-2Mi: 0
memory: 8172516Ki
pods: 110
I'd expect it to show 8388608Ki (the equivalent of 8192Mi/8Gi).
How come?
Memory can be reserved for both system services (system-reserved) and the Kubelet itself (kube-reserved). https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ has details but DO is probably setting it up for you.

Minikube - how to increase ephemeral storage

I am trying to set up a copy of our app on my development machine using minikube. But I get an error showing up in minikube dashboard:
0/1 nodes are available: 1 Insufficient ephemeral-storage
Any ideas as to how I fix this?
The relevant part of the yaml configuration file looks like so:
resources:
requests:
memory: 500Mi
cpu: 1
ephemeral-storage: 16Gi
limits:
memory: 4Gi
cpu: 1
ephemeral-storage: 32Gi
I have tried assigning extra disk space at startup with the following but the error persists:
minikube start --disk-size 64g
The issue is that minikube can't resize the VM disk.
Depending on the type Hypervisor driver (xhyve, virtualbox, hyper-v) and disk type (qcow2, sparse, raw, etc.) resizing the VM disk will be different. For example, for if you have:
/Users/username/.minikube/machines/minikube/minikube.rawdisk
You can do something like this:
$ cd /Users/username/.minikube/machines/minikube
$ mv minikube.rawdisk minikube.img
$ hdiutil resize -size 64g minikube.img
$ mv minikube.img minikube.rawdisk
$ minikube start
$ minikube ssh
Then in the VM:
$ sudo resize2fs /dev/vda1 # <-- or the disk of your VM
Otherwise, if you don't care about the data in your VM:
$ rm -rf ~/.minikube
$ minikube start --disk-size 64g

Minikube NodeUnderDiskPressure issue

I'm constantly running into NodeUnderDiskPressure in my pods that are running in Minikube. Using minikube ssh to see df -h, I'm using 50% max on all of my mounts. In fact, one is 50% and the other 5 are <10%.
$ df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 7.3G 503M 6.8G 7% /
devtmpfs 7.3G 0 7.3G 0% /dev
tmpfs 7.4G 0 7.4G 0% /dev/shm
tmpfs 7.4G 9.2M 7.4G 1% /run
tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup
/dev/sda1 17G 7.5G 7.8G 50% /mnt/sda1
$ df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
rootfs 1.9M 4.1K 1.9M 1% /
devtmpfs 1.9M 324 1.9M 1% /dev
tmpfs 1.9M 1 1.9M 1% /dev/shm
tmpfs 1.9M 657 1.9M 1% /run
tmpfs 1.9M 14 1.9M 1% /sys/fs/cgroup
/dev/sda1 9.3M 757K 8.6M 8% /mnt/sda1
The probably usually just goes away after 1-5 minutes. Strangely, restarting Minikube doesn't seem to speed up this process. I've tried removing all evicted pods but, again, disk usage doesn't actually look very high.
The docker images I'm using are just under 2GB and I'm trying to spin up just a few of them, so that should still leave me with plenty of headroom.
Here's some kubectl describe output:
$ kubectl describe po/consumer-lag-reporter-3832025036-wlfnt
Name: consumer-lag-reporter-3832025036-wlfnt
Namespace: default
Node: <none>
Labels: app=consumer-lag-reporter
pod-template-hash=3832025036
tier=monitor
type=monitor
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"consumer-lag-reporter-3832025036","uid":"342b0f72-9d12-11e8-a735...
Status: Pending
IP:
Created By: ReplicaSet/consumer-lag-reporter-3832025036
Controlled By: ReplicaSet/consumer-lag-reporter-3832025036
Containers:
consumer-lag-reporter:
Image: avery-image:latest
Port: <none>
Command:
/bin/bash
-c
Args:
newrelic-admin run-program python manage.py lag_reporter_runner --settings-module project.settings
Environment Variables from:
local-config ConfigMap Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sjprm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-sjprm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sjprm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 15s (x7 over 46s) default-scheduler No nodes are available that match all of the following predicates:: NodeUnderDiskPressure (1).
Is this a bug? Anything else I can do to debug this?
I tried:
Cleaning up evicted pods (with kubectl get pods -a)
Cleaning up unused images
(with minikube ssh + docker images)
Cleaning up all non-running containers (with
minikube ssh + docker ps -a)
The disk usage remained low as shown in my question. I simply recreated a minikube cluster and used the --disk-size flag and this solved my problem. The key thing to note is that even though df showed that I was barely using any disk, it helped to make the disk even bigger.