Is there a way to calculate the total disk space used by each pod on nodes? - kubernetes

context
Our current context is the following: researchers are running HPC calculations on our Kubernetes cluster. Unfortunately, some pods cannot get scheduled because the container engine (here Docker) is not able to pull the images because the node is running out of disk space.
hypotheses
images too big
The first hypothesis is that the images are too big. This probably the case because we know that some images are bigger than 7 GB.
datasets being decompressed locally
Our second hypothesis is that some people are downloading their datasets locally (e.g. curl ...) and inflate them locally. This would generate the behavior we are observing.
Envisioned solution
I believe that this problem is a good case for a daemon set that would have access to the node's file system. Typically, this pod would calculate the total disk space used by all the pods on the node and would expose them as a Prometheus metric. From there is would be easy to set alert rules in place to check which pods have grown a lot over a short period of time.
How to calculate the total disk space used by a pod?
The question then becomes: is there a way to calculate the total disk space used by a pod?
Does anyone have any experience with this?

Kubernetes does not track overall storage available. It only knows things about emptyDir volumes and the filesystem backing those.
For calculating total disk space you can use below command
kubectl describe nodes
From the above output of the command you can grep ephemeral-storage which is the virtual disk size; this partition is also shared and consumed by Pods via emptyDir volumes, image layers,container logs and container writable layers.
Check where the process is still running and holding file descriptors and/or perhaps some space (You may have other processes and other file descriptors too not being released). Check Is that kubelet.
You can verify by running $ ps -Af | grep xxxx
With Prometheus you can calculate with the below formula
sum(node_filesystem_size_bytes)
Please go through Get total and free disk space using Prometheus for more information.

Related

Ceph reports bucket space utilization and total cluster utilization that is inconsistent

I copied the contents of an older Ceph cluster to a new Ceph cluster using rclone. Because several of the buckets had tens of millions of objects in a single directory I had to enumerate these individually and use the "rclone copyto" command to move them. After copying, the number of objects match but the space utilization on the second Ceph cluster is much higher.
Each Ceph cluster is configured with the default triple redundancy.
The older Ceph cluster has 1.4PiB of raw capacity.
The older Ceph cluster has 526TB in total bucket utilization as reported by "radosgw-admin metadata bucket stats". The "ceph -s" status on this cluster shows 360TiB of object utilization with a total capacity of 1.4PiB for 77% space utilization. The two indicated quantities of 360TiB used in the cluster and 526TB used by buckets are significantly different. There isn't enough raw capacity on this cluster to hold 526TB.
After copying the contents to the new Ceph cluster, the total bucket utilization of 553TB is reflected in the "ceph -s" status as 503TiB. This is slightly higher than the bucket total of the source I assume due to larger drive's block sizes, but the status utilization matches the sum of the bucket utilization as expected. The number of objects in each bucket of the destination cluster matches the source buckets also.
Is this a setting in the first Ceph cluster that merges duplicate objects like a simplistic compression? There isn't enough capacity in the first Ceph cluster to hold much over 500TB so this seems like the only way this could happen. I assume that when two objects are the same, that each bucket gets a symlink like pointer to the same object. The new Ceph cluster doesn't seem to have this capability or it's not set to behave this way.
The first cluster is Ceph version 13.2.6 and the second is version 17.2.3.

the value of container_fs_writes_bytes_total isn't correct in k8s?

If the application in the container writes or reads some files without using volume, the value of container_fs_writes_bytes_total metrics is always 0 when I upload one huge file and the application save it to one folder. And container_fs_reads_bytes_total is also always 0 when I download the uploaded file. Is there a way to get the disk IO which caused by the container.
Btw, I want to know the disk usage of each container in k8s. There is container_fs_usage_bytes metrics in cAdvisor. But the value isn't correct. If there are many container, the metrics of each container are totally same. The metrics value is the whole usage of the disk in machine. There is someone said use kubelet_vlume_stats_used_bytes, it also has the same problem.

Which component in Kubernetes is responsible for resource limits?

When a pod is created but no resource limit is specified, which component is responsible to calculate or assign resource limit to that pod? Is that the kubelet or the Docker?
If a pod doesn't specify any resource limits, it's ultimately up to the Linux kernel scheduler on the node to assign CPU cycles or not to the process, and to OOM-kill either the pod or other processes on the node if memory use is excessive. Neither Kubernetes nor Docker will assign or guess at any sort of limits.
So if you have a process with a massive memory leak, and it gets scheduled on a very large but quiet instance with 256 GB of available memory, it will get to use almost all of that memory before it gets OOM-killed. If a second replica gets scheduled on a much smaller instance with only 4 GB, it's liable to fail much sooner. Usually you'll want to actually set limits for consistent behavior.

limit the amount of memory kube-controller-manager uses

running v1.10 and i notice that kube-controller-managers memory usage spikes and the OOMs all the time. it wouldn't be so bad if the system didn't fall to a crawl before this happens tho.
i tried modifying /etc/kubernetes/manifests/kube-controller-manager.yaml to have a resource.limits.memory=1Gi but the kube-controller-manager pod never seems to want to come back up.
any other options?
There is a bug in kube-controller-manager, and it's fixed in https://github.com/kubernetes/kubernetes/pull/65339
First of all, you missed information about the amount of memory you use per node.
Second, what do you mean by "system didn't fall to a crawl" - do you mean nodes are swapping?
All Kubernetes masters and nodes are expected to have swap disabled - it's recommended by the Kubernetes community, as mentioned in the Kubernetes documentation.
Support for swap is non-trivial and degrades performance.
Turn off swap on every node by:
sudo swapoff -a
Finally,
resource.limits.memory=1Gi
is default value per pod. These limits are hard limits. Pod reaching this level of allocated memory can cause OOM, even if you have gigabytes of unallocated memory.

Should I use SSD or HDD as local disks for kubernetes cluster?

Is it worth using SSD as boot disk? I'm not planning to access local disks within pods.
Also, GCP by default creates 100GB disk. If I use 20GB disk, will it cripple the cluster or it's OK to use smaller sized disks?
Why one or the other?. Kubernetes (Google Conainer Engine) is mainly Memory and CPU intensive unless your applications need a huge throughput on the hard drives. If you want to save money you can create tags on the nodes with HDD and use the node-affinity to tweak which pods goes where so you can have few nodes with SSD and target them with the affinity tags.
I would always recommend SSD considering the small difference in price and large difference in performance. Even if it just speeds up the deployment/upgrade of containers.
Reducing the disk size to what is required for running your PODs should save you more. I cannot give a general recommendation for disk size since it depends on the OS you are using and how many PODs you will end up on each node as well as how big each POD is going to be. To give an example: When I run coreOS based images with staging deployments for nginx, php and some application servers I can reduce the disk size to 10gb with ample free room (both for master and worker nodes). On the extreme side - If I run self-contained golang application containers without storage need, each POD will only require a few MB space.