Unschedulable 0/1 nodes are available insufficient ephemeral-storage - kubernetes

I have one strange issue.
The error that I'm getting is:
unschedulable 0/1 nodes are available insufficient ephemeral-storage
My requests per workflow that I run in kubernetes are:
resources:
requests:
ephemeral-storage: 50Gi
memory: 8Gi
And my node capacity is 100GiB per node.
When I run kubectl describe node <node-name> I get the following result:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 125m (3%) 0 (0%)
memory 8Gi (55%) 0 (0%)
ephemeral-storage 50Gi (56%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Do ephemeral-storage and memory overlap? What can be the issue? I cannot resolve.

In the kubectl describe node output, Kubernetes believes it's used 50 GiB of disk ("ephemeral storage") and that's 56% of the available resources. That implies there's about 89 GiB of usable disk space, and about 39 GiB left, so less space than your container claims it needs.
If the node has a 100 GiB disk, space required by the operating system, Kubernetes, and any pulled images counts against that disk capacity before being considered available for ephemeral storage. You probably will never be able to run two Pods that both require 50 GiB of disk; with the OS overhead they will not both fit at the same time.
(It's also not impossible that the node has 100 GB and not 100 GiB storage. 100 * 10^9 is only 93 * 2^30, which would make that overhead about 4 GiB, which feels a little more typical to me.)
The easiest and "most Kubernetes" option here is to get another node, maybe via the cluster autoscaler. If you do control the node configuration, changing nodes to more like 120 GB storage would make these Pods fit better. Especially in an AWS/EKS context, current Kubernetes also supports generic ephemeral volumes which would let you get per-pod storage backed by a volume (on AWS, most likely an EBS volume) rather than fixed-size local disk.

Related

How to check maximum memory I can request for a pod in a kubernetes cluster?

I am using an EKS cluster with 4 nodes. There are multiple applications running in the kubernetes cluster(nearly 30-40 pods) with different cpu and memory requests.
Now I wish to increase the memory of one particular pod, now how to choose what maximum memory I can assign to the pod in my cluster.
My Idea is to get the free memory inside the kubernetes nodes and based on that will decide the maximum memory that I can assign to pod.
I am trying with free command in pods to check the memory available.
How can I get the free memory available in my EKS cluster nodes?
Note: There is no metrics server installed in my EKS cluster
There may be namespace specific limits, which could be lower than whats available at the node level. In this case, you will have to consider the namespace limits.
However, if thats not the case, as a starting value, you may the below command and look at the "Requests" column and choose the request value in your pod that is lower than the available amount shown for your most utilized node.
kubectl get node --no-headers | while read node status; do echo '>>>> ['$node']'; kubectl describe node $node | grep Resource -A 3 ;done
>>>> [node-be-worker-1]
Resource Requests Limits
-------- -------- ------
cpu 300m (7%) 100m (2%)
memory 220Mi (2%) 220Mi (2%)
>>>> [node-be-worker-2]
Resource Requests Limits
-------- -------- ------
cpu 200m (5%) 100m (2%)
memory 150Mi (1%) 50Mi (0%)
>>>> [node-be-worker-3]
Resource Requests Limits
-------- -------- ------
cpu 400m (10%) 2100m (52%)
memory 420Mi (5%) 420Mi (5%)
>>>> [node-master-0]
Resource Requests Limits
-------- -------- ------
cpu 650m (32%) 100m (5%)
memory 50Mi (1%) 50Mi (1%)
Explanation: The command lists all the nodes and loops over to describe the nodes and then filters the lines for "Resource" and prints the string and the next 3 lines.
You should be able to tweak the above command to work for namespace too.

Kubernetes ephemeral-storage of containers

Kubernetes has the concept of ephemeral-storage which can be applied by the deployment to a container like this:
limits:
  cpu: 500m
  memory: 512Mi
  ephemeral-storage: 100Mi
requests:
  cpu: 50m
  memory: 256Mi
  ephemeral-storage: 50Mi
Now, when applying this to a k8s 1.18 cluster (IBM Cloud managed k8s), I cannot see any changes when I look at a running container:
kubectl exec -it <pod> -n <namespace> -c nginx -- /bin/df
I would expect to see there changes. Am I wrong?
You can see the allocated resources by using kubectl describe node <insert-node-name-here> on the node that is running the pod of the deployment.
You should see something like this:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1130m (59%) 3750m (197%)
memory 4836Mi (90%) 7988Mi (148%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-azure-disk 0 0
When you requested 50Mi of ephemeral-storage it should show up under Requests.
When your pod tries to use more than the limit (100Mi) the pod will be evicted and restarted.
On the node side, any pod that uses more than its requested resources is subject to eviction when the node runs out of resources. In other words, Kubernetes never provides any guarantees of availability of resources beyond a Pod's requests.
In kubernetes documentation you can find more details how Ephemeral storage consumption management works here.
Note that using kubectl exec with df command might not show actual use of storage.
According to kubernetes documentation:
The kubelet can measure how much local storage it is using. It does this provided that:
the LocalStorageCapacityIsolation feature gate is enabled (the feature is on by default), and
you have set up the node using one of the supported configurations for local ephemeral storage.
If you have a different configuration, then the kubelet does not apply resource limits for ephemeral local storage.
Note: The kubelet tracks tmpfs emptyDir volumes as container memory use, rather than as local ephemeral storage.

How can I dimension the Nodes (cpu, memory) in a Kind Cluster?

I am a newbie and I may ask a stupid question, but I could not find answers on Kind or on stackoverflow, so I dare asking:
I run kind (Kubernestes-in-Docker) on a Ubuntu machine, with 32GB memory and 120 GB disk.
I need to run a Cassandra cluster on this Kind cluster, and each node needs at least 0.5 CPU and 1GB memory.
When I look at the node, it gives this:
Capacity:
cpu: 8
ephemeral-storage: 114336932Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32757588Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 114336932Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32757588Ki
pods: 110
so in theory, there is more than enough resources to go. However, when I try to deploy the cassandra deployment, the first Pod keeps in a status 'Pending' because of a lack of resources. And indeed, the Node resources look like this:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 100m (1%) 100m (1%)
memory 50Mi (0%) 50Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
The node does not get actually access to the available resources: it stays limited at 10% of a CPU and 50MB memory.
So, reading the exchange above and having read #887, I understand that I need to actually configure Docker on my host machine in order for Docker to allow the containers simulating the Kind nodes to grab more resources. But then... how can give such parameters to Kind so that they are taken into account when creating the cluster ?
\close
Sorry for this post: I finally found out that the issue was related to the storageclass not being properly configured in the spec of the Cassandra cluster, and not related to the dimensioning of the nodes.
I changed the cassandra-statefulset.yaml file to indicate the 'standard' storageclass: this storageclass is provisionned by default on a KinD cluster since version 0.7. And it works fine.
Since Cassandra is resource hungry, and depending on the machine, you may have to increase the timeout parameters so that the Pods would not be considered faulty during the deployment of the Cassandra cluster. I had to increase the timouts from respectively 15 and 5s, to 25 and 15s.
This topic should be closed.

Can I run a small project with Kubernetes on GCP with one node (g1-small)?

I’m voluntarily operating (developing and hosting) a community project. Meaning time and money are tight. Currently it runs on a bare-metal machine at AWS (t2.micro, (1 vCPU, 1 GB memory)).
For learning purposes I would like to containerize my application. Now I'm looking for hosting. The Google Cloud Plattform seems to be the cheapest to me.
I setup a Kubernetes cluster with 1 node (1.10.9-gke.5, g1-small (1 vCPU shared, 1.7 GB memory)).
After I set up the one node Kubernetes cluster I checked how much memory and CPU is already used by the Kubernetes system. (Please see kubectl describe node).
I was wondering if I can run the following application with 30% CPU and 30% memory left on the node. Unfortunately I don't have experience with how much the container in my example will need in terms of resources. But having only 30% CPU and 30% memory left doesn't seem like much for my kind of application.
kubectl describe node
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system event-exporter-v0.2.3-54f94754f4-bznpk 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system fluentd-gcp-scaler-6d7bbc67c5-pbrq4 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system fluentd-gcp-v3.1.0-fjbz6 100m (10%) 0 (0%) 200Mi (17%) 300Mi (25%)
kube-system heapster-v1.5.3-66b7745959-4zbcl 138m (14%) 138m (14%) 301456Ki (25%) 301456Ki (25%)
kube-system kube-dns-788979dc8f-krrtt 260m (27%) 0 (0%) 110Mi (9%) 170Mi (14%)
kube-system kube-dns-autoscaler-79b4b844b9-vl4mw 20m (2%) 0 (0%) 10Mi (0%) 0 (0%)
kube-system kube-proxy-gke-spokesman-cluster-default-pool-d70d068f-wjtk 100m (10%) 0 (0%) 0 (0%) 0 (0%)
kube-system l7-default-backend-5d5b9874d5-cgczj 10m (1%) 10m (1%) 20Mi (1%) 20Mi (1%)
kube-system metrics-server-v0.2.1-7486f5bd67-ctbr2 53m (5%) 148m (15%) 154Mi (13%) 404Mi (34%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
681m (72%) 296m (31%) 807312Ki (67%) 1216912Ki (102%)
Here my app
PROD:
API: ASP.NET core 1.1 (microsoft/dotnet:1.1-runtime-stretch)
Frontend: Angular app (nginx:1.15-alpine)
Admin: Angular app (nginx:1.15-alpine)
TEST:
API: ASP.NET core 1.1 (microsoft/dotnet:1.1-runtime-stretch)
Frontend: Angular app (nginx:1.15-alpine)
Admin: Angular app (nginx:1.15-alpine)
SHARDED
Database: Postgres (postgres:11-alpine)
Any suggestions are more than welcome.
Thanks in advance!
If you intend to run a containerized app on a single node, a GCE instance could be better to begin with.
When moving into GKE, check out this GCP's guide explaining resource allocation per machine type before any workload and kube-system pods. You'd still need to have estimated resources usage per app component or container, maybe from monitoring your Dev or GCE environment.
If you want to explore other alternatives on GCP for your app (e.g. App Engine supports .NET), here's a post with a decision tree that might help you. I also found this article/tutorial about running containers on App Engine and GKE, comparing both with load tests.

Cluster autoscaler not downscaling

I have a regional cluster set up in google kubernetes engine (GKE). The node group is a single vm in each region (3 total). I have a deployment with 3 replicas minimum controlled by a HPA.
The nodegroup is configured to be autoscaling (cluster autoscaling aka CA).
The problem scenario:
Update deployment image. Kubernetes automatically creates new pods and the CA identifies that a new node is needed. I now have 4.
The old pods get removed when all new pods have started, which means I have the exact same CPU request as the minute before. But the after the 10 min maximum downscale time I still have 4 nodes.
The CPU requests for the nodes is now:
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
358m (38%) 138m (14%) 516896Ki (19%) 609056Ki (22%)
--
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
800m (85%) 0 (0%) 200Mi (7%) 300Mi (11%)
--
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
510m (54%) 100m (10%) 410Mi (15%) 770Mi (29%)
--
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
823m (87%) 158m (16%) 484Mi (18%) 894Mi (33%)
The 38% node is running:
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system event-exporter-v0.1.9-5c8fb98cdb-8v48h 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system fluentd-gcp-v2.0.17-q29t2 100m (10%) 0 (0%) 200Mi (7%) 300Mi (11%)
kube-system heapster-v1.5.2-585f569d7f-886xx 138m (14%) 138m (14%) 301856Ki (11%) 301856Ki (11%)
kube-system kube-dns-autoscaler-69c5cbdcdd-rk7sd 20m (2%) 0 (0%) 10Mi (0%) 0 (0%)
kube-system kube-proxy-gke-production-cluster-default-pool-0fd62aac-7kls 100m (10%) 0 (0%) 0 (0%) 0 (0%)
I suspect it wont downscale because heapster or kube-dns-autoscaler.
But the 85% pod contains:
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system fluentd-gcp-v2.0.17-s25bk 100m (10%) 0 (0%) 200Mi (7%) 300Mi (11%)
kube-system kube-proxy-gke-production-cluster-default-pool-7ffeacff-mh6p 100m (10%) 0 (0%) 0 (0%) 0 (0%)
my-deploy my-deploy-54fc6b67cf-7nklb 300m (31%) 0 (0%) 0 (0%) 0 (0%)
my-deploy my-deploy-54fc6b67cf-zl7mr 300m (31%) 0 (0%) 0 (0%) 0 (0%)
The fluentd and kube-proxy pods are present on every node, so I assume they are not needed without the node. Which means that my deployment could be relocated to the other nodes since it only has a request of 300m (31% since only 94% of node CPU is allocatable).
So I figured that Ill check the logs. But if I run kubectl get pods --all-namespaces there are no pod visible on GKE for the CA. And if I use the command kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml it only tells me if it is about to scale, not why or why not.
Another option is to look at /var/log/cluster-autoscaler.log in the master node. I SSH:ed in the all 4 nodes and only found a gcp-cluster-autoscaler.log.pos file that says: /var/log/cluster-autoscaler.log 0000000000000000 0000000000000000 meaning the file should be right there but is empty.
Last option according to the FAQ, is to check the events for the pods, but as far as i can tell they are empty.
Anyone know why it wont downscale or atleast where to find the logs?
Answering myself for visibility.
The problem is that the CA never considers moving anything unless all the requirements mentioned in the FAQ are met at the same time.
So lets say I have 100 nodes with 51% CPU requests. It still wont consider downscaling.
One solution is to increase the value at which CA checks, now 50%. But unfortunately that is not supported by GKE, see answer from google support #GalloCedrone:
Moreover I know that this value might sound too low and someone could be interested to keep as well a 85% or 90% to avoid your scenario.
Currently there is a feature request open to give the user the possibility to modify the flag "--scale-down-utilization-threshold", but it is not implemented yet.
The workaround I found is to decrease the CPU request (100m instead of 300m) of the pods and have the Horizontal Pod Autoscaler (HPA) create more on demand. This is fine for me but if your application is not suitable for many small instances you are out of luck. Perhaps a cron job that cordons a node if the total utilization is low?
I agree that according to [Documentation][1] it seems that "gke-name-cluster-default-pool" could be safely deleted, conditions:
The sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node's allocatable.
All pods running on the node (except these that run on all nodes by default, like manifest-run pods or pods created by DaemonSets) can be moved to other nodes.
It doesn't have scale-down disabled annotation
Therefore there should remove it after 10 minutes it is considered not needed.
However checking the [Documentation][2] I found:
What types of pods can prevent CA from removing a node?
[...]
Kube-system pods that are not run on the node by default, *
[..]
heapster-v1.5.2--- is running on the node and it is a Kube-system pod that is not run on the node by default.
I will update the answer if I discover more interesting information.
UPDATE
The fact that the node it is the last one in the zone is not an issue.
To prove it I tested on a brand new cluster with 3 nodes each one in a different zone, one of them was without any workload apart from "kube-proxy" and "fluentd" and was correctly deleted even if it was bringing the size of the zone to zero.
[1]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
[2]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node