How do I fix Kubernetes NodeUnderDiskPressure errors? - kubernetes

After creating a simple hello world deployment, my pod status shows as "PENDING". When I run kubectl describe pod on the pod, I get the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x6 over 29s) default-scheduler 0/1 nodes are available: 1 NodeUnderDiskPressure.
If I check on my node health, I get:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:33 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:33 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:43 -0700 KubeletHasDiskPressure kubelet has disk pressure
Ready True Fri, 27 Jul 2018 15:17:27 -0700 Fri, 27 Jul 2018 14:13:43 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
So it seems the issue is that "kubelet has disk pressure" but I can't really figure out what that means. I can't SSH into minikube and check on its disk space because I'm using VMWare Workstation with --vm-driver=none.

This is an old question but I just saw it and because it doesn't have an answer yet, I will write my answer.
I was facing this problem and my pods were getting evicted many times because of disk pressure and different commands such as df or du were not helpful.
With the help of the answer that I wrote here, I found out that the main problem is the log files of the pods and because K8s is not supporting log rotation they can grow to hundreds of Gigs.
There are different log rotation methods available but I currently I am searching for the best practice for K8s so I can't suggest any specific one, yet.
I hope this can be helpful.

Personally I couldn't solve the problem using kube commands because ...
It was said to be due to an antivirus (McAfee).
Reinstalling the company-endorsed docker-desktop version solved the problem.

Had the similar issue.
My_error_log :
Warning FailedScheduling 3m23s default-scheduler 0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate
For me, the / partition was filled to 82%. Cleaning up some unwanted folders resolved the issue.
Command used:-
ssh uname#IP_or_hostname (login to the worker node )
df -h (to check the disk usage)
rm -rf folder_name (delete the unwanted folder,you are forcefully deleting the file, so make sure you really want to delete it).
I hope this can save someone's time.

Community hinted you the comments above. Will try to consolidate it.
The kubelet maps one or more eviction signals to a corresponding node condition.
If a hard eviction threshold has been met, or a soft eviction
threshold has been met independent of its associated grace period, the
kubelet reports a condition that reflects the node is under
pressure.
DiskPressure
Available disk space and inodes on either the node’s root filesystem
or image filesystem has satisfied an eviction threshold
So the problem might be not enough disk space or filesystem has run out of inodes. You have to learn about the conditions of your environment and then apply them in your kubelet configuration.
You do not need to ssh into the minikube since you are running it inside of your host:
--vm-driver=none -
option that runs the Kubernetes components on the host and not in a
VM. Docker is required to use this driver but no hypervisor. If you
use --vm-driver=none, be sure to specify a bridge
network for docker. Otherwise it might change between network restarts,
causing loss of connectivity to your cluster.
You might try to check if there are some issues related to the mentioned topics:
kubectl describe nodes
Look at df reports:
df -i
df -h
Some further reading so you can grasp the topic:
Configure Out Of Resource Handling - section Node Conditions.

Related

kubenrates (k8s) nd POD OOMKILLED during not working hours and no memory spikes

This is the situation,
first time on AWS, first time on K8s...
we have a microservice infrastructure,
we have memory issue with pods/container with jdk8 so we move to jdk11-openj9
memory management if far way well than before, the issue now is a strange one,
a microservice, that is working fine with the assigned memory, get OOMKILL during the not working hours, at random hours, today ar 5.03 AM
this POD is configured to have a QOS = GUARANTEED and this is the only one so EVICTION is not what is happening.
We monitored (with grafana) also the other PODS in that NODE and at the killing time, no one has a memory spike
bitly (the one killed): [scrennshot from grafana]
config: [scrennshot from grafana]
oauth2: [scrennshot from grafana]
redis: [scrennshot from grafana]
customactivity: [scrennshot from grafana]
chat: [scrennshot from grafana]
someone have a suggestion where to look?
OOMKILLED sure not for memory excessive request in my opinion.
update 15/05/2020
killed again this morning :
State: Running
Started: Fri, 15 May 2020 01:11:27 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 13 May 2020 07:32:46 +0000
Finished: Fri, 15 May 2020 01:11:27 +0000
Ready: True
Restart Count: 1
Limits:
memory: 250Mi
Requests:
cpu: 100m
memory: 250Mi
no reason why OOMKILLED, no memory peak or system overload
The limits memory isnt the same with JVM's memory. It should be larger than that. You can use JVM parameter MaxRAMPercentage or Xmx to spare memory for the part non-heap.
See also https://akobor.me/posts/heap-size-and-resource-limits-in-kubernetes-for-jvm-applications

GCP Kubernetes behaviour on System.exit(1)

What is the default behaviour on GCP Kubernetes for Spring Boot applications when
System.exit(1)
is called? Will kubernetes recreate the container?
If no - how can i force recreation if my application crashes?
-Alex
On exit code for application kubernetes will restart the container inside the POD and pod state will change to in Ready column.
if you want to terminate pod gracefully you can have a look at this : https://dzone.com/articles/gracefully-shutting-down-java-in-containers
Terminated: Indicates that the container completed its execution and has stopped running. A container enters into this when it has successfully completed execution or when it has failed for some reason. Regardless, a reason and exit code is displayed, as well as the container’s start and finish time. Before a container enters into Terminated, preStop hook (if any) is executed.
...
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 30 Jan 2019 11:45:26 +0530
Finished: Wed, 30 Jan 2019 11:45:26 +0530
...
Please have a look at kubernetes official doc : https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

How can I run multiple network interfaces on a k8s node?

Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
v1.13.4
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
...
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address:10.90.49.111 type:ExternalIP] map[type:ExternalIP address:10.90.51.94] map[address:10.90.49.111 type:InternalIP] map[address:10.90.51.94 type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:
Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug

kubernetes service can not send request to itself

I have a service that, in some contexts, sends requests to itself.
I can reach the service from outside the cluster, but the self-requests fail (time-out).
Environment:
minikube v0.34.1
Linux version 4.15.0 (jenkins#jenkins) (gcc version 7.3.0 (Buildroot 2018.05)) #1 SMP Fri Feb 15 19:27:06 UTC 2019
I've been using https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#a-pod-cannot-reach-itself-via-service-ip as a troubleshooting guide, but I'm down the step that says "seek help".
Troubleshooting results:
journalctl -u kubelet | grep -i hairpin
Feb 26 19:57:10 minikube kubelet[3066]: W0226 19:57:10.124151 3066 docker_service.go:540] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth"
Feb 26 19:57:10 minikube kubelet[3066]: I0226 19:57:10.124295 3066 docker_service.go:236] Hairpin mode set to "hairpin-veth"
The troubleshooting guide indicates that "hairpin-veth" is OK.
for intf in /sys/devices/virtual/net/docker0/brif/veth*; do cat $intf/hairpin_mode; done
0
...
0
Note that the guide used /sys/devices/virtual/net/cbr0/brif/*, but in this version of minikube, the path is /sys/devices/virtual/net/docker0/brif/veth*. I'd like to understand why the paths are different, but it appears that hairpin_mode is not enabled.
The next step in the guide is: Seek help if none of above works out.
Am I correct in believing that I need to enable hairpin_mode?
If so, how do I do so?
It seems like known issue, more information here:
As workaround you can try:
minikube ssh -- sudo ip link set docker0 promisc on
Please share with the reulsts.

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.