GKE Standard no space left on device

GKE Standard no space left on device - kubernetes

I ran my cluster on GKE standard mode
all services is find till i found when try to deploy new servicecs
pod status is always "ContainerCreating"
and when describe it got
Warning FailedMount 0s kubelet MountVolume.SetUp
failed for volume "kube-api-access-gr995" : write
/var/lib/kubelet/pods/867eb785-4347-4439-8a22-1be71d8985f5/volumes/kubernetes.io~projected/kube-api-access-gr995/..2022_07_27_17_22_26.024529456/namespace:
no space left on device
I already try to delete deployments and redeploy again but not work seen like my master is full of disk but i usage GKE so cann't ssh or do something

Related

Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0" while installing Velero in GKE Cluster

I'm trying to install and configure Velero for kubernetes backup. I have followed the link to configure it in my GKE cluster. The installation went fine, but velero is not working.
I am using google cloud shell for running all my commands (I have installed and configured velero client in my google cloud shell)
On further inspection on velero deployment and velero pods, I found out that it is not able to pull the image from the docker repository.
kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
velero-5489b955f6-kqb7z 0/1 Init:ErrImagePull 0 20s
Error from velero pod (kubectl describe pod) (output redacted for readability - only relevant info shown below)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38s default-scheduler Successfully assigned velero/velero-5489b955f6-kqb7z to gke-gke-cluster1-default-pool-a354fba3-8674
Warning Failed 22s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 22s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Error: ErrImagePull
Normal BackOff 21s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Back-off pulling image "velero/velero-plugin-for-gcp:v1.1.0"
Warning Failed 21s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Error: ImagePullBackOff
Normal Pulling 8s (x2 over 37s) kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Pulling image "velero/velero-plugin-for-gcp:v1.1.0"
Command used to install velero: (some of the values are given as variables)
velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.1.0 \
--bucket $storagebucket \
--secret-file ~/velero-backup-storage-sa-key.json
Velero Version
velero version
Client:
Version: v1.4.2
Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
<error getting server version: timed out waiting for server status request to be processed>
GKE version
v1.15.12-gke.2

Isn't this a Private Cluster ? – mario 31 mins ago
#mario this is a private cluster but I can deploy other services without any issues (for eg: I have deployed nginx successfully) –
Sreesan 15 mins ago
Well, this is a know limitation of GKE Private Clusters. As you can read in the documentation:
Can't pull image from public Docker Hub
Symptoms
A Pod running in your cluster displays a warning in kubectl describe such as Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Potential causes
Nodes in a private cluster do not have outbound access to the public
internet. They have limited access to Google APIs and services,
including Container Registry.
Resolution
You cannot fetch images directly from Docker Hub. Instead, use images
hosted on Container Registry. Note that while Container Registry's
Docker Hub
mirror
is accessible from a private cluster, it should not be exclusively
relied upon. The mirror is only a cache, so images are periodically
removed, and a private cluster is not able to fall back to Docker Hub.
You can also compare it with this answer.
It can be easily verified on your own by making a simple experiment. Try to run two different nginx deployments. First based on image nginx (which equals to nginx:latest) and the second one based on nginx:1.14.2.
While the first scenario is perfectly feasible because the nginx:latest image can be pulled from Container Registry's Docker Hub mirror which is accessible from a private cluster, any attempt of pulling nginx:1.14.2 will fail which you'll see in Pod events. It happens because the kubelet is not able to find this version of the image in GCR and it tries to pull it from public docker registry (https://registry-1.docker.io/v2/), which in Private Clusters is not possible. "The mirror is only a cache, so images are periodically removed, and a private cluster is not able to fall back to Docker Hub." - as you can read in docs.
If you still have doubts, just ssh into your node and try to run following commands:
curl https://cloud.google.com/container-registry/
curl https://registry-1.docker.io/v2/
While the first one works perfectly, the second one will eventually fail:
curl: (7) Failed to connect to registry-1.docker.io port 443: Connection timed out
Reason ? - "Nodes in a private cluster do not have outbound access to the public internet."
Solution ?
You can search what is currently available in GCR here.
In many cases you should be able to get the required image if you don't specify it's exact version (by default latest tag is used). While it can help with nginx, unfortunatelly no version of velero/velero-plugin-for-gcp is currently available in Google Container Registry's Docker Hub mirror.
Granting private nodes outbound internet access by using Cloud NAT seems the only reasonable solution that can be applied in your case.

I solved this problem by realizing that version of:
velero/velero-plugin-for-gcp
is not following the version of:
velero/velero
For example, now latest versions are:
velero/velero:v1.9.1 and velero/velero-plugin-for-gcp:v1.5.0

After uninstalling calico, new pods are stuck in container creating state

After uninstalling calico, kubectl -f calico.yaml, not able to create new pods in the cluster. Any new pods in the cluster are stuck in container creating state. Kubectl describe shows the errors below:
Warning FailedCreatePodSandBox 2m kubelet, 10.0.12.2 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to set up pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to teardown pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]

The main issue is caused because calico has an init container but does not have a cleanup container. T
To undeploy calico, we have to do the usual kubectl delete -f <yaml>, and then delete a calico conf file in each of the nodes /etc/cni/net.d/. This configuration file along with other binaries are loaded on to the host by the init container.
https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
From this link we can see that kubelet reads the configuration file from the default directory, and if there are multiple configuration files, then it applies the CNI plugin from the config file that appears first in an alphabetical order (why, oh god why??).
So, in our case, after uninstalling calico, it would be removed from all the admin privileges but the nodes would still try to apply calico rules based upon the config file it picked up from the default directory. Then restart the node to get rid of the iptable rules.
Removing the file and restarting the node solves the issue and we get back to normal behavior. Another way to solve the same problem is by simply terminating the node from the cluster if you are on a managed kubernetes cluster. Since, public cloud infrastructure automatically boots up another node to keep the same state, it no longer has the calico configuration file.

Jenkins X builds fail with "The node was low on resource: [DiskPressure]."

My Jenkins X installation, mid-project, is now becoming very unstable. (Mainly) Jenkins pods are failing to start due to disk pressure.
Commonly, many pods are failing with
The node was low on resource: [DiskPressure].
or
0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had disk pressure, 2 node(s) had no available volume zone.
Unable to mount volumes for pod "jenkins-x-chartmuseum-blah": timeout expired waiting for volumes to attach or mount for pod "jx"/"jenkins-x-chartmuseum-blah". list of unmounted volumes=[storage-volume]. list of unattached volumes=[storage-volume default-token-blah]
Multi-Attach error for volume "pvc-blah" Volume is already exclusively attached to one node and can't be attached to another
This may have become more pronounced with more preview builds for projects with npm and the massive node-modules directories it generates. I'm also not sure if Jenkins is cleaning up after itself.
Rebooting the nodes helps, but not for very long.

Let's approach this from the Kubernetes side.
There are few things you could do to fix this:
As mentioned by #Vasily check what is causing disk pressure on nodes. You may also need to check logs from:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log
More about why those logs below.
Check your Eviction Thresholds. Adjust Kubelet and Kube-Scheduler configuration if needed. See what is happening with both of them (logs mentioned earlier might be useful now). More info can be found here
Check if you got a correctly running Horizontal Pod Autoscaler: kubectl get hpa
You can use standard kubectl commands to setup and manage your HPA.
Finally, the volume related errors that you receive indicates that we might have problem with PVC and/or PV. Make sure you have your volume in the same zone as node. If you want to mount the volume to a specific container make sure it is not exclusively attached to another one. More info can be found here and here
I did not test it myself because more info is needed in order to reproduce the whole scenario but I hope that above suggestion will be useful.
Please let me know if that helped.

Deploying service to GKE/Kubernetes leading to FailedSync error

When deploying a service to Kubernetes/GKE kubectl describe pod indicates the following error (as occurring after the image was successfully pulled):
{kubelet <id>} Warning FailedSync Error syncing pod,
skipping: failed to "StartContainer" for "<id>" with CrashLoopBackOff:
"Back-off 20s restarting failed container=<id>"
{kubelet <id>}
spec.containers{id} Warning BackOff restarting failed docker container.
I have checked various log files (such as /var/log/kubelet.log and /var/log/docker.log) on the node where the pod is executing but did not find anything more specific?
What does the error message indicate, and how can I further diagnose and solve the problem?
The problem might be in relation to mounting a PD. I can both successfully docker run the the image from Cloud Shell (without the PD) and mount the PD after adding it to GCE VM instance. So apparently it's neither caused by the image nor the PD in isolation.

The root cause for this was apparently that the PD did not contain a directory which was the target of a symbolic link required by the application running inside the image. This cause the application to terminate and as an effect the image to stop, which apparently was reported as failed docker container in the shown log file.
After creating the directory (by attatching the drive to a separate VM instance and mounting it there just for that purpose) this specific problem disappears (only to be followed by this one for now :)

Could not attach GCE PD, Timeout waiting for mount paths

this is getting out of hand... have good specs of GKE, yet, I'm getting timeout for mount paths, I have posted this issue in github, but they said, it would be better if posted in SO. please fix this..
2m 2m 1 {scheduler } Scheduled Successfully assigned mongodb-shard1-master-gp0qa to gke-cluster-1-micro-a0f27b19-node-0p2j
1m 1m 1 {kubelet gke-cluster-1-micro-a0f27b19-node-0p2j} FailedMount Unable to mount volumes for pod "mongodb-shard1-master-gp0qa_default": Could not attach GCE PD "shard1-node1-master". Timeout waiting for mount paths to be created.
1m 1m 1 {kubelet gke-cluster-1-micro-a0f27b19-node-0p2j} FailedSync Error syncing pod, skipping: Could not attach GCE PD "shard1-node1-master". Timeout waiting for mount paths to be created.

This problem has been documented several times, for example here https://github.com/kubernetes/kubernetes/issues/14642. Kubernetes v1.3.0 should have a fix.
As a workaround (in GCP) you can restart your VMs.
Hope this helps!

It's possible that your GCE service account may not be authorized on your project. Try re-adding $YOUR_PROJECT_NUMBER-compute#developer.gserviceaccount.com as "Can-edit" on the Permissions page of the Developers Console.

I ran into this recently, and the issue ended up being that the application running inside the docker container was actually shutting down immediately - this caused gce to try and restart it, but it would fail when GCE tried to attach the disk (already attached).
So, seems like a bit of a bug in GCE, but don't run down the rabbit hole trying to figure that out, I ended up running things locally and debugging the crash using local volume mounts.

this is an old question, but I like to share how I fixed the problem. I manually un-mount the problematic disks from its host via the google cloud console.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

GKE Standard no space left on device - kubernetes

Related

Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0" while installing Velero in GKE Cluster

After uninstalling calico, new pods are stuck in container creating state

Jenkins X builds fail with "The node was low on resource: [DiskPressure]."

Deploying service to GKE/Kubernetes leading to FailedSync error

Could not attach GCE PD, Timeout waiting for mount paths

Categories

Resources