this is getting out of hand... have good specs of GKE, yet, I'm getting timeout for mount paths, I have posted this issue in github, but they said, it would be better if posted in SO. please fix this..
2m 2m 1 {scheduler } Scheduled Successfully assigned mongodb-shard1-master-gp0qa to gke-cluster-1-micro-a0f27b19-node-0p2j
1m 1m 1 {kubelet gke-cluster-1-micro-a0f27b19-node-0p2j} FailedMount Unable to mount volumes for pod "mongodb-shard1-master-gp0qa_default": Could not attach GCE PD "shard1-node1-master". Timeout waiting for mount paths to be created.
1m 1m 1 {kubelet gke-cluster-1-micro-a0f27b19-node-0p2j} FailedSync Error syncing pod, skipping: Could not attach GCE PD "shard1-node1-master". Timeout waiting for mount paths to be created.
This problem has been documented several times, for example here https://github.com/kubernetes/kubernetes/issues/14642. Kubernetes v1.3.0 should have a fix.
As a workaround (in GCP) you can restart your VMs.
Hope this helps!
It's possible that your GCE service account may not be authorized on your project. Try re-adding $YOUR_PROJECT_NUMBER-compute#developer.gserviceaccount.com as "Can-edit" on the Permissions page of the Developers Console.
I ran into this recently, and the issue ended up being that the application running inside the docker container was actually shutting down immediately - this caused gce to try and restart it, but it would fail when GCE tried to attach the disk (already attached).
So, seems like a bit of a bug in GCE, but don't run down the rabbit hole trying to figure that out, I ended up running things locally and debugging the crash using local volume mounts.
this is an old question, but I like to share how I fixed the problem. I manually un-mount the problematic disks from its host via the google cloud console.
Related
I ran my cluster on GKE standard mode
all services is find till i found when try to deploy new servicecs
pod status is always "ContainerCreating"
and when describe it got
Warning FailedMount 0s kubelet MountVolume.SetUp
failed for volume "kube-api-access-gr995" : write
/var/lib/kubelet/pods/867eb785-4347-4439-8a22-1be71d8985f5/volumes/kubernetes.io~projected/kube-api-access-gr995/..2022_07_27_17_22_26.024529456/namespace:
no space left on device
I already try to delete deployments and redeploy again but not work seen like my master is full of disk but i usage GKE so cann't ssh or do something
I'm trying to run a simple flexvolume plugin driver on windows node to enable connectivity with an external SMB share. I followed the steps listed out here
https://github.com/microsoft/K8s-Storage-Plugins/tree/master/flexvolume/windows
Placed the driver plugin in the mentioned path but the problem is the plugin is not getting picked up by gke. The error details are as below.
Warning FailedMount 8s (x2 over 21s) kubelet, gke-windows-node-pool-e4e7a7bf-f2pc Unable to attach or mount volumes: unmounted volumes=[smb-volume], unattached volumes=[default-token-jf28b smb-volume]: failed to get Plugin from volumeSpec for volume "smb-volume" err=no volume plugin matched
Not sure what I'm missing here. Any help would be great. Thanks in Advance.
Just faced with a similar issue on a kubeadm on prem configuration, have used Process Monitor to find the proper location the kubelet.exe process looks for volume plugins.
As result my actual windows node SMB preparation:
curl -L https://github.com/microsoft/K8s-Storage-Plugins/releases/download/V0.0.3/flexvolume-windows.zip -o flexvolume-windows.zip
Expand-Archive flexvolume-windows.zip C:\var\lib\kubelet\usr\libexec\kubernetes\kubelet-plugins\volume\exec\
My Jenkins X installation, mid-project, is now becoming very unstable. (Mainly) Jenkins pods are failing to start due to disk pressure.
Commonly, many pods are failing with
The node was low on resource: [DiskPressure].
or
0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had disk pressure, 2 node(s) had no available volume zone.
Unable to mount volumes for pod "jenkins-x-chartmuseum-blah": timeout expired waiting for volumes to attach or mount for pod "jx"/"jenkins-x-chartmuseum-blah". list of unmounted volumes=[storage-volume]. list of unattached volumes=[storage-volume default-token-blah]
Multi-Attach error for volume "pvc-blah" Volume is already exclusively attached to one node and can't be attached to another
This may have become more pronounced with more preview builds for projects with npm and the massive node-modules directories it generates. I'm also not sure if Jenkins is cleaning up after itself.
Rebooting the nodes helps, but not for very long.
Let's approach this from the Kubernetes side.
There are few things you could do to fix this:
As mentioned by #Vasily check what is causing disk pressure on nodes. You may also need to check logs from:
kubeclt logs: kube-scheduler events logs
journalctl -u kubelet: kubelet logs
/var/log/kube-scheduler.log
More about why those logs below.
Check your Eviction Thresholds. Adjust Kubelet and Kube-Scheduler configuration if needed. See what is happening with both of them (logs mentioned earlier might be useful now). More info can be found here
Check if you got a correctly running Horizontal Pod Autoscaler: kubectl get hpa
You can use standard kubectl commands to setup and manage your HPA.
Finally, the volume related errors that you receive indicates that we might have problem with PVC and/or PV. Make sure you have your volume in the same zone as node. If you want to mount the volume to a specific container make sure it is not exclusively attached to another one. More info can be found here and here
I did not test it myself because more info is needed in order to reproduce the whole scenario but I hope that above suggestion will be useful.
Please let me know if that helped.
Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded
Environment:
Ubuntu, 16.04 LTS
Kubernetes version: v1.13.3
Docker version: 18.06.1-ce
There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.
Steps done to troubleshoot the issue:
kubectl describe node - error in question was found(root cause isn't clear).
journalctl -u kubelet - shows this related message:
skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
it is related to this open k8 issue Ready/NotReady with PLEG issues
Check node health on AWS with cloudwatch - everything seems to be fine.
journalctl -fu docker.service : check docker for errors/issues -
the output doesn't show any erros related to that.
systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.
It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).
Questions:
What is the root cause of the error?
How to monitor that kind of issue and make sure it doesn't happen?
Are there any workarounds to this problem?
What is the root cause of the error?
From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.
How to monitor that kind of issue and make sure it doesn't happen?
There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.
In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.
If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.
Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all
I have got the same issue. I have cordoned and evicted the pods.
Rebooted the server. automatically node came into ready state.
When deploying a service to Kubernetes/GKE kubectl describe pod indicates the following error (as occurring after the image was successfully pulled):
{kubelet <id>} Warning FailedSync Error syncing pod,
skipping: failed to "StartContainer" for "<id>" with CrashLoopBackOff:
"Back-off 20s restarting failed container=<id>"
{kubelet <id>}
spec.containers{id} Warning BackOff restarting failed docker container.
I have checked various log files (such as /var/log/kubelet.log and /var/log/docker.log) on the node where the pod is executing but did not find anything more specific?
What does the error message indicate, and how can I further diagnose and solve the problem?
The problem might be in relation to mounting a PD. I can both successfully docker run the the image from Cloud Shell (without the PD) and mount the PD after adding it to GCE VM instance. So apparently it's neither caused by the image nor the PD in isolation.
The root cause for this was apparently that the PD did not contain a directory which was the target of a symbolic link required by the application running inside the image. This cause the application to terminate and as an effect the image to stop, which apparently was reported as failed docker container in the shown log file.
After creating the directory (by attatching the drive to a separate VM instance and mounting it there just for that purpose) this specific problem disappears (only to be followed by this one for now :)