Deploying service to GKE/Kubernetes leading to FailedSync error - kubernetes

When deploying a service to Kubernetes/GKE kubectl describe pod indicates the following error (as occurring after the image was successfully pulled):
{kubelet <id>} Warning FailedSync Error syncing pod,
skipping: failed to "StartContainer" for "<id>" with CrashLoopBackOff:
"Back-off 20s restarting failed container=<id>"
{kubelet <id>}
spec.containers{id} Warning BackOff restarting failed docker container.
I have checked various log files (such as /var/log/kubelet.log and /var/log/docker.log) on the node where the pod is executing but did not find anything more specific?
What does the error message indicate, and how can I further diagnose and solve the problem?
The problem might be in relation to mounting a PD. I can both successfully docker run the the image from Cloud Shell (without the PD) and mount the PD after adding it to GCE VM instance. So apparently it's neither caused by the image nor the PD in isolation.

The root cause for this was apparently that the PD did not contain a directory which was the target of a symbolic link required by the application running inside the image. This cause the application to terminate and as an effect the image to stop, which apparently was reported as failed docker container in the shown log file.
After creating the directory (by attatching the drive to a separate VM instance and mounting it there just for that purpose) this specific problem disappears (only to be followed by this one for now :)

Related

GKE Standard no space left on device

I ran my cluster on GKE standard mode
all services is find till i found when try to deploy new servicecs
pod status is always "ContainerCreating"
and when describe it got
Warning FailedMount 0s kubelet MountVolume.SetUp
failed for volume "kube-api-access-gr995" : write
/var/lib/kubelet/pods/867eb785-4347-4439-8a22-1be71d8985f5/volumes/kubernetes.io~projected/kube-api-access-gr995/..2022_07_27_17_22_26.024529456/namespace:
no space left on device
I already try to delete deployments and redeploy again but not work seen like my master is full of disk but i usage GKE so cann't ssh or do something

fluentd daemon set container for papertrail failing to start in kubernetes cluster

Am trying to setup fluentd in kubernetes cluster to aggregate logs in papertrail, as per the documentation provided here.
The configuration file is fluentd-daemonset-papertrail.yaml
It basically creates a daemon set for fluentd container and a config map for fluentd configuration.
When I apply the configuration, the pod is assigned to a node and the container is created. However, its either not completing the initialization or the pod gets killed immediately after it is started.
As the pods are getting killed, am loosing the logs too. Couldn't investigate the cause of the issue.
Looking through the events for kube-system namespace has below errors,
Error: failed to start container "fluentd": Error response from daemon: OCI runtime create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/75026/ns/ipc\" caused \"lstat /proc/75026/ns/ipc: no such file or directory\"": unknown
Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9559643bf77e29d270c23bddbb17a9480ff126b0b6be10ba480b558a0733161c" network for pod "fluentd-papertrail-b9t5b": NetworkPlugin kubenet failed to set up pod "fluentd-papertrail-b9t5b_kube-system" network: Error adding container to network: failed to open netns "/proc/111610/ns/net": failed to Statfs "/proc/111610/ns/net": no such file or directory
Am not sure whats causing these errors. Appreciate any help to understand and troubleshoot these errors.
Also, is it possible to look at logs/events that could tell us why a pod is given a terminate signal?
Please ensure that /etc/cni/net.d and its /opt/cni/bin friend both exist and are correctly populated with the CNI configuration files and binaries on all Nodes.
Take a look: sandbox.
With help from papertrail support team, I was able to resolve the issue by removing below entry from manifest file.
kubernetes.io/cluster-service: "true"
Above annotation seems to have been deprecated.
Relevant github issues:
https://github.com/fluent/fluentd-kubernetes-daemonset/issues/296
https://github.com/kubernetes/kubernetes/issues/72757

Kubeflow: Image Pull --> no space left on device

Is there any way to clean all cached docker images etc from a kubernetes setup that could free up space on the master nodes?
I try to install a deployment but the kubernetes prompts “no space left on device” while image pulling.
I am kind of surprised that a 80GB disk is not enough for one simple deployment because the cluster is now completely emptied.
Does anyone has an idea on how to wipe all unused docker image etc out?
Thanks a lot!
Successfully pulled image "tensorflow/serving:1.11.1"
Warning Failed 4m30s kubelet, 192.168.10.37 Failed to pull image "gcr.io/kubeflow-images-public/tf-model-server-http-proxy:v20180606-9dfda4f2": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /usr/lib/python3.5/idlelib/__pycache__/CodeContext.cpython-35.pyc: no space left on device
Warning Failed 4m27s (x3 over 4m29s) kubelet, 192.168.10.37 Error: ImagePullBackOff
You can run docker image prune to clean up unused images or docker system prune
to cleanup all docker unused resources.
Also you can configure Garbage Collection feature of Kubernetes

Kubernetes Error: System error: Activation of org.freedesktop.systemd1 timed out

I got some error when scheduling pod though ReplicationController:
failedSync {kubelet 10.9.8.21} Error syncing pod, skipping: API error (500): Cannot start container 20c2fe3a3e5b5204db4475d1ce6ea37b3aea6da0762a214b9fdb3d624fd5c32c: [8] System error: Activation of org.freedesktop.systemd1 timed out
The pod is scheduled but cannot run unless I re-deploy it with another image.
I'm using kubelet 1.0.1, CoreOS v773.1.0
The part that says Error syncing pod, skipping: API error means that kubelet got an error when trying to start a container for your Pod.
Since you use CoreOS, I think you are using rkt, not docker.
I think that rkt uses systemd to start containers.
And I think systemd crashes when the "unit" name starts with an underscore:
https://github.com/coreos/go-systemd/pull/49
So, maybe one of your pods or containers has a name that starts with an underscore. Change that.

Could not attach GCE PD, Timeout waiting for mount paths

this is getting out of hand... have good specs of GKE, yet, I'm getting timeout for mount paths, I have posted this issue in github, but they said, it would be better if posted in SO. please fix this..
2m 2m 1 {scheduler } Scheduled Successfully assigned mongodb-shard1-master-gp0qa to gke-cluster-1-micro-a0f27b19-node-0p2j
1m 1m 1 {kubelet gke-cluster-1-micro-a0f27b19-node-0p2j} FailedMount Unable to mount volumes for pod "mongodb-shard1-master-gp0qa_default": Could not attach GCE PD "shard1-node1-master". Timeout waiting for mount paths to be created.
1m 1m 1 {kubelet gke-cluster-1-micro-a0f27b19-node-0p2j} FailedSync Error syncing pod, skipping: Could not attach GCE PD "shard1-node1-master". Timeout waiting for mount paths to be created.
This problem has been documented several times, for example here https://github.com/kubernetes/kubernetes/issues/14642. Kubernetes v1.3.0 should have a fix.
As a workaround (in GCP) you can restart your VMs.
Hope this helps!
It's possible that your GCE service account may not be authorized on your project. Try re-adding $YOUR_PROJECT_NUMBER-compute#developer.gserviceaccount.com as "Can-edit" on the Permissions page of the Developers Console.
I ran into this recently, and the issue ended up being that the application running inside the docker container was actually shutting down immediately - this caused gce to try and restart it, but it would fail when GCE tried to attach the disk (already attached).
So, seems like a bit of a bug in GCE, but don't run down the rabbit hole trying to figure that out, I ended up running things locally and debugging the crash using local volume mounts.
this is an old question, but I like to share how I fixed the problem. I manually un-mount the problematic disks from its host via the google cloud console.