Image garbage collection failed: unable to find data for container / . Unable to start kubernetes master - kubernetes

I have a setup kubernetes cluster with one master and 3 minions and it is working for quite some time. But recently i noticed that kubelet.log file has eaten most of the disk space. So i stopped the kubernetes service and deleted the kubelet.log file. But when i restart the kubernetes service it is not starting. I found the below error in the log
Image garbage collection failed: unable to find data for container /
Please let me know what is the issue.

Related

k8s access denied when pulling local images

I use podman with a local registry. I am able to pull the images from the command line and also see the manifest. When I deploy my k8s it fails to pull the image with error access denied. Any idea's? I Googled for days now but do not get an answer that works.
I run Ubuntu 22.04 on VMWARE if that maybe makes a difference. Thank you.
kubelet Failed to pull image "localhost:5001/datawire/aes:3.1.0": rpc error: code = Unknown desc = failed to pull and unpack image "localhost:5001/datawire/aes:3.1.0": failed to resolve reference "localhost:5001/datawire/aes:3.1.0": failed to do request: Head "http://localhost:5001/v2/datawire/aes/manifests/3.1.0": dial tcp 127.0.0.1:5001: connect: connection refused
When Kubernetes attempts to create a new pod but is unable to pull the required container image, an error message stating "Failed to pull image" will be displayed. When you try to add a new resource to your cluster with a command like "kubectl apply," you should see this right away. When you inspect the pod with "kubectl describe pod/my-pod," the error will appear in the Events.
Pull errors are caused by the nodes in your cluster. Each node's Kubelet worker process is in charge of obtaining the images required to process a pod scheduling request. When a node is unable to download an image, the status is reported to the cluster control plane.
Images may not pull for a variety of reasons. It's possible that your nodes' networking failed, or that the cluster as a whole is experiencing connectivity issues. If you are online, the registry is up, and pull errors continue to occur, your firewall or traffic filtering may be malfunctioning.
Refer this doc1 and doc2 for more information.

Unable to enter a pod in the gke cluster

We have our k8s cluster set up with our app, including a neo4j DB deployment and other artifacts. Overnight, we've started facing an issue in our GKE cluster when trying to enter or interact somehow with any pod running in the cluster. The following screenshot shows a sample of the error we get.
issued command
error: unable to upgrade connection: Authorization error (user=kube-apiserver, verb=create, resource=nodes, subresource=proxy)
Our GKE cluster is created as standard (no autopilot) and the versions are
Node pool details
cluster basics
As said before it was working fine regardless of the warning about the versions. However, we haven't been able yet to identify what could have changed between the last time it worked, and now.
Any clue on what authorization setup might have been changed making it incompatible now is very welcomed

401 Unauthorized error while trying to pull image from Google Container Registry

I am using google container registry (GCR) to push and pull docker images. I have created a deployment in kubernetes with 3 replicas. The deployment will use a docker image pulled from the GCR.
Out of 3 replicas, 2 are pulling the images and running fine.But the third replica is showing the below error and the pod's status remains "ImagePullBackOff" or "ErrImagePull"
"Failed to pull image "gcr.io/xxx:yyy": rpc error: code = Unknown desc
= failed to pull and unpack image "gcr.io/xxx:yyy": failed to resolve reference "gcr.io/xxx:yyy": unexpected status code: 401 Unauthorized"
I am confused like why only one of the replicas is showing the error and the other 2 are running without any issue. Can anyone please clarify this?
Thanks in Advance!
ImagePullBackOff and ErrImagePull indicate that the image used by a container cannot be loaded from the image registry.
401 unauthorized error might occur when you pull an image from a private Container Registry repository. For troubleshooting the error:
Identify the node that runs the pod by kubectl describe pod POD_NAME | grep "Node:"
Verify the node has the storage scope by running the command
gcloud compute instances describe NODE_NAME --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
The node's access scope should contain at least one of the following:
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_only
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
Recreate the node pool that node belongs to with sufficient scope and you cannot modify existing nodes, you must recreate the node with the correct scope.
Create a new node pool with the gke-default scope by the following command
gcloud container node-pools create NODE_POOL_NAME --cluster=CLUSTER_NAME --zone=COMPUTE_ZONE --scopes="gke-default"
Create a new node pool with only storage scope
gcloud container node-pools create NODE_POOL_NAME --cluster=CLUSTER_NAME --zone=COMPUTE_ZONE --scopes="https://www.googleapis.com/auth/devstorage.read_only"
Refer to the link for more information on the troubleshooting process.
Hi you will setup role for cluster to access GCR images for pulling and pushing you can see https://github.com/GoogleContainerTools/skaffold/issues/336

OKD unable to pull lager images from Internal Registry right after deployment of microservices through Jenkinsx

I am trying to deploy micro services in OKD through Jenkinsx and the deployment is successful every time.
But the Pods are going into "ImagePullBackOff" error right after deployment and comes into Running state after deleting the pods.
ImagePullBackOff Error:
Events:
The images are being pulled from the OKD's internal registry and the image is of size "1.25 GB". And the images are available in the Internal Registry when the pod is trying to pull it.
I came across "image-pull-progress-deadline" field to be updated in the "/etc/origin/node/node-config.yaml" in all the nodes. Updated the same in all the nodes but still facing the same "ImagePullBackOff" error.
I am trying to restart the kubelet service but that fails with kubelet.service not found error,
[master ~]$ sudo systemctl status kubelet
Unit kubelet.service could not be found.
Please let me know if restart of kubelet service is necessary and any suggestions to resolve the "ImagePullBackOff" issue.

Minions can't rejoin cluster on reboot of AWS instance

The kubernetes cluster using v1.3.4 starts a master and 2 minions
The cluster starts fine and pods can be started and controlled without issue
As soon as one of the minions is rebooted, or any of the dependent services, such as kubelet is restarted, the minions will not rejoin the cluster
The error from the kubelet service is of the form:
Aug 08 08:21:15 ip-10-16-1-20 kubelet[911]: E0808 08:21:15.955309 911 kubelet.go:2875] Error updating node status, will retry: error getting node "ip-10-16-1-20.us-west-2.compute.internal": nodes "ip-10-16-1-20.us-west-2.compute.internal" not found
The only way, that we can see to rectify this issue at the moment is to tear down the whole cluster and rebuild it
UPDATE:
I had a look at the controller manager log and got the following
W0815 13:36:39.087991 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
W0815 13:37:39.123811 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
E0815 13:37:39.133045 1 nodecontroller.go:434] pods "kube-proxy-ip-10-16-1-25.us-west-2.compute.internal" not found
This is actually a coreos issue, although it is difficult to ascertain what the problem actually is. It is more than likely the low level os host resolution code being called from the aws go layers, but that is purely a guess. By upgrading the coreos ami to a later version solved many of the issues we were facing.