Kubernetes ImagePullBackOff Errors when pulling docker images - kubernetes

I have an image of size of 6.5GB in the Google Container Registry. When I try to pull the image on a Kubernetes cluster node(worker node) via a deployment, an error occurs: ErrImagePull(or sometimes ImagePullBackOff). I used the describe command to see the error in detail. The error is described as Failed to pull image "gcr.io/.../.. ": rpc error: code = Canceled desc = context canceled
What may be the issue and how to mitigate it?

It seems that the kubelet expects a updates on progress during the pull of a large image but this currently isn't available by default with most container registries. It's not ideal behaviour but it appears people have been able to work around it from reading the responses on https://github.com/kubernetes/kubernetes/issues/59376 and Kubernetes set a timeout limit on image pulls by adjusting the timeout

Use --image-pull-progress-deadline duration as a parameter when you start the kubelet.
This is documented in the kubelet documentation.
If no pulling progress is made before this deadline, the image pulling will be cancelled. (default 1m0s)

Related

k8s access denied when pulling local images

I use podman with a local registry. I am able to pull the images from the command line and also see the manifest. When I deploy my k8s it fails to pull the image with error access denied. Any idea's? I Googled for days now but do not get an answer that works.
I run Ubuntu 22.04 on VMWARE if that maybe makes a difference. Thank you.
kubelet Failed to pull image "localhost:5001/datawire/aes:3.1.0": rpc error: code = Unknown desc = failed to pull and unpack image "localhost:5001/datawire/aes:3.1.0": failed to resolve reference "localhost:5001/datawire/aes:3.1.0": failed to do request: Head "http://localhost:5001/v2/datawire/aes/manifests/3.1.0": dial tcp 127.0.0.1:5001: connect: connection refused
When Kubernetes attempts to create a new pod but is unable to pull the required container image, an error message stating "Failed to pull image" will be displayed. When you try to add a new resource to your cluster with a command like "kubectl apply," you should see this right away. When you inspect the pod with "kubectl describe pod/my-pod," the error will appear in the Events.
Pull errors are caused by the nodes in your cluster. Each node's Kubelet worker process is in charge of obtaining the images required to process a pod scheduling request. When a node is unable to download an image, the status is reported to the cluster control plane.
Images may not pull for a variety of reasons. It's possible that your nodes' networking failed, or that the cluster as a whole is experiencing connectivity issues. If you are online, the registry is up, and pull errors continue to occur, your firewall or traffic filtering may be malfunctioning.
Refer this doc1 and doc2 for more information.

401 Unauthorized error while trying to pull image from Google Container Registry

I am using google container registry (GCR) to push and pull docker images. I have created a deployment in kubernetes with 3 replicas. The deployment will use a docker image pulled from the GCR.
Out of 3 replicas, 2 are pulling the images and running fine.But the third replica is showing the below error and the pod's status remains "ImagePullBackOff" or "ErrImagePull"
"Failed to pull image "gcr.io/xxx:yyy": rpc error: code = Unknown desc
= failed to pull and unpack image "gcr.io/xxx:yyy": failed to resolve reference "gcr.io/xxx:yyy": unexpected status code: 401 Unauthorized"
I am confused like why only one of the replicas is showing the error and the other 2 are running without any issue. Can anyone please clarify this?
Thanks in Advance!
ImagePullBackOff and ErrImagePull indicate that the image used by a container cannot be loaded from the image registry.
401 unauthorized error might occur when you pull an image from a private Container Registry repository. For troubleshooting the error:
Identify the node that runs the pod by kubectl describe pod POD_NAME | grep "Node:"
Verify the node has the storage scope by running the command
gcloud compute instances describe NODE_NAME --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
The node's access scope should contain at least one of the following:
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_only
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
Recreate the node pool that node belongs to with sufficient scope and you cannot modify existing nodes, you must recreate the node with the correct scope.
Create a new node pool with the gke-default scope by the following command
gcloud container node-pools create NODE_POOL_NAME --cluster=CLUSTER_NAME --zone=COMPUTE_ZONE --scopes="gke-default"
Create a new node pool with only storage scope
gcloud container node-pools create NODE_POOL_NAME --cluster=CLUSTER_NAME --zone=COMPUTE_ZONE --scopes="https://www.googleapis.com/auth/devstorage.read_only"
Refer to the link for more information on the troubleshooting process.
Hi you will setup role for cluster to access GCR images for pulling and pushing you can see https://github.com/GoogleContainerTools/skaffold/issues/336

Controlling pod recovery from "Error: ImagePullBackOff" when Contrainer Registry is also inaccessible

We had a major outage when both our container registry and the entire K8S cluster lost power. When the cluster recovered faster than the container registry, my pod (part of a statefulset) is stuck in Error: ImagePullBackOff.
Is there a config setting to retry downloading the image from the CR periodically or recover without manual intervention?
I looked at imagePullPolicy but that does not apply for a situation when the CR is unavailable.
The BackOff part in ImagePullBackOff status means that Kubernetes is keep trying to pull the image from the registry, with an exponential back-off delay (10s, 20s, 40s, …). The delay between each attempt is increased until it reaches a compiled-in limit of 300 seconds (5 minutes) - more on it in Kubernetes docs.
backOffPeriod parameter for the image pulls is a hard-coded constant in Kuberenets and unfortunately is not tunable now, as it can affect the node performance - otherwise, it can be adjusted in the very code for your custom kubelet binary.
There is still ongoing issue on making it adjustable.

One node for a GKE cluster cannot pull image from dockerhub

This is a very wried thing.
I created a private GKE cluster with a node pool of 3 nodes. Then I have a replica set with 3 pods. some of these pods will be scheduled to one node.
So one of these pods always get ImagePullBackOff, I check the error
Failed to pull image "bitnami/mongodb:3.6": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
And the pods scheduled to the remaining two nodes work well.
I ssh to that node, run docker pull and everything is fine. I cannot find another way to troubleshoot this error.
I tried to drain or delete that node and let the cluster to recreate the node. but it is still not working.
Help me, please.
Update:
From GCP documentation, it will fail to pull images from the docker hub.
BUT the weirdest thing is ONLY ONE node is unable to pull the images.
There was a related reported bug in Kubernetes 1.11
Make sure it is not your case

Heapster status stuck in Container Creating or Pending status

I am new to Kubernetes and started working with it from past one month.
When creating the setup of cluster, sometimes I see that Heapster will be stuck in Container Creating or Pending status. After this happens the only way have found here is to re-install everything from the scratch which has solved our problem. Later if I run the Heapster it would run without any problem. But I think this is not the optimal solution every time. So please help out in solving the same issue when it occurs again.
Heapster image is pulled from the github for our use. Right now the cluster is running fine, So could not send the screenshot of the heapster failing with it's status by staying in Container creating or Pending status.
Suggest any alternative for the problem to be solved if it occurs again.
Thanks in advance for your time.
A pod stuck in pending state can mean more than one thing. Next time it happens you should do 'kubectl get pods' and then 'kubectl describe pod '. However, since it works sometimes the most likely cause is that the cluster doesn't have enough resources on any of its nodes to schedule the pod. If the cluster is low on remaining resources you should get an indication of this by 'kubectl top nodes' and by 'kubectl describe nodes'. (Or with gke, if you are on google cloud, you often get a low resource warning in the web UI console.)
(Or if in Azure then be wary of https://github.com/Azure/ACS/issues/29 )