Kubernetes containers CrashLoopBackOff [duplicate] - kubernetes

This question already has answers here:
My kubernetes pods keep crashing with "CrashLoopBackOff" but I can't find any log
(21 answers)
Closed 2 years ago.
I am new to Kubernetes and trying to learn but I am stuck with an error that I cannot find an explanation for. I am running Pods and Deployments in my cluster and they are running perfectly as shown in the CLI, but after a while they keep crashing and the Pods need to restart.
I did some research to fix my issue before posting here, but the way I understood it, I will have to make a deployment so that my replicaSets will manage my Pods lifecycle and not deploy Pods independently. But as you can see also Pods in deployment is crashing.
kubectl get pods
operator-5bf8c8484c-fcmnp 0/1 CrashLoopBackOff 9 34m
operator-5bf8c8484c-phptp 0/1 CrashLoopBackOff 9 34m
operator-5bf8c8484c-wh7hm 0/1 CrashLoopBackOff 9 34m
operator-pod 0/1 CrashLoopBackOff 12 49m
kubectl describe pods operator
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/operator-pod to workernode
Normal Created 30m (x5 over 34m) kubelet, workernode Created container operator-pod
Normal Started 30m (x5 over 34m) kubelet, workernode Started container operator-pod
Normal Pulled 29m (x6 over 34m) kubelet, workernode Container image "operator-api_1:java" already present on machine
Warning BackOff 4m5s (x101 over 33m) kubelet, workernode Back-off restarting failed container
deployment yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: operator
labels:
app: java
spec:
replicas: 3
selector:
matchLabels:
app: call
template:
metadata:
labels:
app: call
spec:
containers:
- name: operatorapi
image: operator-api_1:java
ports:
- containerPort: 80
Can someone help me out, how can I debug?

The reason is most probably the process running in container finished its task and terminated by container OS after a while. Then the pod is being restarted by kubelet.
What I recommend you to solve this issue, please check the process running in container and try to keep it alive forever. You can create a loop to run this process in container or you can use some commands for container on the deployment.yaml
Here is a reference for you to understand and debug pod failure reason.
https://kubernetes.io/docs/tasks/debug-application-cluster/determine-reason-pod-failure/

There are several ways to debug such a scenario and I recommend viewing Kubernetes documentation for best-practices. I typically have success with the following 2 approaches:
Logs: You can view the logs for the application using the command below:
kubectl logs -l app=java
If you have multiple containers within that pod, you can filter it down:
kubectl logs -l app=java -c operatorapi
Events: You can get a lot of information from events as shown below (sorted by timestamp). Keep in mind that there could be a lot of noise in events depending on the number of apps and services that you may have so you have to filter it down further:
kubectl get events --sort-by='.metadata.creationTimestamp'
Feel free to share the output from those two and I can help you debug further.

Related

Ingress-nginx is in CrashLoopBackOff after K8s upgrade

After upgrading Kubernetes node pool from 1.21 to 1.22, ingress-nginx-controller pods started crashing. The same deployment has been working fine in EKS. I'm just having this issue in GKE. Does anyone have any ideas about the root cause?
$ kubectl logs ingress-nginx-controller-5744fc449d-8t2rq -c controller
-------------------------------------------------------------------------------
NGINX Ingress controller
Release: v1.3.1
Build: 92534fa2ae799b502882c8684db13a25cde68155
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.19.10
-------------------------------------------------------------------------------
W0219 21:23:08.194770 8 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0219 21:23:08.194995 8 main.go:209] "Creating API client" host="https://10.1.48.1:443"
Ingress pod events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned infra/ingress-nginx-controller-5744fc449d-8t2rq to gke-infra-nodep-ffe54a41-s7qx
Normal Pulling 27m kubelet Pulling image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974"
Normal Started 27m kubelet Started container controller
Normal Pulled 27m kubelet Successfully pulled image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" in 6.443361484s
Warning Unhealthy 26m (x6 over 26m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 502
Normal Killing 26m kubelet Container controller failed liveness probe, will be restarted
Normal Created 26m (x2 over 27m) kubelet Created container controller
Warning FailedPreStopHook 26m kubelet Exec lifecycle hook ([/wait-shutdown]) for Container "controller" in Pod "ingress-nginx-controller-5744fc449d-8t2rq_infra(c4c166ff-1d86-4385-a22c-227084d569d6)" failed - error: command '/wait-shutdown' exited with 137: , message: ""
Normal Pulled 26m kubelet Container image "registry.k8s.io/ingress-nginx/controller:v1.3.1#sha256:54f7fe2c6c5a9db9a0ebf1131797109bb7a4d91f56b9b362bde2abd237dd1974" already present on machine
Warning BackOff 7m7s (x52 over 21m) kubelet Back-off restarting failed container
Warning Unhealthy 2m9s (x55 over 26m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 502
The Beta API versions (extensions/v1beta1 and networking.k8s.io/v1beta1) of Ingress are no longer served (removed) for GKE clusters created on versions 1.22 and later. Please refer to the official GKE ingress documentation for changes in the GA API version.
Also refer to Official Kubernetes documentation for API removals for Kubernetes v1.22 for more information.
Before upgrading your Ingress API as a client, make sure that every ingress controller that you use is compatible with the v1 Ingress API. See Ingress Prerequisites for more context about Ingress and ingress controllers.
Also check below possible causes for Crashloopbackoff :
Increasing the initialDelaySeconds value for the livenessProbe setting may help to alleviate the issue, as it will give the container more time to start up and perform its initial work operations before the liveness probe server checks its health.
Check “Container restart policy”, the spec of a Pod has a restartPolicy field with possible values Always, OnFailure, and Never. The default value is Always.
Out of memory or resources : Try to increase the VM size. Containers may crash due to memory limits, then new ones spun up, the health check failed and Ingress served up 502.
Check externalTrafficPolicy=Local is set on the NodePort service will prevent nodes from forwarding traffic to other nodes.
Refer to the Github issue Document how to avoid 502s #34 for more information.

kubernetes cannot pull a public image

kubernetes cannot pull a public image. Standard images like nginx are downloading successfully, but my pet project is not downloading. I'm using minikube for launch kubernetes-cluster
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway-deploumnet
labels:
app: api-gateway
spec:
replicas: 3
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
containers:
- name: api-gateway
image: creatorsprodhouse/api-gateway:latest
imagePullPolicy: Always
ports:
- containerPort: 80
when I try to create a deployment I get an error that kubernetes cannot download my public image.
$ kubectl get pods
result:
NAME READY STATUS RESTARTS AGE
api-gateway-deploumnet-599c784984-j9mf2 0/1 ImagePullBackOff 0 13m
api-gateway-deploumnet-599c784984-qzklt 0/1 ImagePullBackOff 0 13m
api-gateway-deploumnet-599c784984-csxln 0/1 ImagePullBackOff 0 13m
$ kubectl logs api-gateway-deploumnet-599c784984-csxln
result
Error from server (BadRequest): container "api-gateway" in pod "api-gateway-deploumnet-86f6cc5b65-xdx85" is waiting to start: trying and failing to pull image
What could be the problem? The standard images are downloading but my public one is not. Any help would be appreciated.
EDIT 1
$ api-gateway-deploumnet-599c784984-csxln
result:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m22s default-scheduler Successfully assigned default/api-gateway-deploumnet-849899786d-mq4td to minikube
Warning Failed 3m8s kubelet Failed to pull image "creatorsprodhouse/api-gateway:latest": rpc error: code = Unknown desc = context deadline exceeded
Warning Failed 3m8s kubelet Error: ErrImagePull
Normal BackOff 3m7s kubelet Back-off pulling image "creatorsprodhouse/api-gateway:latest"
Warning Failed 3m7s kubelet Error: ImagePullBackOff
Normal Pulling 2m53s (x2 over 8m21s) kubelet Pulling image "creatorsprodhouse/api-gateway:latest"
EDIT 2
If I try to download a separate docker image, it's fine
$ docker pull creatorsprodhouse/api-gateway:latest
result:
Digest: sha256:e664a9dd9025f80a3dd60d157ce1464d4df7d0f8a00538e6a137d44f9f9f12aa
Status: Downloaded newer image for creatorsprodhouse/api-gateway:latest
docker.io/creatorsprodhouse/api-gateway:latest
EDIT 3
After advice to restart minikube
$ minikube stop
$ minikube delete --purge
$ minikube start --cni=calico
I started the pods.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m28s default-scheduler Successfully assigned default/api-gateway-deploumnet-849899786d-bkr28 to minikube
Warning FailedCreatePodSandBox 4m27s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7e112c92e24199f268ec9c6f3a6db69c2572c0751db9fd57a852d1b9b412e0a1" network for pod "api-gateway-deploumnet-849899786d-bkr28": networkPlugin cni failed to set up pod "api-gateway-deploumnet-849899786d-bkr28_default" network: failed to set bridge addr: could not add IP address to "cni0": permission denied, failed to clean up sandbox container "7e112c92e24199f268ec9c6f3a6db69c2572c0751db9fd57a852d1b9b412e0a1" network for pod "api-gateway-deploumnet-849899786d-bkr28": networkPlugin cni failed to teardown pod "api-gateway-deploumnet-849899786d-bkr28_default" network: running [/usr/sbin/iptables -t nat -D POSTROUTING -s 10.85.0.34 -j CNI-57e7da7379b524635074e6d0 -m comment --comment name: "crio" id: "7e112c92e24199f268ec9c6f3a6db69c2572c0751db9fd57a852d1b9b412e0a1" --wait]: exit status 2: iptables v1.8.4 (legacy): Couldn't load target `CNI-57e7da7379b524635074e6d0':No such file or directory
Try `iptables -h' or 'iptables --help' for more information.
I could not solve the problem in the ways I was suggested. However, it worked when I ran minikube with a different driver
$ minikube start --driver=none
--driver=none means that the cluster will run on your host instead of the standard --driver=docker which runs the cluster in docker.
It is better to run minikube with --driver=docker as it is safer and easier, but it didn't work for me as I could not download my images. For me personally it is ok to use --driver=none although it is a bit dangerous.
In general, if anyone knows what the problem is, please answer my question. In the meantime you can try to run minikube cluster on your host with the command I mentioned above.
In any case, thank you very much for your attention!

GCP GKE: View logs of terminated jobs/pods

I have a few cron jobs on GKE.
One of the pods did terminate and now I am trying to access the logs.
➣ $ kubectl get events
LAST SEEN TYPE REASON KIND MESSAGE
23m Normal SuccessfulCreate Job Created pod: virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
22m Normal SuccessfulDelete Job Deleted pod: virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
22m Warning DeadlineExceeded Job Job was active longer than specified deadline
23m Normal Scheduled Pod Successfully assigned default/virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42 to staging-cluster-default-pool-4b4827bf-rpnl
23m Normal Pulling Pod pulling image "gcr.io/my-repo/myimage:v8"
23m Normal Pulled Pod Successfully pulled image "gcr.io/my-repo/my-image:v8"
23m Normal Created Pod Created container
23m Normal Started Pod Started container
22m Normal Killing Pod Killing container with id docker://virulent-angelfish-cronjob:Need to kill Pod
23m Normal SuccessfulCreate CronJob Created job virulent-angelfish-cronjob-netsuite-proservices-1562220000
22m Normal SawCompletedJob CronJob Saw completed job: virulent-angelfish-cronjob-netsuite-proservices-1562220000
So at least one CJ run.
I would like to see the pod's logs, but there is nothing there
➣ $ kubectl get pods
No resources found.
Given that in my cj definition, I have:
failedJobsHistoryLimit: 1
successfulJobsHistoryLimit: 3
shouldn't at least one pod be there for me to do forensics?
Your pod is crashing or otherwise unhealthy
First, take a look at the logs of the current container:
kubectl logs ${POD_NAME} ${CONTAINER_NAME}
If your container has previously crashed, you can access the previous container’s crash log with:
kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
Alternately, you can run commands inside that container with exec:
kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}
Note: -c ${CONTAINER_NAME} is optional. You can omit it for pods that only contain a single container.
As an example, to look at the logs from a running Cassandra pod, you might run:
kubectl exec cassandra -- cat /var/log/cassandra/system.log
If none of these approaches work, you can find the host machine that the pod is running on and SSH into that host.
Finaly, check Logging on Google StackDriver.
Debugging Pods
The first step in debugging a pod is taking a look at it. Check the current state of the pod and recent events with the following command:
kubectl describe pods ${POD_NAME}
Look at the state of the containers in the pod. Are they all Running? Have there been recent restarts?
Continue debugging depending on the state of the pods.
Debugging ReplicationControllers
ReplicationControllers are fairly straightforward. They can either create pods or they can’t. If they can’t create pods, then please refer to the instructions above to debug your pods.
You can also use kubectl describe rc ${CONTROLLER_NAME} to inspect events related to the replication controller.
Hope it helps you to find exactly problem.
You can use the --previous flag to get the logs for the previous pod.
So, you can use:
kubectl logs --previous virulent-angelfish-cronjob-netsuite-proservices-15622200008gc42
to get the logs for the pod that was there before this one.

Trying to create a Kubernetes deployment but it shows 0 pods available

I'm new to k8s, so some of my terminology might be off. But basically, I'm trying to deploy a simple web api: one load balancer in front of n pods (where right now, n=1).
However, when I try to visit the load balancer's IP address it doesn't show my web application. When I run kubectl get deployments, I get this:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
tl-api 1 1 1 0 4m
Here's my YAML file. Let me know if anything looks off--I'm very new to this!
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: tl-api
spec:
replicas: 1
template:
metadata:
labels:
app: tl-api
spec:
containers:
- name: tl-api
image: tlk8s.azurecr.io/devicecloudwebapi:v1
ports:
- containerPort: 80
imagePullSecrets:
- name: acr-auth
nodeSelector:
beta.kubernetes.io/os: windows
---
apiVersion: v1
kind: Service
metadata:
name: tl-api
spec:
type: LoadBalancer
ports:
- port: 80
selector:
app: tl-api
Edit 2: When I try using ACS (which supports Windows), I get this:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned tl-api-3466491809-vd5kg to dc9ebacs9000
Normal SuccessfulMountVolume 11m kubelet, dc9ebacs9000 MountVolume.SetUp succeeded for volume "default-token-v3wz9"
Normal Pulling 4m (x6 over 10m) kubelet, dc9ebacs9000 pulling image "tlk8s.azurecr.io/devicecloudwebapi:v1"
Warning FailedSync 1s (x50 over 10m) kubelet, dc9ebacs9000 Error syncing pod
Normal BackOff 1s (x44 over 10m) kubelet, dc9ebacs9000 Back-off pulling image "tlk8s.azurecr.io/devicecloudwebapi:v1"
I then try examining the failed pod:
PS C:\users\<me>\source\repos\DeviceCloud\DeviceCloud\1- Presentation\DeviceCloud.Web.API> kubectl logs tl-api-3466491809-vd5kg
Error from server (BadRequest): container "tl-api" in pod "tl-api-3466491809-vd5kg" is waiting to start: trying and failing to pull image
When I run docker images I see the following:
REPOSITORY TAG IMAGE ID CREATED SIZE
devicecloudwebapi latest ee3d9c3e231d 24 hours ago 7.85GB
tlk8s.azurecr.io/devicecloudwebapi v1 ee3d9c3e231d 24 hours ago 7.85GB
devicecloudwebapi dev bb33ab221910 25 hours ago 7.76GB
Your problem is that the container image tlk8s.azurecr.io/devicecloudwebapi:v1 is in a private container registry. See the events at the bottom of the following command:
$ kubectl describe po -l=app=tl-api
The official Kubernetes docs describe how to resolve this issue, see Pull an Image from a Private Registry, essentially:
Create a secret kubectl create secret docker-registry
Use it in your deployment, under the spec.imagePullSecrets key

Kubernetes Keeps Restarting Pods of StatefulSet in Minikube With "Need to kill pod"

Minikube version v0.24.1
kubernetes version 1.8.0
The problem that I am facing is that I have several statefulsets created in minikube each with one pod.
Sometimes when I start up minikube my pods will start up initially then keep being restarted by kubernetes. They will go from the creating container state, to running, to terminating over and over.
Now I've seen kubernetes kill and restart things before if kubernetes detects disk pressure, memory pressure, or some other condition like that, but that's not the case here as these flags are not raised and the only message in the pod's event log is "Need to kill pod".
What's most confusing is that this issue doesn't happen all the time, and I'm not sure how to trigger it. My minikube setup will work for a week or more without this happening then one day I'll start minikube up and the pods for my statefulsets just keep restarting. So far the only workaround I've found is to delete my minikube instance and set it up again from scratch, but obviously this is not ideal.
Seen here is a sample of one of the statefulsets whose pod keeps getting restarted. Seen in the logs kubernetes is deleting the pod and starting it again. This happens repeatedly. I'm unable to figure out why it keeps doing that and why it only gets into this state sometimes.
$ kubectl describe statefulsets mongo --namespace=storage
Name: mongo
Namespace: storage
CreationTimestamp: Mon, 08 Jan 2018 16:11:39 -0600
Selector: environment=test,role=mongo
Labels: name=mongo
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"apps/v1beta1","kind":"StatefulSet","metadata":{"annotations":{},"labels":{"name":"mongo"},"name":"mongo","namespace":"storage"},"...
Replicas: 1 desired | 1 total
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: environment=test
role=mongo
Containers:
mongo:
Image: mongo:3.4.10-jessie
Port: 27017/TCP
Command:
mongod
--replSet
rs0
--smallfiles
--noprealloc
Environment: <none>
Mounts:
/data/db from mongo-persistent-storage (rw)
mongo-sidecar:
Image: cvallance/mongo-k8s-sidecar
Port: <none>
Environment:
MONGO_SIDECAR_POD_LABELS: role=mongo,environment=test
KUBERNETES_MONGO_SERVICE_NAME: mongo
Mounts: <none>
Volumes: <none>
Volume Claims:
Name: mongo-persistent-storage
StorageClass:
Labels: <none>
Annotations: volume.alpha.kubernetes.io/storage-class=default
Capacity: 5Gi
Access Modes: [ReadWriteOnce]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDelete 23m (x46 over 1h) statefulset delete Pod mongo-0 in StatefulSet mongo successful
Normal SuccessfulCreate 3m (x62 over 1h) statefulset create Pod mongo-0 in StatefulSet mongo successful
After some more digging there seems to have been a bug which can affect statefulsets that creates multiple controllers for the same statefulset:
https://github.com/kubernetes/kubernetes/issues/56355
This issue seems to have been fixed and the fix seems to have been backported to version 1.8 of kubernetes and included in version 1.9, but minikube doesn't yet have the fixed version. A workaround if your system enters this state is to list the controller revisions like so:
$ kubectl get controllerrevisions --namespace=storage
NAME CONTROLLER REVISION AGE
mongo-68bd5cbcc6 StatefulSet/mongo 1 19h
mongo-68bd5cbcc7 StatefulSet/mongo 1 7d
and delete the duplicate controllers for each statefulset.
$ kubectl delete controllerrevisions mongo-68bd5cbcc6 --namespace=storage
or to simply use version 1.9 of kubernetes or above that includes this bug fix.