Kubernetes CrashLoopBackOff default timing - kubernetes

What are the defaults for the Kubernetes CrashLoopBackOff?
Say, I have a pod:
kubectl run mynginx --image nginx -- echo hello
And I inspect its status:
kubectl get pods -w
NAME READY STATUS RESTARTS AGE
mynginx 0/1 Pending 0 0s
mynginx 0/1 Pending 0 0s
mynginx 0/1 ContainerCreating 0 0s
mynginx 0/1 Completed 0 2s
mynginx 0/1 Completed 1 4s
mynginx 0/1 CrashLoopBackOff 1 5s
mynginx 0/1 Completed 2 20s
mynginx 0/1 CrashLoopBackOff 2 33s
mynginx 0/1 Completed 3 47s
mynginx 0/1 CrashLoopBackOff 3 59s
mynginx 0/1 Completed 4 97s
mynginx 0/1 CrashLoopBackOff 4 109s
This is "expected". Kubernetes starts a pod, it quits "too fast", Kubernetes schedules it again and then Kubernetes sets the state to CrashLoopBackOff.
Now, if i start a pod slightly differently:
kubectl run mynginx3 --image nginx -- /bin/bash -c "sleep 10; echo hello"
I get the following
kubectl get pods -w
NAME READY STATUS RESTARTS AGE
mynginx3 0/1 Pending 0 0s
mynginx3 0/1 Pending 0 0s
mynginx3 0/1 ContainerCreating 0 0s
mynginx3 1/1 Running 0 2s
mynginx3 0/1 Completed 0 12s
mynginx3 1/1 Running 1 14s
mynginx3 0/1 Completed 1 24s
mynginx3 0/1 CrashLoopBackOff 1 36s
mynginx3 1/1 Running 2 38s
mynginx3 0/1 Completed 2 48s
mynginx3 0/1 CrashLoopBackOff 2 62s
mynginx3 1/1 Running 3 75s
mynginx3 0/1 Completed 3 85s
mynginx3 0/1 CrashLoopBackOff 3 96s
mynginx3 1/1 Running 4 2m14s
mynginx3 0/1 Completed 4 2m24s
mynginx3 0/1 CrashLoopBackOff 4 2m38s
This is also expected.
But say I set sleep for 24 hours, would I still get the same CrashLoopBackOff after two pod exits initially and then after each next pod exit?

Based on these docs:
The restartPolicy applies to all containers in the Pod. restartPolicy only refers to restarts of the containers by the kubelet on the same node. After containers in a Pod exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, …), that is capped at five minutes. Once a container has executed for 10 minutes without any problems, the kubelet resets the restart backoff timer for that container.
I think that means that anything that executes for longer than 10 minutes before exiting will not trigger a CrashLoopBackOff status.

Related

deploy exceptionless on k8s! Error Back-off restarting failed container

I get the exceptionless helm chart ,my value.yaml is https://github.com/mypublicuse/myfile/blob/main/el-values.yaml
i got errors
1:
Error: INSTALLATION FAILED: Deployment.apps "exceptionless-elasticsearch" is invalid: spec.template.spec.initContainers[0].image: Required value
so I edit the elasticsearch.yaml Add
spec:
initContainers:
name: sysctl
image: mydockerhost/busybox:1.35
so the helm can install
2: after helm install
i found
exless-nfsclient-nfs-subdir-external-provisioner-7fc86846fmlbgz 1/1 Running 0 52m
exceptionless-redis-85956947f-7vkpg 1/1 Running 0 49m
exceptionless-app-6547d4d88d-2hkbg 1/1 Running 0 49m
exceptionless-elasticsearch-76f6cc9b9-2jgks 1/1 Running 0 49m
exceptionless-jobs-web-hooks-7bb9d7477c-kpmwv 0/1 CrashLoopBackOff 14 (2m53s ago) 49m
exceptionless-jobs-event-notifications-844cb87665-bd7bt 0/1 CrashLoopBackOff 14 (2m53s ago) 49m
exceptionless-jobs-mail-message-647d6bd897-s8jmq 0/1 CrashLoopBackOff 14 (2m55s ago) 49m
exceptionless-jobs-event-usage-75c6d6d54d-m5rjr 0/1 CrashLoopBackOff 14 (2m46s ago) 49m
exceptionless-jobs-work-item-c74d77b55-th4g7 0/1 CrashLoopBackOff 14 (2m34s ago) 49m
exceptionless-jobs-daily-summary-6c99dfbc87-7zq5k 0/1 CrashLoopBackOff 14 (2m34s ago) 49m
exceptionless-jobs-event-posts-75777759b8-nsmbw 0/1 CrashLoopBackOff 14 (2m32s ago) 49m
exceptionless-jobs-close-inactive-sessions-b49595f49-hmfxm 0/1 CrashLoopBackOff 14 (2m14s ago) 49m
exceptionless-jobs-event-user-descriptions-5c9d5dc768-8h27z 0/1 CrashLoopBackOff 14 (2m16s ago) 49m
exceptionless-jobs-stack-event-count-54ffcfb4b6-gk6mz 0/1 CrashLoopBackOff 14 (2m ago) 49m
exceptionless-jobs-maintain-indexes-27669970-s28cg 0/1 CrashLoopBackOff 5 (94s ago) 4m30s
exceptionless-collector-5c774fd8ff-6ksvx 0/1 CrashLoopBackOff 2 (11s ago) 37s
exceptionless-api-66fc9cc659-zckzz 0/1 CrashLoopBackOff 3 (9s ago) 55s
api collector and jobs is un success!
I need help!thanks!
The pod log is
Back-off restarting failed container
yes just it!
i guess the program should be run and immediate crash ,so ....

PostgreSQL-HA on Kubernetes recover from Volume Snapshot?

I have a Kubernetes Volume Snapshot created for pgsql-ha persistent volume backup.
Now that I'm able to recover the PVC by specifying the dataSource as the volume snapshot, and trying to create a new pgsql-ha cluster using HELM chart, then attach this PCV to recover the data. Following is the example installation command:
helm install db-ha bitnami/postgresql-ha\
--set postgresql.password=$PWD \
--set persistence.existingClaim="pvc-restore-from-snapshot"
Then the pgpol and both postgresql Pods shows CrashLoopBackOff forever.
$ kubectl get pods --watch
NAME READY STATUS RESTARTS AGE
db-ha-pgpool-gradfergr43sfxv 0/1 Running 0 8s
db-ha-postgresql-0 0/1 Init:0/1 0 8s
db-ha-postgresql-1 0/1 Init:0/1 0 8s
db-ha-postgresql-1 0/1 PodInitializing 0 23s
db-ha-postgresql-0 0/1 PodInitializing 0 23s
db-ha-postgresql-1 0/1 Error 0 24s
db-ha-postgresql-0 0/1 Error 0 24s
db-ha-postgresql-1 0/1 Error 1 25s
db-ha-postgresql-0 0/1 Error 1 25s
db-ha-postgresql-1 0/1 CrashLoopBackOff 1 26s
db-ha-postgresql-0 0/1 CrashLoopBackOff 1 27s
From what I have read so far in this issue, persistence.existingClaim is only supported when replica was set to 1, which means it can only be restored on a non-ha cluster, and pgsql-ha is currently unable to replicate the manually specified PVC.
So I'm wondering the following:
If that is the whole story, and there is nothing that I'm missing
If it's possible to modify the storageClass or even the provisioner (ebs-csi), so that the existing PVC can be used
If other workaround exist for this workflow
Many thanks!

promethues operator alertmanager-main-0 pending and display

What happened?
kubernetes version: 1.12
promethus operator: release-0.1
I follow the README:
$ kubectl create -f manifests/
# It can take a few seconds for the above 'create manifests' command to fully create the following resources, so verify the resources are ready before proceeding.
$ until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; do date; sleep 1; echo ""; done
$ until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
$ kubectl apply -f manifests/ # This command sometimes may need to be done twice (to workaround a race condition).
and then I use the command and then is showed like:
[root#VM_8_3_centos /data/hansenwu/kube-prometheus/manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 66s
alertmanager-main-1 1/2 Running 0 47s
grafana-54f84fdf45-kt2j9 1/1 Running 0 72s
kube-state-metrics-65b8dbf498-h7d8g 4/4 Running 0 57s
node-exporter-7mpjw 2/2 Running 0 72s
node-exporter-crfgv 2/2 Running 0 72s
node-exporter-l7s9g 2/2 Running 0 72s
node-exporter-lqpns 2/2 Running 0 72s
prometheus-adapter-5b6f856dbc-ndfwl 1/1 Running 0 72s
prometheus-k8s-0 3/3 Running 1 59s
prometheus-k8s-1 3/3 Running 1 59s
prometheus-operator-5c64c8969-lqvkb 1/1 Running 0 72s
[root#VM_8_3_centos /data/hansenwu/kube-prometheus/manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 0/2 Pending 0 0s
grafana-54f84fdf45-kt2j9 1/1 Running 0 75s
kube-state-metrics-65b8dbf498-h7d8g 4/4 Running 0 60s
node-exporter-7mpjw 2/2 Running 0 75s
node-exporter-crfgv 2/2 Running 0 75s
node-exporter-l7s9g 2/2 Running 0 75s
node-exporter-lqpns 2/2 Running 0 75s
prometheus-adapter-5b6f856dbc-ndfwl 1/1 Running 0 75s
prometheus-k8s-0 3/3 Running 1 62s
prometheus-k8s-1 3/3 Running 1 62s
prometheus-operator-5c64c8969-lqvkb 1/1 Running 0 75s
I don't know why the pod altertmanager-main-0 pending and disaply then restart.
And I see the event, it is showed as:
72s Warning FailedCreate StatefulSet create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
72s Warning FailedCreate StatefulSet create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
72s Warning^Z FailedCreate StatefulSet
[10]+ Stopped kubectl get events -n monitoring
Most likely the alertmanager does not get enough time to start correctly.
Have a look at this answer : https://github.com/coreos/prometheus-operator/issues/965#issuecomment-460223268
You can set the paused field to true, and then modify the StatefulSet to try if extending the liveness/readiness solves your issue.

kubernetes UnexpectedAdmissionError after rollout

I had a service failing to reply to some HTTP requests, digging it's logs it seemed to be some sort of DNS failure on reaching a proxy service
'proxy' failed to resolve 'proxy.default.svc.cluster.local' after 2 queries
So I could not find anything wrong and tried kubectl rollout restart deployment/backend.
Just after that these appeared in the pods list:
backend-54769cbb4-xkwf2 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-xlpgf 0/1 UnexpectedAdmissionError 0 4h4m
backend-54769cbb4-xmnr5 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-xmq5n 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-xphrw 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-xrmrq 0/1 UnexpectedAdmissionError 0 4h1m
backend-54769cbb4-xrmw8 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-xt4ck 0/1 UnexpectedAdmissionError 0 4h4m
backend-54769cbb4-xws8r 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-xx6r4 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-xxpfd 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-xzjql 0/1 UnexpectedAdmissionError 0 4h4m
backend-54769cbb4-xzzlk 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-z46ms 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-z4sl7 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-z6jpj 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-z6ngq 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-z8w4h 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-z9jqb 0/1 UnexpectedAdmissionError 0 4h3m
backend-54769cbb4-zbvqm 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zcfxg 0/1 UnexpectedAdmissionError 0 4h3m
backend-54769cbb4-zcvqm 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-zf2f8 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zgnkh 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-zhdr8 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zhx6g 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-zj8f2 0/1 UnexpectedAdmissionError 0 4h3m
backend-54769cbb4-zjbwp 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-zjc8g 0/1 UnexpectedAdmissionError 0 4h3m
backend-54769cbb4-zjdcp 0/1 UnexpectedAdmissionError 0 4h4m
backend-54769cbb4-zkcrb 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-zlpll 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zm2cx 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-zn7mr 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-znjkp 0/1 UnexpectedAdmissionError 0 4h3m
backend-54769cbb4-zpnk7 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zrrl7 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zsdsz 0/1 UnexpectedAdmissionError 0 4h4m
backend-54769cbb4-ztdx8 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-ztln6 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-ztplg 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-ztzfh 0/1 UnexpectedAdmissionError 0 4h2m
backend-54769cbb4-zvb8g 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-zwsr8 0/1 UnexpectedAdmissionError 0 4h7m
backend-54769cbb4-zwvxr 0/1 UnexpectedAdmissionError 0 4h5m
backend-54769cbb4-zwx6h 0/1 UnexpectedAdmissionError 0 4h6m
backend-54769cbb4-zz4bf 0/1 UnexpectedAdmissionError 0 4h1m
backend-54769cbb4-zzq6t 0/1 UnexpectedAdmissionError 0 4h2m
(and many more of these)
So I added two more nodes, and now everything seems fine except for this big list of pods in an error state which I don't understand. What is this UnexpectedAdmissionError, and what should I do about it?
Note: this is a DigitalOcean cluster
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:38:36Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
The following seems important: kubectl describe one_failed_pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m51s default-scheduler Successfully assigned default/backend-549f576d5f-xzdv4 to std-16gb-g7mo
Warning UnexpectedAdmissionError 2m51s kubelet, std-16gb-g7mo Update plugin resources failed due to failed to write checkpoint file "kubelet_internal_checkpoint": write /var/lib/kubelet/device-plugins/.543592130: no space left on device, which is unexpected.
I had the same issue, while describing one of the pods with UnexpectedAdmissionError I saw the following:
Update plugin resources failed due to failed to write deviceplugin checkpoint file "kubelet_internal_checkpoint": write /var/lib/kubelet/device-plugins/.525608957: no space left on device, which is unexpected.
when doing describing node:
OutOfDisk Unknown Tue, 30 Jun 2020 14:07:04 -0400 Tue, 30 Jun 2020 14:12:05 -0400 NodeStatusUnknown Kubelet stopped posting node status.
I resolved this by rebooting node
Because the pod was not even started you can't actually check the logs. However describing the pod provided me with the error. We had some disk/cpu/memory utilization issues with worker5 node.
kubectl get pods | grep -i err
kube-system coredns-autoscaler-79599b9dc6-6l8s8 0/1 UnexpectedAdmissionError 0 10h <none> worker5 <none> <none>
kube-system coredns-autoscaler-79599b9dc6-kzt9z 0/1 UnexpectedAdmissionError 0 10h <none> worker5 <none> <none>
kube-system coredns-autoscaler-79599b9dc6-tgkrc 0/1 UnexpectedAdmissionError 0 10h <none> worker5 <none> <none>
kubectl describe pod -n kube-system coredns-autoscaler-79599b9dc6-kzt9z
Reason: UnexpectedAdmissionError
Message: Pod Allocate failed due to failed to write checkpoint file "kubelet_internal_checkpoint": mkdir /var: file exists, which is unexpected
First step was rebooting the node which fixed the issue. The reason was we had restored some backups to the new cluster and restore process caused this issue.
For the pods because they were a part of replica set, they got spawned on other worker nodes. Therefore we deleted the pods.
A quick way to delete a lot of pods, you can use:
kubectl get pods -n namespace | grep -i Error | cut -d' ' -f 1 | xargs kubectl delete pod
To delete all the erroraneous pods in entire cluster
kubectl get pods -A | grep -i Error | awk '{print $2}' | xargs kubectl delete pod
You can use flag -A/--all-namespaces to get pods from all namespaces in the cluster.
However if they are not getting spawned automatically which would be weird, you can run kubectl replace
kubectl get pod coredns-autoscaler-79599b9dc6-6l8s8 -n kube-system -o yaml | kubectl replace --force -f -
For further a verbose read please refer kubectl replace --help and the following blog

How to debug Kubernetes on the proper way?

I would like to run Istio to play around, but I facing issues with my local kubernetes installation and I am successfuly stack with a way of debug my installation
That is a my current situation:
root#node1:/tmp/istio-0.1.5# kubectl get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana 10.233.2.70 <pending> 3000:31202/TCP 1h
istio-egress 10.233.39.101 <none> 80/TCP 1h
istio-ingress 10.233.48.51 <pending> 80:30982/TCP,443:31195/TCP 1h
istio-manager 10.233.2.109 <none> 8080/TCP,8081/TCP 1h
istio-mixer 10.233.39.58 <none> 9091/TCP,9094/TCP,42422/TCP 1h
kubernetes 10.233.0.1 <none> 443/TCP 4h
prometheus 10.233.63.20 <pending> 9090:32170/TCP 1h
servicegraph 10.233.39.104 <pending> 8088:30814/TCP 1h
root#node1:/tmp/istio-0.1.5# kubectl get pods
NAME READY STATUS RESTARTS AGE
grafana-1261931457-3hx2p 0/1 Pending 0 1h
istio-ca-3887035158-6p3b0 0/1 Pending 0 1h
istio-egress-1920226302-vmlx1 0/1 Pending 0 1h
istio-ingress-2112208289-ctxj5 0/1 Pending 0 1h
istio-manager-2910860705-z28dp 0/2 Pending 0 1h
istio-mixer-2335471611-rsrhb 0/1 Pending 0 1h
prometheus-3067433533-l2m48 0/1 Pending 0 1h
servicegraph-3127588006-1k5rg 0/1 Pending 0 1h
kubectl get rs
NAME DESIRED CURRENT READY AGE
grafana-1261931457 1 1 0 1h
istio-ca-3887035158 1 1 0 1h
istio-egress-1920226302 1 1 0 1h
istio-ingress-2112208289 1 1 0 1h
istio-manager-2910860705 1 1 0 1h
istio-mixer-2335471611 1 1 0 1h
prometheus-3067433533 1 1 0 1h
servicegraph-3127588006 1 1 0 1h
kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
grafana-1261931457-3hx2p 0/1 Pending 0 1h app=grafana,pod-template-hash=1261931457
istio-ca-3887035158-6p3b0 0/1 Pending 0 1h istio=istio-ca,pod-template-hash=3887035158
istio-egress-1920226302-vmlx1 0/1 Pending 0 1h istio=egress,pod-template-hash=1920226302
istio-ingress-2112208289-ctxj5 0/1 Pending 0 1h istio=ingress,pod-template-hash=2112208289
istio-manager-2910860705-z28dp 0/2 Pending 0 1h istio=manager,pod-template-hash=2910860705
istio-mixer-2335471611-rsrhb 0/1 Pending 0 1h istio=mixer,pod-template-hash=2335471611
prometheus-3067433533-l2m48 0/1 Pending 0 1h app=prometheus,pod-template-hash=3067433533
servicegraph-3127588006-1k5rg 0/1 Pending 0 1h app=servicegraph,pod-template-hash=3127588006
root#node1:/tmp/istio-0.1.5# kubectl get nodes --show-labels
NAME STATUS AGE VERSION LABELS
node1 Ready 5h v1.6.4+coreos.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node1,node-role.kubernetes.io/master=true,node-role.kubernetes.io/node=true
node2 Ready 5h v1.6.4+coreos.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2,node-role.kubernetes.io/master=true,node-role.kubernetes.io/node=true
node3 Ready 5h v1.6.4+coreos.0 app=prometeus,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node3,node-role.kubernetes.io/node=true
node4 Ready 5h v1.6.4+coreos.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node4,node-role.kubernetes.io/node=true
Unfortunately, after I read out most of documentation, I found out only few way to debug an installation
journalctl -r -u kubelet
kubectl get events
kubectl describe deployment
Is there any common workflow to debug Kubernetes installation?
Its in the documentation. follow the POD troubleshooting steps
https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/