kubernetes: when pod in CrashLoopBackOff status, related events won't update? - kubernetes

I'm testing kubernetes behavior when pod getting error.
I now have a pod in CrashLoopBackOff status caused by liveness probe failed, from what I can see in kubernetes events, pod turns into CrashLoopBackOff after 3 times try and begin to back off restarting, but the related Liveness probe failed events won't update?
➜ ~ kubectl describe pods/my-nginx-liveness-err-59fb55cf4d-c6p8l
Name: my-nginx-liveness-err-59fb55cf4d-c6p8l
Namespace: default
Priority: 0
Node: minikube/192.168.99.100
Start Time: Thu, 15 Jul 2021 12:29:16 +0800
Labels: pod-template-hash=59fb55cf4d
run=my-nginx-liveness-err
Annotations: <none>
Status: Running
IP: 172.17.0.3
IPs:
IP: 172.17.0.3
Controlled By: ReplicaSet/my-nginx-liveness-err-59fb55cf4d
Containers:
my-nginx-liveness-err:
Container ID: docker://edc363b76811fdb1ccacdc553d8de77e9d7455bb0d0fb3cff43eafcd12ee8a92
Image: nginx
Image ID: docker-pullable://nginx#sha256:353c20f74d9b6aee359f30e8e4f69c3d7eaea2f610681c4a95849a2fd7c497f9
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 15 Jul 2021 13:01:36 +0800
Finished: Thu, 15 Jul 2021 13:02:06 +0800
Ready: False
Restart Count: 15
Liveness: http-get http://:8080/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r7mh4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-r7mh4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37m default-scheduler Successfully assigned default/my-nginx-liveness-err-59fb55cf4d-c6p8l to minikube
Normal Created 35m (x4 over 37m) kubelet Created container my-nginx-liveness-err
Normal Started 35m (x4 over 37m) kubelet Started container my-nginx-liveness-err
Normal Killing 35m (x3 over 36m) kubelet Container my-nginx-liveness-err failed liveness probe, will be restarted
Normal Pulled 31m (x7 over 37m) kubelet Container image "nginx" already present on machine
Warning Unhealthy 16m (x32 over 36m) kubelet Liveness probe failed: Get "http://172.17.0.3:8080/": dial tcp 172.17.0.3:8080: connect: connection refused
Warning BackOff 118s (x134 over 34m) kubelet Back-off restarting failed container
BackOff event updated 118s ago, but Unhealthy event updated 16m ago?
and why I'm getting only 15 times Restart Count while BackOff events with 134 times?
I'm using minikube and my deployment is like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-nginx-liveness-err
spec:
selector:
matchLabels:
run: my-nginx-liveness-err
replicas: 1
template:
metadata:
labels:
run: my-nginx-liveness-err
spec:
containers:
- name: my-nginx-liveness-err
image: nginx
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 8080

I think you might be confusing Status Conditions and Events. Events don't "update", they just exist. It's a stream of event data from the controllers for debugging or alerting on. The Age column is the relative timestamp to the most recent instance of that event type and you can see if does some basic de-duplication. Events also age out after a few hours to keep the database from exploding.
So your issue has nothing to do with the liveness probe, your container is crashing on startup.

Related

CrashLoopBackOff : Back-off restarting failed container for flask application

I am a beginner in kubernetes and was trying to deploy my flask application following this guide: https://medium.com/analytics-vidhya/build-a-python-flask-app-and-deploy-with-kubernetes-ccc99bbec5dc
I have successfully built a docker image and pushed it to dockerhub https://hub.docker.com/repository/docker/beatrix1997/kubernetes_flask_app
but am having trouble debugging a pod.
This is my yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kubernetesflaskapp-deploy
labels:
app: kubernetesflaskapp
spec:
replicas: 1
selector:
matchLabels:
app: kubernetesflaskapp
template:
metadata:
labels:
app: kubernetesflaskapp
spec:
containers:
- name: kubernetesflaskapp
image: beatrix1997/kubernetes_flask_app
ports:
- containerPort: 5000
And this is the description of the pod:
Name: kubernetesflaskapp-deploy-5764bbbd44-8696k
Namespace: default
Priority: 0
Node: minikube/192.168.49.2
Start Time: Fri, 20 May 2022 11:26:33 +0100
Labels: app=kubernetesflaskapp
pod-template-hash=5764bbbd44
Annotations: <none>
Status: Running
IP: 172.17.0.12
IPs:
IP: 172.17.0.12
Controlled By: ReplicaSet/kubernetesflaskapp-deploy-5764bbbd44
Containers:
kubernetesflaskapp:
Container ID: docker://d500dc15e389190670a9273fea1d70e6bd6ab2e7053bd2480d114ad6150830f1
Image: beatrix1997/kubernetes_flask_app
Image ID: docker-pullable://beatrix1997/kubernetes_flask_app#sha256:1bfa98229f55b04f32a6b85d72860886abcc0f17295b14e173151a8e4b0f0334
Port: 5000/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 20 May 2022 11:58:38 +0100
Finished: Fri, 20 May 2022 11:58:38 +0100
Ready: False
Restart Count: 11
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zq8n7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-zq8n7:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33m default-scheduler Successfully assigned default/kubernetesflaskapp-deploy-5764bbbd44-8696k to minikube
Normal Pulled 33m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 14.783413947s
Normal Pulled 33m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 1.243534487s
Normal Pulled 32m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 1.373217701s
Normal Pulling 32m (x4 over 33m) kubelet Pulling image "beatrix1997/kubernetes_flask_app"
Normal Created 32m (x4 over 33m) kubelet Created container kubernetesflaskapp
Normal Pulled 32m kubelet Successfully pulled image "beatrix1997/kubernetes_flask_app" in 1.239794774s
Normal Started 32m (x4 over 33m) kubelet Started container kubernetesflaskapp
Warning BackOff 3m16s (x138 over 33m) kubelet Back-off restarting failed container
I am using ubuntu as my OS if it matters at all.
Any help would be appreciated!
Many thanks!
I would check the following:
Check if your Docker image works in Docker, you can run it with the run command, find the official doc here
If it doesn't work, then you can check what is wrong in your app first.
If it does, try checking the readiness and liveness probe, here the official documentation
You can find more hints about failing pods here
The error can be due to the issue in the application as the reported reason is "Back-off restarting failed container". Please paste the following logs in the question for further clarification
kubectl logs -n <NS> pods <pod-name>

Getting CrashBackloopError when deploying a pod

I am new to kubernetes and am trying to deploy a pod with private registry. Whenever I deploy this yaml it goes crash loop. Added sleep with a large value thinking that might cause this, still haven't worked.
apiVersion: v1
kind: Pod
metadata:
name: privetae-image-testing
spec:
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
command: ['echo','success','sleep 1000000']
Here are the logs:
Name: privetae-image-testing
Namespace: default
Priority: 0
Node: docker-desktop/192.168.65.4
Start Time: Sun, 24 Oct 2021 15:52:25 +0530
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.1.1.49
IPs:
IP: 10.1.1.49
Containers:
private-image-test:
Container ID: docker://46520936762f17b70d1ec92a121269e90aef2549390a14184e6c838e1e6bafec
Image: buildforjenkin.azurecr.io/nginx:latest
Image ID: docker-pullable://buildforjenkin.azurecr.io/nginx#sha256:7250923ba3543110040462388756ef099331822c6172a050b12c7a38361ea46f
Port: <none>
Host Port: <none>
Command:
echo
success
sleep 1000000
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
Ready: False
Restart Count: 2
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ld6zz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ld6zz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned default/privetae-image-testing to docker-desktop
Normal Pulled 17s (x3 over 33s) kubelet Container image "buildforjenkin.azurecr.io/nginx:latest" already present on machine
Normal Created 17s (x3 over 33s) kubelet Created container private-image-test
Normal Started 17s (x3 over 33s) kubelet Started container private-image-test
Warning BackOff 2s (x5 over 31s) kubelet Back-off restarting failed container
I am running the cluster on docker-desktop on windows. TIA
Notice you are using standard nginx image? Try delete your pod and re-apply with:
apiVersion: v1
kind: Pod
metadata:
name: private-image-testing
labels:
run: my-nginx
spec:
restartPolicy: Always
containers:
- name: private-image-test
image: buildforjenkin.azurecr.io/nginx:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
name: http
If your pod runs, you should be able to remote into with kubectl exec -it private-image-testing -- sh, follow by wget -O- localhost should print you a welcome message. If it still fail, paste the output of kubectl logs -f -l run=my-nginx to your question.
Check my previous answer to understand step-by step whats going on after you launch the container.
You are launching some nginx:latest container with the process inside that runs forever as it should be to avoid main process be exited. Then you add overlay that (I will quote David: print the words success and sleep 1000000, and having printed those words, then exit).
Instead of making your container running all the time to serve, you explicitly shooting into your leg by finishing the process using sleep 1000000.
And sure, your command will be executed and container will exit. Check that. It was exited correctly with status 0 and did that 2 times. And will more in the future.
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 24 Oct 2021 15:52:42 +0530
Finished: Sun, 24 Oct 2021 15:52:42 +0530
You need to think well if you really need command: ['echo','success','sleep 1000000']

A question about pod running on the kubernetes(k8s) platform:The pods are running but the containers are not-ready

I build a k8s cluster on my virtual Machines(CentOS/7) with Virtual Box:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master Ready control-plane,master 8d v1.21.2 192.168.0.186 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
k8s-worker01 Ready <none> 8d v1.21.2 192.168.0.187 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
k8s-worker02 Ready <none> 8d v1.21.2 192.168.0.188 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
And i run some pods on the default namespace with a ReplicaSet several days before.
They were all worked fine at first, and then I shut down the VM.
Today, after I restarted the VMs, I found that they are not working properly anymore:
kubectl get all
NAME READY STATUS RESTARTS AGE
pod/dnsutils 1/1 Running 3 5d13h
pod/kubapp-6qbfz 0/1 Running 0 5d13h
pod/kubapp-d887h 0/1 Running 0 5d13h
pod/kubapp-z6nw7 0/1 Running 0 5d13h
NAME DESIRED CURRENT READY AGE
replicaset.apps/kubapp 3 3 0 5d13h
Then I delete the ReplicaSet and re-create it to create the pods.
And i run the command to get more infomations:
[root#k8s-master ch04]# kubectl describe po kubapp-z887v
Name: kubapp-d887h
Namespace: default
Priority: 0
Node: k8s-worker02/192.168.0.188
Start Time: Fri, 23 Jul 2021 15:55:16 +0000
Labels: app=kubapp
Annotations: cni.projectcalico.org/podIP: 10.244.69.244/32
cni.projectcalico.org/podIPs: 10.244.69.244/32
Status: Running
IP: 10.244.69.244
IPs:
IP: 10.244.69.244
Controlled By: ReplicaSet/kubapp
Containers:
kubapp:
Container ID: docker://fc352ce4c6a826f2cf108f9bb9a335e3572509fd5ae2002c116e2b080df5ee10
Image: evalle/kubapp
Image ID: docker-pullable://evalle/kubapp#sha256:560c9c50b1d894cf79ac472a9925dc795b116b9481ec40d142b928a0e3995f4c
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 23 Jul 2021 15:55:21 +0000
Ready: False
Restart Count: 0
Readiness: exec [ls /var/ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9rwr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-m9rwr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned default/kubapp-d887h to k8s-worker02
Normal Pulling 30m kubelet Pulling image "evalle/kubapp"
Normal Pulled 30m kubelet Successfully pulled image "evalle/kubapp" in 4.049160061s
Normal Created 30m kubelet Created container kubapp
Normal Started 30m kubelet Started container kubapp
Warning Unhealthy 11s (x182 over 30m) kubelet Readiness probe failed: ls: cannot access /var/ready: No such file or directory
I don`t know what it happens and how i should do for fix it.
SO here i am and ask to you guys for help.
I am a k8s newbie,just give a hand please.
Thanks for paul-becotte`s help and recommendation.I think i should to post the definition of the pod:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
# here is the name of the replication controller (RC)
name: kubapp
spec:
replicas: 3
# what pods the RC is operating on
selector:
matchLabels:
app: kubapp
# the pod template for creating new pods
template:
metadata:
labels:
app: kubapp
spec:
containers:
- name: kubapp
image: evalle/kubapp
readinessProbe:
exec:
command:
- ls
- /var/ready
There is a example definition of yaml from https://github.com/Evalle/k8s-in-action/blob/master/Chapter_4/kubapp-rs.yaml.
I don`t know where to find the dockerfile of the image evalle/kubapp.
And I don't know if it has the /var/ready directory.
Look at your event
Warning Unhealthy 11s (x182 over 30m) kubelet Readiness probe failed: ls: cannot access /var/ready: No such file or directory
Your readiness probe is failing- looks like it is checking for the existence of a file at /var/ready.
Your next step is "does that make sense? Is my container going to actually write a file at /var/ready when its ready?" If so, you'll want to look at the logs from your pod and figure out why its not writing the file. If its NOT the correct check, look at the yaml you used to create your pod/deployment/replicaset whatever and replace that check with something that does make sense.

Argo workflow stuck in pending due to liveness probe fail?

I am trying to set up a Hyperledger Fabric network on Kubernetes by using this.
I am at the step where I am trying to create channels. I run the command argo submit output.yaml -v where output.yaml is the output of the command helm template channel-flow/ -f samples/simple/network.yaml -f samples/simple/crypto-config.yaml but with spec.securityContext added as follows:
...
spec:
securityContext:
runAsNonRoot: true
#runAsUser: 8737 (I commented out this because I don't know my user ID; not sure if this could cause a problem)
entrypoint: channels
...
My argo workflow ends up getting stuck in the pending state. I say this because I check my orderer and peer logs but I see no movement in their logs.
I referenced Argo sample workflows stuck in the pending state and I start with getting the argo logs:
[user#vmmock3 fabric-kube]$ kubectl logs -n argo -l app=workflow-controller
time="2021-05-31T05:02:41.145Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:41.150Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:46.162Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:46.168Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:51.179Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:51.185Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:56.193Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:56.199Z" level=info msg="Update leases 200"
time="2021-05-31T05:03:01.213Z" level=info msg="Get leases 200"
time="2021-05-31T05:03:01.219Z" level=info msg="Update leases 200"
I try to describe the workflow controller pod:
[user#vmmock3 fabric-kube]$ kubectl -n argo describe pod workflow-controller-57fcfb5df8-qvn74
Name: workflow-controller-57fcfb5df8-qvn74
Namespace: argo
Priority: 0
Node: hlf-pool1-8rnem/10.104.0.8
Start Time: Tue, 25 May 2021 13:44:56 +0800
Labels: app=workflow-controller
pod-template-hash=57fcfb5df8
Annotations: <none>
Status: Running
IP: 10.244.0.158
IPs:
IP: 10.244.0.158
Controlled By: ReplicaSet/workflow-controller-57fcfb5df8
Containers:
workflow-controller:
Container ID: containerd://78c7f8dcb0f3a3b861293559ae0a11b92ce6843065e6f9459556a6b7099c8961
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker.io/argoproj/workflow-controller#sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
--namespaced
State: Running
Started: Mon, 31 May 2021 13:08:11 +0800
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Mon, 31 May 2021 12:59:05 +0800
Finished: Mon, 31 May 2021 13:03:04 +0800
Ready: True
Restart Count: 1333
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-57fcfb5df8-qvn74 (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-hflpb (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
argo-token-hflpb:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-hflpb
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 7m44s (x3994 over 5d23h) kubelet Liveness probe failed: Get "http://10.244.0.158:6060/healthz": dial tcp 10.244.0.158:6060: connect: connection refused
Warning BackOff 3m46s (x16075 over 5d22h) kubelet Back-off restarting failed container
Could this failure be why my argo workflow is stuck in the pending state? How should I go about troubleshooting this?
EDIT: Output of kubectl get pods --all-namespaces (FYI these are being run on Digital Ocean):
[user#vmmock3 fabric-kube]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-5695555c55-867bx 1/1 Running 1 6d19h
argo minio-58977b4b48-r2m2h 1/1 Running 0 6d19h
argo postgres-6b5c55f477-7swpp 1/1 Running 0 6d19h
argo workflow-controller-57fcfb5df8-qvn74 0/1 CrashLoopBackOff 1522 6d19h
default hlf-ca--atlantis-58bbd79d9d-x4mz4 1/1 Running 0 21h
default hlf-ca--karga-547dbfddc8-7w6b5 1/1 Running 0 21h
default hlf-ca--nevergreen-7ffb98484c-nlg4j 1/1 Running 0 21h
default hlf-orderer--groeifabriek--orderer0-0 1/1 Running 0 21h
default hlf-peer--atlantis--peer0-0 2/2 Running 0 21h
default hlf-peer--karga--peer0-0 2/2 Running 0 21h
default hlf-peer--nevergreen--peer0-0 2/2 Running 0 21h
kube-system cilium-2kjfz 1/1 Running 3 26d
kube-system cilium-operator-84bdd6f7b6-kp9vb 1/1 Running 1 6d20h
kube-system cilium-operator-84bdd6f7b6-pkkf9 1/1 Running 1 6d20h
kube-system coredns-55ff57f948-jb5jc 1/1 Running 0 6d20h
kube-system coredns-55ff57f948-r2q4g 1/1 Running 0 6d20h
kube-system csi-do-node-4r9gj 2/2 Running 0 26d
kube-system do-node-agent-sbc8b 1/1 Running 0 26d
kube-system kube-proxy-hpsc7 1/1 Running 0 26d
I will answer partially on your question as I'm not promising everything else will work fine, however I know how to fix the issue with argo workflow-controller pod.
Answer
In short words you need to update argo workflows to a new version (at least 3.0.6, ideally 3.0.7 which available) because it looks like a bug in 3.0.5 version.
How I got there
First I installed argo 3.0.5 version (which is not production ready)
Ended up with workflow-controller pod restarts:
kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-645cf8bc47-sbnqv 1/1 Running 0 9m7s
workflow-controller-768565d958-9lftf 1/1 Running 2 9m7s
curl-pod 1/1 Running 0 6m47s
And the same liveness probe failed:
kubectl describe pod workflow-controller-768565d958-9lftf -n argo
Name: workflow-controller-768565d958-9lftf
Namespace: argo
Priority: 0
Node: worker1/10.186.0.3
Start Time: Tue, 01 Jun 2021 14:25:00 +0000
Labels: app=workflow-controller
pod-template-hash=768565d958
Annotations: <none>
Status: Running
IP: 10.244.1.151
IPs:
IP: 10.244.1.151
Controlled By: ReplicaSet/workflow-controller-768565d958
Containers:
workflow-controller:
Container ID: docker://4b797b57ae762f9fc3f7acdd890d25434a8d9f6f165bbb7a7bda35745b5f4092
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker-pullable://argoproj/workflow-controller#sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
State: Running
Started: Tue, 01 Jun 2021 14:33:00 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 01 Jun 2021 14:29:00 +0000
Finished: Tue, 01 Jun 2021 14:33:00 +0000
Ready: True
Restart Count: 2
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-768565d958-9lftf (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ts9zf (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-ts9zf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m57s default-scheduler Successfully assigned argo/workflow-controller-768565d958-9lftf to worker1
Normal Pulled 57s (x3 over 8m56s) kubelet Container image "argoproj/workflow-controller:v3.0.5" already present on machine
Normal Created 57s (x3 over 8m56s) kubelet Created container workflow-controller
Normal Started 57s (x3 over 8m56s) kubelet Started container workflow-controller
Warning Unhealthy 57s (x6 over 6m57s) kubelet Liveness probe failed: Get "http://10.244.1.151:6060/healthz": dial tcp 10.244.1.151:6060: connect: connection refused
Normal Killing 57s (x2 over 4m57s) kubelet Container workflow-controller failed liveness probe, will be restarted
I also tested this endpoint with a pod in the same namespace based on curlimages/curl image - it has built-in curl.
here's a pod.yaml
apiVersion: v1
kind: Pod
metadata:
namespace: argo
labels:
app: curl
name: curl-pod
spec:
containers:
- image: curlimages/curl
name: curl-pod
command: ['sh', '-c', 'while true ; do sleep ; done']
dnsPolicy: ClusterFirst
restartPolicy: Always
kubectl exec -it curl-pod -n argo -- curl http://10.244.1.151:6060/healthz
Which resulted in the same error:
curl: (7) Failed to connect to 10.244.1.151 port 6060: Connection refused
Next step was trying a newer version (3.10rc and then 3.0.7). And it succeeded!
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned argo/workflow-controller-74b4b5455d-skb2f to worker1
Normal Pulling 27m kubelet Pulling image "argoproj/workflow-controller:v3.0.7"
Normal Pulled 27m kubelet Successfully pulled image "argoproj/workflow-controller:v3.0.7" in 15.728042003s
Normal Created 27m kubelet Created container workflow-controller
Normal Started 27m kubelet Started container workflow-controller
And check it with curl:
kubectl exec -it curl-pod -n argo -- curl 10.244.1.169:6060/healthz
ok
The problem is due to the usage of old version of a workflow-controller.If you are following docs it downloads the old version of the work-flow controller which ends up causing a lot of issues .Use below commands instead
or find out the latest one in here
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.3.6/install.yaml

Readiness probe failed: timeout: failed to connect service ":8080" within 1s

I am trying to build and deploy microservices images to a single-node Kubernetes cluster running on my development machine using minikube. I am using the cloud-native microservices demo application Online Boutique by Google to understand the use of technologies like Kubernetes, Istio etc.
Link to github repo: microservices-demo
I have followed all the installation process to locally build and deploy the microservices, and am able to access the web frontend through my browser. However, when I click on any of the product images say, I see this error page.
HTTP Status: 500 Internal Server Error
On doing a check using kubectl get pods
I realize that one of my pods( Recommendation service) has status CrashLoopBackOff.
Running kubectl describe pods recommendationservice-55b4d6c477-kxv8r:
Namespace: default
Priority: 0
Node: minikube/192.168.99.116
Start Time: Thu, 23 Jul 2020 19:58:38 +0530
Labels: app=recommendationservice
app.kubernetes.io/managed-by=skaffold-v1.11.0
pod-template-hash=55b4d6c477
skaffold.dev/builder=local
skaffold.dev/cleanup=true
skaffold.dev/deployer=kubectl
skaffold.dev/docker-api-version=1.40
skaffold.dev/run-id=49913ced-e8df-40a7-9336-a227b56bcb5f
skaffold.dev/tag-policy=git-commit
Annotations: <none>
Status: Running
IP: 172.17.0.14
IPs:
IP: 172.17.0.14
Controlled By: ReplicaSet/recommendationservice-55b4d6c477
Containers:
server:
Container ID: docker://2d92aa966a82fbe58c8f40f6ecf9d6d55c29f8081cb40e0423a2397e1419350f
Image: recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Image ID: docker://sha256:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 23 Jul 2020 21:09:33 +0530
Finished: Thu, 23 Jul 2020 21:09:53 +0530
Ready: False
Restart Count: 29
Limits:
cpu: 200m
memory: 450Mi
Requests:
cpu: 100m
memory: 220Mi
Liveness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Readiness: exec [/bin/grpc_health_probe -addr=:8080] delay=0s timeout=1s period=5s #success=1 #failure=3
Environment:
PORT: 8080
PRODUCT_CATALOG_SERVICE_ADDR: productcatalogservice:3550
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-sbpcx (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-sbpcx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-sbpcx
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 44m (x15 over 74m) kubelet, minikube Container image "recommendationservice:2216d526d249cc8363129aed9a09d752f9ad8f458e61e50a2a99c59d000606cb" already present on machine
Warning Unhealthy 9m33s (x99 over 74m) kubelet, minikube Readiness probe failed: timeout: failed to connect service ":8080" within 1s
Warning BackOff 4m25s (x294 over 72m) kubelet, minikube Back-off restarting failed container
In Events, I see Readiness probe failed: timeout: failed to connect service ":8080" within 1s.
What is the reason and how can I resolve this?
Thanks for the help!
Answer
The timeout of the Readiness Probe (1 second) was too short.
More Info
The relevant Readiness Probe is defined such that /bin/grpc_health_probe -addr=:8080 is run inside the server container.
You would expect a 1 second timeout to be sufficient for such a probe but this is running on Minikube so that could be impacting the timeout of the probe.