Argo sample workflows stuck in the pending state - argo-workflows

I follow the Argo Workflow's Getting Started documentation. Everything goes smooth until I run the first sample workflow as described in 4. Run Sample Workflows. The workflow just gets stuck in the pending state:
vagrant#master:~$ argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name: hello-world-z4lbs
Namespace: default
ServiceAccount: default
Status: Pending
Created: Thu May 14 12:36:45 +0000 (now)
vagrant#master:~$ argo list
NAME STATUS AGE DURATION PRIORITY
hello-world-z4lbs Pending 27m 0s 0
Here it was mentioned that taints on the muster node may be the problem, so I untainted the master node:
vagrant#master:~$ kubectl taint nodes --all node-role.kubernetes.io/master-
node/master untainted
taint "node-role.kubernetes.io/master" not found
taint "node-role.kubernetes.io/master" not found
Then I deleted the pending workflow and resubmitted it, but it got stuck in the pending state again.
The details of the newly submitted workflow that is also stuck:
vagrant#master:~$ kubectl describe workflow hello-world-8kvmb
Name: hello-world-8kvmb
Namespace: default
Labels: <none>
Annotations: <none>
API Version: argoproj.io/v1alpha1
Kind: Workflow
Metadata:
Creation Timestamp: 2020-05-14T13:57:44Z
Generate Name: hello-world-
Generation: 1
Managed Fields:
API Version: argoproj.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:generateName:
f:spec:
.:
f:arguments:
f:entrypoint:
f:templates:
f:status:
.:
f:finishedAt:
f:startedAt:
Manager: argo
Operation: Update
Time: 2020-05-14T13:57:44Z
Resource Version: 16780
Self Link: /apis/argoproj.io/v1alpha1/namespaces/default/workflows/hello-world-8kvmb
UID: aa82d005-b7ac-411f-9d0b-93f34876b673
Spec:
Arguments:
Entrypoint: whalesay
Templates:
Arguments:
Container:
Args:
hello world
Command:
cowsay
Image: docker/whalesay:latest
Name:
Resources:
Inputs:
Metadata:
Name: whalesay
Outputs:
Status:
Finished At: <nil>
Started At: <nil>
Events: <none>
While trying to get the workflow-controller logs I get the follwoing error:
vagrant#master:~$ kubectl logs -n argo -l app=workflow-controller
Error from server (BadRequest): container "workflow-controller" in pod "workflow-controller-6c4787844c-lbksm" is waiting to start: ContainerCreating
The details for the corresponding workflow-controller pod:
vagrant#master:~$ kubectl -n argo describe pods/workflow-controller-6c4787844c-lbksm
Name: workflow-controller-6c4787844c-lbksm
Namespace: argo
Priority: 0
Node: node-1/192.168.50.11
Start Time: Thu, 14 May 2020 12:08:29 +0000
Labels: app=workflow-controller
pod-template-hash=6c4787844c
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/workflow-controller-6c4787844c
Containers:
workflow-controller:
Container ID:
Image: argoproj/workflow-controller:v2.8.0
Image ID:
Port: <none>
Host Port: <none>
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v2.8.0
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-pz4fd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
argo-token-pz4fd:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-pz4fd
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 7m17s (x4739 over 112m) kubelet, node-1 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 2m18s (x4950 over 112m) kubelet, node-1 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1bd1fd11dfe677c749b4a1260c29c2f8cff0d55de113d154a822e68b41f9438e" network for pod "workflow-controller-6c4787844c-lbksm": networkPlugin cni failed to set up pod "workflow-controller-6c4787844c-lbksm_argo" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
I run Argo 2.8:
vagrant#master:~$ argo version
argo: v2.8.0
BuildDate: 2020-05-11T22:55:16Z
GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
GitTreeState: clean
GitTag: v2.8.0
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
I have checked the cluster status and it looks OK:
vagrant#master:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 95m v1.18.2
node-1 Ready <none> 92m v1.18.2
node-2 Ready <none> 92m v1.18.2
As to the K8s cluster installation, I created it using Vagrant as described here, the only differences being:
libvirt as provdier
newer version of Ubuntu: generic/ubuntu1804
newer version of Calico: v3.14
Any idea why the workflows get stuck in the pending state and how to fix it?

Workflows start in the Pending state and then are moved through their steps by the workflow-controller pod (which is installed in the cluster as part of Argo).
The workflow-controller pod is stuck in ContainerCreating. kubectl describe po {workflow-controller pod} reveals a Calico-related network error.
As mentioned in the comments, it looks like a common Calico error. Once you clear that up, your hello-world workflow should execute just fine.
Note from OP: Further debugging confirms the Calico problem (Calico nodes are not in the running state):
vagrant#master:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-84946785b-94bfs 0/1 ContainerCreating 0 3h59m
argo workflow-controller-6c4787844c-lbksm 0/1 ContainerCreating 0 3h59m
kube-system calico-kube-controllers-74d45555dd-zhkp6 0/1 CrashLoopBackOff 56 3h59m
kube-system calico-node-2n9kt 0/1 CrashLoopBackOff 72 3h59m
kube-system calico-node-b8sb8 0/1 Running 70 3h56m
kube-system calico-node-pslzs 0/1 CrashLoopBackOff 67 3h56m
kube-system coredns-66bff467f8-rmxsp 0/1 ContainerCreating 0 3h59m
kube-system coredns-66bff467f8-z4lbq 0/1 ContainerCreating 0 3h59m
kube-system etcd-master 1/1 Running 2 3h59m
kube-system kube-apiserver-master 1/1 Running 2 3h59m
kube-system kube-controller-manager-master 1/1 Running 2 3h59m
kube-system kube-proxy-k59ks 1/1 Running 2 3h59m
kube-system kube-proxy-mn96x 1/1 Running 1 3h56m
kube-system kube-proxy-vxj8b 1/1 Running 1 3h56m
kube-system kube-scheduler-master 1/1 Running 2 3h59m

For the calico CrashLoopBackOff, kubeadm use the default interface eth0 to bootstrap the cluster.
But the eth0 interface is used by Vagrant (for ssh).
You could configure the kubelet to use a private IP address (for instance) and not eth0.
You'll have to do that for each node then vagrant reload.
sudo vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
#Add the Environment line in 10-kubeadm.conf and replace your_node_ip
Environment="KUBELET_EXTRA_ARGS=--node-ip=your_node_ip"
Hope it helps

Related

Kubernetes Dashboard CrashLoopBackOff: timeout error on Raspberry Pi cluster

Should be a simple task, I simply want to run the Kubernetes Dashboard on a clean install of Kubernetes on a Raspberry Pi cluster.
What I've done:
Setup the initial cluster (hostname, static ip, cgroup, swapspace, install and configure docker, install kubernetes, setup kubernetes network and join nodes)
I have flannel installed
I have applied the dashboard
Bunch of random testing trying to figure this out
Obviously, as seen below, the container in the dashboard pod is not working because it cannot access kubernetes-dashboard-csrf. I have no idea why this cannot be accessed, my only thought is that I missed a step when setting up the cluster. I've followed about 6 different guides without success, prioritizing the official guide. I have also seen quite a few people having the same or similar issues that most have not posted a resolution. Thanks!
Nodes: kubectl get nodes
NAME STATUS ROLES AGE VERSION
gus3 Ready <none> 346d v1.23.1
juliet3 Ready <none> 346d v1.23.1
shawn4 Ready <none> 346d v1.23.1
vick4 Ready control-plane,master 346d v1.23.1
All Pods: kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-74ff55c5b-7j2xg 1/1 Running 27 346d
kube-system coredns-74ff55c5b-cb2x8 1/1 Running 27 346d
kube-system etcd-vick4 1/1 Running 2 169m
kube-system kube-apiserver-vick4 1/1 Running 2 169m
kube-system kube-controller-manager-vick4 1/1 Running 2 169m
kube-system kube-flannel-ds-gclmp 1/1 Running 0 11m
kube-system kube-flannel-ds-hshjv 1/1 Running 0 12m
kube-system kube-flannel-ds-kdd4w 1/1 Running 0 11m
kube-system kube-flannel-ds-wzhkt 1/1 Running 0 10m
kube-system kube-proxy-4t25v 1/1 Running 26 346d
kube-system kube-proxy-b6vbx 1/1 Running 26 346d
kube-system kube-proxy-jgj4s 1/1 Running 27 346d
kube-system kube-proxy-n65sl 1/1 Running 26 346d
kube-system kube-scheduler-vick4 1/1 Running 2 169m
kubernetes-dashboard dashboard-metrics-scraper-5b8896d7fc-99wfk 1/1 Running 0 77m
kubernetes-dashboard kubernetes-dashboard-897c7599f-qss5p 0/1 CrashLoopBackOff 18 77m
Resources: kubectl get all -n kubernetes-dashboard
NAME READY STATUS RESTARTS AGE
pod/dashboard-metrics-scraper-5b8896d7fc-99wfk 1/1 Running 0 79m
pod/kubernetes-dashboard-897c7599f-qss5p 0/1 CrashLoopBackOff 19 79m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/dashboard-metrics-scraper ClusterIP 172.20.0.191 <none> 8000/TCP 79m
service/kubernetes-dashboard ClusterIP 172.20.0.15 <none> 443/TCP 79m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/dashboard-metrics-scraper 1/1 1 1 79m
deployment.apps/kubernetes-dashboard 0/1 1 0 79m
NAME DESIRED CURRENT READY AGE
replicaset.apps/dashboard-metrics-scraper-5b8896d7fc 1 1 1 79m
replicaset.apps/kubernetes-dashboard-897c7599f 1 1 0 79m
Notice CrashLoopBackOff
Pod Details: kubectl describe pods kubernetes-dashboard-897c7599f-qss5p -n kubernetes-dashboard
Name: kubernetes-dashboard-897c7599f-qss5p
Namespace: kubernetes-dashboard
Priority: 0
Node: shawn4/192.168.10.71
Start Time: Fri, 17 Dec 2021 18:52:15 +0000
Labels: k8s-app=kubernetes-dashboard
pod-template-hash=897c7599f
Annotations: <none>
Status: Running
IP: 172.19.1.75
IPs:
IP: 172.19.1.75
Controlled By: ReplicaSet/kubernetes-dashboard-897c7599f
Containers:
kubernetes-dashboard:
Container ID: docker://894a354e40ca1a95885e149dcd75415e0f186ead3f2e05ec0787f4b1c7a29622
Image: kubernetesui/dashboard:v2.4.0
Image ID: docker-pullable://kubernetesui/dashboard#sha256:526850ae4ea9aba360e72b6df69fd3126b129d446efe83ac5250282b85f95b7f
Port: 8443/TCP
Host Port: 0/TCP
Args:
--auto-generate-certificates
--namespace=kubernetes-dashboard
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Fri, 17 Dec 2021 20:10:19 +0000
Finished: Fri, 17 Dec 2021 20:10:49 +0000
Ready: False
Restart Count: 19
Liveness: http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/certs from kubernetes-dashboard-certs (rw)
/tmp from tmp-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kubernetes-dashboard-token-wq9m8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubernetes-dashboard-certs:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-certs
Optional: false
tmp-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kubernetes-dashboard-token-wq9m8:
Type: Secret (a volume populated by a Secret)
SecretName: kubernetes-dashboard-token-wq9m8
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 21s (x327 over 79m) kubelet Back-off restarting failed container
Logs: kubectl logs -f -n kubernetes-dashboard kubernetes-dashboard-897c7599f-qss5p
2021/12/17 20:10:19 Starting overwatch
2021/12/17 20:10:19 Using namespace: kubernetes-dashboard
2021/12/17 20:10:19 Using in-cluster config to connect to apiserver
2021/12/17 20:10:19 Using secret token for csrf signing
2021/12/17 20:10:19 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: Get "https://172.20.0.1:443/api/v1/namespaces/kubernetes-dashboard/secrets/kubernetes-dashboard-csrf": dial tcp 172.20.0.1:443: i/o timeout
goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(*csrfTokenManager).init(0x400055fae8)
/home/runner/work/dashboard/dashboard/src/app/backend/client/csrf/manager.go:41 +0x350
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
/home/runner/work/dashboard/dashboard/src/app/backend/client/csrf/manager.go:66
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).initCSRFKey(0x40001fc080)
/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:502 +0x8c
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0x40001fc080)
/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:470 +0x40
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
/home/runner/work/dashboard/dashboard/src/app/backend/client/manager.go:551
main.main()
/home/runner/work/dashboard/dashboard/src/app/backend/dashboard.go:95 +0x1dc
If you need any more information please ask!
UPDATE 12/29/21:
Fixed this issue by reinstalling the cluster to the newest versions of Kubernetes and Ubuntu.
Turned out there were several issues:
I was using Ubuntu Buster which is deprecated.
My client/server Kubernetes versions were +/-0.3 out of sync
I was following outdated instructions
I reinstalled the whole cluster following Kubernetes official guide and, with a few snags along the way, it works!

A question about pod running on the kubernetes(k8s) platform:The pods are running but the containers are not-ready

I build a k8s cluster on my virtual Machines(CentOS/7) with Virtual Box:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master Ready control-plane,master 8d v1.21.2 192.168.0.186 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
k8s-worker01 Ready <none> 8d v1.21.2 192.168.0.187 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
k8s-worker02 Ready <none> 8d v1.21.2 192.168.0.188 <none> CentOS Linux 7 (Core) 3.10.0-1160.31.1.el7.x86_64 docker://20.10.7
And i run some pods on the default namespace with a ReplicaSet several days before.
They were all worked fine at first, and then I shut down the VM.
Today, after I restarted the VMs, I found that they are not working properly anymore:
kubectl get all
NAME READY STATUS RESTARTS AGE
pod/dnsutils 1/1 Running 3 5d13h
pod/kubapp-6qbfz 0/1 Running 0 5d13h
pod/kubapp-d887h 0/1 Running 0 5d13h
pod/kubapp-z6nw7 0/1 Running 0 5d13h
NAME DESIRED CURRENT READY AGE
replicaset.apps/kubapp 3 3 0 5d13h
Then I delete the ReplicaSet and re-create it to create the pods.
And i run the command to get more infomations:
[root#k8s-master ch04]# kubectl describe po kubapp-z887v
Name: kubapp-d887h
Namespace: default
Priority: 0
Node: k8s-worker02/192.168.0.188
Start Time: Fri, 23 Jul 2021 15:55:16 +0000
Labels: app=kubapp
Annotations: cni.projectcalico.org/podIP: 10.244.69.244/32
cni.projectcalico.org/podIPs: 10.244.69.244/32
Status: Running
IP: 10.244.69.244
IPs:
IP: 10.244.69.244
Controlled By: ReplicaSet/kubapp
Containers:
kubapp:
Container ID: docker://fc352ce4c6a826f2cf108f9bb9a335e3572509fd5ae2002c116e2b080df5ee10
Image: evalle/kubapp
Image ID: docker-pullable://evalle/kubapp#sha256:560c9c50b1d894cf79ac472a9925dc795b116b9481ec40d142b928a0e3995f4c
Port: <none>
Host Port: <none>
State: Running
Started: Fri, 23 Jul 2021 15:55:21 +0000
Ready: False
Restart Count: 0
Readiness: exec [ls /var/ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m9rwr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-m9rwr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned default/kubapp-d887h to k8s-worker02
Normal Pulling 30m kubelet Pulling image "evalle/kubapp"
Normal Pulled 30m kubelet Successfully pulled image "evalle/kubapp" in 4.049160061s
Normal Created 30m kubelet Created container kubapp
Normal Started 30m kubelet Started container kubapp
Warning Unhealthy 11s (x182 over 30m) kubelet Readiness probe failed: ls: cannot access /var/ready: No such file or directory
I don`t know what it happens and how i should do for fix it.
SO here i am and ask to you guys for help.
I am a k8s newbie,just give a hand please.
Thanks for paul-becotte`s help and recommendation.I think i should to post the definition of the pod:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
# here is the name of the replication controller (RC)
name: kubapp
spec:
replicas: 3
# what pods the RC is operating on
selector:
matchLabels:
app: kubapp
# the pod template for creating new pods
template:
metadata:
labels:
app: kubapp
spec:
containers:
- name: kubapp
image: evalle/kubapp
readinessProbe:
exec:
command:
- ls
- /var/ready
There is a example definition of yaml from https://github.com/Evalle/k8s-in-action/blob/master/Chapter_4/kubapp-rs.yaml.
I don`t know where to find the dockerfile of the image evalle/kubapp.
And I don't know if it has the /var/ready directory.
Look at your event
Warning Unhealthy 11s (x182 over 30m) kubelet Readiness probe failed: ls: cannot access /var/ready: No such file or directory
Your readiness probe is failing- looks like it is checking for the existence of a file at /var/ready.
Your next step is "does that make sense? Is my container going to actually write a file at /var/ready when its ready?" If so, you'll want to look at the logs from your pod and figure out why its not writing the file. If its NOT the correct check, look at the yaml you used to create your pod/deployment/replicaset whatever and replace that check with something that does make sense.

k8s pod stuck in status "pending"

All new containers are stuck in status "pending". It does not seem to be a resource issue, since the total cluster utilization is about 10% cpu, 30% memory.
How do I get more insights into the issue?
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
cq-iam-boarding-77fd94dc94-8pc6f 1/1 Running 0 30h
cq-iam-demo-cloud-6b99f6544d-9v7j7 1/1 Running 0 30h
cq-iam-mpm-dev-8c6cc58fd-fczlw 1/1 Running 0 30h
cq-iam-proxy-86854cc78d-49gfw 0/1 Terminating 0 7h42m
cq-iam-proxy-86854cc78d-dqlz8 0/1 Terminating 0 7h36m
cq-iam-proxy-86854cc78d-m7zs2 0/1 Pending 0 5h22m
cq-launchpad-app-7b57c478b9-gqcxj 1/1 Running 0 13h
cq-management-api-7c689c7846-q9fz2 1/1 Running 0 29h
cq-opa-api-8458db697c-75rzd 1/1 Running 0 30h
cq-settings-app-6874885794-mspj9 1/1 Running 0 29h
node-debugger-aks-nodepool1-31127038-vmss000000-czt8s 0/1 Pending 0 8h
$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
cq-iam-boarding-77fd94dc94-8pc6f 2m 482Mi
cq-iam-demo-cloud-6b99f6544d-9v7j7 2m 507Mi
cq-iam-mpm-dev-8c6cc58fd-fczlw 2m 443Mi
cq-launchpad-app-7b57c478b9-gqcxj 0m 2Mi
cq-management-api-7c689c7846-q9fz2 1m 88Mi
cq-opa-api-8458db697c-75rzd 1m 17Mi
cq-settings-app-6874885794-mspj9 1m 2Mi
$ kubectl describe pod cq-iam-proxy-86854cc78d-m7zs2
Name: cq-iam-proxy-86854cc78d-m7zs2
Namespace: dev
Priority: 0
Node: aks-nodepool1-31127038-vmss000000/
Labels: app=cq-iam-proxy
pod-template-hash=86854cc78d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
cq-iam-proxy:
Image: xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
Port: 80/TCP
Host Port: 0/TCP
Environment:
CQ_HOSTNAME: dev.hvt.zone
key1: TODO
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-pl6p4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pl6p4
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Check the status of nodepool1:
nodepool is all good and running
there are three nodes which are all green (memory, disk, readiness)
Can you show the logs of the pod?
This is what I get when I print the pod logs:
$ kubectl logs cq-iam-proxy-86854cc78d-m7zs2
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-m7zs2)
Please include the events of pods in Terminating status. There may be a clue there:
$ kubectl describe pod cq-iam-proxy-86854cc78d-49gfw
Name: cq-iam-proxy-86854cc78d-49gfw
Namespace: dev
Priority: 0
Node: aks-nodepool1-31127038-vmss000000/
Labels: app=cq-iam-proxy
pod-template-hash=86854cc78d
Annotations: <none>
Status: Terminating (lasts 2d18h)
Termination Grace Period: 30s
IP:
IPs: <none>
Controlled By: ReplicaSet/cq-iam-proxy-86854cc78d
Containers:
cq-iam-proxy:
Image: xxx.azurecr.io/karneval/cq-iam-proxy:1.0.14
Port: 80/TCP
Host Port: 0/TCP
Environment:
CQ_HOSTNAME: dev.hvt.zone
key1: TODO
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-pl6p4 (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-pl6p4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-pl6p4
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
There are no events there? Is there anything in the logs of those two pods?
$ kubectl logs cq-iam-proxy-86854cc78d-dqlz8
Error from server (NotFound): the server could not find the requested resource ( pods/log cq-iam-proxy-86854cc78d-dqlz8)
This seems like a problem with the application itself.
It does not seem to be a problem with the application itself. I ran these two commands:
$ kubectl run --image=busybox myapp -- false
$ kubectl run --image=busybox myapp2 -- false
myapp was able to start
myapp2 is in pending mode (same as the other applications)
myapp 0/1 CrashLoopBackOff 5 11m
myapp2 0/1 Pending 0 9m26s
$ kubectl describe pod myapp
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned dev/myapp to aks-nodepool1-31127038-vmss000001
Normal Created 11m (x4 over 11m) kubelet Created container myapp
Normal Started 11m (x4 over 11m) kubelet Started container myapp
Normal Pulling 10m (x5 over 11m) kubelet Pulling image "busybox"
Normal Pulled 10m (x5 over 11m) kubelet Successfully pulled image "busybox"
Warning BackOff 95s (x47 over 11m) kubelet Back-off restarting failed container
$ kubectl describe pod myapp2
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned dev/myapp2 to aks-nodepool1-31127038-vmss000000
The only difference between myapp and myapp2 is that they have been scheduled on different nodes:
myapp was successfully started on node aks-nodepool1-31127038-vmss000001
myapp2 does not start on node aks-nodepool1-31127038-vmss000000
After two weeks the cluster healed it self.
The node nodepool1-31127038-vmss000000 was problematic and would get stuck starting a container.
Next time I encounter this problem I will play with these commands to heal the node:
kubectl cordon my-node # Mark my-node as unschedulable
kubectl drain my-node # Drain my-node in preparation for maintenance
kubectl uncordon my-node # Mark my-node as schedulable
kubectl top node my-node # Show metrics for a given node

Argo workflow stuck in pending due to liveness probe fail?

I am trying to set up a Hyperledger Fabric network on Kubernetes by using this.
I am at the step where I am trying to create channels. I run the command argo submit output.yaml -v where output.yaml is the output of the command helm template channel-flow/ -f samples/simple/network.yaml -f samples/simple/crypto-config.yaml but with spec.securityContext added as follows:
...
spec:
securityContext:
runAsNonRoot: true
#runAsUser: 8737 (I commented out this because I don't know my user ID; not sure if this could cause a problem)
entrypoint: channels
...
My argo workflow ends up getting stuck in the pending state. I say this because I check my orderer and peer logs but I see no movement in their logs.
I referenced Argo sample workflows stuck in the pending state and I start with getting the argo logs:
[user#vmmock3 fabric-kube]$ kubectl logs -n argo -l app=workflow-controller
time="2021-05-31T05:02:41.145Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:41.150Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:46.162Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:46.168Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:51.179Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:51.185Z" level=info msg="Update leases 200"
time="2021-05-31T05:02:56.193Z" level=info msg="Get leases 200"
time="2021-05-31T05:02:56.199Z" level=info msg="Update leases 200"
time="2021-05-31T05:03:01.213Z" level=info msg="Get leases 200"
time="2021-05-31T05:03:01.219Z" level=info msg="Update leases 200"
I try to describe the workflow controller pod:
[user#vmmock3 fabric-kube]$ kubectl -n argo describe pod workflow-controller-57fcfb5df8-qvn74
Name: workflow-controller-57fcfb5df8-qvn74
Namespace: argo
Priority: 0
Node: hlf-pool1-8rnem/10.104.0.8
Start Time: Tue, 25 May 2021 13:44:56 +0800
Labels: app=workflow-controller
pod-template-hash=57fcfb5df8
Annotations: <none>
Status: Running
IP: 10.244.0.158
IPs:
IP: 10.244.0.158
Controlled By: ReplicaSet/workflow-controller-57fcfb5df8
Containers:
workflow-controller:
Container ID: containerd://78c7f8dcb0f3a3b861293559ae0a11b92ce6843065e6f9459556a6b7099c8961
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker.io/argoproj/workflow-controller#sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
--namespaced
State: Running
Started: Mon, 31 May 2021 13:08:11 +0800
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Mon, 31 May 2021 12:59:05 +0800
Finished: Mon, 31 May 2021 13:03:04 +0800
Ready: True
Restart Count: 1333
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-57fcfb5df8-qvn74 (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-hflpb (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
argo-token-hflpb:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-hflpb
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 7m44s (x3994 over 5d23h) kubelet Liveness probe failed: Get "http://10.244.0.158:6060/healthz": dial tcp 10.244.0.158:6060: connect: connection refused
Warning BackOff 3m46s (x16075 over 5d22h) kubelet Back-off restarting failed container
Could this failure be why my argo workflow is stuck in the pending state? How should I go about troubleshooting this?
EDIT: Output of kubectl get pods --all-namespaces (FYI these are being run on Digital Ocean):
[user#vmmock3 fabric-kube]$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-5695555c55-867bx 1/1 Running 1 6d19h
argo minio-58977b4b48-r2m2h 1/1 Running 0 6d19h
argo postgres-6b5c55f477-7swpp 1/1 Running 0 6d19h
argo workflow-controller-57fcfb5df8-qvn74 0/1 CrashLoopBackOff 1522 6d19h
default hlf-ca--atlantis-58bbd79d9d-x4mz4 1/1 Running 0 21h
default hlf-ca--karga-547dbfddc8-7w6b5 1/1 Running 0 21h
default hlf-ca--nevergreen-7ffb98484c-nlg4j 1/1 Running 0 21h
default hlf-orderer--groeifabriek--orderer0-0 1/1 Running 0 21h
default hlf-peer--atlantis--peer0-0 2/2 Running 0 21h
default hlf-peer--karga--peer0-0 2/2 Running 0 21h
default hlf-peer--nevergreen--peer0-0 2/2 Running 0 21h
kube-system cilium-2kjfz 1/1 Running 3 26d
kube-system cilium-operator-84bdd6f7b6-kp9vb 1/1 Running 1 6d20h
kube-system cilium-operator-84bdd6f7b6-pkkf9 1/1 Running 1 6d20h
kube-system coredns-55ff57f948-jb5jc 1/1 Running 0 6d20h
kube-system coredns-55ff57f948-r2q4g 1/1 Running 0 6d20h
kube-system csi-do-node-4r9gj 2/2 Running 0 26d
kube-system do-node-agent-sbc8b 1/1 Running 0 26d
kube-system kube-proxy-hpsc7 1/1 Running 0 26d
I will answer partially on your question as I'm not promising everything else will work fine, however I know how to fix the issue with argo workflow-controller pod.
Answer
In short words you need to update argo workflows to a new version (at least 3.0.6, ideally 3.0.7 which available) because it looks like a bug in 3.0.5 version.
How I got there
First I installed argo 3.0.5 version (which is not production ready)
Ended up with workflow-controller pod restarts:
kubectl get pods -n argo
NAME READY STATUS RESTARTS AGE
argo-server-645cf8bc47-sbnqv 1/1 Running 0 9m7s
workflow-controller-768565d958-9lftf 1/1 Running 2 9m7s
curl-pod 1/1 Running 0 6m47s
And the same liveness probe failed:
kubectl describe pod workflow-controller-768565d958-9lftf -n argo
Name: workflow-controller-768565d958-9lftf
Namespace: argo
Priority: 0
Node: worker1/10.186.0.3
Start Time: Tue, 01 Jun 2021 14:25:00 +0000
Labels: app=workflow-controller
pod-template-hash=768565d958
Annotations: <none>
Status: Running
IP: 10.244.1.151
IPs:
IP: 10.244.1.151
Controlled By: ReplicaSet/workflow-controller-768565d958
Containers:
workflow-controller:
Container ID: docker://4b797b57ae762f9fc3f7acdd890d25434a8d9f6f165bbb7a7bda35745b5f4092
Image: argoproj/workflow-controller:v3.0.5
Image ID: docker-pullable://argoproj/workflow-controller#sha256:740dca63b11168490d9cc7b2d1b08c1364f4a4064e1d9b7a778ca2ab12a63158
Ports: 9090/TCP, 6060/TCP
Host Ports: 0/TCP, 0/TCP
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v3.0.5
State: Running
Started: Tue, 01 Jun 2021 14:33:00 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Tue, 01 Jun 2021 14:29:00 +0000
Finished: Tue, 01 Jun 2021 14:33:00 +0000
Ready: True
Restart Count: 2
Liveness: http-get http://:6060/healthz delay=90s timeout=1s period=60s #success=1 #failure=3
Environment:
LEADER_ELECTION_IDENTITY: workflow-controller-768565d958-9lftf (v1:metadata.name)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ts9zf (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-ts9zf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m57s default-scheduler Successfully assigned argo/workflow-controller-768565d958-9lftf to worker1
Normal Pulled 57s (x3 over 8m56s) kubelet Container image "argoproj/workflow-controller:v3.0.5" already present on machine
Normal Created 57s (x3 over 8m56s) kubelet Created container workflow-controller
Normal Started 57s (x3 over 8m56s) kubelet Started container workflow-controller
Warning Unhealthy 57s (x6 over 6m57s) kubelet Liveness probe failed: Get "http://10.244.1.151:6060/healthz": dial tcp 10.244.1.151:6060: connect: connection refused
Normal Killing 57s (x2 over 4m57s) kubelet Container workflow-controller failed liveness probe, will be restarted
I also tested this endpoint with a pod in the same namespace based on curlimages/curl image - it has built-in curl.
here's a pod.yaml
apiVersion: v1
kind: Pod
metadata:
namespace: argo
labels:
app: curl
name: curl-pod
spec:
containers:
- image: curlimages/curl
name: curl-pod
command: ['sh', '-c', 'while true ; do sleep ; done']
dnsPolicy: ClusterFirst
restartPolicy: Always
kubectl exec -it curl-pod -n argo -- curl http://10.244.1.151:6060/healthz
Which resulted in the same error:
curl: (7) Failed to connect to 10.244.1.151 port 6060: Connection refused
Next step was trying a newer version (3.10rc and then 3.0.7). And it succeeded!
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 27m default-scheduler Successfully assigned argo/workflow-controller-74b4b5455d-skb2f to worker1
Normal Pulling 27m kubelet Pulling image "argoproj/workflow-controller:v3.0.7"
Normal Pulled 27m kubelet Successfully pulled image "argoproj/workflow-controller:v3.0.7" in 15.728042003s
Normal Created 27m kubelet Created container workflow-controller
Normal Started 27m kubelet Started container workflow-controller
And check it with curl:
kubectl exec -it curl-pod -n argo -- curl 10.244.1.169:6060/healthz
ok
The problem is due to the usage of old version of a workflow-controller.If you are following docs it downloads the old version of the work-flow controller which ends up causing a lot of issues .Use below commands instead
or find out the latest one in here
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.3.6/install.yaml

Unable to setup Istio with minikube

I followed Istio's official documentation to setup Istio for sample bookinfo app with minikube. but I'm getting Unable to connect to the server: net/http: TLS handshake timeout error. these are the steps that I have followed(I have kubectl & minikube installed).
minikube start
curl -L https://git.io/getLatestIstio | sh -
cd istio-1.0.3
export PATH=$PWD/bin:$PATH
kubectl apply -f install/kubernetes/helm/istio/templates/crds.yaml
kubectl apply -f install/kubernetes/istio-demo-auth.yaml
kubectl get pods -n istio-system
This is the terminal output I'm getting
$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
grafana-9cfc9d4c9-xg7bh 1/1 Running 0 4m
istio-citadel-6d7f9c545b-lwq8s 1/1 Running 0 3m
istio-cleanup-secrets-69hdj 0/1 Completed 0 4m
istio-egressgateway-75dbb8f95d-k6xj2 1/1 Running 0 4m
istio-galley-6d74549bb9-mdc97 0/1 ContainerCreating 0 4m
istio-grafana-post-install-xz9rk 0/1 Completed 0 4m
istio-ingressgateway-6bd4957bc-vhbct 1/1 Running 0 4m
istio-pilot-7f8c49bbd8-x6bmm 0/2 Pending 0 4m
istio-policy-6c65d8cff4-hx2c7 2/2 Running 0 4m
istio-security-post-install-gjfj2 0/1 Completed 0 4m
istio-sidecar-injector-74855c54b9-nnqgx 0/1 ContainerCreating 0 3m
istio-telemetry-65cdd46d6c-rqzfw 2/2 Running 0 4m
istio-tracing-ff94688bb-hgz4h 1/1 Running 0 3m
prometheus-f556886b8-chdxw 1/1 Running 0 4m
servicegraph-778f94d6f8-9xgw5 1/1 Running 0 3m
$kubectl describe pod istio-galley-6d74549bb9-mdc97
Error from server (NotFound): pods "istio-galley-5bf4d6b8f7-8s2z9" not found
pod describe output
$ kubectl -n istio-system describe pod istio-galley-6d74549bb9-mdc97
Name: istio-galley-6d74549bb9-mdc97
Namespace: istio-system
Node: minikube/172.17.0.4
Start Time: Sat, 03 Nov 2018 04:29:57 +0000
Labels: istio=galley
pod-template-hash=1690826493
Annotations: scheduler.alpha.kubernetes.io/critical-pod=
sidecar.istio.io/inject=false
Status: Pending
IP:
Controlled By: ReplicaSet/istio-galley-5bf4d6b8f7
Containers:
validator:
Container ID:
Image: gcr.io/istio-release/galley:1.0.0 Image ID:
Ports: 443/TCP, 9093/TCP Host Ports: 0/TCP, 0/TCP
Command: /usr/local/bin/galley
validator --deployment-namespace=istio-system
--caCertFile=/etc/istio/certs/root-cert.pem
--tlsCertFile=/etc/istio/certs/cert-chain.pem
--tlsKeyFile=/etc/istio/certs/key.pem
--healthCheckInterval=2s
--healthCheckFile=/health
--webhook-config-file
/etc/istio/config/validatingwebhookconfiguration.yaml
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 10m
Liveness: exec [/usr/local/bin/galley probe --probe-path=/health --interval=4s] delay=4s timeout=1s period=4s #success=1 #failure=3
Readiness: exec [/usr/local/bin/galley probe --probe-path=/health --interval=4s] delay=4s timeout=1s period=4s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/istio/certs from certs (ro)
/etc/istio/config from config (ro)
/var/run/secrets/kubernetes.io/serviceaccount from istio-galley-service-account-token-9pcmv(ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.istio-galley-service-account
Optional: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-galley-configuration
Optional: false
istio-galley-service-account-token-9pcmv:
Type: Secret (a volume populated by a Secret)
SecretName: istio-galley-service-account-token-9pcmv
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned istio-galley-5bf4d6b8f7-8t8qz to minikube
Normal SuccessfulMountVolume 1m kubelet, minikube MountVolume.SetUp succeeded for volume "config"
Normal SuccessfulMountVolume 1m kubelet, minikube MountVolume.SetUp succeeded for volume "istio-galley-service-account-token-9pcmv"
Warning FailedMount 27s (x7 over 1m) kubelet, minikube MountVolume.SetUp failed for volume "certs" : secrets "istio.istio-galley-service-account" not found
after some time :-
$ kubectl describe pod istio-galley-6d74549bb9-mdc97
Unable to connect to the server: net/http: TLS handshake timeout
so I wait for istio-sidecar-injector and istio-galley containers to get created. If I again run kubectl get pods -n istio-system or any other kubectl commands gives Unable to connect to the server: net/http: TLS handshake timeout error.
Please help me with this issue.
ps: I'm running minikube on ubuntu 16.04
Thanks in advance.
Looks like you are running into this and this the secret istio.istio-galley-service-account is missing in your istio-system namespace. You can try the workaround as described:
Install as outlined in the docs: https://istio.io/docs/setup/kubernetes/minimal-install/ the missing secret is created by the citadel pod which isn't running due to the --set security.enabled=false flag, setting that to true starts citadel and the secret is created.
Problem resolved. when I run minikube start --memory=4048. maybe it was a memory issue.
When using either the istio-demo.yaml or istio-demo-auth.yaml, you'll find that a minimum of 4GB RAM is required to run Istio (particularly when you deploy its sample app, BookInfo, too). This is true whether your running MiniKube or Docker Desktop and is one of the gotchas that Meshery identifies and attempts to help those deploying Istio or other service meshes circumvent.