kube-scheduler and kube-controller-manager restarting - kubernetes

I have kubernetes 1.15.3 setup
My kube-controller & kube-scheduler are restarting very frequently . This is happening after kubernetes is upgraded to 1.15.3.
kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-5c98db65d4-nmt5d 1/1 Running 37 24d
coredns-5c98db65d4-tg4kx 1/1 Running 37 24d
etcd-ana01 1/1 Running 1 24d
kube-apiserver-ana01 1/1 Running 10 24d
**kube-controller-manager-ana01 1/1 Running 477 9d**
kube-flannel-ds-amd64-2srzb 1/1 Running 0 12d
kube-proxy-2hvcl 1/1 Running 0 23d
**kube-scheduler-ana01 1/1 Running 518 9d**
tiller-deploy-8557598fbc-kxntc 1/1 Running 0 11d
Here is the logs of the system
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 39m (x500 over 23d) kubelet, ana01 Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
Warning BackOff 39m (x1873 over 23d) kubelet, ana01 Back-off restarting failed container
Normal Pulled 28m (x519 over 24d) kubelet, ana01 Container image "k8s.gcr.io/kube-scheduler:v1.15.3" already present on machine
Normal Created 28m (x519 over 24d) kubelet, ana01 Created container kube-scheduler
Normal Started 27m (x519 over 24d) kubelet, ana01 Started container kube-scheduler
logs are
I0928 09:10:23.554335 1 serving.go:319] Generated self-signed cert in-memory
W0928 09:10:25.002268 1 authentication.go:387] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0928 09:10:25.002523 1 authentication.go:249] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0928 09:10:25.002607 1 authentication.go:252] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0928 09:10:25.002947 1 authorization.go:177] **failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory**
W0928 09:10:25.003116 1 authorization.go:146] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0928 09:10:25.021201 1 server.go:142] Version: v1.15.3

Related

CoreDNS pods stuck in ContainerCreating - Kubernetes

I am still new to Kubernetes and I was trying to set up a cluster on bare metal servers according to the official docu.
Right now I am running a one worker and one master node configuration, but I am struggling to run all the pods once the cluster initializes. The main problem is the coredns pods, that are stuck in the ContainerCreating state.
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-78fcd69978-4vtsp 0/1 ContainerCreating 0 5s
kube-system coredns-78fcd69978-wtn2c 0/1 ContainerCreating 0 12h
kube-system etcd-dcpoth24213118 1/1 Running 4 12h
kube-system kube-apiserver-dcpoth24213118 1/1 Running 0 12h
kube-system kube-controller-manager-dcpoth24213118 1/1 Running 0 12h
kube-system kube-proxy-8282p 1/1 Running 0 12h
kube-system kube-scheduler-dcpoth24213118 1/1 Running 0 12h
kube-system weave-net-6zz2j 2/2 Running 0 12h
After checking the logs I've noticed this error. The problem is I don't really know what the error is refering to.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19s default-scheduler Successfully assigned kube-system/coredns-78fcd69978-4vtsp to dcpoth24213118
Warning FailedCreatePodSandBox 13s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "2521c9dd723f3fc50b3510791a8c35cbc9ec19768468eb3da3367274a4dfcbba" network for pod "coredns-78fcd69978-4vtsp": networkPlugin cni failed to set up pod "coredns-78fcd69978-4vtsp_kube-system" network: error getting ClusterInformation: Get "https://[10.43.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host, failed to clean up sandbox container "2521c9dd723f3fc50b3510791a8c35cbc9ec19768468eb3da3367274a4dfcbba" network for pod "coredns-78fcd69978-4vtsp": networkPlugin cni failed to teardown pod "coredns-78fcd69978-4vtsp_kube-system" network: error getting ClusterInformation: Get "https://[10.43.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.43.0.1:443: connect: no route to host]
Normal SandboxChanged 10s (x2 over 12s) kubelet Pod sandbox changed, it will be killed and re-created.
I've running the kuberenetes cluster behind a corporate proxy. I've set the environmental variables as follows.
export https_proxy=http://proxyIP:PORT
export http_proxy=http://proxyIP:PORT
export HTTP_PROXY="${http_proxy}"
export HTTPS_PROXY="${https_proxy}"
export NO_PROXY=localhost,127.0.0.1,master_node_IP,worker_node_IP,10.0.0.0/8,10.96.0.0/16
[root#dcpoth24213118 ~]# kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 12h
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 12h
[root#dcpoth24213118 ~]# ip r s
default via 6.48.248.129 dev eth1
6.48.248.128/26 dev eth1 proto kernel scope link src 6.48.248.145
10.32.0.0/12 dev weave proto kernel scope link src 10.32.0.1
10.155.0.0/24 via 6.48.248.129 dev eth1
10.228.0.0/24 via 6.48.248.129 dev eth1
10.229.0.0/24 via 6.48.248.129 dev eth1
10.250.0.0/24 via 6.48.248.129 dev eth1
I've got weave network plugin installed. The issue is that I cannot create any other pods, all will get stuck in the ContainerCreating state.
I've run out of ideas how to fix it. Can someone give me a hint ?

How to fix incomplete pods when building kubeflow on kubernetes configured with rancher?

Hello I have built kubernetes on rancher built with single docker and I want to install kubeflow additionally.
https://raw.githubusercontent.com/kubeflow/manifests/master/distributions/kfdef/kfctl_k8s_istio.v1.2.0.yaml
I have imported the yaml file and installed it with kfctl. But the problem is that although it is installed, it is installed incompletely and the main functions are not executed.
taeil-kubeflow# kubectl get all -n kubeflow
NAME READY STATUS RESTARTS AGE
pod/admission-webhook-bootstrap-stateful-set-0 1/1 Running 2 32m
pod/admission-webhook-deployment-5cd7dc96f5-j8ptn 1/1 Running 0 31m
pod/application-controller-stateful-set-0 1/1 Running 0 32m
pod/argo-ui-65df8c7c84-2bfhd 1/1 Running 0 31m
pod/cache-deployer-deployment-5f4979f45-kfhfg 1/2 CrashLoopBackOff 4 31m
pod/cache-server-7859fd67f5-9lmrt 0/2 Init:0/1 0 31m
pod/centraldashboard-67767584dc-flb2n 1/1 Running 0 31m
pod/jupyter-web-app-deployment-67fb955745-49vzj 1/1 Running 0 31m
pod/katib-controller-7fcc95676b-s9lwd 1/1 Running 1 31m
pod/katib-db-manager-85db457c64-s4q5p 0/1 Error 4 31m
pod/katib-mysql-6c7f7fb869-4228x 0/1 Pending 0 31m
pod/katib-ui-65dc4cf6f5-pxs8g 1/1 Running 0 31m
pod/kfserving-controller-manager-0 2/2 Running 0 31m
pod/kubeflow-pipelines-profile-controller-797fb44db9-vstfv 1/1 Running 0 31m
pod/metacontroller-0 1/1 Running 0 32m
pod/metadata-db-6dd978c5b-qwbnz 0/1 Pending 0 31m
pod/metadata-envoy-deployment-67bd5954c-dw6rx 1/1 Running 0 31m
pod/metadata-grpc-deployment-577c67c96f-872lf 0/1 CrashLoopBackOff 1 31m
pod/metadata-writer-756dbdd478-dwwpc 2/2 Running 2 31m
pod/minio-54d995c97b-md886 0/1 Pending 0 31m
pod/ml-pipeline-7c56db5db9-856mr 1/2 CrashLoopBackOff 14 31m
pod/ml-pipeline-persistenceagent-d984c9585-248b4 2/2 Running 0 31m
pod/ml-pipeline-scheduledworkflow-5ccf4c9fcc-mv4vs 2/2 Running 0 31m
pod/ml-pipeline-ui-7ddcd74489-qv2gp 2/2 Running 0 31m
pod/ml-pipeline-viewer-crd-56c68f6c85-7rf6f 2/2 Running 3 31m
pod/ml-pipeline-visualizationserver-5b9bd8f6bf-jj5st 2/2 Running 0 31m
pod/mpi-operator-d5bfb8489-nzl2k 1/1 Running 0 31m
pod/mxnet-operator-7576d697d6-z2dc8 1/1 Running 0 31m
pod/mysql-74f8f99bc8-rpxcc 0/2 Pending 0 31m
pod/notebook-controller-deployment-5bb6bdbd6d-dq5nl 1/1 Running 0 31m
pod/profiles-deployment-56bc5d7dcb-k5cph 2/2 Running 0 31m
pod/pytorch-operator-847c8d55d8-8f79m 1/1 Running 0 31m
pod/seldon-controller-manager-6bf8b45656-jd682 1/1 Running 0 31m
pod/spark-operatorsparkoperator-fdfbfd99-6mhst 1/1 Running 0 32m
pod/spartakus-volunteer-558f8bfd47-l67zg 1/1 Running 0 31m
pod/tf-job-operator-58477797f8-qg2tl 1/1 Running 0 31m
pod/workflow-controller-64fd7cffc5-md54d 1/1 Running 0 31m
As you can see, problems occur in various pods such as cache server, kativ-mysql, metadata-db, minio, mysql, metadata-grpc, ml-pipeline, etc.
I'm guessing it's a persistent volume problem, but I don't know how to solve it specifically.
Please help me
Add an additional describe and log for each pod.
pod : cache-server-7859fd67f5-9lmrt
status : init:0/1
describe:
Warning FailedMount 10m (x72 over 18h) kubelet Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-token kubeflow-pipelines-cachethe condition
Warning FailedMount 3m47s (x550 over 18h) kubelet MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found
log :
error: a container name must be specified for pod cache-server-7859fd67f5-9lmrt, choose one of: [server istio-proxy] or one of the init containers: [istio-init]
pod : katib-mysql-6c7f7fb869-4228x
status : Pending 0/1
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18h default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
pod : metadata-db-6dd978c5b-qwbnz
status : Pending 0/1
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18h default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
pod : minio-54d995c97b-md886
status : Pending 0/1
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18h default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
pod : mysql-74f8f99bc8-rpxcc
status : Pending 0/2
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18h default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
pod : cache-deployer-deployment-5f4979f45-kfhfg
status : crashloopbackoff 1/2
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 36m (x213 over 18h) kubelet Pulling image "gcr.io/ml-pipeline/cache-deployer:1.0.4"
Warning BackOff 68s (x4916 over 18h) kubelet Back-off restarting failed container
log :
error: a container name must be specified for pod cache-deployer-deployment-5f4979f45-kfhfg, choose one of: [main istio-proxy] or one of the init containers: [istio-init]
pod : katib-db-manager-85db457c64-s4q5p
status : crashloopbackoff 0/1
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 7m16s (x4260 over 18h) kubelet Back-off restarting failed container
Warning Unhealthy 2m13s (x1190 over 18h) kubelet Readiness probe failed:
log :
E0622 02:00:41.686168 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:00:46.674159 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:00:51.666117 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:00:56.690171 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:01.682194 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:06.674132 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:11.666146 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:16.690230 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:21.686129 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:26.674431 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:31.670133 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
E0622 02:01:36.690492 1 mysql.go:62] Ping to Katib db failed: dial tcp 10.43.25.138:3306: connect: connection refused
F0622 02:01:36.690581 1 main.go:83] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.
goroutine 1 [running]:
github.com/kubeflow/katib/vendor/k8s.io/klog.stacks(0xc000216200, 0xc000230000, 0x89, 0xd0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:830 +0xb9
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).output(0xc72b20, 0xc000000003, 0xc00022a000, 0xc14079, 0x7, 0x53, 0x0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:781 +0x2da
github.com/kubeflow/katib/vendor/k8s.io/klog.(*loggingT).printf(0xc72b20, 0x3, 0x92bfcd, 0x20, 0xc0001edf48, 0x1, 0x1)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:678 +0x153
github.com/kubeflow/katib/vendor/k8s.io/klog.Fatalf(...)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/klog/klog.go:1209
main.main()
/go/src/github.com/kubeflow/katib/cmd/db-manager/v1beta1/main.go:83 +0x166
pod : metadata-grpc-deployment-577c67c96f-872lf
status : crashloopbackoff 0/1 or Error
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m46s (x5136 over 18h) kubelet Back-off restarting failed container
log :
2021-06-22 02:03:14.130165: F ml_metadata/metadata_store/metadata_store_server_main.cc:219] Non-OK-status: status status: Internal: mysql_real_connect failed: errno: 2002, error: Can't connect to MySQL server on 'metadata-db' (115)MetadataStore cannot be created with the given connection config.
pod : metadata-writer-756dbdd478-dwwpc
status : crashloopbackoff 1/2
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 6s (x3807 over 18h) kubelet Back-off restarting failed container
log :
error: a container name must be specified for pod metadata-writer-756dbdd478-dwwpc, choose one of: [main istio-proxy] or one of the init containers: [istio-init]
pod : ml-pipeline-7c56db5db9-856mr
status : crashloopbackoff 1/2
describe :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 56m (x324 over 18h) kubelet Container image "gcr.io/ml-pipeline/api-server:1.0.4" already present on machine
Warning BackOff 6m1s (x4072 over 18h) kubelet Back-off restarting failed container
Warning Unhealthy 67s (x2939 over 18h) kubelet Readiness probe failed:
log :
error: a container name must be specified for pod ml-pipeline-7c56db5db9-856mr, choose one of: [ml-pipeline-api-server istio-proxy] or one of the init containers: [istio-init]
I found out there was a problem with dynamic volume provisioning. But I can't solve this. I tried to configure nfs server and client, but it doesn't work.

Crashloopbackoff while creating nginx controller

I have installed Kubernetes on AWS-EC2 machines, the cluster has a master and 2 nodes connected to it
[root#k8-m deployments]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8-m Ready control-plane,master 107m v1.20.1
k8-n1 Ready <none> 101m v1.20.1
k8-n2 Ready <none> 91m v1.20.1
I have a requirement to install ingress controller for exposing the traffic outside and the chosen controller is nginx. Am creating the resources such as ns, service account, secret, rbac, config map, ap-rbac, daemon-set config taken from https://github.com/nginxinc/kubernetes-ingress.git.
After creating the properties for ingress controller, am seeing the pods going to crashloopbackoff state
[root#k8-m deployments]# kubectl get all -n nginx-ingress
NAME READY STATUS RESTARTS AGE
pod/nginx-ingress-555f75f85f-5vxf6 0/1 CrashLoopBackOff 7 11m
pod/nginx-ingress-7wmhw 0/1 CrashLoopBackOff 7 11m
pod/nginx-ingress-mss7v 0/1 CrashLoopBackOff 7 11m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/nginx-ingress 2 2 0 2 0 <none> 11m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx-ingress 0/1 1 0 11m
NAME DESIRED CURRENT READY AGE
replicaset.apps/nginx-ingress-555f75f85f 1 1 0 11m
By describing the pod i get the below(pasting only the event details),
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 14m default-scheduler Successfully assigned nginx-ingress/nginx-ingress-555f75f85f-5vxf6 to k8-n2
Normal Pulled 14m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.456877779s
Normal Pulled 14m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.501405255s
Normal Pulled 13m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.63456627s
Normal Created 13m (x4 over 14m) kubelet Created container nginx-ingress
Normal Started 13m (x4 over 14m) kubelet Started container nginx-ingress
Normal Pulled 13m kubelet Successfully pulled image "nginx/nginx-ingress:edge" in 2.659821346s
Normal Pulling 12m (x5 over 14m) kubelet Pulling image "nginx/nginx-ingress:edge"
Warning BackOff 3m53s (x47 over 14m) kubelet Back-off restarting failed container
Am not able to see the logs though
Below are the executions while creating the nginx controller,
kubectl create -f common/ns-and-sa.yaml
kubectl create -f rbac/rbac.yaml
kubectl create -f rbac/ap-rbac.yaml
kubectl create -f common/default-server-secret.yaml
kubectl create -f common/nginx-config.yaml
kubectl create -f deployment/nginx-ingress.yaml
kubectl create -f daemon-set/nginx-ingress.yaml
Could anyone here advice me on this

Add a node to cluster with Flannel : "cannot join network of a non running container"

I am adding a node to the Kubernetes cluster as a node using flannel. Here are the nodes on my cluster:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
jetson-80 NotReady <none> 167m v1.15.0
p4 Ready master 18d v1.15.0
This machine is reachable through the same network. When joining the cluster, Kubernetes pulls some images, among others k8s.gcr.io/pause:3.1, but for some reason failed in pulling the images:
Warning FailedCreatePodSandBox 15d
kubelet,jetson-81 Failed create pod sandbox: rpc error: code = Unknown desc = failed pulling image "k8s.gcr.io/pause:3.1": Error response from daemon: Get https://k8s.gcr.io/v2/: read tcp 192.168.8.81:58820->108.177.126.82:443: read: connection reset by peer
The machine is connected to the internet but only wget command works, not ping
I tried to pull images elsewhere and copy them to the machine.
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s.gcr.io/kube-proxy v1.15.0 d235b23c3570 2 months ago 82.4MB
quay.io/coreos/flannel v0.11.0-arm64 32ffa9fadfd7 6 months ago 53.5MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 20 months ago 742kB
Here are the list of pods on the master :
NAME READY STATUS RESTARTS AGE
coredns-5c98db65d4-gmsz7 1/1 Running 0 2d22h
coredns-5c98db65d4-j6gz5 1/1 Running 0 2d22h
etcd-p4 1/1 Running 0 2d22h
kube-apiserver-p4 1/1 Running 0 2d22h
kube-controller-manager-p4 1/1 Running 0 2d22h
kube-flannel-ds-amd64-cq7kz 1/1 Running 9 17d
kube-flannel-ds-arm64-4s7kk 0/1 Init:CrashLoopBackOff 0 2m8s
kube-proxy-l2slz 0/1 CrashLoopBackOff 4 2m8s
kube-proxy-q6db8 1/1 Running 0 2d22h
kube-scheduler-p4 1/1 Running 0 2d22h
tiller-deploy-5d6cc99fc-rwdrl 1/1 Running 1 17d
but it didn't work either when I check the associated flannel pod kube-flannel-ds-arm64-4s7kk:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 66s default-scheduler Successfully assigned kube-system/kube-flannel-ds-arm64-4s7kk to jetson-80
Warning Failed <invalid> kubelet, jetson-80 Error: failed to start container "install-cni": Error response from daemon: cannot join network of a non running container: 68ffc44cf8cd655234691b0362615f97c59d285bec790af40f890510f27ba298
Warning Failed <invalid> kubelet, jetson-80 Error: failed to start container "install-cni": Error response from daemon: cannot join network of a non running container: a196d8540b68dc7fcd97b0cda1e2f3183d1410598b6151c191b43602ac2faf8e
Warning Failed <invalid> kubelet, jetson-80 Error: failed to start container "install-cni": Error response from daemon: cannot join network of a non running container: 9d05d1fcb54f5388ca7e64d1b6627b05d52aea270114b5a418e8911650893bc6
Warning Failed <invalid> kubelet, jetson-80 Error: failed to start container "install-cni": Error response from daemon: cannot join network of a non running container: 5b730961cddf5cc3fb2af564b1abb46b086073d562bb2023018cd66fc5e96ce7
Normal Created <invalid> (x5 over <invalid>) kubelet, jetson-80 Created container install-cni
Warning Failed <invalid> kubelet, jetson-80 Error: failed to start container "install-cni": Error response from daemon: cannot join network of a non running container: 1767e9eb9198969329eaa14a71a110212d6622a8b9844137ac5b247cb9e90292
Normal SandboxChanged <invalid> (x5 over <invalid>) kubelet, jetson-80 Pod sandbox changed, it will be killed and re-created.
Warning BackOff <invalid> (x4 over <invalid>) kubelet, jetson-80 Back-off restarting failed container
Normal Pulled <invalid> (x6 over <invalid>) kubelet, jetson-80 Container image "quay.io/coreos/flannel:v0.11.0-arm64" already present on machine
I still can't identify if it's a Kubernetes or Flannel issue and haven't been able to solve it despite multiple attempts. Please let me know if you need me to share more details
EDIT:
using kubectl describe pod -n kube-system kube-proxy-l2slz :
Normal Pulled <invalid> (x67 over <invalid>) kubelet, ahold-jetson-80 Container image "k8s.gcr.io/kube-proxy:v1.15.0" already present on machine
Normal SandboxChanged <invalid> (x6910 over <invalid>) kubelet, ahold-jetson-80 Pod sandbox changed, it will be killed and re-created.
Warning FailedSync <invalid> (x77 over <invalid>) kubelet, ahold-jetson-80 (combined from similar events): error determining status: rpc error: code = Unknown desc = Error: No such container: 03e7ee861f8f63261ff9289ed2d73ea5fec516068daa0f1fe2e4fd50ca42ad12
Warning BackOff <invalid> (x8437 over <invalid>) kubelet, ahold-jetson-80 Back-off restarting failed container
Your problem may be coused by the mutil sandbox container in you node. Try to restart the kubelet:
$ systemctl restart kubelet
Check if you have generated and copied public key to right node to have connection between them: ssh-keygen.
Please make sure the firewall/security groups allow traffic on UDP port 58820.
Look at the flannel logs and see if there are any errors there but also look for "Subnet added: " messages. Each node should have added the other two subnets.
While running ping, try to use tcpdump to see where the packets get dropped.
Try src flannel0 (icmp), src host interface (udp port 58820), dest host interface (udp port 58820), dest flannel0 (icmp), docker0 (icmp).
Here is useful documentation: flannel-documentation.

Tiller pod crashes after Vagrant VM is powered off

I have set up a Vagrant VM, and installed Kubernetes and Helm.
vagrant#vagrant:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.7", GitCommit:"dd5e1a2978fd0b97d9b78e1564398aeea7e7fe92", GitTreeState:"clean", BuildDate:"2018-04-19T00:05:56Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.8", GitCommit:"c138b85178156011dc934c2c9f4837476876fb07", GitTreeState:"clean", BuildDate:"2018-05-21T18:53:18Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
vagrant#vagrant:~$ helm version
Client: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
After the first vagrant up that creates the VM, Tiller has no issues.
I power-off the VM with vagrant halt and reactivate it with vagrant up. Then Tiller starts to misbehave.
It has a lot of restarts and at some point, it enters a ClashLoopBackOff state.
etcd-vagrant 1/1 Running 2 1h
heapster-5449cf95bd-h9xk8 1/1 Running 2 1h
kube-apiserver-vagrant 1/1 Running 2 1h
kube-controller-manager-vagrant 1/1 Running 2 1h
kube-dns-6f4fd4bdf-xclbb 3/3 Running 6 1h
kube-proxy-8n8tc 1/1 Running 2 1h
kube-scheduler-vagrant 1/1 Running 2 1h
kubernetes-dashboard-5bd6f767c7-lrdjp 1/1 Running 3 1h
tiller-deploy-78f96d6f9-cswbm 0/1 CrashLoopBackOff 8 38m
weave-net-948jt 2/2 Running 5 1h
I get a look at the pod's events and see that the Liveness and Readiness probes are failing.
vagrant#vagrant:~$ kubectl describe pod tiller-deploy-78f96d6f9-cswbm -n kube-system
Name: tiller-deploy-78f96d6f9-cswbm
Namespace: kube-system
Node: vagrant/10.0.2.15
Start Time: Wed, 23 May 2018 08:51:54 +0000
Labels: app=helm
name=tiller
pod-template-hash=349528295
Annotations: <none>
Status: Running
IP: 10.32.0.28
Controlled By: ReplicaSet/tiller-deploy-78f96d6f9
Containers:
tiller:
Container ID: docker://389470b95c46f0a5ba6b4b5457f212b0e6f3e3a754beb1aeae835260de3790a7
Image: gcr.io/kubernetes-helm/tiller:v2.9.1
Image ID: docker-pullable://gcr.io/kubernetes-helm/tiller#sha256:417aae19a0709075df9cc87e2fcac599b39d8f73ac95e668d9627fec9d341af2
Ports: 44134/TCP, 44135/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 23 May 2018 09:26:53 +0000
Finished: Wed, 23 May 2018 09:27:12 +0000
Ready: False
Restart Count: 8
Liveness: http-get http://:44135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:44135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3
Environment:
TILLER_NAMESPACE: kube-system
TILLER_HISTORY_MAX: 0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-fl44z (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-fl44z:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-fl44z
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 38m kubelet, vagrant MountVolume.SetUp succeeded for volume "default-token-fl44z"
Normal Scheduled 38m default-scheduler Successfully assigned tiller-deploy-78f96d6f9-cswbm to vagrant
Normal Pulled 29m (x2 over 38m) kubelet, vagrant Container image "gcr.io/kubernetes-helm/tiller:v2.9.1" already present on machine
Normal Killing 29m kubelet, vagrant Killing container with id docker://tiller:Container failed liveness probe.. Container will be killed and recreated.
Normal Created 29m (x2 over 38m) kubelet, vagrant Created container
Normal Started 29m (x2 over 38m) kubelet, vagrant Started container
Warning Unhealthy 28m (x2 over 37m) kubelet, vagrant Readiness probe failed: Get http://10.32.0.19:44135/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 17m (x30 over 37m) kubelet, vagrant Liveness probe failed: Get http://10.32.0.19:44135/liveness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Normal SuccessfulMountVolume 11m kubelet, vagrant MountVolume.SetUp succeeded for volume "default-token-fl44z"
Warning FailedCreatePodSandBox 10m (x7 over 11m) kubelet, vagrant Failed create pod sandbox.
Normal SandboxChanged 10m (x8 over 11m) kubelet, vagrant Pod sandbox changed, it will be killed and re-created.
Normal Pulled 10m kubelet, vagrant Container image "gcr.io/kubernetes-helm/tiller:v2.9.1" already present on machine
Normal Created 10m kubelet, vagrant Created container
Normal Started 10m kubelet, vagrant Started container
Warning Unhealthy 10m kubelet, vagrant Liveness probe failed: Get http://10.32.0.28:44135/liveness: dial tcp 10.32.0.28:44135: getsockopt: connection refused
Warning Unhealthy 10m kubelet, vagrant Readiness probe failed: Get http://10.32.0.28:44135/readiness: dial tcp 10.32.0.28:44135: getsockopt: connection refused
Warning Unhealthy 8m (x2 over 9m) kubelet, vagrant Liveness probe failed: Get http://10.32.0.28:44135/liveness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 8m (x2 over 9m) kubelet, vagrant Readiness probe failed: Get http://10.32.0.28:44135/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 1m (x22 over 7m) kubelet, vagrant Back-off restarting failed container
After entering this state, it stays there.
Only after I delete the Tiller pod, it comes up again and everything runs smoothly.
vagrant#vagrant:~$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
etcd-vagrant 1/1 Running 2 1h
heapster-5449cf95bd-h9xk8 1/1 Running 2 1h
kube-apiserver-vagrant 1/1 Running 2 1h
kube-controller-manager-vagrant 1/1 Running 2 1h
kube-dns-6f4fd4bdf-xclbb 3/3 Running 6 1h
kube-proxy-8n8tc 1/1 Running 2 1h
kube-scheduler-vagrant 1/1 Running 2 1h
kubernetes-dashboard-5bd6f767c7-lrdjp 1/1 Running 4 1h
tiller-deploy-78f96d6f9-tgx4z 1/1 Running 0 7m
weave-net-948jt 2/2 Running 5 1h
However, the events seem to have the same Unhealthy Warnings.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m default-scheduler Successfully assigned tiller-deploy-78f96d6f9-tgx4z to vagrant
Normal SuccessfulMountVolume 8m kubelet, vagrant MountVolume.SetUp succeeded for volume "default-token-fl44z"
Normal Pulled 7m kubelet, vagrant Container image "gcr.io/kubernetes-helm/tiller:v2.9.1" already present on machine
Normal Created 7m kubelet, vagrant Created container
Normal Started 7m kubelet, vagrant Started container
Warning Unhealthy 7m kubelet, vagrant Readiness probe failed: Get http://10.32.0.28:44135/readiness: dial tcp 10.32.0.28:44135: getsockopt: connection refused
Warning Unhealthy 7m kubelet, vagrant Liveness probe failed: Get http://10.32.0.28:44135/liveness: dial tcp 10.32.0.28:44135: getsockopt: connection refused
Warning Unhealthy 1m (x6 over 3m) kubelet, vagrant Liveness probe failed: Get http://10.32.0.28:44135/liveness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 41s (x14 over 7m) kubelet, vagrant Readiness probe failed: Get http://10.32.0.28:44135/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Any insight would be appreciated.