Splunk pods are going into Crashloopbackoff - kubernetes-helm

I have installed "splunk-connect-for-kubernetes" helm chart from https://github.com/splunk/splunk-connect-for-kubernetes/ but in my case some of the pods are going into carshloopbackoff with below error logs.
Can someone help me here.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
lv-splunk-logging-5q9pf 0/1 CrashLoopBackOff 12 47m 192.168.54.241 las2-m***31 <none> <none>
lv-splunk-logging-nzzld 0/1 CrashLoopBackOff 13 47m 192.168.97.152 las2-m***18 <none> <none>
lv-splunk-logging-qjfgw 0/1 CrashLoopBackOff 13 47m 192.168.178.41 aws-m****03 <none> <none>
lv-splunk-logging-zmvxp 0/1 CrashLoopBackOff 13 47m 192.168.11.174 las2-***11 <none> <none>
kubectl logs
2023-02-06 14:42:08 +0000 [info]: init supervisor logger path=nil rotate_age=nil rotate_size=nil
2023-02-06 14:42:08 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2023-02-06 14:42:08 +0000 [info]: gem 'fluentd' version '1.15.3'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-concat' version '2.4.0'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-jq' version '0.5.1'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '3.1.0'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.2'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-splunk-hec' version '1.3.1'
2023-02-06 14:42:08 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2023-02-06 14:42:08 +0000 [INFO]: Reading bearer token from /var/run/secrets/kubernetes.io/serviceaccount/token
2023-02-06 14:42:11 +0000 [error]: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint https://10.96.0.1:443/api: Timed out connecting to server"
All pods are up and running.your text

Related

Kubelet certificate expired but work nodes working fine, when we will see the issue

In my v1.23.1 test cluster I see worker node certificate expired some time ago. but worker node still taking the workload and in Ready status.
How this certificate is getting used, when we will see the issue with expired certificate?
# curl -v https://localhost:10250 -k 2>&1 |grep 'expire date'
* expire date: Oct 4 18:02:14 2021 GMT
# openssl x509 -text -noout -in /var/lib/kubelet/pki/kubelet.crt |grep -A2 'Validity'
Validity
Not Before: Oct 4 18:02:14 2020 GMT
Not After : Oct 4 18:02:14 2021 GMT
Update 1:
Cluster is running on-perm with CentOS Stream 8 OS , build with kubeadm tool. I was able to schedule the workload on all the worker nodes. created nginx deployment and scaled it 50 pods, I can see nginx PODs on all the worker nodes.
Also I can reboot the work nodes with-out any issue.
Update 2:
kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
W0303 11:17:18.261639 698383 utils.go:69] The recommended value for "resolvConf" in "KubeletConfiguration" is: /run/systemd/resolve/resolv.conf; the provided value is: /run/systemd/resolve/resolv.conf
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Jan 16, 2023 16:15 UTC 318d ca no
apiserver Jan 16, 2023 16:15 UTC 318d ca no
apiserver-kubelet-client Jan 16, 2023 16:15 UTC 318d ca no
controller-manager.conf Jan 16, 2023 16:15 UTC 318d ca no
front-proxy-client Jan 16, 2023 16:15 UTC 318d front-proxy-ca no
scheduler.conf Jan 16, 2023 16:15 UTC 318d ca no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Oct 02, 2030 18:44 UTC 8y no
front-proxy-ca Oct 02, 2030 18:44 UTC 8y no
Thanks
Update 3
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
server10 Ready control-plane,master 519d v1.23.1
server11 Ready control-plane,master 519d v1.23.1
server12 Ready control-plane,master 519d v1.23.1
server13 Ready <none> 519d v1.23.1
server14 Ready <none> 519d v1.23.1
server15 Ready <none> 516d v1.23.1
server16 Ready <none> 516d v1.23.1
server17 Ready <none> 516d v1.23.1
server18 Ready <none> 516d v1.23.1
# kubectl get pods -o wide
nginx-dev-8677c757d4-4k9xp 1/1 Running 0 4d12h 10.203.53.19 server17 <none> <none>
nginx-dev-8677c757d4-6lbc6 1/1 Running 0 4d12h 10.203.89.120 server14 <none> <none>
nginx-dev-8677c757d4-ksckf 1/1 Running 0 4d12h 10.203.124.4 server16 <none> <none>
nginx-dev-8677c757d4-lrz9h 1/1 Running 0 4d12h 10.203.124.41 server16 <none> <none>
nginx-dev-8677c757d4-tllx9 1/1 Running 0 4d12h 10.203.151.70 server11 <none> <none>
# grep client /etc/kubernetes/kubelet.conf
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
# ls -ltr /var/lib/kubelet/pki
total 16
-rw------- 1 root root 1679 Oct 4 2020 kubelet.key
-rw-r--r-- 1 root root 2258 Oct 4 2020 kubelet.crt
-rw------- 1 root root 1114 Oct 4 2020 kubelet-client-2020-10-04-14-50-21.pem
lrwxrwxrwx 1 root root 59 Jul 6 2021 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2021-07-06-01-44-10.pem
-rw------- 1 root root 1114 Jul 6 2021 kubelet-client-2021-07-06-01-44-10.pem
Those kubelet certificates is called kubelet-serving certificates. They are used when Kubelet acts as a "server" instead of a "client".
For example, kubelet provides metrics to metrics server. So when metrics-server was enabled to use secure-tls, and in case that those certificate are expired, the metrics-server would not have a proper connection to Kubelet to get the metrics. In case you are using K8s Dashboard, the Dashboard will not able to show CPU and memory consumption in the page. That's the time when you see the issue from those expired certificates.
Reference: https://kubernetes.io/docs/reference/access-authn-authz/kubelet-tls-bootstrapping/#client-and-serving-certificates
Those certificate will not auto-rotate when expiring. They are not also able to be rotated with "kubeadm certificate renew". To renew those certificate, you will need to add "serverTLSBootstrap: true" in your cluster config. With this, when the serving certificate expired, kubelet will send a CSR request to K8s cluster, from the cluster, you can use "kubectl certificate approve" to renew them.

kubernetes master node is NOTReady | Unable to register node "node-server", it is forbidden

I've been trying to troubleshoot my kubernetes cluster since my master node is NOT Ready. I've followed guides on StackOverflow and Kubernetes troubleshooting guide but I am not able to pinpoint the issue. I'm relatively new to kubernetes.
Here's what I have tried:
#kubectl get nodes
NAME STATUS ROLES AGE VERSION
NodeName NotReady master 213d v1.16.2
#kubectl describe node NodeName
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 12 Jan 2021 15:41:24 +0530 Tue, 12 Jan 2021 15:41:24 +0530 CalicoIsUp Calico is running on this node
MemoryPressure Unknown Fri, 15 Jan 2021 16:40:54 +0530 Fri, 15 Jan 2021 16:48:07 +0530 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Fri, 15 Jan 2021 16:40:54 +0530 Fri, 15 Jan 2021 16:48:07 +0530 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Fri, 15 Jan 2021 16:40:54 +0530 Fri, 15 Jan 2021 16:48:07 +0530 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Fri, 15 Jan 2021 16:40:54 +0530 Fri, 15 Jan 2021 16:48:07 +0530 NodeStatusUnknown Kubelet stopped posting node status.
# sudo journalctl -u kubelet -n 100 --no-pager
Feb 26 12:23:03 devportal-test kubelet[11311]: E0226 12:23:03.581359 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:03 devportal-test kubelet[11311]: E0226 12:23:03.681814 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:03 devportal-test kubelet[11311]: E0226 12:23:03.782649 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:03 devportal-test kubelet[11311]: E0226 12:23:03.883846 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:03 devportal-test kubelet[11311]: I0226 12:23:03.912585 11311 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Feb 26 12:23:03 devportal-test kubelet[11311]: I0226 12:23:03.918664 11311 kubelet_node_status.go:72] Attempting to register node devportal-test
Feb 26 12:23:03 devportal-test kubelet[11311]: E0226 12:23:03.926545 11311 kubelet_node_status.go:94] Unable to register node "devportal-test" with API server: nodes "devportal-test" is forbidden: node "NodeName" is not allowed to modify node "devportal-test"
Feb 26 12:23:05 devportal-test kubelet[11311]: E0226 12:23:05.893160 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:05 devportal-test kubelet[11311]: E0226 12:23:05.993770 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:06 devportal-test kubelet[11311]: E0226 12:23:06.095640 11311 kubelet.go:2267] node "devportal-test" not found
Feb 26 12:23:06 devportal-test kubelet[11311]: E0226 12:23:06.147085 11311 controller.go:135] failed to ensure node lease exists, will retry in 7s, error: leases.coordination.k8s.io "devportal-test" is forbidden: User "system:node:NodeName" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-6d85fdfbd8-dkxjx 1/1 Terminating 8 213d
calico-kube-controllers-6d85fdfbd8-jsxjd 0/1 Pending 0 28d
calico-node-v5w2w 1/1 Running 8 213d
coredns-5644d7b6d9-g8rnl 1/1 Terminating 16 213d
coredns-5644d7b6d9-vgzg2 0/1 Pending 0 28d
coredns-5644d7b6d9-z8dzw 1/1 Terminating 16 213d
coredns-5644d7b6d9-zmcjr 0/1 Pending 0 28d
etcd-NodeName 1/1 Running 34 213d
kube-apiserver-NodeName 1/1 Running 85 213d
kube-controller-manager-NodeName 1/1 Running 790 213d
kube-proxy-gd5jx 1/1 Running 9 213d
kube-scheduler-NodeName 1/1 Running 800 213d
local-path-provisioner-56db8cbdb5-gqgqr 1/1 Running 3 44d
# kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:18:23Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:09:08Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes worker node is NotReady due to CNI plugin not initialized

I'm using kind to run a test kubernetes cluster on my local Macbook.
I found one of the nodes with status NotReady:
$ kind get clusters
mc
$ kubernetes get nodes
NAME STATUS ROLES AGE VERSION
mc-control-plane Ready master 4h42m v1.18.2
mc-control-plane2 Ready master 4h41m v1.18.2
mc-control-plane3 Ready master 4h40m v1.18.2
mc-worker NotReady <none> 4h40m v1.18.2
mc-worker2 Ready <none> 4h40m v1.18.2
mc-worker3 Ready <none> 4h40m v1.18.2
The only interesting thing in kubectl describe node mc-worker is that the CNI plugin not initialized:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
message:Network plugin returns error: cni plugin not initialized
I have 2 similar clusters and this only occurs on this cluster.
Since kind uses the local Docker daemon to run these nodes as containers, I have already tried to restart the container (should be the equivalent of rebooting the node).
I have considered deleting and recreating the cluster, but there ought to be a way to solve this without recreating the cluster.
Here are the versions that I'm running:
$ kind version
kind v0.8.1 go1.14.4 darwin/amd64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-30T20:19:45Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
How do you resolve this issue?
Most likely cause:
The docker VM is running out of some resource and cannot start CNI on that particular node.
You can poke around in the HyperKit VM by connecting to it:
From a shell:
screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
If that doesn't work for some reason:
docker run -it --rm --privileged --pid=host alpine nsenter -t 1 -m -u -n -i sh
Once in the VM:
# ps -Af
# free
# df -h
...
Then you can always update the setting on the docker UI:
Finally, your node after all is running in a container. So you can connect to that container and see what kubelet errors you see:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6d881be79f4a kindest/node:v1.18.2 "/usr/local/bin/entr…" 32 seconds ago Up 29 seconds 127.0.0.1:57316->6443/tcp kind-control-plane
docker exec -it 6d881be79f4a bash
root#kind-control-plane:/# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/kind/systemd/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2020-08-12 02:32:16 UTC; 35s ago
Docs: http://kubernetes.io/docs/
Main PID: 768 (kubelet)
Tasks: 23 (limit: 2348)
Memory: 32.8M
CGroup: /docker/6d881be79f4a8ded3162ec6b5caa8805542ff9703fabf5d3d2eee204a0814e01/system.slice/kubelet.service
└─768 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet
/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --fail-swap-on=false --node-ip= --fail-swap-on=false
...
✌️
I encountered this scenario. Master is Ready but the worker node's status are not. After some investigation, i found out that the /opt/cni/bin is empty - there is no network plugin for my worker node hosts. Thus, i installed this "kubernetes-cni.x86_64" and restarted kubelet service. This solved the "NotReady" status of my worker nodes.
Stop and disable apparmor & restart the containerd service on that node will solve your issue
root#node:~# systemctl stop apparmor
root#node:~# systemctl disable apparmor
root#node:~# systemctl restart containerd.service

Kubelet Master stays in KubeletNotReady because of cni missing

Kubelet has been initialized with pod network for Calico :
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --image-repository=someserver
Then i get calico.yaml v3.11 and applied it :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" apply -f calico.yaml
Right after i check on the pod status :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" get nodes
NAME STATUS ROLES AGE VERSION
master-1 NotReady master 7m21s v1.17.2
on describe i've got cni config unitialized, but i thought that calico should have done that ?
MemoryPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
In fact i have nothing under /etc/cni/net.d/ so it seems it forgot something ?
ll /etc/cni/net.d/
total 0
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-f7lqq 0/1 Pending 0 3h
calico-node-f4xzh 0/1 Init:ImagePullBackOff 0 3h
coredns-7fb8cdf968-bbqbz 0/1 Pending 0 3h24m
coredns-7fb8cdf968-vdnzx 0/1 Pending 0 3h24m
etcd-master-1 1/1 Running 0 3h24m
kube-apiserver-master-1 1/1 Running 0 3h24m
kube-controller-manager-master-1 1/1 Running 0 3h24m
kube-proxy-9m879 1/1 Running 0 3h24m
kube-scheduler-master-1 1/1 Running 0 3h24m
As explained i'm running through a local repo and journalctl says :
kubelet[21935]: E0225 14:30:54.830683 21935 pod_workers.go:191] Error syncing pod cec2f72b-844a-4d6b-8606-3aff06d4a36d ("calico-node-f4xzh_kube-system(cec2f72b-844a-4d6b-8606-3aff06d4a36d)"), skipping: failed to "StartContainer" for "upgrade-ipam" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://repo:10000/v2/calico/cni/manifests/v3.11.2: no basic auth credentials"
kubelet[21935]: E0225 14:30:56.008989 21935 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feels like it's not only CNI the issue
Core DNS pod will be pending and master will be NotReady till calico pods are successfully running and CNI is setup properly.
It seems to be network issue to download calico docker images from docker.io. So you can pull calico images from docker.io and and push it to your internal container registry and then modify the calico yaml to refer that registry in images section of calico.yaml and finally apply the modified calico yaml to the kubernetes cluster.
So the issue with Init:ImagePullBackOff was that it cannot apply image from my private repo automatically. I had to pull all images for calico from docker. Then i deleted the calico pod it's recreate itself with the newly pushed image
sudo docker pull private-repo/calico/pod2daemon-flexvol:v3.11.2
sudo docker pull private-repo/calico/node:v3.11.2
sudo docker pull private-repo/calico/cni:v3.11.2
sudo docker pull private-repo/calico/kube-controllers:v3.11.2
sudo kubectl -n kube-system delete po/calico-node-y7g5
After that the node re-do all the init phase and :
sudo kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-qkf47 1/1 Running 0 11s
calico-node-mkcsr 1/1 Running 0 21m
coredns-7fb8cdf968-bgqvj 1/1 Running 0 37m
coredns-7fb8cdf968-v85jx 1/1 Running 0 37m
etcd-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-apiserver-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-controller-manager-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-proxy-9hkns 1/1 Running 0 37m
kube-scheduler-lin-1k8w1dv-vmh 1/1 Running 0 38m

core_dns stuck in ContainerCreating status

I am trying to setup a basic k8s cluster
After doing a kubeadm init --pod-network-cidr=10.244.0.0/16, the coredns pods are stuck in ContainerCreating status
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-2cnhj 0/1 ContainerCreating 0 43h
coredns-6955765f44-dnphb 0/1 ContainerCreating 0 43h
etcd-perf1 1/1 Running 0 43h
kube-apiserver-perf1 1/1 Running 0 43h
kube-controller-manager-perf1 1/1 Running 0 43h
kube-flannel-ds-amd64-smpbk 1/1 Running 0 43h
kube-proxy-6zgvn 1/1 Running 0 43h
kube-scheduler-perf1 1/1 Running 0 43h
OS-IMAGE: Ubuntu 16.04.6 LTS
KERNEL-VERSION: 4.4.0-142-generic
CONTAINER-RUNTIME: docker://19.3.5
Errors from journalctl -xeu kubelet command
Jan 02 10:31:44 perf1 kubelet[11901]: 2020-01-02 10:31:44.112 [INFO][10207] k8s.go 228: Using Calico IPAM
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118281 11901 cni.go:385] Error deleting kube-system_coredns-6955765f44-2cnhj/12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf from
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118828 11901 remote_runtime.go:128] StopPodSandbox "12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf" from runtime service failed:
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118872 11901 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf"}
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118917 11901 kuberuntime_manager.go:676] killPodWithSyncResult failed: failed to "KillPodSandbox" for "e44bc42f-0b8d-40ad-82a9-334a1b1c8e40" with
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118939 11901 pod_workers.go:191] Error syncing pod e44bc42f-0b8d-40ad-82a9-334a1b1c8e40 ("coredns-6955765f44-2cnhj_kube-system(e44bc42f-0b8d-40ad-
Jan 02 10:31:47 perf1 kubelet[11901]: W0102 10:31:47.081709 11901 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "747c3cc9455a7d
Jan 02 10:31:47 perf1 kubelet[11901]: 2020-01-02 10:31:47.113 [INFO][10267] k8s.go 228: Using Calico IPAM
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.118526 11901 cni.go:385] Error deleting kube-system_coredns-6955765f44-dnphb/747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53 from
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119017 11901 remote_runtime.go:128] StopPodSandbox "747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53" from runtime service failed:
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119052 11901 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53"}
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119098 11901 kuberuntime_manager.go:676] killPodWithSyncResult failed: failed to "KillPodSandbox" for "52ffb25e-06c7-4cc6-be70-540049a6be20" with
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119119 11901 pod_workers.go:191] Error syncing pod 52ffb25e-06c7-4cc6-be70-540049a6be20 ("coredns-6955765f44-dnphb_kube-system(52ffb25e-06c7-4cc6-
I have tried kubdeadm reset as well but no luck so far
Looks like the issue was because I tried switching from calico to flannel cni. Following the steps mentioned here has resolved the issue for me
Pods failed to start after switch cni plugin from flannel to calico and then flannel
Additionally you may have to clear the contents of /etc/cni/net.d
CoreDNS will not start up before a CNI network is installed.
For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16 to kubeadm init.
Set /proc/sys/net/bridge/bridge-nf-call-iptables to 1 by running sysctl net.bridge.bridge-nf-call-iptables=1 to pass bridged IPv4 traffic to iptables’ chains. This is a requirement for some CNI plugins to work.
Make sure that your firewall rules allow UDP ports 8285 and 8472 traffic for all hosts participating in the overlay network. see here .
Note that flannel works on amd64, arm, arm64, ppc64le and s390x under Linux. Windows (amd64) is claimed as supported in v0.11.0 but the usage is undocumented
To deploy flannel as CNI network
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml
After you have deployed flannel delete the core dns pods, Kubernetes will recreate the pods.
You have deployed flannel as CNI but the logs from kubelet shows that kubernetes is using calico.
[INFO][10207] k8s.go 228: Using Calico IPAM
Something wrong with container network. without that coredns doesnt succeed.
You might have to reinstall with correct CNI. Once CNI is deployed successfully, coreDNS gets deployed automatically
So here is my solution:
First, coreDNS will run on your [Master / Control-Plane] Nodes
Now let's run ifconfig to check for these 2 interfaces cni0 and flannel.1
Suppose cni0=10.244.1.1 & flannel.1=10.244.0.0 then your DNS will not be created
It should be cni0=10.244.0.1 & flannel.1=10.244.0.0. Which mean cni0 must follow flannel.1/24 patterns
Run the following 2 command to Down Interface and Remove it from your Master/Control-Plane Machines
sudo ifconfig cni0 down;
sudo ip link delete cni0;
Now check via ifconfig you will see 2 more vethxxxxxxxx Interface appears. This should fixed your problem.