core_dns stuck in ContainerCreating status - kubernetes

I am trying to setup a basic k8s cluster
After doing a kubeadm init --pod-network-cidr=10.244.0.0/16, the coredns pods are stuck in ContainerCreating status
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-2cnhj 0/1 ContainerCreating 0 43h
coredns-6955765f44-dnphb 0/1 ContainerCreating 0 43h
etcd-perf1 1/1 Running 0 43h
kube-apiserver-perf1 1/1 Running 0 43h
kube-controller-manager-perf1 1/1 Running 0 43h
kube-flannel-ds-amd64-smpbk 1/1 Running 0 43h
kube-proxy-6zgvn 1/1 Running 0 43h
kube-scheduler-perf1 1/1 Running 0 43h
OS-IMAGE: Ubuntu 16.04.6 LTS
KERNEL-VERSION: 4.4.0-142-generic
CONTAINER-RUNTIME: docker://19.3.5
Errors from journalctl -xeu kubelet command
Jan 02 10:31:44 perf1 kubelet[11901]: 2020-01-02 10:31:44.112 [INFO][10207] k8s.go 228: Using Calico IPAM
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118281 11901 cni.go:385] Error deleting kube-system_coredns-6955765f44-2cnhj/12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf from
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118828 11901 remote_runtime.go:128] StopPodSandbox "12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf" from runtime service failed:
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118872 11901 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "12cd9435dc905c026bbdb4a1954fc36c82ede1d703b040a3052ab3370445abbf"}
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118917 11901 kuberuntime_manager.go:676] killPodWithSyncResult failed: failed to "KillPodSandbox" for "e44bc42f-0b8d-40ad-82a9-334a1b1c8e40" with
Jan 02 10:31:44 perf1 kubelet[11901]: E0102 10:31:44.118939 11901 pod_workers.go:191] Error syncing pod e44bc42f-0b8d-40ad-82a9-334a1b1c8e40 ("coredns-6955765f44-2cnhj_kube-system(e44bc42f-0b8d-40ad-
Jan 02 10:31:47 perf1 kubelet[11901]: W0102 10:31:47.081709 11901 cni.go:331] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "747c3cc9455a7d
Jan 02 10:31:47 perf1 kubelet[11901]: 2020-01-02 10:31:47.113 [INFO][10267] k8s.go 228: Using Calico IPAM
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.118526 11901 cni.go:385] Error deleting kube-system_coredns-6955765f44-dnphb/747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53 from
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119017 11901 remote_runtime.go:128] StopPodSandbox "747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53" from runtime service failed:
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119052 11901 kuberuntime_manager.go:898] Failed to stop sandbox {"docker" "747c3cc9455a7db202ab14576d15509d8ef6967c6349e9acbeff2207914d3d53"}
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119098 11901 kuberuntime_manager.go:676] killPodWithSyncResult failed: failed to "KillPodSandbox" for "52ffb25e-06c7-4cc6-be70-540049a6be20" with
Jan 02 10:31:47 perf1 kubelet[11901]: E0102 10:31:47.119119 11901 pod_workers.go:191] Error syncing pod 52ffb25e-06c7-4cc6-be70-540049a6be20 ("coredns-6955765f44-dnphb_kube-system(52ffb25e-06c7-4cc6-
I have tried kubdeadm reset as well but no luck so far

Looks like the issue was because I tried switching from calico to flannel cni. Following the steps mentioned here has resolved the issue for me
Pods failed to start after switch cni plugin from flannel to calico and then flannel
Additionally you may have to clear the contents of /etc/cni/net.d

CoreDNS will not start up before a CNI network is installed.
For flannel to work correctly, you must pass --pod-network-cidr=10.244.0.0/16 to kubeadm init.
Set /proc/sys/net/bridge/bridge-nf-call-iptables to 1 by running sysctl net.bridge.bridge-nf-call-iptables=1 to pass bridged IPv4 traffic to iptables’ chains. This is a requirement for some CNI plugins to work.
Make sure that your firewall rules allow UDP ports 8285 and 8472 traffic for all hosts participating in the overlay network. see here .
Note that flannel works on amd64, arm, arm64, ppc64le and s390x under Linux. Windows (amd64) is claimed as supported in v0.11.0 but the usage is undocumented
To deploy flannel as CNI network
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml
After you have deployed flannel delete the core dns pods, Kubernetes will recreate the pods.

You have deployed flannel as CNI but the logs from kubelet shows that kubernetes is using calico.
[INFO][10207] k8s.go 228: Using Calico IPAM
Something wrong with container network. without that coredns doesnt succeed.
You might have to reinstall with correct CNI. Once CNI is deployed successfully, coreDNS gets deployed automatically

So here is my solution:
First, coreDNS will run on your [Master / Control-Plane] Nodes
Now let's run ifconfig to check for these 2 interfaces cni0 and flannel.1
Suppose cni0=10.244.1.1 & flannel.1=10.244.0.0 then your DNS will not be created
It should be cni0=10.244.0.1 & flannel.1=10.244.0.0. Which mean cni0 must follow flannel.1/24 patterns
Run the following 2 command to Down Interface and Remove it from your Master/Control-Plane Machines
sudo ifconfig cni0 down;
sudo ip link delete cni0;
Now check via ifconfig you will see 2 more vethxxxxxxxx Interface appears. This should fixed your problem.

Related

Kubernetes worker node is NotReady due to CNI plugin not initialized

I'm using kind to run a test kubernetes cluster on my local Macbook.
I found one of the nodes with status NotReady:
$ kind get clusters
mc
$ kubernetes get nodes
NAME STATUS ROLES AGE VERSION
mc-control-plane Ready master 4h42m v1.18.2
mc-control-plane2 Ready master 4h41m v1.18.2
mc-control-plane3 Ready master 4h40m v1.18.2
mc-worker NotReady <none> 4h40m v1.18.2
mc-worker2 Ready <none> 4h40m v1.18.2
mc-worker3 Ready <none> 4h40m v1.18.2
The only interesting thing in kubectl describe node mc-worker is that the CNI plugin not initialized:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Tue, 11 Aug 2020 16:55:44 -0700 Tue, 11 Aug 2020 12:10:16 -0700 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady
message:Network plugin returns error: cni plugin not initialized
I have 2 similar clusters and this only occurs on this cluster.
Since kind uses the local Docker daemon to run these nodes as containers, I have already tried to restart the container (should be the equivalent of rebooting the node).
I have considered deleting and recreating the cluster, but there ought to be a way to solve this without recreating the cluster.
Here are the versions that I'm running:
$ kind version
kind v0.8.1 go1.14.4 darwin/amd64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-30T20:19:45Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
How do you resolve this issue?
Most likely cause:
The docker VM is running out of some resource and cannot start CNI on that particular node.
You can poke around in the HyperKit VM by connecting to it:
From a shell:
screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
If that doesn't work for some reason:
docker run -it --rm --privileged --pid=host alpine nsenter -t 1 -m -u -n -i sh
Once in the VM:
# ps -Af
# free
# df -h
...
Then you can always update the setting on the docker UI:
Finally, your node after all is running in a container. So you can connect to that container and see what kubelet errors you see:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6d881be79f4a kindest/node:v1.18.2 "/usr/local/bin/entr…" 32 seconds ago Up 29 seconds 127.0.0.1:57316->6443/tcp kind-control-plane
docker exec -it 6d881be79f4a bash
root#kind-control-plane:/# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/kind/systemd/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Wed 2020-08-12 02:32:16 UTC; 35s ago
Docs: http://kubernetes.io/docs/
Main PID: 768 (kubelet)
Tasks: 23 (limit: 2348)
Memory: 32.8M
CGroup: /docker/6d881be79f4a8ded3162ec6b5caa8805542ff9703fabf5d3d2eee204a0814e01/system.slice/kubelet.service
└─768 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet
/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --fail-swap-on=false --node-ip= --fail-swap-on=false
...
✌️
I encountered this scenario. Master is Ready but the worker node's status are not. After some investigation, i found out that the /opt/cni/bin is empty - there is no network plugin for my worker node hosts. Thus, i installed this "kubernetes-cni.x86_64" and restarted kubelet service. This solved the "NotReady" status of my worker nodes.
Stop and disable apparmor & restart the containerd service on that node will solve your issue
root#node:~# systemctl stop apparmor
root#node:~# systemctl disable apparmor
root#node:~# systemctl restart containerd.service

Kubelet Master stays in KubeletNotReady because of cni missing

Kubelet has been initialized with pod network for Calico :
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --image-repository=someserver
Then i get calico.yaml v3.11 and applied it :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" apply -f calico.yaml
Right after i check on the pod status :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" get nodes
NAME STATUS ROLES AGE VERSION
master-1 NotReady master 7m21s v1.17.2
on describe i've got cni config unitialized, but i thought that calico should have done that ?
MemoryPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
In fact i have nothing under /etc/cni/net.d/ so it seems it forgot something ?
ll /etc/cni/net.d/
total 0
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-f7lqq 0/1 Pending 0 3h
calico-node-f4xzh 0/1 Init:ImagePullBackOff 0 3h
coredns-7fb8cdf968-bbqbz 0/1 Pending 0 3h24m
coredns-7fb8cdf968-vdnzx 0/1 Pending 0 3h24m
etcd-master-1 1/1 Running 0 3h24m
kube-apiserver-master-1 1/1 Running 0 3h24m
kube-controller-manager-master-1 1/1 Running 0 3h24m
kube-proxy-9m879 1/1 Running 0 3h24m
kube-scheduler-master-1 1/1 Running 0 3h24m
As explained i'm running through a local repo and journalctl says :
kubelet[21935]: E0225 14:30:54.830683 21935 pod_workers.go:191] Error syncing pod cec2f72b-844a-4d6b-8606-3aff06d4a36d ("calico-node-f4xzh_kube-system(cec2f72b-844a-4d6b-8606-3aff06d4a36d)"), skipping: failed to "StartContainer" for "upgrade-ipam" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://repo:10000/v2/calico/cni/manifests/v3.11.2: no basic auth credentials"
kubelet[21935]: E0225 14:30:56.008989 21935 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feels like it's not only CNI the issue
Core DNS pod will be pending and master will be NotReady till calico pods are successfully running and CNI is setup properly.
It seems to be network issue to download calico docker images from docker.io. So you can pull calico images from docker.io and and push it to your internal container registry and then modify the calico yaml to refer that registry in images section of calico.yaml and finally apply the modified calico yaml to the kubernetes cluster.
So the issue with Init:ImagePullBackOff was that it cannot apply image from my private repo automatically. I had to pull all images for calico from docker. Then i deleted the calico pod it's recreate itself with the newly pushed image
sudo docker pull private-repo/calico/pod2daemon-flexvol:v3.11.2
sudo docker pull private-repo/calico/node:v3.11.2
sudo docker pull private-repo/calico/cni:v3.11.2
sudo docker pull private-repo/calico/kube-controllers:v3.11.2
sudo kubectl -n kube-system delete po/calico-node-y7g5
After that the node re-do all the init phase and :
sudo kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-qkf47 1/1 Running 0 11s
calico-node-mkcsr 1/1 Running 0 21m
coredns-7fb8cdf968-bgqvj 1/1 Running 0 37m
coredns-7fb8cdf968-v85jx 1/1 Running 0 37m
etcd-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-apiserver-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-controller-manager-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-proxy-9hkns 1/1 Running 0 37m
kube-scheduler-lin-1k8w1dv-vmh 1/1 Running 0 38m

unable to recognize "kube-flannel.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"

I am running K8s master(ubuntu 16.04) and node(ubuntu 16.04) on Hyper-V's Vm nor and able to join a node nor coredns pods are ready.
On k8s Worker Node:
admin1#POC-k8s-node1:~$ sudo kubeadm join 192.168.137.2:6443 --token s03usq.lrz343lolmrz00lf --discovery-token-ca-cert-hash sha256:5c6b88a78e7b303debda447fa6f7fb48e3746bedc07dc2a518fbc80d48f37ba4 --ignore-preflight-errors=all
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.5. Latest validated version: 18.09
[WARNING Port-10250]: Port 10250 is in use
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.16" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Activating the kubelet service
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.
error execution phase kubelet-start: error uploading crisocket: timed out waiting for the condition
To see the stack trace of this error execute with --v=5 or higher
admin1#POC-k8s-node1:~$ journalctl -u kubelet -f
Nov 21 05:28:15 POC-k8s-node1 kubelet[55491]: E1121 05:28:15.784713 55491 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:459: Failed to list *v1.Node: Unauthorized
Nov 21 05:28:15 POC-k8s-node1 kubelet[55491]: E1121 05:28:15.827982 55491 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 05:28:15 POC-k8s-node1 kubelet[55491]: E1121 05:28:15.928413 55491 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 05:28:15 POC-k8s-node1 kubelet[55491]: E1121 05:28:15.988489 55491 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.RuntimeClass: Unauthorized
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.029295 55491 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.129571 55491 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.187178 55491 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSIDriver: Unauthorized
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.230227 55491 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.330777 55491 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.386758 55491 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Unauthorized
Nov 21 05:28:16 POC-k8s-node1 kubelet[55491]: E1121 05:28:16.431420 55491 kubelet.go:2267] node "poc-k8s-node1" not found
root#POC-k8s-node1:/home/admin1# journalctl -xe -f
Nov 21 06:30:45 POC-k8s-node1 kubelet[75467]: E1121 06:30:45.670520 75467 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Unauthorized
Nov 21 06:30:45 POC-k8s-node1 kubelet[75467]: E1121 06:30:45.691050 75467 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 06:30:45 POC-k8s-node1 kubelet[75467]: E1121 06:30:45.791249 75467 kubelet.go:2267] node "poc-k8s-node1" not found
Nov 21 06:30:45 POC-k8s-node1 kubelet[75467]: E1121 06:30:45.866004
On K8s Master :
root#POC-k8s-master:~# kubeadm config images pull
[config/images] Pulled k8s.gcr.io/kube-apiserver:v1.16.3
[config/images] Pulled k8s.gcr.io/kube-controller-manager:v1.16.3
[config/images] Pulled k8s.gcr.io/kube-scheduler:v1.16.3
[config/images] Pulled k8s.gcr.io/kube-proxy:v1.16.3
[config/images] Pulled k8s.gcr.io/pause:3.1
[config/images] Pulled k8s.gcr.io/etcd:3.3.15-0
[config/images] Pulled k8s.gcr.io/coredns:1.6.2
root#POC-k8s-master:~# export KUBECONFIG=/etc/kubernetes/admin.conf
root#POC-k8s-master:~# sysctl net.bridge.bridge-nf-call-iptables=1
net.bridge.bridge-nf-call-iptables = 1
root#POC-k8s-master:~# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
*****kube-system coredns-5644d7b6d9-7xk42 0/1 Pending 0 91s
kube-system coredns-5644d7b6d9-mbrlx 0/1 Pending 0 91s*****
kube-system etcd-poc-k8s-master 1/1 Running 0 51s
kube-system kube-apiserver-poc-k8s-master 1/1 Running 0 32s
kube-system kube-controller-manager-poc-k8s-master 1/1 Running 0 47s
kube-system kube-proxy-xqb2d 1/1 Running 0 91s
kube-system kube-scheduler-poc-k8s-master 1/1 Running 0 38s
root#POC-k8s-master:~# kubectl apply -f
https://raw.githubusercontent.com/coreos/flannel/c5d10c8/Documentation/kube-flannel.yml
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
unable to recognize "https://raw.githubusercontent.com/coreos/flannel/c5d10c8/Documentation/kube-flannel.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"
unable to recognize "https://raw.githubusercontent.com/coreos/flannel/c5d10c8/Documentation/kube-flannel.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"
unable to recognize "https://raw.githubusercontent.com/coreos/flannel/c5d10c8/Documentation/kube-flannel.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"
unable to recognize "https://raw.githubusercontent.com/coreos/flannel/c5d10c8/Documentation/kube-flannel.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"
unable to recognize "https://raw.githubusercontent.com/coreos/flannel/c5d10c8/Documentation/kube-flannel.yml": no matches for kind "DaemonSet" in version "extensions/v1beta1"
It seems you're using k8s version 1.16 and daemonset API group change to apps/v1
Update the link to this:
https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
And also there is an issue about this out there:
https://github.com/kubernetes/website/issues/16441
Resolved first part of the question by "#kubeadm reset" on the node and then join command worked! As 2nd part of the question was resolved first hence it was possible to resolve the question so #Alireza David thanks a lot.

How to debug when Kubernetes nodes are in 'Not Ready' state

I initialized the master node and add 2 worker nodes, but only master and one of the worker node show up when I run the following command:
kubectl get nodes
also, both these nodes are in 'Not Ready' state.
What are the steps should I take to understand what the problem could be?
I can ping all the nodes from each of the other nodes.
The version of Kubernetes is 1.8.
OS is Cent OS 7
I used the following repo to install Kubernetes:
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes] name=Kubernetes
baseurl=http://yum.kubernetes.io/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=0
repo_gpgcheck=0
EOF
yum install kubelet kubeadm kubectl kubernetes-cni
First, describe nodes and see if it reports anything:
$ kubectl describe nodes
Look for conditions, capacity and allocatable:
Conditions:
Type Status
---- ------
OutOfDisk False
MemoryPressure False
DiskPressure False
Ready True
Capacity:
cpu: 2
memory: 2052588Ki
pods: 110
Allocatable:
cpu: 2
memory: 1950188Ki
pods: 110
If everything is alright here, SSH into the node and observe kubelet logs to see if it reports anything. Like certificate erros, authentication errors etc.
If kubelet is running as a systemd service, you can use
$ journalctl -u kubelet
Steps to debug:-
In case you face any issue in kubernetes, first step is to check if kubernetes self applications are running fine or not.
Command to check:- kubectl get pods -n kube-system
If you see any pod is crashing, check it's logs
if getting NotReady state error, verify network pod logs.
if not able to resolve with above, follow below steps:-
kubectl get nodes # Check which node is not in ready state
kubectl describe node nodename #nodename which is not in readystate
ssh to that node
execute systemctl status kubelet # Make sure kubelet is running
systemctl status docker # Make sure docker service is running
journalctl -u kubelet # To Check logs in depth
Most probably you will get to know about error here, After fixing it reset kubelet with below commands:-
systemctl daemon-reload
systemctl restart kubelet
In case you still didn't get the root cause, check below things:-
Make sure your node has enough space and memory. Check for /var directory space especially.
command to check: -df -kh, free -m
Verify cpu utilization with top command. and make sure any process is not taking an unexpected memory.
I was having similar issue because of a different reason:
Error:
cord#node1:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready master 17h v1.13.5
node2 Ready <none> 17h v1.13.5
node3 NotReady <none> 9m48s v1.13.5
cord#node1:~$ kubectl describe node node3
Name: node3
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
Ready False Thu, 18 Apr 2019 01:15:46 -0400 Thu, 18 Apr 2019 01:03:48 -0400 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Addresses:
InternalIP: 192.168.2.6
Hostname: node3
cord#node3:~$ journalctl -u kubelet
Apr 18 01:24:50 node3 kubelet[54132]: W0418 01:24:50.649047 54132 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
Apr 18 01:24:50 node3 kubelet[54132]: W0418 01:24:50.649086 54132 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
Apr 18 01:24:50 node3 kubelet[54132]: E0418 01:24:50.649402 54132 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Apr 18 01:24:55 node3 kubelet[54132]: W0418 01:24:55.650816 54132 cni.go:149] Error loading CNI config list file /etc/cni/net.d/10-calico.conflist: error parsing configuration list: no 'plugins' key
Apr 18 01:24:55 node3 kubelet[54132]: W0418 01:24:55.650845 54132 cni.go:203] Unable to update cni config: No valid networks found in /etc/cni/net.d
Apr 18 01:24:55 node3 kubelet[54132]: E0418 01:24:55.651056 54132 kubelet.go:2192] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Apr 18 01:24:57 node3 kubelet[54132]: I0418 01:24:57.248519 54132 setters.go:72] Using node IP: "192.168.2.6"
Issue:
My file: 10-calico.conflist was incorrect. Verified it from a different node and from sample file in the same directory "calico.conflist.template".
Resolution:
Changing the file, "10-calico.conflist" and restarting the service using "systemctl restart kubelet", resolved my issue:
NAME STATUS ROLES AGE VERSION
node1 Ready master 18h v1.13.5
node2 Ready <none> 18h v1.13.5
node3 Ready <none> 48m v1.13.5
I recently started using VMWare Octant https://github.com/vmware-tanzu/octant. This is a better UI than the Kubernetes Dashboard. You can view the Kubernetes cluster and look at the details of the cluster and the PODS. This will allow you to check the logs and open a terminal into the POD(s).
I found applying the network and rebooting both the nodes did the trick for me.
kubectl apply -f [podnetwork].yaml
I recently had this issue and checking out the known-issues from kind website here https://kind.sigs.k8s.io/docs/user/known-issues/ it would tell you specifically the main problem mostly comes from the lack of memory allocated to docker. They actually advice to allocate 8GB to docker, I allocated 6GB up from 3GB and it worked fine for me this is kind version I am running atm
$ kind version
kind v0.10.0 go1.15.7 darwin/amd64
and this is docker version
$ docker version
Client:
Cloud integration: 1.0.17
Version: 20.10.8
API version: 1.41
Go version: go1.16.6
Git commit: 3967b7d
Built: Fri Jul 30 19:55:20 2021
OS/Arch: darwin/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.8
API version: 1.41 (minimum version 1.12)
Go version: go1.16.6
Git commit: 75249d8
Built: Fri Jul 30 19:52:10 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b63
docker-init:
Version: 0.19.0
GitCommit: de40ad0
I hope this helps you or anyone facing the same issue.
and here is the output from kind
$ k get node
NAME STATUS ROLES AGE VERSION
test2-control-plane Ready control-plane,master 4m42s v1.20.2

using network plugins "cni": cni config unintialized; Skipping pod

I created the kubernetes cluster by using kubeadm kubeadm init.
I am getting error messages in /var/log/messages.
Oct 20 10:09:52 aws08 kubelet: I1020 10:09:52.015921 7116
docker_manager.go:1787] DNS ResolvConfPath exists:
/var/lib/docker/containers/717adf7a8481637ac20a9ba103d8f97635a88bf05f18bd4299f0d164e48f2920/resolv.conf.
Will attempt to add ndots option: options ndots:5 Oct 20 10:09:52
aws08 kubelet: I1020 10:09:52.015963 7116 docker_manager.go:2121]
Calling network plugin cni to setup pod for
kube-dns-2247936740-cjij4_kube-system(3b296413-96aa-11e6-8c40-02fff663a168)
Oct 20 10:09:52 aws08 kubelet: E1020 10:09:52.015982 7116
docker_manager.go:2127] Failed to setup network for pod
"kube-dns-2247936740-cjij4_kube-system(3b296413-96aa-11e6-8c40-02fff663a168)"
using network plugins "cni": cni config unintialized; Skipping pod Oct
20 10:09:52 aws08 kubelet: I1020 10:09:52.018824 7116
docker_manager.go:1492] Killing container
"717adf7a8481637ac20a9ba103d8f97635a88bf05f18bd4299f0d164e48f2920
kube-system/kube-dns-2247936740-cjij4" with 30 second grace period
The DNS pod is failing:
kube-system kube-dns-2247936740-j5rtc 0/3 ContainerCreating 21 1h
If I disabled CNI, the DNS pod is running. But the issue for DNS persists.
The method to disable cni is to comment the KUBELET_NETWORK_ARGS line in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and restart kubelet service
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--kubeconfig=/etc/kubernetes/kubelet.conf --require-kubeconfig=true"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
# Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=100.64.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_EXTRA_ARGS=--v=4"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_EXTRA_ARGS
followed by:
sudo systemctl restart kubelet
I'm guessing that you forgot to setup the pod network.
From the documentation:
It is necessary to do this before you try to deploy any applications to your cluster, and before kube-dns will start up. Note also that kubeadm only supports CNI based networks and therefore kubenet based networks will not work.
You can install a pod network add-on with the following command:
kubectl apply -f <add-on.yaml>
Example:
kubectl create -f https://git.io/weave-kube
To install Weave Net add-on.
After you have done this, you might need to recreate kube-dns pod.
The cni intialization should be completed during kubelet initialization. So try reboot kubelet service and make sure that cni configuration can be parsed correctly.