Single controlplane node shows not ready - kubernetes

What happened:
The master node does not show ready anymore. Maybe that happend after an failed update (downloaded kubeadm and kubelet in a way too high version)
s-rtk8s01 Ready Node 2y120d v1.14.1
s-rtk8s02 Ready Node 2y173d v1.14.1
s-rtk8s03 Ready Node 2y174d v1.14.1
s-rtk8s04 Ready Node 2y174d v1.14.1
s-rtk8s05 Ready Node 2y174d v1.14.1
s-rtk8sma01 NotReady,SchedulingDisabled master 2y174d v1.14.1
Scheduler does not show (after it was deleted forcefully) up in the list of pods but docker ps shows that the static pods are getting started in the background. in the
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-hvh6b 1/1 Running 56 288d
coredns-fb8b8dccf-x5r5h 1/1 Running 58 302d
etcd-s-rtk8sma01 1/1 Running 45 535d
kube-apiserver-s-rtk8sma01 1/1 Running 13 535d
kube-controller-manager-s-rtk8sma01 1/1 Running 7 485d
kube-flannel-ds-2fmj4 1/1 Running 6 485d
kube-flannel-ds-5g47f 1/1 Running 5 485d
kube-flannel-ds-5k27n 1/1 Running 5 485d
kube-flannel-ds-cj967 1/1 Running 8 485d
kube-flannel-ds-drjff 1/1 Running 9 485d
kube-flannel-ds-v4sfg 1/1 Running 5 485d
kube-proxy-6ngn6 1/1 Running 11 535d
kube-proxy-85g6c 1/1 Running 10 535d
kube-proxy-gd5jb 1/1 Running 13 535d
kube-proxy-grvsk 1/1 Running 11 535d
kube-proxy-lpht9 1/1 Running 13 535d
kube-proxy-pmdmj 0/1 Pending 0 25h
systemd logs for kubelet shows following (I see those errors with the hostname case remarks and an error with a missing mirror pod - maybe the scheduler?)
kubelet_node_status.go:94] Unable to register node "s-rtk8sma01" with API server: nodes "s-rtk8sma01" is forbidden: node "S-RTK8SMA01" is not allowed to modify node "s-rtk8sma01"
setters.go:739] Error getting volume limit for plugin kubernetes.io/azure-disk
setters.go:739] Error getting volume limit for plugin kubernetes.io/cinder
setters.go:739] Error getting volume limit for plugin kubernetes.io/aws-ebs
setters.go:739] Error getting volume limit for plugin kubernetes.io/gce-pd
Generated UID "56ba6ffcb6b23178170f8063052292ee" pod "kube-scheduler" from /etc/kubernetes/manifests/kube-scheduler.yaml
Generated Name "kube-scheduler-s-rtk8sma01" for UID "56ba6ffcb6b23178170f8063052292ee" from URL /etc/kubernetes/manifests/kube-scheduler.yaml
Using namespace "kube-system" for pod "kube-scheduler-s-rtk8sma01" from /etc/kubernetes/manifests/kube-scheduler.yaml
Reading config file "/etc/kubernetes/manifests/kube-scheduler.yaml_bck"
Generated UID "56ba6ffcb6b23178170f8063052292ee" pod "kube-scheduler" from /etc/kubernetes/manifests/kube-scheduler.yaml_bck
Generated Name "kube-scheduler-s-rtk8sma01" for UID "56ba6ffcb6b23178170f8063052292ee" from URL /etc/kubernetes/manifests/kube-scheduler.yaml_bck
Using namespace "kube-system" for pod "kube-scheduler-s-rtk8sma01" from /etc/kubernetes/manifests/kube-scheduler.yaml_bck
Setting pods for source file
anager.go:445] Static pod "56ba6ffcb6b23178170f8063052292ee" (kube-scheduler-s-rtk8sma01/kube-system) does not have a corresponding mirror pod; skipping
anager.go:464] Status Manager: syncPod in syncbatch. pod UID: "24db95fbbd2e618dc6ed589132ed7158"
docker ps shows
aec23e01ee2a 2c4adeb21b4f "etcd --advertise-cl…" 7 hours ago Up 7 hours k8s_etcd_etcd-s-rtk8sma01_kube-system_24db95fbbd2e618dc6ed589132ed7158_59
97910491f3b2 20a2d7035165 "/usr/local/bin/kube…" 26 hours ago Up 26 hours k8s_kube-proxy_kube-proxy-pmdmj_kube-system_3e807b5e-041d-11eb-a61a-001dd8b72689_0
37d87cdd8886 k8s.gcr.io/pause:3.1 "/pause" 26 hours ago Up 26 hours k8s_POD_kube-proxy-pmdmj_kube-system_3e807b5e-041d-11eb-a61a-001dd8b72689_0
83a8af0407e5 cfaa4ad74c37 "kube-apiserver --ad…" 39 hours ago Up 39 hours k8s_kube-apiserver_kube-apiserver-s-rtk8sma01_kube-system_57d405cdab537a9a32ce375f1242e4b5_1
85250c421db4 k8s.gcr.io/pause:3.1 "/pause" 39 hours ago Up 39 hours k8s_POD_kube-apiserver-s-rtk8sma01_kube-system_57d405cdab537a9a32ce375f1242e4b5_1
984a3628068c 3fa2504a839b "kube-scheduler --bi…" 40 hours ago Up 40 hours k8s_kube-scheduler_kube-scheduler-s-rtk8sma01_kube-system_56ba6ffcb6b23178170f8063052292ee_7
4d5446906cc5 efb3887b411d "kube-controller-man…" 40 hours ago Up 40 hours k8s_kube-controller-manager_kube-controller-manager-s-rtk8sma01_kube-system_ffbb7c0e6913f72111f95f08ad36e944_3
544423226bed k8s.gcr.io/pause:3.1 "/pause" 40 hours ago Up 40 hours k8s_POD_kube-scheduler-s-rtk8sma01_kube-system_56ba6ffcb6b23178170f8063052292ee_4
a75feece56b5 k8s.gcr.io/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_etcd-s-rtk8sma01_kube-system_24db95fbbd2e618dc6ed589132ed7158_20
1b17cb3ef1c1 k8s.gcr.io/pause:3.1 "/pause" 2 days ago Up 2 days k8s_POD_kube-controller-manager-s-rtk8sma01_kube-system_ffbb7c0e6913f72111f95f08ad36e944_0
c7c7235ed0dc ff281650a721 "/opt/bin/flanneld -…" 2 months ago Up 2 months k8s_kube-flannel_kube-flannel-ds-v4sfg_kube-system_bc432e78-878f-11e9-9c4b-001dd8b72689_8
d56fe3708565 k8s.gcr.io/pause:3.1 "/pause" 2 months ago Up 2 months k8s_POD_kube-flannel-ds-v4sfg_kube-system_bc432e78-878f-11e9-9c4b-001dd8b72689_7
What you expected to happen:
The master is getting ready again, and the static pods and daemonsets are generated again, so I can start to upgrade the cluster
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I am really lost at this point and tried many hours to find a solution by myself and hope to get a little bit help from the experts, to understand the problem any maybe get some kind of workaround.
Environment:
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
OnPremise
OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a):
Linux S-RTK8SMA01 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
flannel quay.io/coreos/flannel:v0.11.0-amd64
Does anybody know how to fix those mirror pod problems and knows how I can fix the problem with the node name casing?
What I tried so far was, that I started kubelet with hostname override but this did not have any effect.

Related

Minikube Kubernetes pending pod on AWS EC2

I've tried many times to install kubernetes on Debian last stable version on AWS EC2 instance (2 vcpu, 4 GB RAM, 10 GB HD).
I've also tried to install now on ubuntu Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-1084-aws x86_64) over AWS EC2 same vm compute configuration.
I've installed docker, kubctl, docker-cri, crictl and minikube but I've an issue with Kubernetes node not ready and then pending pods. The blocking point here for me is the CNI as I've core-dns pods pending and I see few strange things in the logs, but do not know how to solve it.
I've tried also to install Calico as you will see the calico pods. It's the first time I install Kubernetes and Minikube.
Minikube is started with the following command : minikube start --vm-driver=none
minikube version: v1.27.1
root#awsec2:~# minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
root#awsec2:~# docker version
Client:
Version: 20.10.7
API version: 1.41
Go version: go1.13.8
Git commit: 20.10.7-0ubuntu5~18.04.3
Built: Mon Nov 1 01:04:14 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
root#ip-172-31-37-142:~# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-awsec2-ip NotReady control-plane 10h v1.25.2 172.31.37.142 Ubuntu 18.04.6 LTS 5.4.0-1084-aws docker://20.10.7
root#aws:~# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-minikube 0/1 Pending 0 10h
kube-system coredns-565d847f94-kmbdr 0/1 Pending 0 11h
kube-system etcd-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system kube-apiserver-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system kube-controller-manager-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system kube-proxy-dff99 1/1 Running 1 (10h ago) 11h
kube-system kube-scheduler-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system storage-provisioner 0/1 Pending 0 11h
tigera-operator tigera-operator-6675dc47f4-gngrn 1/1 Running 2 (7m ago) 10h
In the minikube logs command I've seen this error but do not know how to solve it :
==> kubelet <==
-- Logs begin at Tue 2022-10-18 21:26:09 UTC, end at Wed 2022-10-19 08:57:51 UTC. --
Oct 19 08:52:52 ip-172-31-37-142 kubelet[17361]: E1019 08:52:52.018304 17361 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
If someone can explain how to correct that as it's should be very standard issue.
Now, I've found a workaround that will work for my test.
I've run Minikube with Kubernetes 1.23 instead of 1.24.
I've run this command :
minikube start --vm-driver=none --kubernetes-version=v1.23.0
I didn't set up Calico CNI now, I have successfully my node and hello-World pod running correctly.
I will test it like that, then I will try to upgrade to Kube 1.24.
Kind Regards,

How can I detect CNI type/version in Kubernetes cluster?

Is there a Kubectl command or config map in the cluster that can help me find what CNI is being used?
First of all checking presence of exactly one config file in /etc/cni/net.d is a good start:
$ ls /etc/cni/net.d
10-flannel.conflist
and ip a s or ifconfig helpful for checking existence of network interfaces. e.g. flannel CNI should setup flannel.1 interface:
$ ip a s flannel.1
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether de:cb:d1:d6:e3:e7 brd ff:ff:ff:ff:ff:ff
inet 10.244.1.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::dccb:d1ff:fed6:e3e7/64 scope link
valid_lft forever preferred_lft forever
When creating a cluster, CNI installation is typically installed using:
kubectl apply -f <add-on.yaml>
thus the networking pod will be called kube-flannel*, kube-calico* etc. depending on your networking configuration.
Then crictl will help you inspect running pods and containers.
crictl pods ls
On a controller node in a healthy cluster you should have all pods in Ready state.
crictl pods ls
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
dc90dd87e18cf 3 minutes ago Ready coredns-6d4b75cb6d-r2j9s kube-system 0 (default)
d1ab9d0aa815a 3 minutes ago Ready kubernetes-dashboard-cd4778d69-xmtkz kube-system 0 (default)
0c151fdd92e71 3 minutes ago Ready coredns-6d4b75cb6d-bn8hr kube-system 0 (default)
40f18ce56f776 4 minutes ago Ready kube-flannel-ds-d4fd7 kube-flannel 0 (default)
0e390a68380a5 4 minutes ago Ready kube-proxy-r6cq2 kube-system 0 (default)
cd93e58d3bf70 4 minutes ago Ready kube-scheduler-c01 kube-system 0 (default)
266a33aa5c241 4 minutes ago Ready kube-apiserver-c01 kube-system 0 (default)
0910a7a73f5aa 4 minutes ago Ready kube-controller-manager-c01 kube-system 0 (default)
If your cluster is properly configured you should be able to list containers using kubectl:
kubectl get pods -n kube-system
if kubectl is not working (kube-apiserver is not running) you can fallback to crictl.
On an unhealthy cluster kubectl will show pods in CrashLoopBackOff state. crictl pods ls command will give you similar picture, only displaying pods from single node. Also check documentation for common CNI errors.
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6d4b75cb6d-brb9d 0/1 ContainerCreating 0 25m
coredns-6d4b75cb6d-pcrcp 0/1 ContainerCreating 0 25m
kube-apiserver-cm01 1/1 Running 27 (18m ago) 26m
kube-apiserver-cm02 0/1 Running 31 (8m11s ago) 23m
kube-apiserver-cm03 0/1 CrashLoopBackOff 33 (2m22s ago) 26m
kube-controller-manager-cm01 0/1 CrashLoopBackOff 13 (50s ago) 24m
kube-controller-manager-cm02 0/1 CrashLoopBackOff 7 (15s ago) 24m
kube-controller-manager-cm03 0/1 CrashLoopBackOff 15 (3m45s ago) 26m
kube-proxy-2dvfg 0/1 CrashLoopBackOff 8 (97s ago) 25m
kube-proxy-7gnnr 0/1 CrashLoopBackOff 8 (39s ago) 25m
kube-proxy-cqmvz 0/1 CrashLoopBackOff 8 (19s ago) 25m
kube-scheduler-cm01 1/1 Running 28 (7m15s ago) 12m
kube-scheduler-cm02 0/1 CrashLoopBackOff 28 (4m45s ago) 18m
kube-scheduler-cm03 1/1 Running 36 (107s ago) 26m
kubernetes-dashboard-cd4778d69-g8jmf 0/1 ContainerCreating 0 2m27s
crictl ps will give you containers (like docker ps), watch for high number of attempts:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
d54c6f1e45dea 2ae1ba6417cbc 2 seconds ago Running kube-proxy 1 347fef3ae1e98 kube-proxy-7gnnr
d6048ef9e30c7 d521dd763e2e3 41 seconds ago Running kube-apiserver 27 640658b58d1ae kube-apiserver-cm03
b6b8c7a24914e 3a5aa3a515f5d 41 seconds ago Running kube-scheduler 28 c7b710a0acf30 kube-scheduler-cm03
b0a480d2c1baf 586c112956dfc 42 seconds ago Running kube-controller-manager 8 69504853ab81b kube-controller-manager-cm03
and check logs using
crictl logs d54c6f1e45dea
Last not least /opt/cni/bin/ path usually contains binaries required for networking. Another PATH might defined in add on setup or CNI config.
$ ls /opt/cni/bin/
bandwidth bridge dhcp firewall flannel host-device host-local ipvlan loopback macvlan portmap ptp sbr static tuning vlan
Finally crictl reads /etc/crictl.yaml config, you should set proper runtime and image endpoint to match you container runtime:
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 10

Kubernetes can't access pod in multi worker nodes

I was following a tutorial on youtube and the guy said that if you deploy your application in a multi-cluster setup and if your service is of type NodePort, you don't have to worry from where your pod gets scheduled. You can access it with different node IP address like
worker1IP:servicePort or worker2IP:servicePort or workerNIP:servicePort
But I tried just now and this is not the case, I can only access the pod on the node from where it is scheduled and deployed. Is it correct behavior?
kubectl version --short
> Client Version: v1.18.5
> Server Version: v1.18.5
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66bff467f8-6pt8s 0/1 Running 288 7d22h
coredns-66bff467f8-t26x4 0/1 Running 288 7d22h
etcd-redhat-master 1/1 Running 16 7d22h
kube-apiserver-redhat-master 1/1 Running 17 7d22h
kube-controller-manager-redhat-master 1/1 Running 19 7d22h
kube-flannel-ds-amd64-9mh6k 1/1 Running 16 5d22h
kube-flannel-ds-amd64-g2k5c 1/1 Running 16 5d22h
kube-flannel-ds-amd64-rnvgb 1/1 Running 14 5d22h
kube-proxy-gf8zk 1/1 Running 16 7d22h
kube-proxy-wt7cp 1/1 Running 9 7d22h
kube-proxy-zbw4b 1/1 Running 9 7d22h
kube-scheduler-redhat-master 1/1 Running 18 7d22h
weave-net-6jjd8 2/2 Running 34 7d22h
weave-net-ssqbz 1/2 CrashLoopBackOff 296 7d22h
weave-net-ts2tj 2/2 Running 34 7d22h
[root#redhat-master deployments]# kubectl logs weave-net-ssqbz -c weave -n kube-system
DEBU: 2020/07/05 07:28:04.661866 [kube-peers] Checking peer "b6:01:79:66:7d:d3" against list &{[{e6:c9:b2:5f:82:d1 redhat-master} {b2:29:9a:5b:89:e9 redhat-console-1} {e2:95:07:c8:a0:90 redhat-console-2}]}
Peer not in list; removing persisted data
INFO: 2020/07/05 07:28:04.924399 Command line options: map[conn-limit:200 datapath:datapath db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 ipalloc-init:consensus=2 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 name:b6:01:79:66:7d:d3 nickname:redhat-master no-dns:true port:6783]
INFO: 2020/07/05 07:28:04.924448 weave 2.6.5
FATA: 2020/07/05 07:28:04.938587 Existing bridge type "bridge" is different than requested "bridged_fastdp". Please do 'weave reset' and try again
Update:
So basically the issue is because iptables is deprecated in rhel8. But After downgrading my OS to rhel7. I can access the nodeport only on the node it is deployed.

Kubeadm Failed to create SubnetManager: error retrieving pod spec for kube-system

No matter what I do it seems I cannot get rid of this problem. I have installed Kubernetes using kubeadm many times quite successfully however adding a v1.16.0 node is giving me a heck of a headache.
O/S: Ubuntu 18.04.3 LTS
Kubernetes version: v1.16.0
Kubeadm version: Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:01Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"
A query of the cluster shows:
NAME STATUS ROLES AGE VERSION
kube-apiserver-1 Ready master 110d v1.15.0
kube-apiserver-2 Ready master 110d v1.15.0
kube-apiserver-3 Ready master 110d v1.15.0
kube-node-1 Ready <none> 110d v1.15.0
kube-node-2 Ready <none> 110d v1.15.0
kube-node-3 Ready <none> 110d v1.15.0
kube-node-4 Ready <none> 110d v1.16.0
kube-node-5 Ready,SchedulingDisabled <none> 3m28s v1.16.0
kube-node-databases Ready <none> 110d v1.15.0
I have temporarily disabled scheduling to the node until I can fix this problem. A query of the pod status in the kube-system namespace shows the problem:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-55zjs 1/1 Running 128 21d
coredns-fb8b8dccf-kzrpc 1/1 Running 144 21d
kube-flannel-ds-amd64-29xp2 1/1 Running 11 110d
kube-flannel-ds-amd64-hp7nq 1/1 Running 14 110d
kube-flannel-ds-amd64-hvdpf 0/1 CrashLoopBackOff 5 8m28s
kube-flannel-ds-amd64-jhhlk 1/1 Running 11 110d
kube-flannel-ds-amd64-k6dzc 1/1 Running 2 110d
kube-flannel-ds-amd64-lccxl 1/1 Running 21 110d
kube-flannel-ds-amd64-nnn7g 1/1 Running 14 110d
kube-flannel-ds-amd64-shss5 1/1 Running 7 110d
kubectl -n kube-system logs -f kube-flannel-ds-amd64-hvdpf
I1002 01:13:22.136379 1 main.go:514] Determining IP address of default interface
I1002 01:13:22.136823 1 main.go:527] Using interface with name ens3 and address 192.168.5.46
I1002 01:13:22.136849 1 main.go:544] Defaulting external address to interface address (192.168.5.46)
E1002 01:13:52.231471 1 main.go:241] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-hvdpf': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-hvdpf: dial tcp 10.96.0.1:443: i/o timeout
Although I had a few hits on iptables issues and kernel routing I don't understand why previous versions have installed without a hitch but this version is giving me such a problem.
I have installed this node and destroyed it quite a few times yet the result is always the same.
Anyone else having this issue or has a solution?
This occurs when its not able to lookup the host add the below after name: POD_NAMESPACE
- name: KUBERNETES_SERVICE_HOST
value: "10.220.64.186" #ip address of the host where kube-apiservice is running
- name: KUBERNETES_SERVICE_PORT
value: "6443"
According to Documentation about version skew policy:
kubelet
kubelet must not be newer than kube-apiserver, and may be up to two minor versions older.
Example:
kube-apiserver is at 1.13
kubelet is supported at 1.13, 1.12, and 1.11
That means that worker nodes with version v1.16.0 is not supported on master node with version v1.15.0.
To fix this issue I recommend reinstalling node with version v1.15.0 to match the rest of the cluster.
Optionally You can upgrade whole cluster to v1.16.1 however there are some problems with it running flannel as network plugin at the moment. Please review this guide from documentation before proceeding.

Troubleshooting a NotReady node

I have one node that is giving me some trouble at the moment. Not found a solution as of yet but that might be a skill level problem, Google coming up empty, or I have found some unsolvable issue. The latter is highly unlikely.
kubectl version v1.8.5
docker version 1.12.6
Doing some normal maintenance on my nodes I noticed the following:
NAME STATUS ROLES AGE VERSION
ip-192-168-4-14.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-143.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-174.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-182.ourdomain.pro Ready <none> 46d v1.8.5
ip-192-168-4-221.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-249.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-251.ourdomain.pro NotReady <none> 206d v1.8.5
On the NotReady node, I am unable to attach or exec myself in which seems normal when in a NotReady state unless I am misreading it. Not able to look at any specific logs on that node for the same reason.
At this point, I restarted kubelet and attached myself to the logs simultaneously to see if anything out of the ordinary would appear.
I have attached the things I spent a day Googling but I can not confirm is the actually connected to the problem.
ERROR 1
unable to connect to Rkt api service
We are not using this so I put this on the ignore list.
ERROR 2
unable to connect to CRI-O api service
We are not using this so I put this on the ignore list.
ERROR 3
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
I have not been able to exclude this as potential pitfall but the things I have found thus far do not seem to relate to the version I am running.
ERROR 4
skipping pod synchronization - [container runtime is down PLEG is not healthy
I do not have an answer for this one except for the fact that the garbage collection error above appears a second time after this message.
ERROR 5
Registration of the rkt container factory failed
Not using this so it should fail unless I am mistaken.
ERROR 6
Registration of the crio container factory failed
Not using this so it should fail unless, again, I am mistaken.
ERROR 7
28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container
Found a Github ticket for this one but seems it's fixed so not sure how it relates.
ERROR 8
28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}
And here the node goes into NotReady.
Last log messages and status
systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
Docs: http://kubernetes.io/docs/
Main PID: 28087 (kubelet)
Tasks: 21
Memory: 42.3M
CGroup: /system.slice/kubelet.service
└─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530 28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
Here is the kubectl get po -o wide output.
NAME READY STATUS RESTARTS AGE IP NODE
docker-image-prune-fhjkl 1/1 Running 4 213d 100.96.67.87 ip-192-168-4-249
docker-image-prune-ltfpf 1/1 Running 4 213d 100.96.152.74 ip-192-168-4-143
docker-image-prune-nmg29 1/1 Running 3 213d 100.96.22.236 ip-192-168-4-221
docker-image-prune-pdw5h 1/1 Running 7 213d 100.96.90.116 ip-192-168-4-174
docker-image-prune-swbhc 1/1 Running 0 46d 100.96.191.129 ip-192-168-4-182
docker-image-prune-vtsr4 1/1 NodeLost 1 206d 100.96.182.197 ip-192-168-4-251
fluentd-es-4bgdz 1/1 Running 6 213d 192.168.4.249 ip-192-168-4-249
fluentd-es-fb4gw 1/1 Running 7 213d 192.168.4.14 ip-192-168-4-14
fluentd-es-fs8gp 1/1 Running 6 213d 192.168.4.143 ip-192-168-4-143
fluentd-es-k572w 1/1 Running 0 46d 192.168.4.182 ip-192-168-4-182
fluentd-es-lpxhn 1/1 Running 5 213d 192.168.4.174 ip-192-168-4-174
fluentd-es-pjp9w 1/1 Unknown 2 206d 192.168.4.251 ip-192-168-4-251
fluentd-es-wbwkp 1/1 Running 4 213d 192.168.4.221 ip-192-168-4-221
grafana-76c7dbb678-p8hzb 1/1 Running 3 213d 100.96.90.115 ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp 2/2 Running 2 101d 100.96.22.234 ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m 2/2 Running 2 101d 100.96.22.235 ip-192-168-4-221
prometheus-65b4b68d97-82vr7 1/1 Running 3 213d 100.96.90.87 ip-192-168-4-174
pushgateway-79f575d754-75l6r 1/1 Running 3 213d 100.96.90.83 ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb 2/2 Running 4 181d 100.96.90.117 ip-192-168-4-174
replicator-56x7v 1/1 Running 3 213d 100.96.90.84 ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv 1/1 Running 3 213d 100.96.90.85 ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk 1/1 Running 4 213d 100.96.152.73 ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n 1/1 Running 3 213d 100.96.22.232 ip-192-168-4-221
Output of kubectl get po -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP
calico-kube-controllers-78f554c7bb-s7tmj 1/1 Running 4 213d 192.168.4.14
calico-node-5cgc6 2/2 Running 9 213d 192.168.4.249
calico-node-bbwtm 2/2 Running 8 213d 192.168.4.14
calico-node-clwqk 2/2 NodeLost 4 206d 192.168.4.251
calico-node-d2zqz 2/2 Running 0 46d 192.168.4.182
calico-node-m4x2t 2/2 Running 6 213d 192.168.4.221
calico-node-m8xwk 2/2 Running 9 213d 192.168.4.143
calico-node-q7r7g 2/2 Running 8 213d 192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk 1/1 Running 10 207d 100.96.67.88
kube-apiserver-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-apiserver-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-apiserver-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-controller-manager-ip-192-168-4-14 1/1 Running 5 213d 192.168.4.14
kube-controller-manager-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-controller-manager-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-dns-545bc4bfd4-rt7qp 3/3 Running 13 213d 100.96.19.197
kube-proxy-2bn42 1/1 Running 0 46d 192.168.4.182
kube-proxy-95cvh 1/1 Running 4 213d 192.168.4.174
kube-proxy-bqrhw 1/1 NodeLost 2 206d 192.168.4.251
kube-proxy-cqh67 1/1 Running 6 213d 192.168.4.14
kube-proxy-fbdvx 1/1 Running 4 213d 192.168.4.221
kube-proxy-gcjxg 1/1 Running 5 213d 192.168.4.249
kube-proxy-mt62x 1/1 Running 4 213d 192.168.4.143
kube-scheduler-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-scheduler-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-scheduler-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2 1/1 Running 5 213d 100.96.22.230
tiller-deploy-6d9f596465-svpql 1/1 Running 3 213d 100.96.22.231
I am a bit lost at this point of where to go from here. Any suggestions are welcome.
Most likely the kubelet must be down.
share the output from below command
journalctl -u kubelet
share the output from the below command
kubectl get po -n kube-system -owide
It appears like the node is not able to communicate with the control plane.
you can below steps
detached the node from cluster ( cordon the node, drain the node and finally delete the node)
reset the node
rejoin the node as fresh to cluster