Troubleshooting a NotReady node

Troubleshooting a NotReady node - kubernetes

I have one node that is giving me some trouble at the moment. Not found a solution as of yet but that might be a skill level problem, Google coming up empty, or I have found some unsolvable issue. The latter is highly unlikely.
kubectl version v1.8.5
docker version 1.12.6
Doing some normal maintenance on my nodes I noticed the following:
NAME STATUS ROLES AGE VERSION
ip-192-168-4-14.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-143.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-174.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-182.ourdomain.pro Ready <none> 46d v1.8.5
ip-192-168-4-221.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-249.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-251.ourdomain.pro NotReady <none> 206d v1.8.5
On the NotReady node, I am unable to attach or exec myself in which seems normal when in a NotReady state unless I am misreading it. Not able to look at any specific logs on that node for the same reason.
At this point, I restarted kubelet and attached myself to the logs simultaneously to see if anything out of the ordinary would appear.
I have attached the things I spent a day Googling but I can not confirm is the actually connected to the problem.
ERROR 1
unable to connect to Rkt api service
We are not using this so I put this on the ignore list.
ERROR 2
unable to connect to CRI-O api service
We are not using this so I put this on the ignore list.
ERROR 3
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
I have not been able to exclude this as potential pitfall but the things I have found thus far do not seem to relate to the version I am running.
ERROR 4
skipping pod synchronization - [container runtime is down PLEG is not healthy
I do not have an answer for this one except for the fact that the garbage collection error above appears a second time after this message.
ERROR 5
Registration of the rkt container factory failed
Not using this so it should fail unless I am mistaken.
ERROR 6
Registration of the crio container factory failed
Not using this so it should fail unless, again, I am mistaken.
ERROR 7
28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container
Found a Github ticket for this one but seems it's fixed so not sure how it relates.
ERROR 8
28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}
And here the node goes into NotReady.
Last log messages and status
systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
Docs: http://kubernetes.io/docs/
Main PID: 28087 (kubelet)
Tasks: 21
Memory: 42.3M
CGroup: /system.slice/kubelet.service
└─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530 28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
Here is the kubectl get po -o wide output.
NAME READY STATUS RESTARTS AGE IP NODE
docker-image-prune-fhjkl 1/1 Running 4 213d 100.96.67.87 ip-192-168-4-249
docker-image-prune-ltfpf 1/1 Running 4 213d 100.96.152.74 ip-192-168-4-143
docker-image-prune-nmg29 1/1 Running 3 213d 100.96.22.236 ip-192-168-4-221
docker-image-prune-pdw5h 1/1 Running 7 213d 100.96.90.116 ip-192-168-4-174
docker-image-prune-swbhc 1/1 Running 0 46d 100.96.191.129 ip-192-168-4-182
docker-image-prune-vtsr4 1/1 NodeLost 1 206d 100.96.182.197 ip-192-168-4-251
fluentd-es-4bgdz 1/1 Running 6 213d 192.168.4.249 ip-192-168-4-249
fluentd-es-fb4gw 1/1 Running 7 213d 192.168.4.14 ip-192-168-4-14
fluentd-es-fs8gp 1/1 Running 6 213d 192.168.4.143 ip-192-168-4-143
fluentd-es-k572w 1/1 Running 0 46d 192.168.4.182 ip-192-168-4-182
fluentd-es-lpxhn 1/1 Running 5 213d 192.168.4.174 ip-192-168-4-174
fluentd-es-pjp9w 1/1 Unknown 2 206d 192.168.4.251 ip-192-168-4-251
fluentd-es-wbwkp 1/1 Running 4 213d 192.168.4.221 ip-192-168-4-221
grafana-76c7dbb678-p8hzb 1/1 Running 3 213d 100.96.90.115 ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp 2/2 Running 2 101d 100.96.22.234 ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m 2/2 Running 2 101d 100.96.22.235 ip-192-168-4-221
prometheus-65b4b68d97-82vr7 1/1 Running 3 213d 100.96.90.87 ip-192-168-4-174
pushgateway-79f575d754-75l6r 1/1 Running 3 213d 100.96.90.83 ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb 2/2 Running 4 181d 100.96.90.117 ip-192-168-4-174
replicator-56x7v 1/1 Running 3 213d 100.96.90.84 ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv 1/1 Running 3 213d 100.96.90.85 ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk 1/1 Running 4 213d 100.96.152.73 ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n 1/1 Running 3 213d 100.96.22.232 ip-192-168-4-221
Output of kubectl get po -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP
calico-kube-controllers-78f554c7bb-s7tmj 1/1 Running 4 213d 192.168.4.14
calico-node-5cgc6 2/2 Running 9 213d 192.168.4.249
calico-node-bbwtm 2/2 Running 8 213d 192.168.4.14
calico-node-clwqk 2/2 NodeLost 4 206d 192.168.4.251
calico-node-d2zqz 2/2 Running 0 46d 192.168.4.182
calico-node-m4x2t 2/2 Running 6 213d 192.168.4.221
calico-node-m8xwk 2/2 Running 9 213d 192.168.4.143
calico-node-q7r7g 2/2 Running 8 213d 192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk 1/1 Running 10 207d 100.96.67.88
kube-apiserver-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-apiserver-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-apiserver-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-controller-manager-ip-192-168-4-14 1/1 Running 5 213d 192.168.4.14
kube-controller-manager-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-controller-manager-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-dns-545bc4bfd4-rt7qp 3/3 Running 13 213d 100.96.19.197
kube-proxy-2bn42 1/1 Running 0 46d 192.168.4.182
kube-proxy-95cvh 1/1 Running 4 213d 192.168.4.174
kube-proxy-bqrhw 1/1 NodeLost 2 206d 192.168.4.251
kube-proxy-cqh67 1/1 Running 6 213d 192.168.4.14
kube-proxy-fbdvx 1/1 Running 4 213d 192.168.4.221
kube-proxy-gcjxg 1/1 Running 5 213d 192.168.4.249
kube-proxy-mt62x 1/1 Running 4 213d 192.168.4.143
kube-scheduler-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-scheduler-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-scheduler-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2 1/1 Running 5 213d 100.96.22.230
tiller-deploy-6d9f596465-svpql 1/1 Running 3 213d 100.96.22.231
I am a bit lost at this point of where to go from here. Any suggestions are welcome.

Most likely the kubelet must be down.
share the output from below command
journalctl -u kubelet
share the output from the below command
kubectl get po -n kube-system -owide
It appears like the node is not able to communicate with the control plane.
you can below steps
detached the node from cluster ( cordon the node, drain the node and finally delete the node)
reset the node
rejoin the node as fresh to cluster

Related

deploy exceptionless on k8s! Error Back-off restarting failed container

I get the exceptionless helm chart ,my value.yaml is https://github.com/mypublicuse/myfile/blob/main/el-values.yaml
i got errors
1:
Error: INSTALLATION FAILED: Deployment.apps "exceptionless-elasticsearch" is invalid: spec.template.spec.initContainers[0].image: Required value
so I edit the elasticsearch.yaml Add
spec:
initContainers:
name: sysctl
image: mydockerhost/busybox:1.35
so the helm can install
2: after helm install
i found
exless-nfsclient-nfs-subdir-external-provisioner-7fc86846fmlbgz 1/1 Running 0 52m
exceptionless-redis-85956947f-7vkpg 1/1 Running 0 49m
exceptionless-app-6547d4d88d-2hkbg 1/1 Running 0 49m
exceptionless-elasticsearch-76f6cc9b9-2jgks 1/1 Running 0 49m
exceptionless-jobs-web-hooks-7bb9d7477c-kpmwv 0/1 CrashLoopBackOff 14 (2m53s ago) 49m
exceptionless-jobs-event-notifications-844cb87665-bd7bt 0/1 CrashLoopBackOff 14 (2m53s ago) 49m
exceptionless-jobs-mail-message-647d6bd897-s8jmq 0/1 CrashLoopBackOff 14 (2m55s ago) 49m
exceptionless-jobs-event-usage-75c6d6d54d-m5rjr 0/1 CrashLoopBackOff 14 (2m46s ago) 49m
exceptionless-jobs-work-item-c74d77b55-th4g7 0/1 CrashLoopBackOff 14 (2m34s ago) 49m
exceptionless-jobs-daily-summary-6c99dfbc87-7zq5k 0/1 CrashLoopBackOff 14 (2m34s ago) 49m
exceptionless-jobs-event-posts-75777759b8-nsmbw 0/1 CrashLoopBackOff 14 (2m32s ago) 49m
exceptionless-jobs-close-inactive-sessions-b49595f49-hmfxm 0/1 CrashLoopBackOff 14 (2m14s ago) 49m
exceptionless-jobs-event-user-descriptions-5c9d5dc768-8h27z 0/1 CrashLoopBackOff 14 (2m16s ago) 49m
exceptionless-jobs-stack-event-count-54ffcfb4b6-gk6mz 0/1 CrashLoopBackOff 14 (2m ago) 49m
exceptionless-jobs-maintain-indexes-27669970-s28cg 0/1 CrashLoopBackOff 5 (94s ago) 4m30s
exceptionless-collector-5c774fd8ff-6ksvx 0/1 CrashLoopBackOff 2 (11s ago) 37s
exceptionless-api-66fc9cc659-zckzz 0/1 CrashLoopBackOff 3 (9s ago) 55s
api collector and jobs is un success!
I need help!thanks!
The pod log is
Back-off restarting failed container
yes just it!
i guess the program should be run and immediate crash ,so ....

How to debug GKE internal network issue?

UPDATE 1:
Some more logs from api-servers:
https://gist.github.com/nvcnvn/47df8798e798637386f6e0777d869d4f
This question is more about debugging method for current GKE but welcome for solution.
We're using GKE version 1.22.3-gke.1500 with following configuration:
We recently facing issue that commands like kubectl logs and exec doesn't work, deleting a namespace taking forever.
Checking some service inside the cluster, it seem for some reason some network operation just randomly failed. For example metric-server keep crashing with these error logs:
message: "pkg/mod/k8s.io/client-go#v0.19.10/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.97.0.1:443/api/v1/nodes?resourceVersion=387681528": net/http: TLS handshake timeout"
HTTP request timeout also:
unable to fully scrape metrics: unable to fully scrape metrics from node gke-staging-n2d-standard-8-78c35b3a-6h16: unable to fetch metrics from node gke-staging-n2d-standard-8-78c35b3a-6h16: Get "http://10.148.15.217:10255/stats/summary?only_cpu_and_memory=true": context deadline exceeded
and I also try to restart (by kubectl delete) most of the pod in this list:
kubectl get pod
NAME READY STATUS RESTARTS AGE
event-exporter-gke-5479fd58c8-snq26 2/2 Running 0 4d7h
fluentbit-gke-gbs2g 2/2 Running 0 4d7h
fluentbit-gke-knz2p 2/2 Running 0 85m
fluentbit-gke-ljw8h 2/2 Running 0 30h
gke-metadata-server-dtnvh 1/1 Running 0 4d7h
gke-metadata-server-f2bqw 1/1 Running 0 30h
gke-metadata-server-kzcv6 1/1 Running 0 85m
gke-metrics-agent-4g56c 1/1 Running 12 (3h6m ago) 4d7h
gke-metrics-agent-hnrll 1/1 Running 13 (13h ago) 30h
gke-metrics-agent-xdbrw 1/1 Running 0 85m
konnectivity-agent-87bc84bb7-g9nd6 1/1 Running 0 2m59s
konnectivity-agent-87bc84bb7-rkhhh 1/1 Running 0 3m51s
konnectivity-agent-87bc84bb7-x7pk4 1/1 Running 0 3m50s
konnectivity-agent-autoscaler-698b6d8768-297mh 1/1 Running 0 83m
kube-dns-77d9986bd5-2m8g4 4/4 Running 0 3h24m
kube-dns-77d9986bd5-z4j62 4/4 Running 0 3h24m
kube-dns-autoscaler-f4d55555-dmvpq 1/1 Running 0 83m
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-8299 1/1 Running 0 11s
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-fp5u 1/1 Running 0 11s
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-rkdp 1/1 Running 0 11s
l7-default-backend-7db896cb4-mvptg 1/1 Running 0 83m
metrics-server-v0.4.4-fd9886cc5-tcscj 2/2 Running 82 33h
netd-5vpmc 1/1 Running 0 30h
netd-bhq64 1/1 Running 0 85m
netd-n6jmc 1/1 Running 0 4d7h
Some logs from metrics server
https://gist.github.com/nvcnvn/b77eb02705385889961aca33f0f841c7

if you cannot use kubectl to get info from your cluster, can you try to access them by using their restfull api
http://blog.madhukaraphatak.com/understanding-k8s-api-part-2/
try to delete "metric-server" pods or get logs from it using podman or curl command.

pvc get stuck in pending waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually

I use rook to build a ceph cluster.But my pvc get stuck in pending. When I used kubectl describe pvc, I found events from persistentvolume-controller:
waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
All my pods are in running state:
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-ntqk6 3/3 Running 0 14d
csi-cephfsplugin-pqxdw 3/3 Running 6 14d
csi-cephfsplugin-provisioner-c68f789b8-dt4jf 6/6 Running 49 14d
csi-cephfsplugin-provisioner-c68f789b8-rn42r 6/6 Running 73 14d
csi-rbdplugin-6pgf4 3/3 Running 0 14d
csi-rbdplugin-l8fkm 3/3 Running 6 14d
csi-rbdplugin-provisioner-6c75466c49-tzqcr 6/6 Running 106 14d
csi-rbdplugin-provisioner-6c75466c49-x8675 6/6 Running 17 14d
rook-ceph-crashcollector-compute08.dc-56b86f7c4c-9mh2j 1/1 Running 2 12d
rook-ceph-crashcollector-compute09.dc-6998676d86-wpsrs 1/1 Running 0 12d
rook-ceph-crashcollector-compute10.dc-684599bcd8-7hzlc 1/1 Running 0 12d
rook-ceph-mgr-a-69fd54cccf-tjkxh 1/1 Running 200 12d
rook-ceph-mon-at-8568b88589-2bm5h 1/1 Running 0 4d3h
rook-ceph-mon-av-7b4444c8f4-2mlpc 1/1 Running 0 4d1h
rook-ceph-mon-aw-7df9f76fcd-zzmkw 1/1 Running 0 4d1h
rook-ceph-operator-7647888f87-zjgsj 1/1 Running 1 15d
rook-ceph-osd-0-6db4d57455-p4cz9 1/1 Running 2 12d
rook-ceph-osd-1-649d74dc6c-5r9dj 1/1 Running 0 12d
rook-ceph-osd-2-7c57d4498c-dh6nk 1/1 Running 0 12d
rook-ceph-osd-prepare-compute08.dc-gxt8p 0/1 Completed 0 3h9m
rook-ceph-osd-prepare-compute09.dc-wj2fp 0/1 Completed 0 3h9m
rook-ceph-osd-prepare-compute10.dc-22kth 0/1 Completed 0 3h9m
rook-ceph-tools-6b4889fdfd-d6xdg 1/1 Running 0 12d
Here is the kubectl logs -n rook-ceph csi-cephfsplugin-provisioner-c68f789b8-dt4jf csi-provisioner
I0120 11:57:13.283362 1 csi-provisioner.go:121] Version: v2.0.0
I0120 11:57:13.283493 1 csi-provisioner.go:135] Building kube configs for running in cluster...
I0120 11:57:13.294506 1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I0120 11:57:13.294984 1 common.go:111] Probing CSI driver for readiness
W0120 11:57:13.296379 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0120 11:57:13.299629 1 leaderelection.go:243] attempting to acquire leader lease rook-ceph/rook-ceph-cephfs-csi-ceph-com...
Here is the ceph status in toolbox container:
cluster:
id: 0b71fd4c-9731-4fea-81a7-1b5194e14204
health: HEALTH_ERR
Module 'dashboard' has failed: [('x509 certificate routines', 'X509_check_private_key', 'key values mismatch')]
Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded, 1 pg undersized
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
services:
mon: 3 daemons, quorum at,av,aw (age 4d)
mgr: a(active, since 4d)
osd: 3 osds: 3 up (since 12d), 3 in (since 12d)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 0 B
usage: 3.3 GiB used, 3.2 TiB / 3.2 TiB avail
pgs: 2/6 objects degraded (33.333%)
1 active+undersized+degraded
I think it’s because the cluster’s health is health_err, but I don’t know how to solve it...I use raw partitions to build the ceph cluster currently: one partition on a node and two partitions on another node.
I found that there are few pods restarted several times, so I checked their logs.As for the csi-rbdplugin-provisioner pod, there is the same error in csi-resizer,csi attacher and csi-snapshotter container:
E0122 08:08:37.891106 1 leaderelection.go:321] error retrieving resource lock rook-ceph/external-resizer-rook-ceph-rbd-csi-ceph-com: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-resizer-rook-ceph-rbd-csi-ceph-com": dial tcp 10.96.0.1:443: i/o timeout
,and a repeating error in csi-snapshotter:
E0122 08:08:48.420082 1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)
As for the mgr pod,there is a repeating record:
debug 2021-01-29T00:47:22.155+0000 7f10fdb48700 0 log_channel(cluster) log [DBG] : pgmap v28775: 1 pgs: 1 active+undersized+degraded; 0 B data, 337 MiB used, 3.2 TiB / 3.2 TiB avail; 2/6 objects degraded (33.333%)
It's also weird that the mon pods' names are at,av and aw rather than a,b and c.Seems like the mon pods deleted and created several times,but I don't know why.
Thanks for any advice.

Kubernetes can't access pod in multi worker nodes

I was following a tutorial on youtube and the guy said that if you deploy your application in a multi-cluster setup and if your service is of type NodePort, you don't have to worry from where your pod gets scheduled. You can access it with different node IP address like
worker1IP:servicePort or worker2IP:servicePort or workerNIP:servicePort
But I tried just now and this is not the case, I can only access the pod on the node from where it is scheduled and deployed. Is it correct behavior?
kubectl version --short
> Client Version: v1.18.5
> Server Version: v1.18.5
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66bff467f8-6pt8s 0/1 Running 288 7d22h
coredns-66bff467f8-t26x4 0/1 Running 288 7d22h
etcd-redhat-master 1/1 Running 16 7d22h
kube-apiserver-redhat-master 1/1 Running 17 7d22h
kube-controller-manager-redhat-master 1/1 Running 19 7d22h
kube-flannel-ds-amd64-9mh6k 1/1 Running 16 5d22h
kube-flannel-ds-amd64-g2k5c 1/1 Running 16 5d22h
kube-flannel-ds-amd64-rnvgb 1/1 Running 14 5d22h
kube-proxy-gf8zk 1/1 Running 16 7d22h
kube-proxy-wt7cp 1/1 Running 9 7d22h
kube-proxy-zbw4b 1/1 Running 9 7d22h
kube-scheduler-redhat-master 1/1 Running 18 7d22h
weave-net-6jjd8 2/2 Running 34 7d22h
weave-net-ssqbz 1/2 CrashLoopBackOff 296 7d22h
weave-net-ts2tj 2/2 Running 34 7d22h
[root#redhat-master deployments]# kubectl logs weave-net-ssqbz -c weave -n kube-system
DEBU: 2020/07/05 07:28:04.661866 [kube-peers] Checking peer "b6:01:79:66:7d:d3" against list &{[{e6:c9:b2:5f:82:d1 redhat-master} {b2:29:9a:5b:89:e9 redhat-console-1} {e2:95:07:c8:a0:90 redhat-console-2}]}
Peer not in list; removing persisted data
INFO: 2020/07/05 07:28:04.924399 Command line options: map[conn-limit:200 datapath:datapath db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 ipalloc-init:consensus=2 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 name:b6:01:79:66:7d:d3 nickname:redhat-master no-dns:true port:6783]
INFO: 2020/07/05 07:28:04.924448 weave 2.6.5
FATA: 2020/07/05 07:28:04.938587 Existing bridge type "bridge" is different than requested "bridged_fastdp". Please do 'weave reset' and try again
Update:
So basically the issue is because iptables is deprecated in rhel8. But After downgrading my OS to rhel7. I can access the nodeport only on the node it is deployed.

kubelet.service: Service hold-off time over, scheduling restart

Context
We are currently using a few clusters with v1.8.7 (which was created by currently unavailable developers, months ago) and are trying to upgrade to a higher version.
However, we wanted to try the same on an cluster we use for experimental & POCs.
What we tried
In doing the same, we tried to run a few kubeadm commands on one of the master nodes, but kubeadm was not found.
So, we tried installing the same with commands -
apt-get update && apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
apt-get update
apt-get install -y kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl
What happened
However, now that node has status Not Ready and kubelet service is failing
Any pointers on how to fix this and what we should've done ?
root#k8s-master-dev-0:/home/azureuser# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-dev-0 NotReady master 118d v1.8.7
k8s-master-dev-1 Ready master 118d v1.8.7
k8s-master-dev-2 Ready master 163d v1.8.7
k8s-agents-dev-0 Ready agent 163d v1.8.7
k8s-agents-dev-1 Ready agent 163d v1.8.7
k8s-agents-dev-2 Ready agent 163d v1.8.7
root#k8s-master-dev-0:/home/azureuser# systemctl status kubelet.service
● kubelet.service - Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: failed (Result: start-limit-hit) since Thu 2018-12-13 14:33:25 UTC; 18h ago
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Control process exited, code=exited status=2
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: Failed to start Kubelet.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Unit entered failed state.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: Stopped Kubelet.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Start request repeated too quickly.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: Failed to start Kubelet.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Unit entered failed state.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Failed with result 'start-limit-hit'.

The reason your kubelet went into bad state is that you upgraded kubelet package and service file for kubelet must be renewed and If you earlier did some changes must be lost.
Following things you can try:
Disabling you swap memory: swapoff -a
Check your kubelet service file, for kubeadm it is located at /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and check the value --cgroup-driver and if it is systemd make it cgroupfs and then:
Reload the daemon and restart kubelet:
systemctl daemon-reload
systemctl restart kubelet
Now check if your kubelet started or not.
PS: Live upgrade of kubeadm control plane should be done carefully, check my answer on how to upgrade kubeadm
how to upgrade kubernetes from v1.10.0 to v1.10.11

Is it clean Kubernetes cluster?
I think you should be careful with installation kubelet kubeadm kubectl in a LIVE Kubernetes cluster.
Here you can find more information about installation kubelet on a live cluster.
https://kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/
Can you show me your output off:
kubectl get all --namespace kube-system

#wrogrammer
root#k8s-master-dev-0:/var/log/apt# kubectl get all --namespace kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/kube-proxy 6 6 5 6 5 beta.kubernetes.io/os=linux 164d
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/heapster 1 1 1 1 164d
deploy/kube-dns-v20 2 2 2 2 164d
deploy/kubernetes-dashboard 1 1 1 1 164d
deploy/tiller-deploy 1 1 1 1 164d
NAME DESIRED CURRENT READY AGE
rs/heapster-75f8df9884 1 1 1 164d
rs/heapster-7d6ffbf65 0 0 0 164d
rs/kube-dns-v20-5d9fdc7448 2 2 2 164d
rs/kubernetes-dashboard-8555bd85db 1 1 1 164d
rs/tiller-deploy-6677dc8d46 1 1 1 163d
rs/tiller-deploy-86d6cf59b 0 0 0 164d
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/heapster 1 1 1 1 164d
deploy/kube-dns-v20 2 2 2 2 164d
deploy/kubernetes-dashboard 1 1 1 1 164d
deploy/tiller-deploy 1 1 1 1 164d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/kube-proxy 6 6 5 6 5 beta.kubernetes.io/os=linux 164d
NAME DESIRED CURRENT READY AGE
rs/heapster-75f8df9884 1 1 1 164d
rs/heapster-7d6ffbf65 0 0 0 164d
rs/kube-dns-v20-5d9fdc7448 2 2 2 164d
rs/kubernetes-dashboard-8555bd85db 1 1 1 164d
rs/tiller-deploy-6677dc8d46 1 1 1 163d
rs/tiller-deploy-86d6cf59b 0 0 0 164d
NAME READY STATUS RESTARTS AGE
po/heapster-75f8df9884-nxn2z 2/2 Running 0 37d
po/kube-addon-manager-k8s-master-dev-0 1/1 Unknown 4 30d
po/kube-addon-manager-k8s-master-dev-1 1/1 Running 4 118d
po/kube-addon-manager-k8s-master-dev-2 1/1 Running 2 164d
po/kube-apiserver-k8s-master-dev-0 1/1 Unknown 4 30d
po/kube-apiserver-k8s-master-dev-1 1/1 Running 4 118d
po/kube-apiserver-k8s-master-dev-2 1/1 Running 2 164d
po/kube-controller-manager-k8s-master-dev-0 1/1 Unknown 6 30d
po/kube-controller-manager-k8s-master-dev-1 1/1 Running 4 118d
po/kube-controller-manager-k8s-master-dev-2 1/1 Running 4 164d
po/kube-dns-v20-5d9fdc7448-smf9s 3/3 Running 0 37d
po/kube-dns-v20-5d9fdc7448-vtjh4 3/3 Running 0 37d
po/kube-proxy-cklcx 1/1 Running 1 118d
po/kube-proxy-dldnd 1/1 Running 4 164d
po/kube-proxy-gg89s 1/1 NodeLost 3 163d
po/kube-proxy-mrkqf 1/1 Running 4 143d
po/kube-proxy-s95mm 1/1 Running 10 164d
po/kube-proxy-zxnb7 1/1 Running 2 164d
po/kube-scheduler-k8s-master-dev-0 1/1 Unknown 6 30d
po/kube-scheduler-k8s-master-dev-1 1/1 Running 6 118d
po/kube-scheduler-k8s-master-dev-2 1/1 Running 4 164d
po/kubernetes-dashboard-8555bd85db-4txtm 1/1 Running 0 37d
po/tiller-deploy-6677dc8d46-5n5cp 1/1 Running 0 37d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/heapster ClusterIP XX Redacted XX <none> 80/TCP 164d
svc/kube-dns ClusterIP XX Redacted XX <none> 53/UDP,53/TCP 164d
svc/kubernetes-dashboard NodePort XX Redacted XX <none> 80:31279/TCP 164d
svc/tiller-deploy ClusterIP XX Redacted XX <none> 44134/TCP 164d

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Troubleshooting a NotReady node - kubernetes

Related

deploy exceptionless on k8s! Error Back-off restarting failed container

How to debug GKE internal network issue?

pvc get stuck in pending waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually

Kubernetes can't access pod in multi worker nodes

kubelet.service: Service hold-off time over, scheduling restart

Categories

Resources