kubelet.service: Service hold-off time over, scheduling restart - kubernetes

Context
We are currently using a few clusters with v1.8.7 (which was created by currently unavailable developers, months ago) and are trying to upgrade to a higher version.
However, we wanted to try the same on an cluster we use for experimental & POCs.
What we tried
In doing the same, we tried to run a few kubeadm commands on one of the master nodes, but kubeadm was not found.
So, we tried installing the same with commands -
apt-get update && apt-get install -y apt-transport-https curl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
apt-get update
apt-get install -y kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl
What happened
However, now that node has status Not Ready and kubelet service is failing
Any pointers on how to fix this and what we should've done ?
root#k8s-master-dev-0:/home/azureuser# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-dev-0 NotReady master 118d v1.8.7
k8s-master-dev-1 Ready master 118d v1.8.7
k8s-master-dev-2 Ready master 163d v1.8.7
k8s-agents-dev-0 Ready agent 163d v1.8.7
k8s-agents-dev-1 Ready agent 163d v1.8.7
k8s-agents-dev-2 Ready agent 163d v1.8.7
root#k8s-master-dev-0:/home/azureuser# systemctl status kubelet.service
● kubelet.service - Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: failed (Result: start-limit-hit) since Thu 2018-12-13 14:33:25 UTC; 18h ago
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Control process exited, code=exited status=2
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: Failed to start Kubelet.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Unit entered failed state.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: Stopped Kubelet.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Start request repeated too quickly.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: Failed to start Kubelet.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Unit entered failed state.
Dec 13 14:33:25 k8s-master-dev-0 systemd[1]: kubelet.service: Failed with result 'start-limit-hit'.

The reason your kubelet went into bad state is that you upgraded kubelet package and service file for kubelet must be renewed and If you earlier did some changes must be lost.
Following things you can try:
Disabling you swap memory: swapoff -a
Check your kubelet service file, for kubeadm it is located at /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and check the value --cgroup-driver and if it is systemd make it cgroupfs and then:
Reload the daemon and restart kubelet:
systemctl daemon-reload
systemctl restart kubelet
Now check if your kubelet started or not.
PS: Live upgrade of kubeadm control plane should be done carefully, check my answer on how to upgrade kubeadm
how to upgrade kubernetes from v1.10.0 to v1.10.11

Is it clean Kubernetes cluster?
I think you should be careful with installation kubelet kubeadm kubectl in a LIVE Kubernetes cluster.
Here you can find more information about installation kubelet on a live cluster.
https://kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/
Can you show me your output off:
kubectl get all --namespace kube-system

#wrogrammer
root#k8s-master-dev-0:/var/log/apt# kubectl get all --namespace kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/kube-proxy 6 6 5 6 5 beta.kubernetes.io/os=linux 164d
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/heapster 1 1 1 1 164d
deploy/kube-dns-v20 2 2 2 2 164d
deploy/kubernetes-dashboard 1 1 1 1 164d
deploy/tiller-deploy 1 1 1 1 164d
NAME DESIRED CURRENT READY AGE
rs/heapster-75f8df9884 1 1 1 164d
rs/heapster-7d6ffbf65 0 0 0 164d
rs/kube-dns-v20-5d9fdc7448 2 2 2 164d
rs/kubernetes-dashboard-8555bd85db 1 1 1 164d
rs/tiller-deploy-6677dc8d46 1 1 1 163d
rs/tiller-deploy-86d6cf59b 0 0 0 164d
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/heapster 1 1 1 1 164d
deploy/kube-dns-v20 2 2 2 2 164d
deploy/kubernetes-dashboard 1 1 1 1 164d
deploy/tiller-deploy 1 1 1 1 164d
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/kube-proxy 6 6 5 6 5 beta.kubernetes.io/os=linux 164d
NAME DESIRED CURRENT READY AGE
rs/heapster-75f8df9884 1 1 1 164d
rs/heapster-7d6ffbf65 0 0 0 164d
rs/kube-dns-v20-5d9fdc7448 2 2 2 164d
rs/kubernetes-dashboard-8555bd85db 1 1 1 164d
rs/tiller-deploy-6677dc8d46 1 1 1 163d
rs/tiller-deploy-86d6cf59b 0 0 0 164d
NAME READY STATUS RESTARTS AGE
po/heapster-75f8df9884-nxn2z 2/2 Running 0 37d
po/kube-addon-manager-k8s-master-dev-0 1/1 Unknown 4 30d
po/kube-addon-manager-k8s-master-dev-1 1/1 Running 4 118d
po/kube-addon-manager-k8s-master-dev-2 1/1 Running 2 164d
po/kube-apiserver-k8s-master-dev-0 1/1 Unknown 4 30d
po/kube-apiserver-k8s-master-dev-1 1/1 Running 4 118d
po/kube-apiserver-k8s-master-dev-2 1/1 Running 2 164d
po/kube-controller-manager-k8s-master-dev-0 1/1 Unknown 6 30d
po/kube-controller-manager-k8s-master-dev-1 1/1 Running 4 118d
po/kube-controller-manager-k8s-master-dev-2 1/1 Running 4 164d
po/kube-dns-v20-5d9fdc7448-smf9s 3/3 Running 0 37d
po/kube-dns-v20-5d9fdc7448-vtjh4 3/3 Running 0 37d
po/kube-proxy-cklcx 1/1 Running 1 118d
po/kube-proxy-dldnd 1/1 Running 4 164d
po/kube-proxy-gg89s 1/1 NodeLost 3 163d
po/kube-proxy-mrkqf 1/1 Running 4 143d
po/kube-proxy-s95mm 1/1 Running 10 164d
po/kube-proxy-zxnb7 1/1 Running 2 164d
po/kube-scheduler-k8s-master-dev-0 1/1 Unknown 6 30d
po/kube-scheduler-k8s-master-dev-1 1/1 Running 6 118d
po/kube-scheduler-k8s-master-dev-2 1/1 Running 4 164d
po/kubernetes-dashboard-8555bd85db-4txtm 1/1 Running 0 37d
po/tiller-deploy-6677dc8d46-5n5cp 1/1 Running 0 37d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/heapster ClusterIP XX Redacted XX <none> 80/TCP 164d
svc/kube-dns ClusterIP XX Redacted XX <none> 53/UDP,53/TCP 164d
svc/kubernetes-dashboard NodePort XX Redacted XX <none> 80:31279/TCP 164d
svc/tiller-deploy ClusterIP XX Redacted XX <none> 44134/TCP 164d

Related

pvc get stuck in pending waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually

I use rook to build a ceph cluster.But my pvc get stuck in pending. When I used kubectl describe pvc, I found events from persistentvolume-controller:
waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator
All my pods are in running state:
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-ntqk6 3/3 Running 0 14d
csi-cephfsplugin-pqxdw 3/3 Running 6 14d
csi-cephfsplugin-provisioner-c68f789b8-dt4jf 6/6 Running 49 14d
csi-cephfsplugin-provisioner-c68f789b8-rn42r 6/6 Running 73 14d
csi-rbdplugin-6pgf4 3/3 Running 0 14d
csi-rbdplugin-l8fkm 3/3 Running 6 14d
csi-rbdplugin-provisioner-6c75466c49-tzqcr 6/6 Running 106 14d
csi-rbdplugin-provisioner-6c75466c49-x8675 6/6 Running 17 14d
rook-ceph-crashcollector-compute08.dc-56b86f7c4c-9mh2j 1/1 Running 2 12d
rook-ceph-crashcollector-compute09.dc-6998676d86-wpsrs 1/1 Running 0 12d
rook-ceph-crashcollector-compute10.dc-684599bcd8-7hzlc 1/1 Running 0 12d
rook-ceph-mgr-a-69fd54cccf-tjkxh 1/1 Running 200 12d
rook-ceph-mon-at-8568b88589-2bm5h 1/1 Running 0 4d3h
rook-ceph-mon-av-7b4444c8f4-2mlpc 1/1 Running 0 4d1h
rook-ceph-mon-aw-7df9f76fcd-zzmkw 1/1 Running 0 4d1h
rook-ceph-operator-7647888f87-zjgsj 1/1 Running 1 15d
rook-ceph-osd-0-6db4d57455-p4cz9 1/1 Running 2 12d
rook-ceph-osd-1-649d74dc6c-5r9dj 1/1 Running 0 12d
rook-ceph-osd-2-7c57d4498c-dh6nk 1/1 Running 0 12d
rook-ceph-osd-prepare-compute08.dc-gxt8p 0/1 Completed 0 3h9m
rook-ceph-osd-prepare-compute09.dc-wj2fp 0/1 Completed 0 3h9m
rook-ceph-osd-prepare-compute10.dc-22kth 0/1 Completed 0 3h9m
rook-ceph-tools-6b4889fdfd-d6xdg 1/1 Running 0 12d
Here is the kubectl logs -n rook-ceph csi-cephfsplugin-provisioner-c68f789b8-dt4jf csi-provisioner
I0120 11:57:13.283362 1 csi-provisioner.go:121] Version: v2.0.0
I0120 11:57:13.283493 1 csi-provisioner.go:135] Building kube configs for running in cluster...
I0120 11:57:13.294506 1 connection.go:153] Connecting to unix:///csi/csi-provisioner.sock
I0120 11:57:13.294984 1 common.go:111] Probing CSI driver for readiness
W0120 11:57:13.296379 1 metrics.go:142] metrics endpoint will not be started because `metrics-address` was not specified.
I0120 11:57:13.299629 1 leaderelection.go:243] attempting to acquire leader lease rook-ceph/rook-ceph-cephfs-csi-ceph-com...
Here is the ceph status in toolbox container:
cluster:
id: 0b71fd4c-9731-4fea-81a7-1b5194e14204
health: HEALTH_ERR
Module 'dashboard' has failed: [('x509 certificate routines', 'X509_check_private_key', 'key values mismatch')]
Degraded data redundancy: 2/6 objects degraded (33.333%), 1 pg degraded, 1 pg undersized
1 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
services:
mon: 3 daemons, quorum at,av,aw (age 4d)
mgr: a(active, since 4d)
osd: 3 osds: 3 up (since 12d), 3 in (since 12d)
data:
pools: 1 pools, 1 pgs
objects: 2 objects, 0 B
usage: 3.3 GiB used, 3.2 TiB / 3.2 TiB avail
pgs: 2/6 objects degraded (33.333%)
1 active+undersized+degraded
I think it’s because the cluster’s health is health_err, but I don’t know how to solve it...I use raw partitions to build the ceph cluster currently: one partition on a node and two partitions on another node.
I found that there are few pods restarted several times, so I checked their logs.As for the csi-rbdplugin-provisioner pod, there is the same error in csi-resizer,csi attacher and csi-snapshotter container:
E0122 08:08:37.891106 1 leaderelection.go:321] error retrieving resource lock rook-ceph/external-resizer-rook-ceph-rbd-csi-ceph-com: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/rook-ceph/leases/external-resizer-rook-ceph-rbd-csi-ceph-com": dial tcp 10.96.0.1:443: i/o timeout
,and a repeating error in csi-snapshotter:
E0122 08:08:48.420082 1 reflector.go:127] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: Failed to watch *v1beta1.VolumeSnapshotClass: failed to list *v1beta1.VolumeSnapshotClass: the server could not find the requested resource (get volumesnapshotclasses.snapshot.storage.k8s.io)
As for the mgr pod,there is a repeating record:
debug 2021-01-29T00:47:22.155+0000 7f10fdb48700 0 log_channel(cluster) log [DBG] : pgmap v28775: 1 pgs: 1 active+undersized+degraded; 0 B data, 337 MiB used, 3.2 TiB / 3.2 TiB avail; 2/6 objects degraded (33.333%)
It's also weird that the mon pods' names are at,av and aw rather than a,b and c.Seems like the mon pods deleted and created several times,but I don't know why.
Thanks for any advice.

Kubernetes can't access pod in multi worker nodes

I was following a tutorial on youtube and the guy said that if you deploy your application in a multi-cluster setup and if your service is of type NodePort, you don't have to worry from where your pod gets scheduled. You can access it with different node IP address like
worker1IP:servicePort or worker2IP:servicePort or workerNIP:servicePort
But I tried just now and this is not the case, I can only access the pod on the node from where it is scheduled and deployed. Is it correct behavior?
kubectl version --short
> Client Version: v1.18.5
> Server Version: v1.18.5
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-66bff467f8-6pt8s 0/1 Running 288 7d22h
coredns-66bff467f8-t26x4 0/1 Running 288 7d22h
etcd-redhat-master 1/1 Running 16 7d22h
kube-apiserver-redhat-master 1/1 Running 17 7d22h
kube-controller-manager-redhat-master 1/1 Running 19 7d22h
kube-flannel-ds-amd64-9mh6k 1/1 Running 16 5d22h
kube-flannel-ds-amd64-g2k5c 1/1 Running 16 5d22h
kube-flannel-ds-amd64-rnvgb 1/1 Running 14 5d22h
kube-proxy-gf8zk 1/1 Running 16 7d22h
kube-proxy-wt7cp 1/1 Running 9 7d22h
kube-proxy-zbw4b 1/1 Running 9 7d22h
kube-scheduler-redhat-master 1/1 Running 18 7d22h
weave-net-6jjd8 2/2 Running 34 7d22h
weave-net-ssqbz 1/2 CrashLoopBackOff 296 7d22h
weave-net-ts2tj 2/2 Running 34 7d22h
[root#redhat-master deployments]# kubectl logs weave-net-ssqbz -c weave -n kube-system
DEBU: 2020/07/05 07:28:04.661866 [kube-peers] Checking peer "b6:01:79:66:7d:d3" against list &{[{e6:c9:b2:5f:82:d1 redhat-master} {b2:29:9a:5b:89:e9 redhat-console-1} {e2:95:07:c8:a0:90 redhat-console-2}]}
Peer not in list; removing persisted data
INFO: 2020/07/05 07:28:04.924399 Command line options: map[conn-limit:200 datapath:datapath db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 ipalloc-init:consensus=2 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 name:b6:01:79:66:7d:d3 nickname:redhat-master no-dns:true port:6783]
INFO: 2020/07/05 07:28:04.924448 weave 2.6.5
FATA: 2020/07/05 07:28:04.938587 Existing bridge type "bridge" is different than requested "bridged_fastdp". Please do 'weave reset' and try again
Update:
So basically the issue is because iptables is deprecated in rhel8. But After downgrading my OS to rhel7. I can access the nodeport only on the node it is deployed.

Kubelet Master stays in KubeletNotReady because of cni missing

Kubelet has been initialized with pod network for Calico :
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --image-repository=someserver
Then i get calico.yaml v3.11 and applied it :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" apply -f calico.yaml
Right after i check on the pod status :
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" get nodes
NAME STATUS ROLES AGE VERSION
master-1 NotReady master 7m21s v1.17.2
on describe i've got cni config unitialized, but i thought that calico should have done that ?
MemoryPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Fri, 21 Feb 2020 10:14:24 +0100 Fri, 21 Feb 2020 10:09:00 +0100 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
In fact i have nothing under /etc/cni/net.d/ so it seems it forgot something ?
ll /etc/cni/net.d/
total 0
sudo kubectl --kubeconfig="/etc/kubernetes/admin.conf" -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-f7lqq 0/1 Pending 0 3h
calico-node-f4xzh 0/1 Init:ImagePullBackOff 0 3h
coredns-7fb8cdf968-bbqbz 0/1 Pending 0 3h24m
coredns-7fb8cdf968-vdnzx 0/1 Pending 0 3h24m
etcd-master-1 1/1 Running 0 3h24m
kube-apiserver-master-1 1/1 Running 0 3h24m
kube-controller-manager-master-1 1/1 Running 0 3h24m
kube-proxy-9m879 1/1 Running 0 3h24m
kube-scheduler-master-1 1/1 Running 0 3h24m
As explained i'm running through a local repo and journalctl says :
kubelet[21935]: E0225 14:30:54.830683 21935 pod_workers.go:191] Error syncing pod cec2f72b-844a-4d6b-8606-3aff06d4a36d ("calico-node-f4xzh_kube-system(cec2f72b-844a-4d6b-8606-3aff06d4a36d)"), skipping: failed to "StartContainer" for "upgrade-ipam" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://repo:10000/v2/calico/cni/manifests/v3.11.2: no basic auth credentials"
kubelet[21935]: E0225 14:30:56.008989 21935 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Feels like it's not only CNI the issue
Core DNS pod will be pending and master will be NotReady till calico pods are successfully running and CNI is setup properly.
It seems to be network issue to download calico docker images from docker.io. So you can pull calico images from docker.io and and push it to your internal container registry and then modify the calico yaml to refer that registry in images section of calico.yaml and finally apply the modified calico yaml to the kubernetes cluster.
So the issue with Init:ImagePullBackOff was that it cannot apply image from my private repo automatically. I had to pull all images for calico from docker. Then i deleted the calico pod it's recreate itself with the newly pushed image
sudo docker pull private-repo/calico/pod2daemon-flexvol:v3.11.2
sudo docker pull private-repo/calico/node:v3.11.2
sudo docker pull private-repo/calico/cni:v3.11.2
sudo docker pull private-repo/calico/kube-controllers:v3.11.2
sudo kubectl -n kube-system delete po/calico-node-y7g5
After that the node re-do all the init phase and :
sudo kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5644fb7cf6-qkf47 1/1 Running 0 11s
calico-node-mkcsr 1/1 Running 0 21m
coredns-7fb8cdf968-bgqvj 1/1 Running 0 37m
coredns-7fb8cdf968-v85jx 1/1 Running 0 37m
etcd-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-apiserver-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-controller-manager-lin-1k8w1dv-vmh 1/1 Running 0 38m
kube-proxy-9hkns 1/1 Running 0 37m
kube-scheduler-lin-1k8w1dv-vmh 1/1 Running 0 38m

Troubleshooting a NotReady node

I have one node that is giving me some trouble at the moment. Not found a solution as of yet but that might be a skill level problem, Google coming up empty, or I have found some unsolvable issue. The latter is highly unlikely.
kubectl version v1.8.5
docker version 1.12.6
Doing some normal maintenance on my nodes I noticed the following:
NAME STATUS ROLES AGE VERSION
ip-192-168-4-14.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-143.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-174.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-182.ourdomain.pro Ready <none> 46d v1.8.5
ip-192-168-4-221.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-249.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-251.ourdomain.pro NotReady <none> 206d v1.8.5
On the NotReady node, I am unable to attach or exec myself in which seems normal when in a NotReady state unless I am misreading it. Not able to look at any specific logs on that node for the same reason.
At this point, I restarted kubelet and attached myself to the logs simultaneously to see if anything out of the ordinary would appear.
I have attached the things I spent a day Googling but I can not confirm is the actually connected to the problem.
ERROR 1
unable to connect to Rkt api service
We are not using this so I put this on the ignore list.
ERROR 2
unable to connect to CRI-O api service
We are not using this so I put this on the ignore list.
ERROR 3
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
I have not been able to exclude this as potential pitfall but the things I have found thus far do not seem to relate to the version I am running.
ERROR 4
skipping pod synchronization - [container runtime is down PLEG is not healthy
I do not have an answer for this one except for the fact that the garbage collection error above appears a second time after this message.
ERROR 5
Registration of the rkt container factory failed
Not using this so it should fail unless I am mistaken.
ERROR 6
Registration of the crio container factory failed
Not using this so it should fail unless, again, I am mistaken.
ERROR 7
28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container
Found a Github ticket for this one but seems it's fixed so not sure how it relates.
ERROR 8
28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}
And here the node goes into NotReady.
Last log messages and status
systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
Docs: http://kubernetes.io/docs/
Main PID: 28087 (kubelet)
Tasks: 21
Memory: 42.3M
CGroup: /system.slice/kubelet.service
└─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530 28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
Here is the kubectl get po -o wide output.
NAME READY STATUS RESTARTS AGE IP NODE
docker-image-prune-fhjkl 1/1 Running 4 213d 100.96.67.87 ip-192-168-4-249
docker-image-prune-ltfpf 1/1 Running 4 213d 100.96.152.74 ip-192-168-4-143
docker-image-prune-nmg29 1/1 Running 3 213d 100.96.22.236 ip-192-168-4-221
docker-image-prune-pdw5h 1/1 Running 7 213d 100.96.90.116 ip-192-168-4-174
docker-image-prune-swbhc 1/1 Running 0 46d 100.96.191.129 ip-192-168-4-182
docker-image-prune-vtsr4 1/1 NodeLost 1 206d 100.96.182.197 ip-192-168-4-251
fluentd-es-4bgdz 1/1 Running 6 213d 192.168.4.249 ip-192-168-4-249
fluentd-es-fb4gw 1/1 Running 7 213d 192.168.4.14 ip-192-168-4-14
fluentd-es-fs8gp 1/1 Running 6 213d 192.168.4.143 ip-192-168-4-143
fluentd-es-k572w 1/1 Running 0 46d 192.168.4.182 ip-192-168-4-182
fluentd-es-lpxhn 1/1 Running 5 213d 192.168.4.174 ip-192-168-4-174
fluentd-es-pjp9w 1/1 Unknown 2 206d 192.168.4.251 ip-192-168-4-251
fluentd-es-wbwkp 1/1 Running 4 213d 192.168.4.221 ip-192-168-4-221
grafana-76c7dbb678-p8hzb 1/1 Running 3 213d 100.96.90.115 ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp 2/2 Running 2 101d 100.96.22.234 ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m 2/2 Running 2 101d 100.96.22.235 ip-192-168-4-221
prometheus-65b4b68d97-82vr7 1/1 Running 3 213d 100.96.90.87 ip-192-168-4-174
pushgateway-79f575d754-75l6r 1/1 Running 3 213d 100.96.90.83 ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb 2/2 Running 4 181d 100.96.90.117 ip-192-168-4-174
replicator-56x7v 1/1 Running 3 213d 100.96.90.84 ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv 1/1 Running 3 213d 100.96.90.85 ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk 1/1 Running 4 213d 100.96.152.73 ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n 1/1 Running 3 213d 100.96.22.232 ip-192-168-4-221
Output of kubectl get po -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP
calico-kube-controllers-78f554c7bb-s7tmj 1/1 Running 4 213d 192.168.4.14
calico-node-5cgc6 2/2 Running 9 213d 192.168.4.249
calico-node-bbwtm 2/2 Running 8 213d 192.168.4.14
calico-node-clwqk 2/2 NodeLost 4 206d 192.168.4.251
calico-node-d2zqz 2/2 Running 0 46d 192.168.4.182
calico-node-m4x2t 2/2 Running 6 213d 192.168.4.221
calico-node-m8xwk 2/2 Running 9 213d 192.168.4.143
calico-node-q7r7g 2/2 Running 8 213d 192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk 1/1 Running 10 207d 100.96.67.88
kube-apiserver-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-apiserver-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-apiserver-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-controller-manager-ip-192-168-4-14 1/1 Running 5 213d 192.168.4.14
kube-controller-manager-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-controller-manager-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-dns-545bc4bfd4-rt7qp 3/3 Running 13 213d 100.96.19.197
kube-proxy-2bn42 1/1 Running 0 46d 192.168.4.182
kube-proxy-95cvh 1/1 Running 4 213d 192.168.4.174
kube-proxy-bqrhw 1/1 NodeLost 2 206d 192.168.4.251
kube-proxy-cqh67 1/1 Running 6 213d 192.168.4.14
kube-proxy-fbdvx 1/1 Running 4 213d 192.168.4.221
kube-proxy-gcjxg 1/1 Running 5 213d 192.168.4.249
kube-proxy-mt62x 1/1 Running 4 213d 192.168.4.143
kube-scheduler-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-scheduler-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-scheduler-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2 1/1 Running 5 213d 100.96.22.230
tiller-deploy-6d9f596465-svpql 1/1 Running 3 213d 100.96.22.231
I am a bit lost at this point of where to go from here. Any suggestions are welcome.
Most likely the kubelet must be down.
share the output from below command
journalctl -u kubelet
share the output from the below command
kubectl get po -n kube-system -owide
It appears like the node is not able to communicate with the control plane.
you can below steps
detached the node from cluster ( cordon the node, drain the node and finally delete the node)
reset the node
rejoin the node as fresh to cluster

Kubernetes - kube-system pods in master node keep restarting after worker node joins

I have followed this tutorial and this tutorial and this one but am facing the same issue for last 3 days.
I am able to set up the master node correctly with the following steps:
kubeadm init
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
export kubever=$(kubectl version | base64 | tr -d ‘\’)
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$kubever"
and everything seems fine in
kubectl get all --namespace=kube-system
then,
on the worker node:
kubeadm join --token 864655.fdf6d0b389867b79 192.168.100.17:6443 --discovery-token-ca-cert-hash sha256:a2d840808b17b53b9612e6271ccde489f13dbede7d354f97188d0faa9e210af2
The output seems fine and is as below:
[preflight] Running pre-flight checks.
[WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Starting the kubelet service
[discovery] Trying to connect to API Server "192.168.100.17:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://192.168.100.17:6443"
[discovery] Requesting info from "https://192.168.100.17:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "192.168.100.17:6443"
[discovery] Successfully established connection with API Server "192.168.100.17:6443"
This node has joined the cluster:
* Certificate signing request was sent to master and a response
was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
BUT as soon as I run this command, all hell breaks loose. The
kubectl get all --namespace=kube-system
starts showing that all pods are kind of restarting all the time. the status keeps changing between Pending and Running, and at time some of the pods will even disappear and may have ContainerCreating status etc.
NAME READY STATUS RESTARTS AGE
po/etcd-ubuntu 0/1 Pending 0 0s
po/kube-controller-manager-ubuntu 0/1 Pending 0 0s
po/kube-dns-6f4fd4bdf-cmcfk 3/3 Running 0 13m
po/kube-proxy-2chb6 1/1 Running 0 13m
po/kube-scheduler-ubuntu 0/1 Pending 0 0s
po/weave-net-ptdxr 2/2 Running 0 11m
I have also tried the second tutorial, with flannel, and get the exact same issue.
My Set Up
I created two new VMs with a fresh installation of Ubuntu 17.10 on VMware with 2 processor/2core 6 GB of ram and 50 GB hard disk each. My physical machine is a i7-6700k with 32gb of ram.
I installed kubeadm, kubelet and docker on both of them and then followed the steps as mentioned above.
I have also tried switching between NAT and Bridge on VMware and nothing changed.
The initial IP of both VMs with bridge network was 192.168.100.12 and 192.168.100.17.
The hostname -I for master:
192.168.100.17 172.17.0.1 10.32.0.1 10.32.0.2
The hostname -I for worker-node:
192.168.100.12 172.17.0.1 10.44.0.0 10.32.0.1
journalctl -xeu kubelet shows the following:
https://gist.github.com/saad749/9a771a3460bf88c274498b5bc4b7fd84
While trying with flannel (and still the same issue), the result from
kubectl describe nodes
is
https://gist.github.com/saad749/d24c453c8b4e663e9abf572a0fb38bf4
Am I missing any step before kubeadm init? Should I change the IP addresses (to what)? Are there any specific logs I should look into? Is there a more comprehensive tutorial for this?
All Issues start after kubeadm join on the worker node, I can deploy the kubernetes on the master node or any other stuff, and it works fine.
UPDATE:
Even after applying the suggestions from errordeveloper, The same issue persists.
I add the following flag to kubeadm init:
--apiserver-advertise-address 192.168.100.17
I updated the kubeadm.conf to following and did reload and restart:
https://gist.github.com/saad749/c7149c87ec3e75a40586f626cf04279a
and also tried changing the cluster dns
https://gist.github.com/saad749/5fa66bebc22841e58119333e75600e40
This the log from after initializing the master:
kube-master#ubuntu:~$ kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system etcd-ubuntu 1/1 Running 0 22s 192.168.100.17 ubuntu
kube-system kube-apiserver-ubuntu 1/1 Running 0 29s 192.168.100.17 ubuntu
kube-system kube-controller-manager-ubuntu 1/1 Running 0 13s 192.168.100.17 ubuntu
kube-system kube-dns-6f4fd4bdf-wfqhb 3/3 Running 0 1m 10.32.0.7 ubuntu
kube-system kube-proxy-h4hz9 1/1 Running 0 1m 192.168.100.17 ubuntu
kube-system kube-scheduler-ubuntu 1/1 Running 0 34s 192.168.100.17 ubuntu
kube-system weave-net-fkgnh 2/2 Running 0 32s 192.168.100.17 ubuntu
The hostname -i results:
kube-master#ubuntu:~$ hostname -I
192.168.100.17 172.17.0.1 10.32.0.1 10.32.0.2 10.32.0.3 10.32.0.4 10.32.0.5 10.32.0.6 10.244.0.0 10.244.0.1
kube-master#ubuntu:~$ hostname -i
192.168.100.17
Results from:
kubectl describe nodes
https://gist.github.com/saad749/8f460650182a04d0ddf3158a52761a9a
The Internal IP seems correct now.
After joining from second node, this happens:
kube-master#ubuntu:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ubuntu Ready master 49m v1.9.3
kube-master#ubuntu:~$ kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system kube-controller-manager-ubuntu 0/1 Pending 0 0s <none> ubuntu
kube-system kube-dns-6f4fd4bdf-wfqhb 0/3 ContainerCreating 0 49m <none> ubuntu
kube-system kube-proxy-h4hz9 1/1 Running 0 49m 192.168.100.17 ubuntu
kube-system kube-scheduler-ubuntu 1/1 Running 0 1s 192.168.100.17 ubuntu
kube-system weave-net-fkgnh 2/2 Running 0 48m 192.168.100.17 ubuntu
ifconfig -a results:
https://gist.github.com/saad749/63a5a52bd3246ff72477b2aca7d158d0
journalctl -xeu kubelet results
https://gist.github.com/saad749/8a60870b35f93df8565e66cb208aff32
Sometimes, the pods IP is shown at 192.168.100.12 which is the IP of the non-master second node.
kube-master#ubuntu:~$ kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system etcd-ubuntu 0/1 Pending 0 0s <none> ubuntu
kube-system kube-apiserver-ubuntu 0/1 Pending 0 0s <none> ubuntu
kube-system kube-controller-manager-ubuntu 1/1 Running 0 0s 192.168.100.12 ubuntu
kube-system kube-dns-6f4fd4bdf-wfqhb 2/3 Running 0 3h 10.32.0.7 ubuntu
kube-system kube-proxy-h4hz9 1/1 Running 0 3h 192.168.100.12 ubuntu
kube-system kube-scheduler-ubuntu 0/1 Pending 0 0s <none> ubuntu
kube-system weave-net-fkgnh 2/2 Running 1 3h 192.168.100.17 ubuntu
kube-master#ubuntu:~$ kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system kube-dns-6f4fd4bdf-wfqhb 3/3 Running 0 3h 10.32.0.7 ubuntu
kube-system kube-proxy-h4hz9 1/1 Running 0 3h 192.168.100.12 ubuntu
kube-system weave-net-fkgnh 2/2 Running 0 3h 192.168.100.12 ubuntu
kubectl describe nodes
Name: ubuntu
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=ubuntu
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: node-role.kubernetes.io/master:NoSchedule
CreationTimestamp: Fri, 02 Mar 2018 08:21:47 -0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 02 Mar 2018 11:38:36 -0800 Fri, 02 Mar 2018 08:21:43 -0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 02 Mar 2018 11:38:36 -0800 Fri, 02 Mar 2018 08:21:43 -0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 02 Mar 2018 11:38:36 -0800 Fri, 02 Mar 2018 08:21:43 -0800 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Fri, 02 Mar 2018 11:38:36 -0800 Fri, 02 Mar 2018 11:28:25 -0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.100.12
Hostname: ubuntu
Capacity:
cpu: 4
memory: 6080832Ki
pods: 110
Allocatable:
cpu: 4
memory: 5978432Ki
pods: 110
System Info:
Machine ID: 59bf65b835b242a3aa182f4b8a542219
System UUID: 0C3C4D56-4747-D59E-EE09-F16F2793677E
Boot ID: 658b4a08-d724-425e-9246-2b41995ecc46
Kernel Version: 4.13.0-36-generic
OS Image: Ubuntu 17.10
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.9.3
Kube-Proxy Version: v1.9.3
ExternalID: ubuntu
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system kube-dns-6f4fd4bdf-wfqhb 260m (6%) 0 (0%) 110Mi (1%) 170Mi (2%)
kube-system kube-proxy-h4hz9 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-fkgnh 20m (0%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
280m (7%) 0 (0%) 110Mi (1%) 170Mi (2%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Rebooted 12m (x814 over 2h) kubelet, ubuntu Node ubuntu has been rebooted, boot id: 16efd500-a2a5-446f-ba25-1187857996e0
Normal NodeHasNoDiskPressure 10m kubelet, ubuntu Node ubuntu status is now: NodeHasNoDiskPressure
Normal Starting 10m kubelet, ubuntu Starting kubelet.
Normal NodeAllocatableEnforced 10m kubelet, ubuntu Updated Node Allocatable limit across pods
Normal NodeHasSufficientDisk 10m kubelet, ubuntu Node ubuntu status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 10m kubelet, ubuntu Node ubuntu status is now: NodeHasSufficientMemory
Normal NodeNotReady 10m kubelet, ubuntu Node ubuntu status is now: NodeNotReady
Warning Rebooted 2m (x870 over 2h) kubelet, ubuntu Node ubuntu has been rebooted, boot id: 658b4a08-d724-425e-9246-2b41995ecc46
Warning Rebooted 15s (x60 over 10m) kubelet, ubuntu Node ubuntu has been rebooted, boot id: 16efd500-a2a5-446f-ba25-1187857996e0
What am I doing wrong?
So after following the advice from #errordeveloper and still hitting the wall, I was able to solve the issue that turns out to be pretty simple.
Both my VMs had the same hostname.
hostname -f
would return
ubuntu
on both, and that causes issue with kubernetes, apparently.
I changed the name on my non-master node with
hostnamectl set-hostname kminion
and in the following files:
/etc/hostname
/etc/hosts
and everything went smooth onward!
Should I change the IP addresses (to what)?
Yes, this is typically the way to make things work on VMs where the default route is for NATed access to the Internet.
You want to use the IP of the bridge network, for you master that appears to be 192.168.100.17 (but please double check).
First, please try using kubeadm init --apiserver-advertise-address 192.168.100.17, but that may not solve all of the issues.
In your ouput of kubectl describe nodes, I can see this
Addresses:
InternalIP: 172.17.0.1
Hostname: ubuntu
So you probably want to make sure that kubelet also doesn't used the NATed interface, for which you would need to use kubelet's --node-ip flag.
However, there are other ways to fix this problem, e.g. if you can ensure that hostname -i returns the IP of the bridged interface (which you can do by tweaking /etc/hosts).