Kubelet nodes set to NotReady after my VMs restarted - kubernetes

Here are my 3 VMs with nodes. I'm not sure when exactly it broke, but I'm assuming right when my VMs shut down and I had to power them back on.
NAME STATUS ROLES AGE VERSION
tjordy-k8-master.myipname NotReady master 99d v1.17.1
tjordy-k8-worker1.myipname NotReady <none> 99d v1.17.1
tjordy-k8-worker2.myipname NotReady <none> 99d v1.17.1
On of the main affects of this is when I try and get logs from a pod or port-forward a pod, I get a connection error.
error: error upgrading connection: error dialing backend: dial tcp 10.18.223.95:10250: connect: no route to host
Here is the describe from my master node:
Name: tjordy-k8-master.myipname
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=tjordy-k8-master.myipname
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 14 May 2020 02:23:12 -0700
Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: tjordy-k8-master.myipname
AcquireTime: <unset>
RenewTime: Sat, 15 Aug 2020 11:51:16 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 14 May 2020 14:13:24 -0700 Thu, 14 May 2020 14:13:24 -0700 WeaveIsUp Weave pod has set this
MemoryPressure Unknown Sat, 15 Aug 2020 11:50:47 -0700 Fri, 21 Aug 2020 16:03:03 -0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Sat, 15 Aug 2020 11:50:47 -0700 Fri, 21 Aug 2020 16:03:03 -0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Sat, 15 Aug 2020 11:50:47 -0700 Fri, 21 Aug 2020 16:03:03 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Sat, 15 Aug 2020 11:50:47 -0700 Fri, 21 Aug 2020 16:03:03 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.18.223.22
Hostname: tjordy-k8-master.myipname
Capacity:
cpu: 2
ephemeral-storage: 100112644Ki
hugepages-2Mi: 0
memory: 3880788Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 92263812558
hugepages-2Mi: 0
memory: 3778388Ki
pods: 110
System Info:
Machine ID: b116c790be914ec08657e4cc260f0164
System UUID: 4216A453-81C5-3477-2710-CF356A1B0BFE
Boot ID: c73333b0-cd1c-40f2-8877-28a8a4b4bd05
Kernel Version: 3.10.0-957.10.1.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.8
Kubelet Version: v1.17.1
Kube-Proxy Version: v1.17.1
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default prometheus-operator-1589787700-prometheus-node-exporter-nb4fv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 95d
kube-system coredns-6955765f44-pk8jm 100m (5%) 0 (0%) 70Mi (1%) 170Mi (4%) 99d
kube-system coredns-6955765f44-xkfk6 100m (5%) 0 (0%) 70Mi (1%) 170Mi (4%) 99d
kube-system etcd-tjordy-k8-master.myipname 0 (0%) 0 (0%) 0 (0%) 0 (0%) 99d
kube-system kube-apiserver-tjordy-k8-master.myipname 250m (12%) 0 (0%) 0 (0%) 0 (0%) 99d
kube-system kube-controller-manager-tjordy-k8-master.myipname 200m (10%) 0 (0%) 0 (0%) 0 (0%) 99d
kube-system kube-proxy-xcg6h 0 (0%) 0 (0%) 0 (0%) 0 (0%) 99d
kube-system kube-scheduler-tjordy-k8-master.myipname 100m (5%) 0 (0%) 0 (0%) 0 (0%) 99d
kube-system weave-net-6fmv6 20m (1%) 0 (0%) 0 (0%) 0 (0%) 99d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 770m (38%) 0 (0%)
memory 140Mi (3%) 340Mi (9%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
I've tried restarting them but nothing changed. Not really sure what to do.
Edit:
Ran journalctl -u kubelet. Theres thousands of lines but are mostly characterized by
Aug 21 19:44:42 3nxdomain kubelet[8060]: I0821 19:44:42.158482 8060 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
Aug 21 19:44:42 3nxdomain kubelet[8060]: E0821 19:44:42.189131 8060 kubelet.go:2263] node "3nxdomain" not found
Aug 21 19:44:42 3nxdomain kubelet[8060]: E0821 19:44:42.289338 8060 kubelet.go:2263] node "3nxdomain" not found
Aug 21 19:44:42 3nxdomain kubelet[8060]: E0821 19:44:42.390437 8060 kubelet.go:2263] node "3nxdomain" not found
Aug 21 19:44:42 3nxdomain kubelet[8060]: I0821 19:44:42.411680 8060 kubelet_node_status.go:70] Attempting to register node 3nxdomain
Aug 21 19:44:42 3nxdomain kubelet[8060]: E0821 19:44:42.413954 8060 kubelet_node_status.go:92] Unable to register node "3nxdomain" with API server: nodes "3nxdomain" is forbidden: node "tjordy-k8-master.myip" is not allowed to modify node "3nxdomain"
Aug 21 19:44:42 3nxdomain kubelet[8060]: E0821 19:44:42.490625 8060 kubelet.go:2263] node "3nxdomain" not found

Related

why does my k8s cluster reports node `overcommitted.`?

I deploy my application to AWS EKS cluster with 3 nodes. When I run describe, it shows me this message: (Total limits may be over 100 percent, i.e., overcommitted.). But based on the full message, it doesn't seem there are many resources. Why do i see this message in the output?
$ kubectl describe node ip-192-168-54-184.ap-southeast-2.compute.internal
Name: ip-192-168-54-184.ap-southeast-2.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.medium
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=ON_DEMAND
eks.amazonaws.com/nodegroup=scalegroup
eks.amazonaws.com/nodegroup-image=ami-0ecaff41b4f38a650
failure-domain.beta.kubernetes.io/region=ap-southeast-2
failure-domain.beta.kubernetes.io/zone=ap-southeast-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-54-184.ap-southeast-2.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=t3.medium
topology.kubernetes.io/region=ap-southeast-2
topology.kubernetes.io/zone=ap-southeast-2b
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 04 Mar 2021 22:27:50 +1100
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: ip-192-168-54-184.ap-southeast-2.compute.internal
AcquireTime: <unset>
RenewTime: Fri, 05 Mar 2021 09:13:16 +1100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Fri, 05 Mar 2021 09:11:33 +1100 Thu, 04 Mar 2021 22:27:50 +1100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 05 Mar 2021 09:11:33 +1100 Thu, 04 Mar 2021 22:27:50 +1100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 05 Mar 2021 09:11:33 +1100 Thu, 04 Mar 2021 22:27:50 +1100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 05 Mar 2021 09:11:33 +1100 Thu, 04 Mar 2021 22:28:10 +1100 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.54.184
ExternalIP: 13.211.200.109
Hostname: ip-192-168-54-184.ap-southeast-2.compute.internal
InternalDNS: ip-192-168-54-184.ap-southeast-2.compute.internal
ExternalDNS: ec2-13-211-200-109.ap-southeast-2.compute.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3970504Ki
pods: 17
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1930m
ephemeral-storage: 18242267924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3415496Ki
pods: 17
System Info:
Machine ID: ec246b12e91dc516024822fbcdac4408
System UUID: ec246b12-e91d-c516-0248-22fbcdac4408
Boot ID: 5c6a3d95-c82c-4051-bc90-6e732b0b5be2
Kernel Version: 5.4.91-41.139.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.19.6-eks-49a6c0
Kube-Proxy Version: v1.19.6-eks-49a6c0
ProviderID: aws:///ap-southeast-2b/i-03c0417efb85b8e6c
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-cainjector-9747d56-qwhjw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10h
kube-system aws-node-m296t 10m (0%) 0 (0%) 0 (0%) 0 (0%) 10h
kube-system coredns-67997b9dbd-cgjdj 100m (5%) 0 (0%) 70Mi (2%) 170Mi (5%) 10h
kube-system kube-proxy-dc5fh 100m (5%) 0 (0%) 0 (0%) 0 (0%) 10h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 210m (10%) 0 (0%)
memory 70Mi (2%) 170Mi (5%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
Let's quickly analyze the source code of the kubectl describe command, in particular the describeNodeResource function.
Inside the describeNodeResource(...) function we see ( this line ):
w.Write(LEVEL_0, "Allocated resources:\n (Total limits may be over 100 percent, i.e., overcommitted.)\n")
There is no condition to check when this message should be printed, it is just an informational message that is printed every time.

Node role is missing for Master node - Kubernetes installation done with the help of Kubespray

After clean installation of Kubernetes cluster with 3 nodes (2 Master & 3 Node)
i.e., Masters are also assigned to be worker node.
After successful installation, I got the below roles for the node. Where Node role is missing for the masters as shown.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready master 12d v1.18.5
node2 Ready master 12d v1.18.5
node3 Ready <none> 12d v1.18.5
inventory/mycluster/hosts.yaml
all:
hosts:
node1:
ansible_host: 10.1.10.110
ip: 10.1.10.110
access_ip: 10.1.10.110
node2:
ansible_host: 10.1.10.111
ip: 10.1.10.111
access_ip: 10.1.10.111
node3:
ansible_host: 10.1.10.112
ip: 10.1.10.112
access_ip: 10.1.10.112
children:
kube-master:
hosts:
node1:
node2:
kube-node:
hosts:
node1:
node2:
node3:
etcd:
hosts:
node1:
node2:
node3:
k8s-cluster:
children:
kube-master:
kube-node:
calico-rr:
hosts: {}
vault:
hosts:
node1
node2
node3
Network Plugin : Flannel
Command used to invoke ansible:
ansible-playbook -i inventory/mycluster/hosts.yaml --become cluster.yml
How can I make master node to be work as worker node as well?
Kubectl describe node1 output:
kubectl describe node node1
Name: node1
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=node1
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"a6:bb:9e:2a:7e:a8"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.1.10.110
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 01 Jul 2020 09:26:15 -0700
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: node1
AcquireTime: <unset>
RenewTime: Tue, 14 Jul 2020 06:39:58 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 10 Jul 2020 12:51:05 -0700 Fri, 10 Jul 2020 12:51:05 -0700 FlannelIsUp Flannel is running on this node
MemoryPressure False Tue, 14 Jul 2020 06:40:02 -0700 Fri, 03 Jul 2020 15:00:26 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 14 Jul 2020 06:40:02 -0700 Fri, 03 Jul 2020 15:00:26 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 14 Jul 2020 06:40:02 -0700 Fri, 03 Jul 2020 15:00:26 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 14 Jul 2020 06:40:02 -0700 Mon, 06 Jul 2020 10:45:01 -0700 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.1.10.110
Hostname: node1
Capacity:
cpu: 8
ephemeral-storage: 51175Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32599596Ki
pods: 110
Allocatable:
cpu: 7800m
ephemeral-storage: 48294789041
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31997196Ki
pods: 110
System Info:
Machine ID: c8690497b9704d2d975c33155c9fa69e
System UUID: 00000000-0000-0000-0000-AC1F6B96768A
Boot ID: 5e3eabe0-7732-4e6d-b25d-7eeec347d6c6
Kernel Version: 3.10.0-1127.13.1.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.12
Kubelet Version: v1.18.5
Kube-Proxy Version: v1.18.5
PodCIDR: 10.233.64.0/24
PodCIDRs: 10.233.64.0/24
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default httpd-deployment-598596ddfc-n56jq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d20h
kube-system coredns-dff8fc7d-lb6bh 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 3d17h
kube-system kube-apiserver-node1 250m (3%) 0 (0%) 0 (0%) 0 (0%) 12d
kube-system kube-controller-manager-node1 200m (2%) 0 (0%) 0 (0%) 0 (0%) 12d
kube-system kube-flannel-px8cj 150m (1%) 300m (3%) 64M (0%) 500M (1%) 3d17h
kube-system kube-proxy-6spl2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3d17h
kube-system kube-scheduler-node1 100m (1%) 0 (0%) 0 (0%) 0 (0%) 12d
kube-system kubernetes-metrics-scraper-54fbb4d595-28vvc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d20h
kube-system nodelocaldns-rxs4f 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 12d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 900m (11%) 300m (3%)
memory 205860Ki (0%) 856515840 (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
How can I make master node to be work as worker node as well ?
Remove the NoSchedule taint from master nodes using below command
kubectl taint node node1 node-role.kubernetes.io/master:NoSchedule-
kubectl taint node node2 node-role.kubernetes.io/master:NoSchedule-
After this node1 and node2 will become like worker nodes and pods can be scheduled on them.

How to Change Kubernetes Node Status from "Ready" to "NotReady" by changing CPU Utilization or memory utilization or Disk Pressure?

I have kubernetes cluster set up of 1 master and 1 worker node.For testing purpose , I increased CPU Utlization and memory utilization upto 100%, but stilling Node is not getting status "NotReady".
I am testing Pressure Status .. How to change status flag of MemoryPressure to true or DiskPressure or PIDPressure to true
Here are my master node condtions :-
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 27 Nov 2019 14:36:29 +0000 Wed, 27 Nov 2019 14:36:29 +0000 WeaveIsUp Weave pod has set this
MemoryPressure False Thu, 28 Nov 2019 07:36:46 +0000 Fri, 22 Nov 2019 13:30:38 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 28 Nov 2019 07:36:46 +0000 Fri, 22 Nov 2019 13:30:38 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 28 Nov 2019 07:36:46 +0000 Fri, 22 Nov 2019 13:30:38 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 28 Nov 2019 07:36:46 +0000 Fri, 22 Nov 2019 13:30:48 +0000 KubeletReady kubelet is posting ready status
Here are pods info :-
Non-terminated Pods: (8 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system coredns-5644d7b6d9-dm8v7 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 22d
kube-system coredns-5644d7b6d9-mz5rm 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 22d
kube-system etcd-ip-172-31-28-186.us-east-2.compute.internal 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d
kube-system kube-apiserver-ip-172-31-28-186.us-east-2.compute.internal 250m (12%) 0 (0%) 0 (0%) 0 (0%) 22d
kube-system kube-controller-manager-ip-172-31-28-186.us-east-2.compute.internal 200m (10%) 0 (0%) 0 (0%) 0 (0%) 22d
kube-system kube-proxy-cw8vv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d
kube-system kube-scheduler-ip-172-31-28-186.us-east-2.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 22d
kube-system weave-net-ct9zb 20m (1%) 0 (0%) 0 (0%) 0 (0%) 22d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 770m (38%) 0 (0%)
memory 140Mi (1%) 340Mi (4%)
ephemeral-storage 0 (0%) 0 (0%)
Here for worker node :-
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 28 Nov 2019 07:00:08 +0000 Thu, 28 Nov 2019 07:00:08 +0000 WeaveIsUp Weave pod has set this
MemoryPressure False Thu, 28 Nov 2019 07:39:03 +0000 Thu, 28 Nov 2019 07:00:00 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 28 Nov 2019 07:39:03 +0000 Thu, 28 Nov 2019 07:00:00 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 28 Nov 2019 07:39:03 +0000 Thu, 28 Nov 2019 07:00:00 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 28 Nov 2019 07:39:03 +0000 Thu, 28 Nov 2019 07:00:00 +0000 KubeletReady kubelet is posting ready status
There are several ways of making a node to get into NotReady status, but not through Pods. When a Pod starts to consume too much memory kubelet will just kill that pod, to precisely protect the node.
I guess you want to test what happens when a node goes down, in that case you want to drain it. In other words, to simulate node issues, you should do:
kubectl drain NODE
Still, check on kubectl drain --help to see under what circumstances what happens.
EDIT
I tried, actually, accessing the node and running stress on the node directly, and this is what happened within 20 seconds:
root#gke-klusta-lemmy-3ce02acd-djhm:/# stress --cpu 16 --io 8 --vm 8 --vm-bytes 2G
Checking on the node:
$ kubectl get no -w | grep gke-klusta-lemmy-3ce02acd-djhm
gke-klusta-lemmy-3ce02acd-djhm Ready <none> 15d v1.13.11-gke.14
gke-klusta-lemmy-3ce02acd-djhm Ready <none> 15d v1.13.11-gke.14
gke-klusta-lemmy-3ce02acd-djhm NotReady <none> 15d v1.13.11-gke.14
gke-klusta-lemmy-3ce02acd-djhm NotReady <none> 15d v1.13.11-gke.14
gke-klusta-lemmy-3ce02acd-djhm NotReady <none> 15d v1.13.11-gke.14
I am running pretty weak nodes. 1CPU#4GB RAM
Just bring down kubelet service on the node that you want to be in NotReady status

How to remove kube taints from worker nodes: Taints node.kubernetes.io/unreachable:NoSchedule

I was able to remove the Taint from master but my two worker nodes installed bare metal with Kubeadmin keep the unreachable taint even after issuing command to remove them. It says removed but its not permanent. And when I check taints still there. I also tried patching and setting to null but this did not work. Only thing I found on SO or anywhere else deals with master or assumes these commands work.
UPDATE: I checked the timestamp of the Taint and its added in again the moment it is deleted. So in what sense is the node unreachable? I can ping it. Is there any kubernetes diagnostics I can run to find out how it is unreachable? I checked I can ping both ways between master and worker nodes. So where would log would show error which component cannot connect?
kubectl describe no k8s-node1 | grep -i taint
Taints: node.kubernetes.io/unreachable:NoSchedule
Tried:
kubectl patch node k8s-node1 -p '{"spec":{"Taints":[]}}'
And
kubectl taint nodes --all node.kubernetes.io/unreachable:NoSchedule-
kubectl taint nodes --all node.kubernetes.io/unreachable:NoSchedule-
node/k8s-node1 untainted
node/k8s-node2 untainted
error: taint "node.kubernetes.io/unreachable:NoSchedule" not found
result is it says untainted for the two workers nodes but then I see them again when I grep
kubectl describe no k8s-node1 | grep -i taint
Taints: node.kubernetes.io/unreachable:NoSchedule
$ k get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 10d v1.14.2
k8s-node1 NotReady <none> 10d v1.14.2
k8s-node2 NotReady <none> 10d v1.14.2
UPDATE: Found someone had same problem and could only fix by resetting the cluster with Kubeadmin
https://forum.linuxfoundation.org/discussion/846483/lab2-1-kubectl-untainted-not-working
Sure hope I dont have to do that every time the worker nodes get tainted.
k describe node k8s-node2
Name: k8s-node2
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-node2
kubernetes.io/os=linux
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":”d2:xx:61:c3:xx:16"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.xx.1.xx
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 05 Jun 2019 11:46:12 +0700
Taints: node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Fri, 14 Jun 2019 10:34:07 +0700 Fri, 14 Jun 2019 10:35:09 +0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Fri, 14 Jun 2019 10:34:07 +0700 Fri, 14 Jun 2019 10:35:09 +0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Fri, 14 Jun 2019 10:34:07 +0700 Fri, 14 Jun 2019 10:35:09 +0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Fri, 14 Jun 2019 10:34:07 +0700 Fri, 14 Jun 2019 10:35:09 +0700 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.10.10.xx
Hostname: k8s-node2
Capacity:
cpu: 2
ephemeral-storage: 26704124Ki
memory: 4096032Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 24610520638
memory: 3993632Ki
pods: 110
System Info:
Machine ID: 6e4e4e32972b3b2f27f021dadc61d21
System UUID: 6e4e4ds972b3b2f27f0cdascf61d21
Boot ID: abfa0780-3b0d-sda9-a664-df900627be14
Kernel Version: 4.4.0-87-generic
OS Image: Ubuntu 16.04.3 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.3
Kubelet Version: v1.14.2
Kube-Proxy Version: v1.14.2
PodCIDR: 10.xxx.10.1/24
Non-terminated Pods: (18 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
heptio-sonobuoy sonobuoy-systemd-logs-daemon-set- 6a8d92061c324451-hnnp9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d1h
istio-system istio-pilot-7955cdff46-w648c 110m (5%) 2100m (105%) 228Mi (5%) 1224Mi (31%) 6h55m
istio-system istio-telemetry-5c9cb76c56-twzf5 150m (7%) 2100m (105%) 228Mi (5%) 1124Mi (28%) 6h55m
istio-system zipkin-8594bbfc6b-9p2qc 0 (0%) 0 (0%) 1000Mi (25%) 1000Mi (25%) 6h55m
knative-eventing webhook-576479cc56-wvpt6 0 (0%) 0 (0%) 1000Mi (25%) 1000Mi (25%) 6h45m
knative-monitoring elasticsearch-logging-0 100m (5%) 1 (50%) 0 (0%) 0 (0%) 3d20h
knative-monitoring grafana-5cdc94dbd-mc4jn 100m (5%) 200m (10%) 100Mi (2%) 200Mi (5%) 3d21h
knative-monitoring kibana-logging-7cb6b64bff-dh8nx 100m (5%) 1 (50%) 0 (0%) 0 (0%) 3d20h
knative-monitoring kube-state-metrics-56f68467c9-vr5cx 223m (11%) 243m (12%) 176Mi (4%) 216Mi (5%) 3d21h
knative-monitoring node-exporter-7jw59 110m (5%) 220m (11%) 50Mi (1%) 90Mi (2%) 3d22h
knative-monitoring prometheus-system-0 0 (0%) 0 (0%) 400Mi (10%) 1000Mi (25%) 3d20h
knative-serving activator-6cfb97bccf-bfc4w 120m (6%) 2200m (110%) 188Mi (4%) 1624Mi (41%) 6h45m
knative-serving autoscaler-85749b6c48-4wf6z 130m (6%) 2300m (114%) 168Mi (4%) 1424Mi (36%) 6h45m
knative-serving controller-b49d69f4d-7j27s 100m (5%) 1 (50%) 100Mi (2%) 1000Mi (25%) 6h45m
knative-serving networking-certmanager-5b5d8f5dd8-qjh5q 100m (5%) 1 (50%) 100Mi (2%) 1000Mi (25%) 6h45m
knative-serving networking-istio-7977b9bbdd-vrpl5 100m (5%) 1 (50%) 100Mi (2%) 1000Mi (25%) 6h45m
kube-system canal-qbn67 250m (12%) 0 (0%) 0 (0%) 0 (0%) 10d
kube-system kube-proxy-phbf5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1693m (84%) 14363m (718%)
memory 3838Mi (98%) 11902Mi (305%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
Problem was that swap was turned on the worker nodes and thus kublet crashed exited. This was evident from syslog file under /var, thus the taint will get re-added until this is resolved. Perhaps someone can comment on the implications of allowing kublet to run with swap on?:
kubelet[29207]: F0616 06:25:05.597536 29207 server.go:265] failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps contained: [Filename#011#011#011#011Type#011#011Size#011Used#011Priority /dev/xvda5 partition#0114191228#0110#011-1]
Jun 16 06:25:05 k8s-node2 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jun 16 06:25:05 k8s-node2 systemd[1]: kubelet.service: Unit entered failed state.
Jun 16 06:25:05 k8s-node2 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jun 16 06:25:15 k8s-node2 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jun 16 06:25:15 k8s-node2 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jun 16 06:25:15 k8s-node2 systemd[1]: Started kubelet: The Kubernetes Node Agent.

Problem with Kubernetes in Google Cloud stuck with ContainerCreating status

I'm having a problem with my GKE cluster, all the pods are stuck with ContainerCreating status. When I run the kubectl get events I see this error:
Failed create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Anyone knows what the hell is happening? I can't find this solution anywhere.
EDIT
I saw this post https://github.com/kubernetes/kubernetes/issues/44273 saying that the GKE instances that are small than the default google instance for GKE(n1-standard-1) can have network problems. So I changed my instances to the default type, but without success. Here are my node and pod descriptions:
Name: gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-bgb6
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/fluentd-ds-ready=true
beta.kubernetes.io/instance-type=n1-standard-1
beta.kubernetes.io/os=linux
cloud.google.com/gke-nodepool=pool-nodes-dev
failure-domain.beta.kubernetes.io/region=southamerica-east1
failure-domain.beta.kubernetes.io/zone=southamerica-east1-a
kubernetes.io/hostname=gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-bgb6
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Thu, 27 Sep 2018 20:27:47 -0300
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
KernelDeadlock False Fri, 28 Sep 2018 09:58:58 -0300 Thu, 27 Sep 2018 20:27:16 -0300 KernelHasNoDeadlock kernel has no deadlock
FrequentUnregisterNetDevice False Fri, 28 Sep 2018 09:58:58 -0300 Thu, 27 Sep 2018 20:32:18 -0300 UnregisterNetDevice node is functioning properly
NetworkUnavailable False Thu, 27 Sep 2018 20:27:48 -0300 Thu, 27 Sep 2018 20:27:48 -0300 RouteCreated NodeController create implicit route
OutOfDisk False Fri, 28 Sep 2018 09:59:03 -0300 Thu, 27 Sep 2018 20:27:47 -0300 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 28 Sep 2018 09:59:03 -0300 Thu, 27 Sep 2018 20:27:47 -0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 28 Sep 2018 09:59:03 -0300 Thu, 27 Sep 2018 20:27:47 -0300 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 28 Sep 2018 09:59:03 -0300 Thu, 27 Sep 2018 20:27:47 -0300 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 28 Sep 2018 09:59:03 -0300 Thu, 27 Sep 2018 20:28:07 -0300 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.0.2
ExternalIP:
Hostname: gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-bgb6
Capacity:
cpu: 1
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 3787608Ki
pods: 110
Allocatable:
cpu: 940m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 2702168Ki
pods: 110
System Info:
Machine ID: 1e8e0ecad8f5cc7fb5851bc64513d40c
System UUID: 1E8E0ECA-D8F5-CC7F-B585-1BC64513D40C
Boot ID: 971e5088-6bc1-4151-94bf-b66c6c7ee9a3
Kernel Version: 4.14.56+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.2
Kubelet Version: v1.10.7-gke.2
Kube-Proxy Version: v1.10.7-gke.2
PodCIDR: 10.0.32.0/24
ProviderID: gce://aditumpay/southamerica-east1-a/gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-bgb6
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system event-exporter-v0.2.1-5f5b89fcc8-xsvmg 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system fluentd-gcp-scaler-7c5db745fc-vttc9 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system fluentd-gcp-v3.1.0-sz8r8 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system heapster-v1.5.3-75486b456f-sj7k8 138m (14%) 138m (14%) 301856Ki (11%) 301856Ki (11%)
kube-system kube-dns-788979dc8f-99xvh 260m (27%) 0 (0%) 110Mi (4%) 170Mi (6%)
kube-system kube-dns-788979dc8f-9sz2b 260m (27%) 0 (0%) 110Mi (4%) 170Mi (6%)
kube-system kube-dns-autoscaler-79b4b844b9-6s8x2 20m (2%) 0 (0%) 10Mi (0%) 0 (0%)
kube-system kube-proxy-gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-bgb6 100m (10%) 0 (0%) 0 (0%) 0 (0%)
kube-system kubernetes-dashboard-598d75cb96-6nhcd 50m (5%) 100m (10%) 100Mi (3%) 300Mi (11%)
kube-system l7-default-backend-5d5b9874d5-8wk6h 10m (1%) 10m (1%) 20Mi (0%) 20Mi (0%)
kube-system metrics-server-v0.2.1-7486f5bd67-fvddz 53m (5%) 148m (15%) 154Mi (5%) 404Mi (15%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 891m (94%) 396m (42%)
memory 817952Ki (30%) 1391392Ki (51%)
Events: <none>
The other node:
Name: gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/fluentd-ds-ready=true
beta.kubernetes.io/instance-type=n1-standard-1
beta.kubernetes.io/os=linux
cloud.google.com/gke-nodepool=pool-nodes-dev
failure-domain.beta.kubernetes.io/region=southamerica-east1
failure-domain.beta.kubernetes.io/zone=southamerica-east1-a
kubernetes.io/hostname=gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz
Annotations: node.alpha.kubernetes.io/ttl=0
volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp: Thu, 27 Sep 2018 20:30:05 -0300
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
KernelDeadlock False Fri, 28 Sep 2018 10:11:03 -0300 Thu, 27 Sep 2018 20:29:34 -0300 KernelHasNoDeadlock kernel has no deadlock
FrequentUnregisterNetDevice False Fri, 28 Sep 2018 10:11:03 -0300 Thu, 27 Sep 2018 20:34:36 -0300 UnregisterNetDevice node is functioning properly
NetworkUnavailable False Thu, 27 Sep 2018 20:30:06 -0300 Thu, 27 Sep 2018 20:30:06 -0300 RouteCreated NodeController create implicit route
OutOfDisk False Fri, 28 Sep 2018 10:11:49 -0300 Thu, 27 Sep 2018 20:30:05 -0300 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 28 Sep 2018 10:11:49 -0300 Thu, 27 Sep 2018 20:30:05 -0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 28 Sep 2018 10:11:49 -0300 Thu, 27 Sep 2018 20:30:05 -0300 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 28 Sep 2018 10:11:49 -0300 Thu, 27 Sep 2018 20:30:05 -0300 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 28 Sep 2018 10:11:49 -0300 Thu, 27 Sep 2018 20:30:25 -0300 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.0.3
ExternalIP:
Hostname: gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz
Capacity:
cpu: 1
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 3787608Ki
pods: 110
Allocatable:
cpu: 940m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 2702168Ki
pods: 110
System Info:
Machine ID: f1d5cf2a0b2c5472cf6509778a7941a7
System UUID: F1D5CF2A-0B2C-5472-CF65-09778A7941A7
Boot ID: f35bebb8-acd7-4a2f-95d6-76604638aef9
Kernel Version: 4.14.56+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.2
Kubelet Version: v1.10.7-gke.2
Kube-Proxy Version: v1.10.7-gke.2
PodCIDR: 10.0.33.0/24
ProviderID: gce://aditumpay/southamerica-east1-a/gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default aditum-payment-7d966c494c-wpk2t 100m (10%) 0 (0%) 0 (0%) 0 (0%)
default aditum-portal-dev-5c69d76bb6-n5d5b 100m (10%) 0 (0%) 0 (0%) 0 (0%)
default aditum-vtexapi-5c758fcfb7-rhvsn 100m (10%) 0 (0%) 0 (0%) 0 (0%)
default admin-mongo-dev-7d9f7f7d46-rrj42 100m (10%) 0 (0%) 0 (0%) 0 (0%)
default mongod-0 200m (21%) 0 (0%) 200Mi (7%) 0 (0%)
kube-system fluentd-gcp-v3.1.0-pgwfx 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz 100m (10%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 700m (74%) 0 (0%)
memory 200Mi (7%) 0 (0%)
Events: <none>
All the cluster's pods are stucked.
NAMESPACE NAME READY STATUS RESTARTS AGE
default aditum-payment-7d966c494c-wpk2t 0/1 ContainerCreating 0 13h
default aditum-portal-dev-5c69d76bb6-n5d5b 0/1 ContainerCreating 0 13h
default aditum-vtexapi-5c758fcfb7-rhvsn 0/1 ContainerCreating 0 13h
default admin-mongo-dev-7d9f7f7d46-rrj42 0/1 ContainerCreating 0 13h
default mongod-0 0/1 ContainerCreating 0 13h
kube-system event-exporter-v0.2.1-5f5b89fcc8-xsvmg 0/2 ContainerCreating 0 13h
kube-system fluentd-gcp-scaler-7c5db745fc-vttc9 0/1 ContainerCreating 0 13h
kube-system fluentd-gcp-v3.1.0-pgwfx 0/2 ContainerCreating 0 16h
kube-system fluentd-gcp-v3.1.0-sz8r8 0/2 ContainerCreating 0 16h
kube-system heapster-v1.5.3-75486b456f-sj7k8 0/3 ContainerCreating 0 13h
kube-system kube-dns-788979dc8f-99xvh 0/4 ContainerCreating 0 13h
kube-system kube-dns-788979dc8f-9sz2b 0/4 ContainerCreating 0 13h
kube-system kube-dns-autoscaler-79b4b844b9-6s8x2 0/1 ContainerCreating 0 13h
kube-system kube-proxy-gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-bgb6 0/1 ContainerCreating 0 13h
kube-system kube-proxy-gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz 0/1 ContainerCreating 0 13h
kube-system kubernetes-dashboard-598d75cb96-6nhcd 0/1 ContainerCreating 0 13h
kube-system l7-default-backend-5d5b9874d5-8wk6h 0/1 ContainerCreating 0 13h
kube-system metrics-server-v0.2.1-7486f5bd67-fvddz 0/2 ContainerCreating 0 13h
A stucked pod.
Name: aditum-payment-7d966c494c-wpk2t
Namespace: default
Node: gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz/10.0.0.3
Start Time: Thu, 27 Sep 2018 20:30:47 -0300
Labels: io.kompose.service=aditum-payment
pod-template-hash=3852270507
Annotations: kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container aditum-payment
Status: Pending
IP:
Controlled By: ReplicaSet/aditum-payment-7d966c494c
Containers:
aditum-payment:
Container ID:
Image: gcr.io/aditumpay/aditumpaymentwebapi:latest
Image ID:
Port: 5000/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment:
CONNECTIONSTRING: <set to the key 'CONNECTIONSTRING' of config map 'aditum-payment-config'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qsc9k (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-qsc9k:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qsc9k
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 3m (x1737 over 13h) kubelet, gke-aditum-k8scluster--pool-nodes-dev-500ebc8b-m7bz Failed create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Thanks!
Sorry for taking to long to respond. It was a very silly problem. After I reach the google cloud support, I notice that my NAT machine was not working properly. The PrivateAccess route was passing thougth my NAT. Thanks everyone for the help.
In addition of the description of your nodes, it ca depend from where you are launching them.
As mentioned in kubernetes/minikube issue 2148 or kubernetes/minikube issue 3142, that won't work from China.
The workaround in that case is to find another source, pull it and tag it:
minikube ssh \
"docker pull registry.cn-hangzhou.aliyuncs.com/google-containers/pause-amd64:3.0
docker tag registry.cn-hangzhou.aliyuncs.com/google-containers/pause-amd64:3.0 gcr.io/google_containers/pause-amd64:3.0"