kubectl get nodes shows NotReady - kubernetes

I have installed two nodes kubernetes 1.12.1 in cloud VMs, both behind internet proxy. Each VMs have floating IPs associated to connect over SSH, kube-01 is a master and kube-02 is a node. Executed export:
no_proxy=127.0.0.1,localhost,10.157.255.185,192.168.0.153,kube-02,192.168.0.25,kube-01
before running kubeadm init, but I am getting the following status for kubectl get nodes:
NAME STATUS ROLES AGE VERSION
kube-01 NotReady master 89m v1.12.1
kube-02 NotReady <none> 29s v1.12.2
Am I missing any configuration? Do I need to add 192.168.0.153 and 192.168.0.25 in respective VM's /etc/hosts?

Looks like pod network is not installed yet on your cluster . You can install weave for example with below command
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
After a few seconds, a Weave Net pod should be running on each Node and any further pods you create will be automatically attached to the Weave network.
You can install pod networks of your choice . Here is a list
after this check
$ kubectl describe nodes
check all is fine like below
Conditions:
Type Status
---- ------
OutOfDisk False
MemoryPressure False
DiskPressure False
Ready True
Capacity:
cpu: 2
memory: 2052588Ki
pods: 110
Allocatable:
cpu: 2
memory: 1950188Ki
pods: 110
next ssh to the pod which is not ready and observe kubelet logs. Most likely errors can be of certificates and authentication.
You can also use journalctl on systemd to check kubelet errors.
$ journalctl -u kubelet

Try with this
Your coredns is in pending state check with the networking plugin you have used and check the proper addons are added
check kubernates troubleshooting guide
https://kubernetes.io/docs/setup/independent/troubleshooting-kubeadm/#coredns-or-kube-dns-is-stuck-in-the-pending-state
https://kubernetes.io/docs/concepts/cluster-administration/addons/
And install the following with those
And check
kubectl get pods -n kube-system

On the off chance it might be the same for someone else, in my case, I was using the wrong AMI image to create the nodegroup.

Run
journalctl -u kubelet
Then check at node logs, if you get below error, disable the sawp using swapoff -a
"Failed to run kubelet" err="failed to run Kubelet: running with swap on is not supported, please disable swap! or set --fa
Main process exited, code=exited, status=1/FAILURE

Related

Kubernetes ALL workloads fail when deploying a single update

After I update the backend code (pushing update to gcr.io), I delete the pod. Usually a new pod spins up.
But after today the whole cluster just breaks down. I really cannot comprehend what is happening here (I did not touch any of the other items).
I am really looking in the dark here. Where do I start looking?
I see that the logs show:
0/2 nodes are available: 2 node(s) had taints that the pod didn't tolerate.
when I look this up:
kubectl describe node | grep -i taint
Taints: node.kubernetes.io/unreachable:NoSchedule
Taints: node.kubernetes.io/unreachable:NoSchedule
But I have no clue what this is or how they even get there.
EDIT:
It looks like I need to remove the taints, but I am not able to (taint not found?)
kubectl taint nodes --all node-role.kubernetes.io/unreachable-
taint "node-role.kubernetes.io/unreachable" not found
taint "node-role.kubernetes.io/unreachable" not found
Likely problem with the nodes. Debug with some of these (sample):
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 1d v1.14.2
k8s-node1 NotReady <none> 1d v1.14.2
k8s-node2 NotReady <none> 1d v1.14.2 <-- Does it say NotReady?
$ kubectl describe node k8s-node1
...
# Do you see something like this? What's the event message?
MemoryPressure...
DiskPressure...
PIDPressure...
Check if the kubelet is running on every node (it might be crashing and restarting)
ssh k8s-node1
# ps -Af | grep kubelet
# systemctl status kubelet
# journalctl -xeu kubelet
Nuclear option:
If you are using a node pool, delete your nodes and let the autoscaler restart brand new nodes.
Related question/answer.
✌️

Kubernetes HA master set up

I have made a HA Kubernetes cluster. FIrst I added a node and joined the other node as master role.
I basically did the multi etcd set up. This worked fine for me. I did the fail over testing which also worked fine. Now the problem is once I am done working, I drained and deleted the other node and then I shut down the other machine( a VM on GCP). But then my kubectl commands dont work... Let me share the steps:
kubectl get node(when multi node is set up)
NAME STATUS ROLES AGE VERSION
instance-1 Ready <none> 17d v1.15.1
instance-3 Ready <none> 25m v1.15.1
masternode Ready master 18d v1.16.0
kubectl get node ( when I shut down my other node)
root#masternode:~# kubectl get nodes
The connection to the server k8smaster:6443 was refused - did you specify the right host or port?
Any clue?
After reboot the server you need to do some step below:
sudo -i
swapoff -a
exit
strace -eopenat kubectl version

kubernetes worker node in "NotReady" status

I am trying to setup my first cluster using Kubernetes 1.13.1. The master got initialized okay, but both of my worker nodes are NotReady. kubectl describe node shows that Kubelet stopped posting node status on both worker nodes. On one of the worker nodes I get log output like
> kubelet[3680]: E0107 20:37:21.196128 3680 kubelet.go:2266] node
> "xyz" not found.
Here is the full details:
I am using Centos 7 & Kubernetes 1.13.1.
Initializing was done as follows:
[root#master ~]# kubeadm init --apiserver-advertise-address=10.142.0.4 --pod-network-cidr=10.142.0.0/24
Successfully initialized the cluster:
You can now join any number of machines by running the following on each node
as root:
`kubeadm join 10.142.0.4:6443 --token y0epoc.zan7yp35sow5rorw --discovery-token-ca-cert-hash sha256:f02d43311c2696e1a73e157bda583247b9faac4ffb368f737ee9345412c9dea4`
deployed the flannel CNI:
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The join command worked fine.
[kubelet-start] Activating the kubelet service
[tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
[patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "node01" as an annotation
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
Result of kubectl get nodes:
[root#master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 9h v1.13.1
node01 NotReady <none> 9h v1.13.1
node02 NotReady <none> 9h v1.13.1
on both nodes:
[root#node01 ~]# service kubelet status
Redirecting to /bin/systemctl status kubelet.service
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2019-01-08 04:49:20 UTC; 32s ago
Docs: https://kubernetes.io/docs/
Main PID: 4224 (kubelet)
Memory: 31.3M
CGroup: /system.slice/kubelet.service
└─4224 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfi
`Jan 08 04:54:10 node01 kubelet[4224]: E0108 04:54:10.957115 4224 kubelet.go:2266] node "node01" not found`
I appreciate your advise on how to troubleshoot this.
The previous answer sounds correct. You can verify that by running
kubectl describe node node01 on the master, or wherever kubectl is correctly configured.
It seems like the reason of this error is due to incorrect subnet. In Flannel documentation it is written that you should use /16 not /24 for pod network.
NOTE: If kubeadm is used, then pass --pod-network-cidr=10.244.0.0/16
to kubeadm init to ensure that the podCIDR is set.
I tried to run kubeadm with /24 and although I had nodes in Ready state the flannel pods did not run properly which resulted in some issues.
You can check if your flannel pods are running properly by:
kubectl get pods -n kube-system if the status is other than running then it is incorrect behavior. In this case you can check details by running kubectl describe pod PODNAME -n kube-system. Try changing the subnet and update us if that fixed the problem.
I ran into almost the same problem, and in the end I found that the reason was that the firewall was not turned off. You can try the following commands:
sudo ufw disable
or
systemctl disable firewalld
or
setenforce 0

kubernetes pods stuck at containercreating

I have a raspberry pi cluster (one master , 3 nodes)
My basic image is : raspbian stretch lite
I already set up a basic kubernetes setup where a master can see all his nodes (kubectl get nodes) and they're all running.
I used a weave network plugin for the network communication
When everything is all setup i tried to run a nginx pod (first with some replica's but now just 1 pod) on my cluster as followed
kubectl run my-nginx --image=nginx
But somehow the pod get stuck in the status "Container creating" , when i run docker images i can't see the nginx image being pulled. And normally an nginx image is not that large so it had to be pulled already by now (15 minutes).
The kubectl describe pods give the error that the pod sandbox failed to create and kubernetes will rec-create it.
I searched everything about this issue and tried the solutions on stackoverflow (reboot to restart cluster, searched describe pods , new network plugin tried it with flannel) but i can't see what the actual problem is.
I did the exact same thing in Virtual box (just ubuntu not ARM ) and everything worked.
First i thougt it was a permission issue because i run everything as a normal user , but in vm i did the same thing and nothing changed.
Then i checked kubectl get pods --all-namespaces to verify that the pods for the weaver network and kube-dns are running and also nothing wrong over there .
Is this a firewall issue in Raspberry pi ?
Is the weave network plugin not compatible (even the kubernetes website says it is) with arm devices ?
I 'am guessing there is an api network problem and thats why i can't get my pod runnning on a node
[EDIT]
Log files
kubectl describe podName
>
> Name: my-nginx-9d5677d94-g44l6 Namespace: default Node: kubenode1/10.1.88.22 Start Time: Tue, 06 Mar 2018 08:24:13
> +0000 Labels: pod-template-hash=581233850
> run=my-nginx Annotations: <none> Status: Pending IP: Controlled By: ReplicaSet/my-nginx-9d5677d94 Containers:
> my-nginx:
> Container ID:
> Image: nginx
> Image ID:
> Port: 80/TCP
> State: Waiting
> Reason: ContainerCreating
> Ready: False
> Restart Count: 0
> Environment: <none>
> Mounts:
> /var/run/secrets/kubernetes.io/serviceaccount from default-token-phdv5 (ro) Conditions: Type Status
> Initialized True Ready False PodScheduled True
> Volumes: default-token-phdv5:
> Type: Secret (a volume populated by a Secret)
> SecretName: default-token-phdv5
> Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for
> 300s
> node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From
> Message ---- ------ ---- ----
> ------- Normal Scheduled 5m default-scheduler Successfully assigned my-nginx-9d5677d94-g44l6 to kubenode1 Normal
> SuccessfulMountVolume 5m kubelet, kubenode1 MountVolume.SetUp
> succeeded for volume "default-token-phdv5" Warning
> FailedCreatePodSandBox 1m kubelet, kubenode1 Failed create pod
> sandbox. Normal SandboxChanged 1m kubelet, kubenode1
> Pod sandbox changed, it will be killed and re-created.
kubectl logs podName
Error from server (BadRequest): container "my-nginx" in pod "my-nginx-9d5677d94-g44l6" is waiting to start: ContainerCreating
journalctl -u kubelet gives this error
Mar 12 13:42:45 kubeMaster kubelet[16379]: W0312 13:42:45.824314 16379 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 13:42:45 kubeMaster kubelet[16379]: E0312 13:42:45.824816 16379 kubelet.go:2104] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
The problem seems to be with my network plugin. In my /etc/systemd/system/kubelet.service.d/10.kubeadm.conf . the flags for the network plugins are present ? environment= kubelet_network_args --cni-bin-dir=/etc/cni/net.d
--network-plugin=cni
Thank you all for responding to my question.
I solved my problem now. For anyone who has come to my question in the future the solution was as followed.
I cloned my raspberry pi images because i wanted a basicConfig.img for when i needed to add a new node to my cluster of when one gets down.
Weave network (the plugin i used) got confused because on every node and master the os had the same machine-id. When i deleted the machine id and created a new one (and reboot the nodes) my error got fixed.
The commands to do this was
sudo rm /etc/machine-id
sudo rm /var/lib/dbus/machine-id
sudo dbus-uuidgen --ensure=/etc/machine-id
Once again my patience was being tested. Because my kubernetes setup was normal and my raspberry pi os was normal. I founded this with the help of someone in the kubernetes community. This again shows us how important and great our IT community is. To the people of the future who will come to this question. I hope this solution will fix your error and will decrease the amount of time you will be searching after a stupid small thing.
You can see if it's network related by finding the node trying to pull the image:
kubectl describe pod <name> -n <namespace>
SSH to the node, and run docker pull nginx on it. If it's having issues pulling the image manually, then it might be network related.

kubectl get nodes not showing workers

I am following this tutorial with 2 vms running CentOS7. Everything looks fine (no errors during installation/setup) but I can't see my nodes.
NOTE:
I am running this on VMWare VMs
kub1 is my master and kub2 my worker node
kubectl get nodes output:
[root#kub1 ~]# kubectl cluster-info
Kubernetes master is running at http://kub1:8080
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
[root#kub2 ~]# kubectl cluster-info
Kubernetes master is running at http://kub1:8080
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
nodes:
[root#kub1 ~]# kubectl get nodes
[root#kub1 ~]# kubectl get nodes -a
[root#kub1 ~]#
[root#kub2 ~]# kubectl get nodes -a
[root#kub2 ~]# kubectl get no
[root#kub2 ~]#
cluster events:
[root#kub1 ~]# kubectl get events -a
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
1h 1h 1 kub2.local Node Normal Starting {kube-proxy kub2.local} Starting kube-proxy.
1h 1h 1 kub2.local Node Normal Starting {kube-proxy kub2.local} Starting kube-proxy.
1h 1h 1 kub2.local Node Normal Starting {kubelet kub2.local} Starting kubelet.
1h 1h 1 node-kub2 Node Normal Starting {kubelet node-kub2} Starting kubelet.
1h 1h 1 node-kub2 Node Normal Starting {kubelet node-kub2} Starting kubelet.
/var/log/messages:
kubelet.go:1194] Unable to construct api.Node object for kubelet: can't get ip address of node node-kub2: lookup node-kub2: no such host
QUESTION: any idea why my nodes are not shown using "kubectl get nodes"?
My issue was that the KUBELET_HOSTNAME on /etc/kubernetes/kubeletvalue didn't match the hostname.
I commented that line, then restarted the services and I could see my worker after that.
hope that helps
Not sure about your scenario, but I have solved it after 3-4 hours of efforts.
Solved
I was facing this issue, because my docker cgroup driver was different than kubernetes cgroup driver.
Just updated it to cgroupfs using following commands mentioned in doc.
cat << EOF > /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
EOF
Restart docker service service docker restart.
Reset kubernetes on slave node: kubeadm reset
Joined master again: kubeadm join <><>
It was visible on master using kubectl get nodes.
I had a similar problem after installing k8s using kubespray on fedora31, and to debug the issue, tried to run a random container directly using docker run that failed with:
docker: Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown.
this is a known problem cause by cgroup version on fedora 31, and the fix is to update grub to use the previous version:
sudo dnf install grubby
sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"