kubectl get pods returns inconsistent results - kubernetes

When I execute kubectl get pods, I get different output for the same pod.
For example:
$ kubectl get pods -n ha-rabbitmq
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 1/1 Running 0 85m
rabbitmq-ha-1 1/1 Running 9 84m
rabbitmq-ha-2 1/1 Running 0 50m
After that I execute the same command and here is the different result:
$ kubectl get pods -n ha-rabbitmq
NAME READY STATUS RESTARTS AGE
rabbitmq-ha-0 0/1 CrashLoopBackOff 19 85m
rabbitmq-ha-1 1/1 Running 9 85m
rabbitmq-ha-2 1/1 Running 0 51m
I have 2 master nodes and 5 worker nodes initialized with kubeadm. Each master node has one instance of built-in etcd pod running on them.
Result of kubectl get nodes:
$ kubectl get nodes -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-meb-master1 Ready master 14d v1.14.3 10.30.29.11 <none> Ubuntu 18.04.2 LTS 4.15.0-51-generic docker://18.9.5
k8s-meb-master2 Ready master 14d v1.14.3 10.30.29.12 <none> Ubuntu 18.04.2 LTS 4.15.0-51-generic docker://18.9.6
k8s-meb-worker1 Ready <none> 14d v1.14.3 10.30.29.13 <none> Ubuntu 18.04.2 LTS 4.15.0-51-generic docker://18.9.5
k8s-meb-worker2 Ready <none> 14d v1.14.3 10.30.29.14 <none> Ubuntu 18.04.2 LTS 4.15.0-51-generic docker://18.9.5
k8s-meb-worker3 Ready <none> 14d v1.14.3 10.30.29.15 <none> Ubuntu 18.04.2 LTS 4.15.0-51-generic docker://18.9.5
k8s-meb-worker4 Ready <none> 14d v1.14.2 10.30.29.16 <none> Ubuntu 18.04.2 LTS 4.15.0-51-generic docker://18.9.5
k8s-meb-worker5 Ready <none> 5d19h v1.14.2 10.30.29.151 <none> Ubuntu 18.04 LTS 4.15.0-20-generic docker://18.9.5
Can this issue be related to unsynchronized contents for the /var/lib/etcd/ in the master nodes ?

Your pods are in CrashLoopBackoff state.
That means that some containers inside the pod are exiting (the main process exits) and the pod gets restarted over and over again.
Depending when you run the get po command, you might see your pod as Running (the process didn't exit yet) or CrashLoopBackoff (kubernetes is waiting before restarting your pod.
You can confirm this is the case by looking at the Restarts counter in the output.
I suggest you have a look at the restarting pods logs to get an idea why they're failing.

it seems there is a ETCD inconsistence between each control nodes due to a incomplete etcd restoration. Please refer this link to how to do it properly https://medium.com/#pranaybhardwaj007/etcd-backup-and-restore-in-ha-mode-8722b97d440d

Related

Minikube Kubernetes pending pod on AWS EC2

I've tried many times to install kubernetes on Debian last stable version on AWS EC2 instance (2 vcpu, 4 GB RAM, 10 GB HD).
I've also tried to install now on ubuntu Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-1084-aws x86_64) over AWS EC2 same vm compute configuration.
I've installed docker, kubctl, docker-cri, crictl and minikube but I've an issue with Kubernetes node not ready and then pending pods. The blocking point here for me is the CNI as I've core-dns pods pending and I see few strange things in the logs, but do not know how to solve it.
I've tried also to install Calico as you will see the calico pods. It's the first time I install Kubernetes and Minikube.
Minikube is started with the following command : minikube start --vm-driver=none
minikube version: v1.27.1
root#awsec2:~# minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
root#awsec2:~# docker version
Client:
Version: 20.10.7
API version: 1.41
Go version: go1.13.8
Git commit: 20.10.7-0ubuntu5~18.04.3
Built: Mon Nov 1 01:04:14 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
root#ip-172-31-37-142:~# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-awsec2-ip NotReady control-plane 10h v1.25.2 172.31.37.142 Ubuntu 18.04.6 LTS 5.4.0-1084-aws docker://20.10.7
root#aws:~# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-minikube 0/1 Pending 0 10h
kube-system coredns-565d847f94-kmbdr 0/1 Pending 0 11h
kube-system etcd-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system kube-apiserver-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system kube-controller-manager-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system kube-proxy-dff99 1/1 Running 1 (10h ago) 11h
kube-system kube-scheduler-ip-172-31-37-142 1/1 Running 1 (10h ago) 11h
kube-system storage-provisioner 0/1 Pending 0 11h
tigera-operator tigera-operator-6675dc47f4-gngrn 1/1 Running 2 (7m ago) 10h
In the minikube logs command I've seen this error but do not know how to solve it :
==> kubelet <==
-- Logs begin at Tue 2022-10-18 21:26:09 UTC, end at Wed 2022-10-19 08:57:51 UTC. --
Oct 19 08:52:52 ip-172-31-37-142 kubelet[17361]: E1019 08:52:52.018304 17361 kubelet.go:2373] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized"
If someone can explain how to correct that as it's should be very standard issue.
Now, I've found a workaround that will work for my test.
I've run Minikube with Kubernetes 1.23 instead of 1.24.
I've run this command :
minikube start --vm-driver=none --kubernetes-version=v1.23.0
I didn't set up Calico CNI now, I have successfully my node and hello-World pod running correctly.
I will test it like that, then I will try to upgrade to Kube 1.24.
Kind Regards,

How to make k8s imagePullPolicy = never work?

I have followed the instructions on this blog to create a simple container image and deploy it in a k8s cluster.
However, in my case the pods do not run:
student#master:~$ k get pod -o wide -l app=hello-python --field-selector spec.nodeName=master
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hello-python-58547cf485-7l8dg 0/1 ErrImageNeverPull 0 2m26s 192.168.219.126 master <none> <none>
hello-python-598c594dc5-4c9zd 0/1 ErrImageNeverPull 0 2m26s 192.168.219.67 master <none> <none>
student#master:~$ sudo podman images hello-python
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/hello-python latest 11cf1e5a86b1 50 minutes ago 941 MB
student#master:~$ hostname
master
student#master:~$
I understand why it may not work on the worker node, but why it does not work on the same node where the image is cached - the master node?
student#master:~$ k describe pod hello-python-58547cf485-7l8dg | grep -A 10 'Events:'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/hello-python-58547cf485-7l8dg to master
Warning Failed 8m7s (x12 over 10m) kubelet Error: ErrImageNeverPull
Warning ErrImageNeverPull 4m59s (x27 over 10m) kubelet Container image "localhost/hello-python:latest" is not present with pull policy of Never
student#master:~$
My question is: how to make the pod run on the master node with the imagePullPolicy = never given that the image in question is available on the master node as the podman images attests?
EDIT 1
I am using a k8s cluster running on two VMs deployed in GCE. It was setup with a script provided in the context of the Linux Foundation Kubernetes Developer course LFD0259.
EDIT 2
The master node is allowed to run workloads - this is how the LFD259 course sets it up. For example:
student#master:~$ k create deployment xyz --image=httpd
deployment.apps/xyz created
student#master:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xyz-6c6bd4cd89-qn4zr 1/1 Running 0 5m37s 192.168.171.66 worker <none> <none>
student#master:~$
student#master:~$ k scale deployment xyz --replicas=10
deployment.apps/xyz scaled
student#master:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xyz-6c6bd4cd89-c2xv4 1/1 Running 0 73s 192.168.219.71 master <none> <none>
xyz-6c6bd4cd89-g89k2 0/1 ContainerCreating 0 73s <none> master <none> <none>
xyz-6c6bd4cd89-jfftl 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-kbdnq 1/1 Running 0 73s 192.168.219.106 master <none> <none>
xyz-6c6bd4cd89-nm6rt 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-qn4zr 1/1 Running 0 7m22s 192.168.171.66 worker <none> <none>
xyz-6c6bd4cd89-vts6x 1/1 Running 0 73s 192.168.171.84 worker <none> <none>
xyz-6c6bd4cd89-wd2ls 1/1 Running 0 73s 192.168.171.127 worker <none> <none>
xyz-6c6bd4cd89-wv4jn 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-xvtlm 0/1 ContainerCreating 0 73s <none> master <none> <none>
student#master:~$
It depends how you've set up your Kubernetes Cluster. I assume you've installed it with kubeadm. However, by default the Master is not scheduleable for workloads. And by my understanding the image you're talking about only exists on the master node right? If that's the case you can't start a pod with that Image as it only exists on the master node, which doesn't allow workloads by default.
If you were to copy the Image to the worker node, your given command should work.
However if you want to make your Master-Node scheduleable just taint it with (maybe you need to amend the last bit if it differs from yours):
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

There are 2 networking component installed on node master, Weave and Calico. how can I completely remove Calico from my kubernetes cluster?

Weave has overlap with host's IP address and its pod stuck in CrashLoopBackOff state. There is a need to remove Calico first as I have no clue about working 2 Networking module on master!
emo#master:~$ sudo kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-64897985d-dw6ch 0/1 ContainerCreating 0
kube-system coredns-64897985d-xr6br 0/1 ContainerCreating 0
kube-system etcd-master 1/1 Running 26 (14m ago)
kube-system kube-apiserver-master 1/1 Running 26 (12m ago)
kube-system kube-controller-manager-master 1/1 Running 4 (20m ago)
kube-system kube-proxy-g98ph 1/1 Running 3 (20m ago)
kube-system kube-scheduler-master 1/1 Running 4 (20m ago)
kube-system weave-net-56n8k 1/2 CrashLoopBackOff 76 (54s ago)
tigera-operator tigera-operator-b876f5799-sqzf9 1/1 Running 6 (5m57s ago)
master:
emo#master:~$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master Ready control-plane,master 6d19h v1.23.5 192.168.71.132 <none> Ubuntu 20.04.3 LTS 5.4.0-81-generic containerd://1.5.5
You may need to re-build your cluster after cleaning it up.
First, run kubectl delete for all the manifests you have applied to configure calico and weave. (e.g. kubectl delete -f https://projectcalico.docs.tigera.io/manifests/tigera-operator.yaml)
Then run kubeadm reset and run /etc/cni/net.d/ to delete all of your cni configurations. After that, you also need to reboot the server to delete some old records of ip link, or manually remove them by ip link delete {name}.
Now the new installation should be done well.

Why does the pod is running on the master nodes?

My kubernetes cluster looks as follow:
k get nodes
NAME STATUS ROLES AGE VERSION
k8s-1 Ready master 2d22h v1.16.2
k8s-2 Ready master 2d22h v1.16.2
k8s-3 Ready master 2d22h v1.16.2
k8s-4 Ready master 2d22h v1.16.2
k8s-5 Ready <none> 2d22h v1.16.2
k8s-6 Ready <none> 2d22h v1.16.2
k8s-7 Ready <none> 2d22h v1.16.2
As you can see, the cluster consists of 4 master and 3 nodes.
These are the running pods:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default greeter-service-v1-8d97f9bcd-2hf4x 2/2 Running 0 47h 10.233.69.7 k8s-6 <none> <none>
default greeter-service-v1-8d97f9bcd-gnsvp 2/2 Running 0 47h 10.233.65.3 k8s-2 <none> <none>
default greeter-service-v1-8d97f9bcd-lkt6p 2/2 Running 0 47h 10.233.68.9 k8s-7 <none> <none>
default helloweb-77c9476f6d-7f76v 2/2 Running 0 47h 10.233.64.3 k8s-1 <none> <none>
default helloweb-77c9476f6d-pj494 2/2 Running 0 47h 10.233.69.8 k8s-6 <none> <none>
default helloweb-77c9476f6d-tnqfb 2/2 Running 0 47h 10.233.70.7 k8s-5 <none> <none>
Why the pods greeter-service-v1-8d97f9bcd-gnsvp and helloweb-77c9476f6d-7f76v are running on the master?
By default, there is no restriction for Pod to be scheduled on master unless there is a Taint like node-role.kubernetes.io/master:NoSchedule.
You can verify if there is any taint on master node using
kubectl describe k8s-1
or
kubectl get node k8s-secure-master.linxlabs.com -o jsonpath={.spec.taints[]} && echo
If you want to put a taint then use below
kubectl taint node k8s-1 node-role.kubernetes.io/master="":NoSchedule
After adding taint, no new pods will be scheduled on this node unless there is matching toleration on Pod spec.
Read more about Taints and Tolerations here

Kubeadm Failed to create SubnetManager: error retrieving pod spec for kube-system

No matter what I do it seems I cannot get rid of this problem. I have installed Kubernetes using kubeadm many times quite successfully however adding a v1.16.0 node is giving me a heck of a headache.
O/S: Ubuntu 18.04.3 LTS
Kubernetes version: v1.16.0
Kubeadm version: Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:34:01Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"
A query of the cluster shows:
NAME STATUS ROLES AGE VERSION
kube-apiserver-1 Ready master 110d v1.15.0
kube-apiserver-2 Ready master 110d v1.15.0
kube-apiserver-3 Ready master 110d v1.15.0
kube-node-1 Ready <none> 110d v1.15.0
kube-node-2 Ready <none> 110d v1.15.0
kube-node-3 Ready <none> 110d v1.15.0
kube-node-4 Ready <none> 110d v1.16.0
kube-node-5 Ready,SchedulingDisabled <none> 3m28s v1.16.0
kube-node-databases Ready <none> 110d v1.15.0
I have temporarily disabled scheduling to the node until I can fix this problem. A query of the pod status in the kube-system namespace shows the problem:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-55zjs 1/1 Running 128 21d
coredns-fb8b8dccf-kzrpc 1/1 Running 144 21d
kube-flannel-ds-amd64-29xp2 1/1 Running 11 110d
kube-flannel-ds-amd64-hp7nq 1/1 Running 14 110d
kube-flannel-ds-amd64-hvdpf 0/1 CrashLoopBackOff 5 8m28s
kube-flannel-ds-amd64-jhhlk 1/1 Running 11 110d
kube-flannel-ds-amd64-k6dzc 1/1 Running 2 110d
kube-flannel-ds-amd64-lccxl 1/1 Running 21 110d
kube-flannel-ds-amd64-nnn7g 1/1 Running 14 110d
kube-flannel-ds-amd64-shss5 1/1 Running 7 110d
kubectl -n kube-system logs -f kube-flannel-ds-amd64-hvdpf
I1002 01:13:22.136379 1 main.go:514] Determining IP address of default interface
I1002 01:13:22.136823 1 main.go:527] Using interface with name ens3 and address 192.168.5.46
I1002 01:13:22.136849 1 main.go:544] Defaulting external address to interface address (192.168.5.46)
E1002 01:13:52.231471 1 main.go:241] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-hvdpf': Get https://10.96.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-ds-amd64-hvdpf: dial tcp 10.96.0.1:443: i/o timeout
Although I had a few hits on iptables issues and kernel routing I don't understand why previous versions have installed without a hitch but this version is giving me such a problem.
I have installed this node and destroyed it quite a few times yet the result is always the same.
Anyone else having this issue or has a solution?
This occurs when its not able to lookup the host add the below after name: POD_NAMESPACE
- name: KUBERNETES_SERVICE_HOST
value: "10.220.64.186" #ip address of the host where kube-apiservice is running
- name: KUBERNETES_SERVICE_PORT
value: "6443"
According to Documentation about version skew policy:
kubelet
kubelet must not be newer than kube-apiserver, and may be up to two minor versions older.
Example:
kube-apiserver is at 1.13
kubelet is supported at 1.13, 1.12, and 1.11
That means that worker nodes with version v1.16.0 is not supported on master node with version v1.15.0.
To fix this issue I recommend reinstalling node with version v1.15.0 to match the rest of the cluster.
Optionally You can upgrade whole cluster to v1.16.1 however there are some problems with it running flannel as network plugin at the moment. Please review this guide from documentation before proceeding.