Using flannel as a CNI in kubernetes i am trying to implement a network for pod to pod communication spread on different vagrant vms. I am using this https://raw.githubusercontent.com/coreos/flannel/v0.9.0/Documentation/kube-flannel.yml to create flannel pods. But the kube-flannel pods go in CrashLoopBackOff error and do not start.
[root#flnode-04 ~]# kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
diamanti-system collectd-v0.5-flnode-04 1/1 Running 0 3h 192.168.30.14 flnode-04
diamanti-system collectd-v0.5-flnode-05 1/1 Running 0 3h 192.168.30.15 flnode-05
diamanti-system collectd-v0.5-flnode-06 1/1 Running 0 3h 192.168.30.16 flnode-06
diamanti-system provisioner-d4kvf 1/1 Running 0 3h 192.168.30.16 flnode-06
kube-system kube-flannel-ds-2kqpv 0/1 CrashLoopBackOff 1 18m 192.168.30.14 flnode-04
kube-system kube-flannel-ds-xgqdm 0/1 CrashLoopBackOff 1 18m 192.168.30.16 flnode-06
kube-system kube-flannel-ds-z59jz 0/1 CrashLoopBackOff 1 18m 192.168.30.15 flnode-05
here are the logs of one pod
[root#flnode-04 ~]# kubectl logs kube-flannel-ds-2kqpv --namespace=kube-system
I0327 10:28:44.103425 1 main.go:483] Using interface with name mgmt0 and address 192.168.30.14
I0327 10:28:44.105609 1 main.go:500] Defaulting external address to interface address (192.168.30.14)
I0327 10:28:44.138132 1 kube.go:130] Waiting 10m0s for node controller to sync
I0327 10:28:44.138213 1 kube.go:283] Starting kube subnet manager
I0327 10:28:45.138509 1 kube.go:137] Node controller sync successful
I0327 10:28:45.138588 1 main.go:235] Created subnet manager: Kubernetes Subnet Manager - flnode-04
I0327 10:28:45.138596 1 main.go:238] Installing signal handlers
I0327 10:28:45.138690 1 main.go:348] Found network config - Backend type: vxlan
I0327 10:28:45.138767 1 vxlan.go:119] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
panic: assignment to entry in nil map
goroutine 1 [running]:
github.com/coreos/flannel/subnet/kube.(*kubeSubnetManager).AcquireLease(0xc420010cd0, 0x7f5314399bd0, 0xc420347880, 0xc4202213e0, 0x6, 0xf54, 0xc4202213e0)
/go/src/github.com/coreos/flannel/subnet/kube/kube.go:239 +0x1f7
github.com/coreos/flannel/backend/vxlan.(*VXLANBackend).RegisterNetwork(0xc4200b3480, 0x7f5314399bd0, 0xc420347880, 0xc420010c30, 0xc4200b3480, 0x0, 0x0, 0x4d0181)
/go/src/github.com/coreos/flannel/backend/vxlan/vxlan.go:141 +0x44e
main.main()
/go/src/github.com/coreos/flannel/main.go:278 +0x8ae
What exactly is the reason for flannel pods going into CrashLoopBackoff and what is the solution ?
I was able to solve the problem by running the command
kubectl annotate node appserv9 flannel.alpha.coreos.com/public-ip=10.10.10.10 --overwrite=true
Reason for bug : nil map in the code(no key available)
it doesn't matter what ip you give but this command has to be run individually on all the nodes so that the above error does not have to assign to a nil map.
If you deploy cluster with kubeadm init --pod-network-cidr network/mask, this network/mask should match the ConfigMap in kube-flannel.yaml
My ConfigMap looks like:
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
data:
...
net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
So the network/mask should equal 10.244.0.0/16
Related
I'm trying to understand how kubernetes works, so I tried to do this operation for my minikube:
~ kubectl delete pod --all -n kube-system
pod "coredns-f9fd979d6-5n4b6" deleted
pod "etcd-minikube" deleted
pod "kube-apiserver-minikube" deleted
pod "kube-controller-manager-minikube" deleted
pod "kube-proxy-879lg" deleted
pod "kube-scheduler-minikube" deleted
It's okay. Pods deleted as wish. But if I do kubectl get pods -n kube-system I will see:
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-5d25r 1/1 Running 0 50s
etcd-minikube 1/1 Running 0 50s
kube-apiserver-minikube 1/1 Running 0 50s
kube-controller-manager-minikube 1/1 Running 0 50s
kube-proxy-nlw69 1/1 Running 0 43s
kube-scheduler-minikube 1/1 Running 0 49s
Okay. I thought it's ReplicaSet or DaemonSet:
➜ ~ kubectl get ds -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 18m
➜ ~ kubectl get rs -n kube-system
NAME DESIRED CURRENT READY AGE
coredns-f9fd979d6 1 1 1 18m
It is true for coredns and kube-proxy. But what about others (apiserver, etcd, controller and scheduler)? Why are they still alive?
The control plane pods are run as static Pods - static Pods are not managed by the control plane controllers like e.g. DaemonSet and ReplicaSet. Static pods are instead managed by the Kubelet daemon on the local node directly.
I start k8s on my local machine and find most pods are not ready:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-65dbdb44db-zrfpm 0/1 Running 0 3m9s
kube-system dashboard-metrics-scraper-545bbb8767-nfnbg 1/1 Running 1 16m
kube-system kube-flannel-ds-amd64-nm7kr 0/1 CrashLoopBackOff 5 16m
kube-system kubernetes-dashboard-65665f84db-rqxpv 0/1 CrashLoopBackOff 4 118s
kube-system metrics-server-869ffc99cd-6fhfl 0/1 CrashLoopBackOff 5 16m
Then I check the status of flannel:
kubectl logs kube-flannel-ds-amd64-nm7kr -n kube-system
The log tells me that something is unauthorized:
I0809 10:23:51.307347 1 main.go:518] Determining IP address of default interface
I0809 10:23:51.308840 1 main.go:531] Using interface with name wlo1 and address 192.168.1.102
I0809 10:23:51.308894 1 main.go:548] Defaulting external address to interface address (192.168.1.102)
W0809 10:23:51.308917 1 client_config.go:517] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
E0809 10:23:51.620449 1 main.go:243] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-ds-amd64-nm7kr': Unauthorized
Then I check the coredns:
E0809 10:28:06.807582 1 reflector.go:153] pkg/mod/k8s.io/client-go#v0.17.2/tools/cache/reflector.go:105: Failed to list *v1.Namespace: Unauthorized
Then I check the metrics-server:
2020/08/09 10:28:06 Starting overwatch
2020/08/09 10:28:06 Using namespace: kube-system
2020/08/09 10:28:06 Using in-cluster config to connect to apiserver
2020/08/09 10:28:06 Using secret token for csrf signing
2020/08/09 10:28:06 Initializing csrf token from kubernetes-dashboard-csrf secret
panic: Unauthorized
goroutine 1 [running]:
github.com/kubernetes/dashboard/src/app/backend/client/csrf.(*csrfTokenManager).init(0xc00036f0e0)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:41 +0x446
github.com/kubernetes/dashboard/src/app/backend/client/csrf.NewCsrfTokenManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/csrf/manager.go:66
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).initCSRFKey(0xc000114080)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:501 +0xc6
github.com/kubernetes/dashboard/src/app/backend/client.(*clientManager).init(0xc000114080)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:469 +0x47
github.com/kubernetes/dashboard/src/app/backend/client.NewClientManager(...)
/home/travis/build/kubernetes/dashboard/src/app/backend/client/manager.go:550
main.main()
/home/travis/build/kubernetes/dashboard/src/app/backend/dashboard.go:105 +0x20d
I use kubernetes 1.18.2.
So what's wrong with that?
I am building a Kubernetes cluster following this tutorial, and I have troubles to access the Kubernetes dashboard. I already created another question about it that you can see here, but while digging up into my cluster, I think that the problem might be somewhere else and that's why I create a new question.
I start my master, by running the following commands:
> kubeadm reset
> kubeadm init --apiserver-advertise-address=[MASTER_IP] > file.txt
> tail -2 file.txt > join.sh # I keep this file for later
> kubectl apply -f https://git.io/weave-kube/
> kubectl -n kube-system get pod
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-kb2zq 0/1 Pending 0 2m46s
coredns-fb8b8dccf-nnc5n 0/1 Pending 0 2m46s
etcd-kubemaster 1/1 Running 0 93s
kube-apiserver-kubemaster 1/1 Running 0 93s
kube-controller-manager-kubemaster 1/1 Running 0 113s
kube-proxy-lxhvs 1/1 Running 0 2m46s
kube-scheduler-kubemaster 1/1 Running 0 93s
Here we can see that I have two coredns pods stuck in Pending state forever, and when I run the command :
> kubectl -n kube-system describe pod coredns-fb8b8dccf-kb2zq
I can see in the Events part the following Warning :
Failed Scheduling : 0/1 nodes are available 1 node(s) had taints that the pod didn't tolerate.
Since it is a Warning and not and Error, and that as a Kubernetes newbie, taints does not mean much to me, I tried to connect a node to the master (using the previously saved command) :
> cat join.sh
kubeadm join [MASTER_IP]:6443 --token [TOKEN] \
--discovery-token-ca-cert-hash sha256:[ANOTHER_TOKEN]
> ssh [USER]#[WORKER_IP] 'bash' < join.sh
This node has joined the cluster.
On the master, I check that the node is connected:
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubemaster NotReady master 13m v1.14.1
kubeslave1 NotReady <none> 31s v1.14.1
And I check my pods :
> kubectl -n kube-system get pod
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-kb2zq 0/1 Pending 0 14m
coredns-fb8b8dccf-nnc5n 0/1 Pending 0 14m
etcd-kubemaster 1/1 Running 0 13m
kube-apiserver-kubemaster 1/1 Running 0 13m
kube-controller-manager-kubemaster 1/1 Running 0 13m
kube-proxy-lxhvs 1/1 Running 0 14m
kube-proxy-xllx4 0/1 ContainerCreating 0 2m16s
kube-scheduler-kubemaster 1/1 Running 0 13m
We can see that another kube-proxy pod have been created and is stuck in ContainerCreating status.
And when I am doing a describe again :
kubectl -n kube-system describe pod kube-proxy-xllx4
I can see in the Events part multiple identical Warnings :
Failed create pod sandbox : rpx error: code = Unknown desc = failed pulling image "k8s.gcr.io/pause:3.1": Get https://k8s.gcr.io/v1/_ping: dial tcp: lookup k8s.gcr.io on [::1]:53 read up [::1]43133->[::1]:53: read: connection refused
Here are my repositories :
docker image ls
REPOSITORY TAG
k8s.gcr.io/kube-proxy v1.14.1
k8s.gcr.io/kube-apiserver v1.14.1
k8s.gcr.io/kube-controller-manager v1.14.1
k8s.gcr.io/kube-scheduler v1.14.1
k8s.gcr.io/coredns 1.3.1
k8s.gcr.io/etcd 3.3.10
k8s.gcr.io/pause 3.1
And so, for the dashboard part, I tried to start it with the command
> kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/master/aio/deploy/recommended/kubernetes-dashboard.yaml
But the dashboard pod is stuck in Pending state.
kubectl -n kube-system get pod
NAME READY STATUS RESTARTS AGE
coredns-fb8b8dccf-kb2zq 0/1 Pending 0 40m
coredns-fb8b8dccf-nnc5n 0/1 Pending 0 40m
etcd-kubemaster 1/1 Running 0 38m
kube-apiserver-kubemaster 1/1 Running 0 38m
kube-controller-manager-kubemaster 1/1 Running 0 39m
kube-proxy-lxhvs 1/1 Running 0 40m
kube-proxy-xllx4 0/1 ContainerCreating 0 27m
kube-scheduler-kubemaster 1/1 Running 0 38m
kubernetes-dashboard-5f7b999d65-qn8qn 1/1 Pending 0 8s
So, event though my problem originaly was that I cannot access to my dashboard, I guess that the real problem is deeper thant that.
I know that I just put a lot of information here, but I am a k8s beginner and I am completely lost on this.
There is an issue I experienced with coredns pods stuck in a pending mode when setting up your own cluster; which I resolve by adding pod network.
Looks like because there is no Network Addon installed, the nodes are taint as not-ready. Installing the Addon would remove the taints and the Pods will be able to schedule. In my case adding flannel fixed the issue.
EDIT: There is a note about this in the official k8s documentation - Create cluster with kubeadm:
The network must be deployed before any applications. Also, CoreDNS
will not start up before a network is installed. kubeadm only
supports Container Network Interface (CNI) based networks (and does
not support kubenet).
Actually it is the opposite of a deep or serious issue. This is a trivial issue. Always you see a pod stuck on Pending state, it means the scheduler is having a hard time to schedule the pod; mostly because there are no enough resources on the node.
In your case it is a taint that has the node, and your pod doesn't have the toleration. What you have to do is to describe the node and get the taint:
kubectl describe node | grep -i taints
Note: you might have more then one taint. So you might want to do kubectl describe no NODE since with grep you will only see one taint.
Once you get the taint, that will be something like hello=world:NoSchedule; which means key=value:effect, you will have to add a toleration section in your Deployment. This is an example Deployment so you can see how it should look like:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 10
strategy:
type: Recreate
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
ports:
- containerPort: 80
name: http
tolerations:
- effect: NoExecute #NoSchedule, PreferNoSchedule
key: node
operator: Equal
value: not-ready
tolerationSeconds: 3600
As you can see there is the toleration section in the yaml. So, if I would have a node with node=not-ready:NoExecute taint, no pod would be able to be scheduled on that node, unless would have this toleration.
Also you can remove the taint, if you don need it. To remove a taint you would describe the node, get the key of the taint and do:
kubectl taint node NODE key-
Hope it makes sense. Just add this section to your deployment, and it will work.
Set up the flannel network tool.
Running commands:
$ sysctl net.bridge.bridge-nf-call-iptables=1
$ kubectl apply -f
https://raw.githubusercontent.com/coreos/flannel/62e44c867a2846fefb68bd5f178daf4da3095ccb/Documentation/kube-flannel.yml
I'm setting up a Kubernetes cluster on a customer.
I've done this process before multiple times, including dealing with vagrant specifics and I've been able to constantly get a K8s cluster up and running without too much fuss.
Now, on this customer I'm doing the same but I've been finding a lot of issues when setting things up, which is completely unexpected.
Comparing to other places where I've setup Kubernetes, the only obvious difference that I see is that I have a proxy server which I constantly have to battle with. Nothing that a NO_PROXY env hasn't been able to handle.
The main issue I'm facing is setting up Canal (Calico + Flannel).
For some reason, on Masters 2 and 3 it just won't start.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system canal-2pvpr 2/3 CrashLoopBackOff 7 14m 10.136.3.37 devmn2.cpdprd.pt
kube-system canal-rdmnl 2/3 CrashLoopBackOff 7 14m 10.136.3.38 devmn3.cpdprd.pt
kube-system canal-swxrw 3/3 Running 0 14m 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-apiserver-devmn1.cpdprd.pt 1/1 Running 1 1h 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-apiserver-devmn2.cpdprd.pt 1/1 Running 1 4h 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-apiserver-devmn3.cpdprd.pt 1/1 Running 1 1h 10.136.3.38 devmn3.cpdprd.pt
kube-system kube-controller-manager-devmn1.cpdprd.pt 1/1 Running 0 15m 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-controller-manager-devmn2.cpdprd.pt 1/1 Running 0 15m 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-controller-manager-devmn3.cpdprd.pt 1/1 Running 0 15m 10.136.3.38 devmn3.cpdprd.pt
kube-system kube-dns-86f4d74b45-vqdb4 0/3 ContainerCreating 0 1h <none> devmn2.cpdprd.pt
kube-system kube-proxy-4j7dp 1/1 Running 1 2h 10.136.3.38 devmn3.cpdprd.pt
kube-system kube-proxy-l2wpm 1/1 Running 1 2h 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-proxy-scm9g 1/1 Running 1 2h 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-scheduler-devmn1.cpdprd.pt 1/1 Running 1 1h 10.136.3.36 devmn1.cpdprd.pt
kube-system kube-scheduler-devmn2.cpdprd.pt 1/1 Running 1 4h 10.136.3.37 devmn2.cpdprd.pt
kube-system kube-scheduler-devmn3.cpdprd.pt 1/1 Running 1 1h 10.136.3.38 devmn3.cpdprd.pt
Looking for the specific error, I've come to find out that the issue is with the kube-flannel container, which is throwing an error:
[exXXXXX#devmn1 ~]$ kubectl logs canal-rdmnl -n kube-system -c kube-flannel
I0518 16:01:22.555513 1 main.go:487] Using interface with name ens192 and address 10.136.3.38
I0518 16:01:22.556080 1 main.go:504] Defaulting external address to interface address (10.136.3.38)
I0518 16:01:22.565141 1 kube.go:130] Waiting 10m0s for node controller to sync
I0518 16:01:22.565167 1 kube.go:283] Starting kube subnet manager
I0518 16:01:23.565280 1 kube.go:137] Node controller sync successful
I0518 16:01:23.565311 1 main.go:234] Created subnet manager: Kubernetes Subnet Manager - devmn3.cpdprd.pt
I0518 16:01:23.565331 1 main.go:237] Installing signal handlers
I0518 16:01:23.565388 1 main.go:352] Found network config - Backend type: vxlan
I0518 16:01:23.565440 1 vxlan.go:119] VXLAN config: VNI=1 Port=0 GBP=false DirectRouting=false
E0518 16:01:23.565619 1 main.go:279] Error registering network: failed to acquire lease: node "devmn3.cpdprd.pt" pod cidr not assigned
I0518 16:01:23.565671 1 main.go:332] Stopping shutdownHandler...
I just can't understand why.
Some relevant info:
My clusterCIDR and podCIDR are: 192.168.151.0/25 (I know, it's weird, don't ask unless it's a huge issue)
I've setup etcd on systemd
I've modified the kube-controller-manager.yaml to change the mask size to 25 (otherwise the IP mentioned before wouldn't work).
I'm installing everything with Kubeadm. One weird thing I did notice was that, when viewing the config (kubeadm config view) much of the information that I had setup on the kubeadm config.yaml (for kubeadm init) was not present in the config view, including the paths to etcd certs.
I'm also not sure why that happened, but I've fixed it (hopefully) by editing the kubeadm config map (kubectl edit cm kubeadm-config -n kube-system) and saving it.
Still no luck with canal.
Can anyone help me figure out what's wrong?
I have documented pretty much every step of the configuration I've done, so if required I may be able to provide it.
EDIT:
I figured how meanwhile that indeed my master2 and 3 do not have a podCIDR associated. Why would this happen? And how can I add it?
Try to edit:
/etc/kubernetes/manifests/kube-controller-manager.yaml
and add
--allocate-node-cidrs=true
--cluster-cidr=192.168.151.0/25
then, reload kubelet.
I found this information here and it was useful for me.
I am in the process of implementing an HA solution for Kubernetes Master nodes in a CentOS7 env.
My env looks like :
K8S_Master1 : 172.16.16.5
K8S_Master2 : 172.16.16.51
HAProxy : 172.16.16.100
K8S_Minion1 : 172.16.16.50
etcd Version: 3.1.7
Kubernetes v1.5.2
CentOS Linux release 7.3.1611 (Core)
My etcd cluster is setup properly and is in working state.
[root#master1 ~]# etcdctl cluster-health
member 282a4a2998aa4eb0 is healthy: got healthy result from http://172.16.16.51:2379
member dd3979c28abe306f is healthy: got healthy result from http://172.16.16.5:2379
member df7b762ad1c40191 is healthy: got healthy result from http://172.16.16.50:2379
My K8S config for Master1 is :
[root#master1 ~]# cat /etc/kubernetes/apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"
[root#master1 ~]# cat /etc/kubernetes/config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
[root#master1 ~]# cat /etc/kubernetes/controller-manager
KUBE_CONTROLLER_MANAGER_ARGS="--leader-elect"
[root#master1 ~]# cat /etc/kubernetes/scheduler
KUBE_SCHEDULER_ARGS="--leader-elect"
As for Master2 , I have configured it to be :
[root#master2 kubernetes]# cat apiserver
KUBE_API_ADDRESS="--address=0.0.0.0"
KUBE_ETCD_SERVERS="--etcd_servers=http://127.0.0.1:4001"
KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.100.0.0/16"
KUBE_ADMISSION_CONTROL="--admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota"
[root#master2 kubernetes]# cat config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://127.0.0.1:8080"
[root#master2 kubernetes]# cat scheduler
KUBE_SCHEDULER_ARGS=""
[root#master2 kubernetes]# cat controller-manager
KUBE_CONTROLLER_MANAGER_ARGS=""
Note that --leader-elect is only configured on Master1 as I want Master1 to be the leader.
My HA Proxy config is simple :
frontend K8S-Master
bind 172.16.16.100:8080
default_backend K8S-Master-Nodes
backend K8S-Master-Nodes
mode http
balance roundrobin
server master1 172.16.16.5:8080 check
server master2 172.16.16.51:8080 check
Now I have directed my minion to connect to the Load Balancer IP rather than directly to the Master IP.
Config on Minion is :
[root#minion kubernetes]# cat /etc/kubernetes/config
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=0"
KUBE_ALLOW_PRIV="--allow_privileged=false"
KUBE_MASTER="--master=http://172.16.16.100:8080"
On both Master nodes, I see the minion/node status as Ready
[root#master1 ~]# kubectl get nodes
NAME STATUS AGE
172.16.16.50 Ready 2h
[root#master2 ~]# kubectl get nodes
NAME STATUS AGE
172.16.16.50 Ready 2h
I setup an example nginx pod using :
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 2
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
I created the Replication Controller on Master1 using :
[root#master1 ~]# kubectl create -f nginx.yaml
And on both Master nodes, I was able to see the pods created.
[root#master1 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-jwpxd 1/1 Running 0 29m
nginx-q613j 1/1 Running 0 29m
[root#master2 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-jwpxd 1/1 Running 0 29m
nginx-q613j 1/1 Running 0 29m
Now logically thinking, if I were to take down Master1 node and delete the pods on Master2 , Master2 should recreate the pods. So this is what I do.
On Master1 :
[root#master1 ~]# systemctl stop kube-scheduler ; systemctl stop kube-apiserver ; systemctl stop kube-controller-manager
On Master2 :
[root#slave1 kubernetes]# kubectl delete po --all
pod "nginx-l7mvc" deleted
pod "nginx-r3m58" deleted
Now Master2 should create the pods since the Replication Controller is still up. But the new Pods are stuck in :
[root#master2 kubernetes]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-l7mvc 1/1 Terminating 0 13m
nginx-qv6z9 0/1 Pending 0 13m
nginx-r3m58 1/1 Terminating 0 13m
nginx-rplcz 0/1 Pending 0 13m
Ive waited a long time but the pods are stuck in this state.
But when I restart the services on Master1 :
[root#master1 ~]# systemctl start kube-scheduler ; systemctl start kube-apiserver ; systemctl start kube-controller-manager
Then I see progress on Master1 :
NAME READY STATUS RESTARTS AGE
nginx-qv6z9 0/1 ContainerCreating 0 14m
nginx-rplcz 0/1 ContainerCreating 0 14m
[root#slave1 kubernetes]# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-qv6z9 1/1 Running 0 15m
nginx-rplcz 1/1 Running 0 15m
Why doesnt Master2 recreate the pods ? This is the confusion that I am trying to figure out. Ive come a long way to setup a fully function HA setup but seems like almost there only if I can figure out this puzzle.
In my opinion, the error comes from the fact that Master2 has no --leader-elect flag enabled. There can only be one scheduler and controller processes running at the same time, that is the reason of --leader-elect. The aim of this flag is to have them "compete" to see which of the scheduler and controller processes is active at a given time. As you did not set the flag in both master nodes, then there are two scheduler and controller processes active, and thus the conflicts you are experiencing. In order to fix the issue, I advise you to enable this flag in all of the master nodes.
What is more, according to the k8s documentation https://kubernetes.io/docs/tasks/administer-cluster/highly-available-master/#best-practices-for-replicating-masters-for-ha-clusters:
Do not use a cluster with two master replicas. Consensus on a two replica cluster requires both replicas running when changing persistent state. As a result, both replicas are needed and a failure of any replica turns cluster into majority failure state. A two-replica cluster is thus inferior, in terms of HA, to a single replica cluster.