Rancher 2.0 - Troubleshooting and fixing “Controller Manager Unhealthy Issue” - kubernetes

I have a problem with controller-manager and scheduler not responding, that is not related to github issues I've found (rancher#11496, azure#173, …)
Two days ago we had a memory overflow by one POD on one Node in our 3-node HA cluster. After that rancher webapp was not accessible, we found the compromised pod and scaled it to 0 over kubectl. But that took some time, figuring everything out.
Since then rancher webapp is working properly, but there are continuous alerts from controller-manager and scheduler not working. Alerts are not consist, sometimes they are both working, some times their health check urls are refusing connection.
NAME STATUS MESSAGE ERROR
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
scheduler Healthy ok
etcd-0 Healthy {"health": "true"}
etcd-2 Healthy {"health": "true"}
etcd-1 Healthy {"health": "true"}
Restarting controller-manager and scheduler on compromised Node hasn’t been effective. Even reloading all of the components with
docker restart kube-apiserver kubelet kube-controller-manager kube-scheduler kube-proxy
wasn’t effective either.
Can someone please help me figure out the steps towards troubleshooting and fixing this issue without downtime on running containers?
Nodes are hosted on DigitalOcean on servers with 4 Cores and 8GB of RAM each (Ubuntu 16, Docker 17.03.3).
Thanks in advance !

The first area to look at would be your logs... Can you export the following logs and attach them?
/var/log/kube-controller-manager.log
The controller manager is an endpoint, so you will need to do a "get endpoint". Can you run the following:
kubectl -n kube-system get endpoints kube-controller-manager
and
kubectl -n kube-system describe endpoints kube-controller-manager
and
kubectl -n kube-system get endpoints kube-controller-manager -o jsonpath='{.metadata.annotations.control-plane\.alpha\.kubernetes\.io/leader}'

Please run this command in master nodes
sed -i 's|- --port=0|#- --port=0|' /etc/kubernetes/manifests/kube-scheduler.yaml
sed -i 's|- --port=0|#- --port=0|' /etc/kubernetes/manifests/kube-controller-manager.yaml
systemctl restart kubelet
After restarting the kubelet, the problem will be solved.

Related

Kubernetes pod failed to update

We have a Gitlab CI/CD to deploy pod via Kubernetes. However, the updated pod is always pending and the deleted pod is always stuck at terminating.
The controller and scheduler are both okay.
If I described the pending pod, it shows it is scheduled but nothing else.
This is the pending pod's logs:
$ kubectl logs -f robo-apis-dev-7b79ccf74b-nr9q2 -n xxx -f Error from
server (BadRequest): container "robo-apis-dev" in pod
"robo-apis-dev-7b79ccf74b-nr9q2" is waiting to start:
ContainerCreating
What could be the issue? Our Kubernetes cluster never had this issue before.
Okay, it turns out we used to have an NFS server as PVC. But we have moved to AWS EKS recently, thus cleaning the NFS servers. Maybe there are some resources from nodes that are still on the NFS server. Once we temporarily roll back the NFS server, the pods start to move to RUNNING state.
The issue was discussed here - Orphaned pod https://github.com/kubernetes/kubernetes/issues/60987

Nginx Kubernetes POD stays in ContainerCreating

I was able to setup the Kubernetes Cluster on Centos7 with one master and two worker nodes, however when I try to deploy a pod with nginx, the state of the pod stays in ContainerRunning forever and doesn't seem to get out of it.
For pod network I am using the calico.
Can you please help me resolve this issue? for some reason I don't feel satisfied moving forward without resolving this issue, I tried to check forums etc, since the last two days and this is the last resort that I am reaching out to you.
[root#kube-master ~]# kubectl get pods --all-namespaces
[get pods result][1]
However when I run describe pods I see the below error for the nginx container under events section.
Warning FailedCreatePodSandBox 41s (x8 over 11m) kubelet,
kube-worker1 (combined from similar events): Failed to create pod
sandbox: rpc error: code = Unknown desc = failed to set up sandbox
container
"ac77a42270009cba0c508e2fd82a84d6caef287bdb117d288d5193960b52abcb"
network for pod "nginx-6db489d4b7-2r4d2": networkPlugin cni failed to
set up pod "nginx-6db489d4b7-2r4d2_default" network: unable to connect
to Cilium daemon: failed to create cilium agent client after 30.000000
seconds timeout: Get http:///var/run/cilium/cilium.sock/v1/config:
dial unix /var/run/cilium/cilium.sock: connect: no such file or
directory
Hope you can help here.
Edit 1:
The ip address of the master VM is 192.168.40.133
Used the below command to initialize the kubeadm:
kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address 192.168.40.133
Used the below command to install the pod network:
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
The kubeadm init above gave me the join command that I used to join the workers into the cluster.
All the VMs are connected to host and bridged network adapters.
your pod subnet (specified by --pod-network-cidr) clashes with the network your VMs are located in: these 2 have to be distinct. Use something else for the pod subnet, for example 10.244.0.0/16 and then edit calico.yaml before applying it as described in the official docs:
POD_CIDR="10.244.0.0/16"
kubeadm init --pod-network-cidr=${POD_CIDR} --apiserver-advertise-address 192.168.40.133
curl https://docs.projectcalico.org/manifests/calico.yaml -O
sed -i -e "s?192.168.0.0/16?${POD_CIDR}?g" calico.yaml
kubectl apply -f calico.yaml
hope this helps :)
note: you don't really need to specify --apiserver-advertise-address flag: kubeadm will detect correctly the main IP of the machine most of the time.

Kubernetes coredns readiness probe failed

I have setup a Kubernetes cluster with one master (kube-master) and 2 slaves (kube-node-01 and kube-node-02)
All was running fine ... now after debian stretch -> buster upgrade my coredns pods are failing with CrashLoopBackOff for some reason.
I did a kubectl describe and the error is Readiness probe failed: HTTP probe failed with statuscode: 503
The Readiness url looks suspicious to me http-get http://:8080/health delay=0s timeout=1s period=10s #success=1 #failure=3 ... there is no hostname !? is that correct?
The Liveness property also does not have a hostname.
All vm's are pingable from each other.
Any ideas?
I met similar issue when upgrade my host machine to ubuntu 18.04 which uses systemd-resolved as DNS server. The nameserver field in /etc/resolv.conf is using a local IP address 127.0.0.53 which would cause coreDNS failed to start.
You can take a look at the details from the following link.
https://github.com/coredns/coredns/blob/master/plugin/loop/README.md#troubleshooting-loops-in-kubernetes-clusters
I just hit this problem myself. Apparently the lack of a hostname in the healthcheck url is ok.
What got me ahead was
microk8s.inspect
The output said there's a problem with forwarding on iptables. Since I have firewalld on my system, I temporarily disabled it
systemctl stop firewalld
and then disabled dns in microk8s and enabled it again (for some unknown reason the dns pod didn't get up on it's own)
microk8s.disable dns
microk8s.enable dns
it started without any issues.
I would start troubleshooting with kubelet agent verification on the master and worker nodes in order exclude any intercommunication issue within a cluster nodes whenever the rest core runtime Pods are up and running, as kubelet is the main contributor for
Liveness and Readiness probes.
systemctl status kubelet -l
journalctl -u kubelet
Mentioned in the question health check URLs are fine, as they are predefined in CoreDNS deployment per design.
Ensure that CNI plugin Pods are functioning and cluster overlay network intercepts requests from Pod to Pod communication as CoreDNS is very sensitive on any issue related to entire cluster networking.
In addition to #Hang Du answer about CoreDNS pods loopback issue, I encourage you to get more information regarding CoreDNS problems investigation in official k8s debug documentation.

How to check if Kubernetes cluster is running fine

I have a Kubernetes cluster running. I used:
kubeadm init --apiserver-advertise-address=192.168.20.101 --pod-network-cidr=10.244.0.0/16
This is working okay. Now I'm putting this in a script and I only want to execute kubeadm init again if my cluster is not running fine. How can I check if a Kubernetes cluster is running fine? So if not I can recreate the cluster.
You can use the following command to do that:
[root#ip-10-0-1-19]# kubectl cluster-info
Kubernetes master is running at https://10.0.1.197:6443
KubeDNS is running at https://10.0.1.197:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
It shows that your master is running fine on particular url.
kubectl get cs
The above command would display the health of controller, scheduler and etcd as Healthy if your cluster is fine
# kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health":"true"}
You should also check nodes running in the cluster.
kubectl get nodes

"Waiting for tearing down pods" when Kubernetes turns down

I have a Kubernetes cluster installed in my Ubuntu machines. It consists of three machines: one master/node and two nodes.
When I turn down the cluster, it never stops printing "waiting for tearing down pods":
root#kubernetes01:~/kubernetes/cluster# KUBERNETES_PROVIDER=ubuntu ./kube-down.sh
Bringing down cluster using provider: ubuntu
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
No resources found
No resources found
service "kubernetes" deleted
No resources found
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
waiting for tearing down pods
There is no pods nor services running when I turn it down. Finally, I have to force stop by killing processes and stoping services.
First we have to find out which rc is running :
kubectl get rc --namespace=kube-system
We have to delete Running rc :
kubectl delete rc above_running_rc_name --namespace=kube-system
Then cluster down script "KUBERNETES_PROVIDER=ubuntu ./kube-down.sh", will execute without Error "waiting for tearing down pods"
EXAMPLE ::
root#ubuntu:~/kubernetes/cluster# KUBERNETES_PROVIDER=ubuntu ./kube-down.sh
Bringing down cluster using provider: ubuntu
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
No resources found
No resources found
service "kubernetes" deleted
No resources found
waiting for tearing down pods
waiting for tearing down pods
^C
root#ubuntu:~/kubernetes/cluster# kubectl get rc --namespace=kube-system
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS AGE
kubernetes-dashboard-v1.0.1 kubernetes-dashboard gcr.io/google_containers/kubernetes-dashboard-amd64:v1.0.1 k8s-app=kubernetes-dashboard 1 44m
root#ubuntu:~/kubernetes/cluster#
root#ubuntu:~/kubernetes/cluster# kubectl delete rc kubernetes-dashboard-v1.0.1 --namespace=kube-system
replicationcontroller "kubernetes-dashboard-v1.0.1" deleted
root#ubuntu:~/kubernetes/cluster# KUBERNETES_PROVIDER=ubuntu ./kube-down.sh
Bringing down cluster using provider: ubuntu
Identity added: /root/.ssh/id_rsa (/root/.ssh/id_rsa)
No resources found
No resources found
service "kubernetes" deleted
No resources found
Cleaning on master 172.27.59.208
26979
etcd stop/waiting
Connection to 172.27.59.208 closed.
Connection to 172.27.59.208 closed.
Connection to 172.27.59.208 closed.
Cleaning on node 172.27.59.233
2165
flanneld stop/waiting
Connection to 172.27.59.233 closed.
Connection to 172.27.59.233 closed.
Done
You can find out which pods is it waiting for by running:
kubectl get pods --show-all --all-namespaces
Thats what the code runs: https://github.com/kubernetes/kubernetes/blob/1c80864913e4b9da957c45eef005b06dba68cec3/cluster/ubuntu/util.sh#L689