Rook Ceph Operator hangs when checking for cluster status

Rook Ceph Operator hangs when checking for cluster status - kubernetes

I've setup a k8s cluster on digital ocean Ubuntu 18.04 LTS droplets using calico on top of wireguard vpn, and was able to setup nginx-ingress with traefik as external LB. I'm now on the step of setting up distributed storage using rook ceph, by following the quickstart at https://rook.io/docs/rook/master/ceph-quickstart.html, but it seems like the monitors never reach a quorum (even when its just one). Actually, monitor a reaches by itself, but neither the operator or any other monitors seem to know that, and the operator hangs when trying to check the status.
I've tried troubleshooting network issues, all the way from wireguard, calico and ufw. I've even set ufw to temporarily allow all traffic by default just to make sure I wasn't allowing one port but the traffic was on another interface (i have wg0, eth1, tunl0 and the calico interfaces).
The I followed the ceph troubleshooting guide unsuccessfully: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#recovering-a-monitor-s-broken-monmap
I've been 4 days at this and I'm out of solutions.
Here's how I setup the storage cluster
cd cluster/examples/kubernetes/ceph
kubectl apply -f common.yaml
kubectl apply -f operator.yaml
kubectl apply -f cluster-test.yaml
Running kubectl get pods returns
NAME READY STATUS RESTARTS AGE
pod/rook-ceph-agent-9ws2p 1/1 Running 0 24s
pod/rook-ceph-agent-v6v9n 1/1 Running 0 24s
pod/rook-ceph-agent-x2jv4 1/1 Running 0 24s
pod/rook-ceph-mon-a-74cc6db5c8-8s5l5 1/1 Running 0 9s
pod/rook-ceph-operator-7cd5d8bd4c-pclxp 1/1 Running 0 25s
pod/rook-discover-24cfj 1/1 Running 0 24s
pod/rook-discover-6xsnp 1/1 Running 0 24s
pod/rook-discover-hj4tc 1/1 Running 0 24s
However, when I try to check the status of the monitors, from the operator pod I get:
#This hangs forever
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph status
#This hangs foverer
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.a
#This returns [errno 2] error calling ping_monitor
#Which I guess should, becasue mon.b does/should not exist
#But I expected a response such as mon.b does not exist
kubectl exec -it rook-ceph-operator-7cd5d8bd4c-pclxp ceph ping mon.b
Pinging the monitor pod from the operator works just fine by the way
Operator logs
https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-operator-log
Monitor a logs
https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-mon-a-log
Monitor a status, obtainer directly form monitor pod via socket
https://gist.github.com/figassis/0a3f499f5e3f79a430c9bd58718fd29f#file-mon-a-status

You can execute ceph status command inside ceph toolbox pod.
https://github.com/rook/rook/blob/master/Documentation/ceph-toolbox.md

Related

Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]

my pod stucks in ContainerCreating status with this massage :
Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "483590313b7fd092fe5eeec92356152721df3e971d942174464ac5a3f1529898" network for pod "my-nginx": networkPlugin cni failed to set up pod "my-nginx_default" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "483590313b7fd092fe5eeec92356152721df3e971d942174464ac5a3f1529898", failed to clean up sandbox container "483590313b7fd092fe5eeec92356152721df3e971d942174464ac5a3f1529898" network for pod "my-nginx": networkPlugin cni failed to teardown pod "my-nginx_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
the state of worker node is Ready .
but the output of kubectl get pods -n kube-system seems to have issues :
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-6dfcd885bf-ktbhb 1/1 Running 0 22h
calico-node-4fs2v 0/1 Init:RunContainerError 1 22h
calico-node-l9qvc 0/1 Running 0 22h
coredns-f9fd979d6-8pzcd 1/1 Running 0 23h
coredns-f9fd979d6-v4cq8 1/1 Running 0 23h
etcd-k8s-master 1/1 Running 1 23h
kube-apiserver-k8s-master 1/1 Running 128 23h
kube-controller-manager-k8s-master 1/1 Running 4 23h
kube-proxy-bwtwj 0/1 CrashLoopBackOff 342 23h
kube-proxy-stq7q 1/1 Running 1 23h
kube-scheduler-k8s-master 1/1 Running 4 23h
and the resualt of command kubectl -n kube-system logs kube-proxy-bwtwj the resulst was :
failed to try resolving symlinks in path "/var/log/pods/kube-system_kube-proxy-bwtwj_1a0f4b93-cc6f-46b9-bf29-125feba593cb/kube-proxy/348.log": lstat /var/log/pods/kube-system_kube-proxy-bwtwj_1a0f4b93-cc6f-46b9-bf29-125feba593cb/kube-proxy/348.log: no such file or directory

I see two topics here:
The default --pod-network-cidr for calico is 192.168.0.0/16. You can use a different one but always make sure that there are no overlays. However, I have tested with the default one and my cluster runs with no problems. In order to start over with a proper config, you should Remove the node and Clean up the control plane. Than proceed with:
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
kubectl apply -f https://docs.projectcalico.org/v3.14/manifests/calico.yaml
After that, join your worker nodes with kubeadm join
Use sudo where/if needed. All necessary details can be found in the official documentation.
The failed to try resolving symlinks error means that kubelet is looking for the pod logs in a wrong directory. In order to fix it you need to pass the --log-dir=/var/log flag to kubelet. After adding the flag you have run systemctl daemon-reload so the kubelet would be restarted. This has to be done on all of your nodes.

Make sure you deploy calico before joining other nodes to your cluster. When you have other nodes in your cluster calico-kube-controllers sometimes gets push to a worker node. This can lead to issues

You need to carefully check logs for calico-node pods.
In my case i have some other network interfaces and the autodetection mechanism in calico was detecting wrong interface (ip address).
You need to consult this documentation https://projectcalico.docs.tigera.io/reference/node/configuration.
What i did in my case, was simply:
kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=cidr=172.16.8.0/24
cidr is my "working network".
After this, all calico-nodes restarted and suddenly everything was fine.

Enabling NodeLocalDNS fails

We have 2 clusters on GKE: dev and production. I tried to run this command on dev cluster:
gcloud beta container clusters update "dev" --update-addons=NodeLocalDNS=ENABLED
And everything went great, node-local-dns pods are running and all works, next morning I decided to run same command on production cluster and node-local-dns fails to run, and I noticed that both PILLAR__LOCAL__DNS and PILLAR__DNS__SERVER in yaml aren't changed to proper IPs, I tried to change those variables in config yaml, but GKE keeps overwriting them back to yaml with PILLAR__DNS__SERVER variables...
The only difference between clusters is that dev runs on 1.15.9-gke.24 and production 1.15.11-gke.1.

Apparently 1.15.11-gke.1 version has a bug.
I recreated it first on 1.15.11-gke.1 and can confirm that node-local-dns Pods fall into CrashLoopBackOff state:
node-local-dns-28xxt 0/1 CrashLoopBackOff 5 5m9s
node-local-dns-msn9s 0/1 CrashLoopBackOff 6 8m17s
node-local-dns-z2jlz 0/1 CrashLoopBackOff 6 10m
When I checked the logs:
$ kubectl logs -n kube-system node-local-dns-msn9s
2020/04/07 21:01:52 [FATAL] Error parsing flags - Invalid localip specified - "__PILLAR__LOCAL__DNS__", Exiting
Solution:
Upgrade to 1.15.11-gke.3 helped. First you need to upgrade your master-node and then your node pool. It looks like on this version everything runs nice and smoothly:
$ kubectl get daemonsets -n kube-system node-local-dns
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-local-dns 3 3 3 3 3 addon.gke.io/node-local-dns-ds-ready=true 44m
$ kubectl get pods -n kube-system -l k8s-app=node-local-dns
NAME READY STATUS RESTARTS AGE
node-local-dns-8pjr5 1/1 Running 0 11m
node-local-dns-tmx75 1/1 Running 0 19m
node-local-dns-zcjzt 1/1 Running 0 19m
As it comes to manually fixing this particular daemonset yaml file, I wouldn't recommend it as you can be sure that GKE's auto-repair and auto-upgrade features will overwrite it sooner or later anyway.
I hope it was helpful.

status ErrImageNeverPull unable to set up kubernetes dashboard on Mac

I am using DOCKER desktop to setup kubernetes. I have used below command to install kubernetes dashboard on mac
kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/master/aio/test-resources/kubernetes-dashboard-local.yaml
when i use kubectl get pod --namespace=kube-system
NAME READY STATUS RESTARTS AGE
coredns-6dcc67dcbc-2qdls 1/1 Running 0 4h50m
coredns-6dcc67dcbc-nqm76 1/1 Running 0 4h50m
etcd-docker-desktop 1/1 Running 0 4h49m
kube-apiserver-docker-desktop 1/1 Running 0 4h50m
kube-controller-manager-docker-desktop 1/1 Running 0 4h49m
kube-proxy-pq9pv 1/1 Running 0 4h50m
kube-scheduler-docker-desktop 1/1 Running 0 4h49m
kubernetes-dashboard-local-599bb4877f-6nnkz 0/1 ErrImageNeverPull 0 138m
kubernetes-metrics-scraper-head-787ff8f87-rrq67 1/1 Running 0 138m
i wanted to know what is that "ErrImageNeverPull" of that POD. it is not even allowing me to describe/delete by that name.
kubectl describe pod kubernetes-dashboard-local-599bb4877f-6nnkz
Error from server (NotFound): pods "kubernetes-dashboard-local-599bb4877f-6nnkz" not found
How to fix or get rid of that so that i can successfully proceed further.

That YAML file specifies:
image: kubernetes/kubernetes-dashboard-amd64:head
imagePullPolicy: Never
So ErrImageNeverPull means that (a) that exact image name doesn't exist on the node where the pod is scheduled, and (b) imagePullPolicy: Never tells it to not try to fetch it.
Since the pod is not in the default namespace, you need to provide the kubectl --namespace kube-system option option to every command that tries to interact with it (not just get pod but also describe pod, delete deployment, etc.).
It looks like you've pulled a deployment spec from inside the dashboard's test tree which is intended to be used by a developer actively working on the dashboard code. The installation instructions have a different YAML file to use. (This link to the GitHub repo is probably more stable than the link to the version-specific YAML file that's there.)

kubectl get nodes not showing workers

I am following this tutorial with 2 vms running CentOS7. Everything looks fine (no errors during installation/setup) but I can't see my nodes.
NOTE:
I am running this on VMWare VMs
kub1 is my master and kub2 my worker node
kubectl get nodes output:
[root#kub1 ~]# kubectl cluster-info
Kubernetes master is running at http://kub1:8080
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
[root#kub2 ~]# kubectl cluster-info
Kubernetes master is running at http://kub1:8080
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
nodes:
[root#kub1 ~]# kubectl get nodes
[root#kub1 ~]# kubectl get nodes -a
[root#kub1 ~]#
[root#kub2 ~]# kubectl get nodes -a
[root#kub2 ~]# kubectl get no
[root#kub2 ~]#
cluster events:
[root#kub1 ~]# kubectl get events -a
LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
1h 1h 1 kub2.local Node Normal Starting {kube-proxy kub2.local} Starting kube-proxy.
1h 1h 1 kub2.local Node Normal Starting {kube-proxy kub2.local} Starting kube-proxy.
1h 1h 1 kub2.local Node Normal Starting {kubelet kub2.local} Starting kubelet.
1h 1h 1 node-kub2 Node Normal Starting {kubelet node-kub2} Starting kubelet.
1h 1h 1 node-kub2 Node Normal Starting {kubelet node-kub2} Starting kubelet.
/var/log/messages:
kubelet.go:1194] Unable to construct api.Node object for kubelet: can't get ip address of node node-kub2: lookup node-kub2: no such host
QUESTION: any idea why my nodes are not shown using "kubectl get nodes"?

My issue was that the KUBELET_HOSTNAME on /etc/kubernetes/kubeletvalue didn't match the hostname.
I commented that line, then restarted the services and I could see my worker after that.
hope that helps

Not sure about your scenario, but I have solved it after 3-4 hours of efforts.
Solved
I was facing this issue, because my docker cgroup driver was different than kubernetes cgroup driver.
Just updated it to cgroupfs using following commands mentioned in doc.
cat << EOF > /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
EOF
Restart docker service service docker restart.
Reset kubernetes on slave node: kubeadm reset
Joined master again: kubeadm join <><>
It was visible on master using kubectl get nodes.

I had a similar problem after installing k8s using kubespray on fedora31, and to debug the issue, tried to run a random container directly using docker run that failed with:
docker: Error response from daemon: cgroups: cgroup mountpoint does not exist: unknown.
this is a known problem cause by cgroup version on fedora 31, and the fix is to update grub to use the previous version:
sudo dnf install grubby
sudo grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=0"

How to fix weave-net CrashLoopBackOff for the second node?

I have got 2 VMs nodes. Both see each other either by hostname (through /etc/hosts) or by ip address. One has been provisioned with kubeadm as a master. Another as a worker node. Following the instructions (http://kubernetes.io/docs/getting-started-guides/kubeadm/) I have added weave-net. The list of pods looks like the following:
vagrant#vm-master:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-vm-master 1/1 Running 0 3m
kube-system kube-apiserver-vm-master 1/1 Running 0 5m
kube-system kube-controller-manager-vm-master 1/1 Running 0 4m
kube-system kube-discovery-982812725-x2j8y 1/1 Running 0 4m
kube-system kube-dns-2247936740-5pu0l 3/3 Running 0 4m
kube-system kube-proxy-amd64-ail86 1/1 Running 0 4m
kube-system kube-proxy-amd64-oxxnc 1/1 Running 0 2m
kube-system kube-scheduler-vm-master 1/1 Running 0 4m
kube-system kubernetes-dashboard-1655269645-0swts 1/1 Running 0 4m
kube-system weave-net-7euqt 2/2 Running 0 4m
kube-system weave-net-baao6 1/2 CrashLoopBackOff 2 2m
CrashLoopBackOff appears for each worker node connected. I have spent several ours playing with network interfaces, but it seems the network is fine. I have found similar question, where the answer advised to look into the logs and no follow up. So, here are the logs:
vagrant#vm-master:~$ kubectl logs weave-net-baao6 -c weave --namespace=kube-system
2016-10-05 10:48:01.350290 I | error contacting APIServer: Get https://100.64.0.1:443/api/v1/nodes: dial tcp 100.64.0.1:443: getsockopt: connection refused; trying with blank env vars
2016-10-05 10:48:01.351122 I | error contacting APIServer: Get http://localhost:8080/api: dial tcp [::1]:8080: getsockopt: connection refused
Failed to get peers
What I am doing wrong? Where to go from there?

I ran in the same issue too. It seems weaver wants to connect to the Kubernetes Cluster IP address, which is virtual. Just run this to find the cluster ip:
kubectl get svc. It should give you something like this:
$ kubectl get svc
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes 100.64.0.1 <none> 443/TCP 2d
Weaver picks up this IP and tries to connect to it, but worker nodes does not know anything about it. Simple route will solve this issue. On all your worker nodes, execute:
route add 100.64.0.1 gw <your real master IP>

this happens with a single node setup, too. I tried several things like reapplying the configuration and recreation, but the most stable way at the moment is to perform a full tear down (as described in docs) and put the cluster up again.
I use these scripts for relaunching the cluster:
down.sh
#!/bin/bash
systemctl stop kubelet;
docker rm -f -v $(docker ps -q);
find /var/lib/kubelet | xargs -n 1 findmnt -n -t tmpfs -o TARGET -T | uniq | xargs -r umount -v;
rm -r -f /etc/kubernetes /var/lib/kubelet /var/lib/etcd;
up.sh
#!/bin/bash
systemctl start kubelet
kubeadm init
# kubectl taint nodes --all dedicated- # single node!
kubectl create -f https://git.io/weave-kube
edit: I would also give other Pod networks a try, like Calico, if this is a weave related issue

The most common causes for this may be:
- presence of a firewall (e.g. firewalld on CentOS)
- network configuration (e.g. default NAT interface on VirtualBox)
Currently kubeadm is still alpha, and this is one of the issues that has already been reported by many of the alpha testers. We are looking into fixing this by documenting the most common problems, such documentation is going to be ready closer to beta version.
Right there exists a VirtualBox+Vargant+Ansible for Ubunutu and CentOS reference implementation that provides solutions for firewall, SELinux and VirtualBox NAT issues.

/usr/local/bin/weave reset
was the fix for me - Hope its useful - and yes make sure selinux is set to disabled
and firewalld is not running (on redhat / centos) releases
kube-system weave-net-2vlvj 2/2 Running 3 11d
kube-system weave-net-42k6p 1/2 Running 3 11d
kube-system weave-net-wvsk5 2/2 Running 3 11d

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Rook Ceph Operator hangs when checking for cluster status - kubernetes

You can execute ceph status command inside ceph toolbox pod. https://github.com/rook/rook/blob/master/Documentation/ceph-toolbox.md

Related

Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]

Enabling NodeLocalDNS fails

status ErrImageNeverPull unable to set up kubernetes dashboard on Mac

kubectl get nodes not showing workers

How to fix weave-net CrashLoopBackOff for the second node?

Categories

Resources