kubernetes nodes keep rebooting when using rook volumes - kubernetes

Several days ago I faced a problem when my nodes were rebooting constantly
My stack:
1 master, 2 workers k8s-cluster built with kubeadm (v1.17.1-00)
Ubuntu 18.04 x86_64 4.15.0-74-generic
Flannel cni plugin (v0.11.0)
Rook (v1.2) cephfs for storage. Ceph was deployed in the same cluster, where my application lives
I was able to run ceph cluster, but when I tried to deploy my application, that was using my rook-volumes, suddenly my pods were starting to die
I got this message when I used kubectl describe pods/name command:
Pod sandbox changed, it will be killed and re-created
In the k8s events I got:
<Node name> has been rebooted
After some time node comes to life but eventually dies in 2-3 minutes.
I tried to drain my node and connect back to my cluster but after that some another node was getting this error.
I looked into the system error logs of a failed node by command journalctl -p 3.
And found, that logs were flooded with these messages: kernel: cache_from_obj: Wrong slab cache. inode_cache but object is from ceph_inode_info.
After googling this problem, I found this issue:
https://github.com/coreos/bugs/issues/2616
It turned out, that cephfs just doesn't work with some versions of Linux kernel!!
For me neither of these worked:
Ubuntu 19.04 x86_64 5.0.0-32-generic
Ubuntu 18.04 x86_64 4.15.0-74-generic

Solution
Cephfs doesn't work with some versions of Linux kernel. Upgrade your kernel. I finally got it working on Ubuntu 18.04 x86_64 5.0.0-38-generic
Github issue, that helped me:
https://github.com/coreos/bugs/issues/2616
This is indeed a tricky issue, I was struggling to find a solution, and I spent A LOT of time trying to understand what was happening. I hope this information will help some one, cause there is not so much information on google.

Related

Rancher RKE2 CNI Plugin Not Initialized

I've ran into this error message while installing Rancher's RKE2 tools on my Ubuntu 20.04 (Hirsute) Virtualbox (runs on a Windows 10 laptop). I've allocated 128 MB of video memory, 145GB of storage, and 8196MB of memory.
Container runtime network not ready: NetworkReady=false reason NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
My research (Googling and trying to read Rancher's documentation) keeps leading me down rabbit holes but each thing that I read does not clarify the problem or what "CNI plugin not intialized" means. I'm not sure where to go from here.
What I have tried:
Installing the flannel kube
Updating /etc/NetworkDevices/rke2-canal.conf as described in the RKE2 documentation
I also noticed a taint on my node when I run kubectl describe node: node.kubernetes.io/not-ready:NoSchedule Again, I feel like I am chasing my own tail when I research this issue: the people who reported the same taint are not all using the same platform, VM, or Kubernetes distribution as I am, so I am lost as to whether the solutions might apply to me.
Where should I go from here? I'm lost.

Kubectl connection refused existing cluster

Hope someone can help me.
To describe the situation in short, I have a self managed k8s cluster, running on 3 machines (1 master, 2 worker nodes). In order to make it HA, I attempted to add a second master to the cluster.
After some failed attempts, I found out that I needed to add controlPlaneEndpoint configuration to kubeadm-config config map. So I did, with masternodeHostname:6443.
I generated the certificate and join command for the second master, and after running it on the second master machine, it failed with
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
Checking the first master now, I get connection refused for the IP on port 6443. So I cannot run any kubectl commands.
Tried recreating the .kube folder, with all the config copied there, no luck.
Restarted kubelet, docker.
The containers running on the cluster seem ok, but I am locked out of any cluster configuration (dashboard is down, kubectl commands not working).
Is there any way I make it work again? Not losing any of the configuration or the deployments already present?
Thanks! Sorry if it’s a noob question.
Cluster information:
Kubernetes version: 1.15.3
Cloud being used: (put bare-metal if not on a public cloud) bare-metal
Installation method: kubeadm
Host OS: RHEL 7
CNI and version: weave 0.3.0
CRI and version: containerd 1.2.6
This is an old, known problem with Kubernetes 1.15 [1,2].
It is caused by short etcd timeout period. As far as I'm aware it is a hard coded value in source, and cannot be changed (feature request to make it configurable is open for version 1.22).
Your best bet would be to upgrade to a newer version, and recreate your cluster.

Failed to create pod sandbox kubernetes cluster

I have an weave network plugin.
inside my folder /etc/cni/net.d there is a 10-weave.conf
{
"name": "weave",
"type": "weave-net",
"hairpinMode": true
}
My weave pods are running and the dns pod is also running
But when i want to run a pod like a simple nginx wich will pull an nginx image
The pod stuck at container creating , describe pod gives me the error , failed create pod sandbox.
When i run journalctl -u kubelet i get this error
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
is my network plugin not good configured ?
i used this command to configure my weave network
kubectl apply -f https://git.io/weave-kube-1.6
After this won't work i also tried this command
kubectl apply -f “https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d ‘\n’)”
I even tried flannel and that gives me the same error.
The system i am setting kubernetes on is a raspberry pi.
I am trying to build a raspberry pi cluster with 3 nodes and 1 master with kubernetes
Dose anyone have ideas on this?
Thank you all for responding to my question. I solved my problem now. For anyone who has come to my question in the future the solution was as followed.
I cloned my raspberry pi images because i wanted a basicConfig.img for when i needed to add a new node to my cluster of when one gets down.
Weave network (the plugin i used) got confused because on every node and master the os had the same machine-id. When i deleted the machine id and created a new one (and reboot the nodes) my error got fixed. The commands to do this was
sudo rm /etc/machine-id
sudo rm /var/lib/dbus/machine-id
sudo dbus-uuidgen --ensure=/etc/machine-id
Once again my patience was being tested. Because my kubernetes setup was normal and my raspberry pi os was normal. I founded this with the help of someone in the kubernetes community. This again shows us how important and great are IT community is. To the people of the future who will come to this question. I hope this solution will fix your error and will decrease the amount of time you will be searching after a stupid small thing.
Looking at the pertinent code in Kubernetes and in CNI, the specific error you see seems to indicate that it cannot find any files ending in .json, .conf or .conflist in the directory given.
This makes me think it could be something as the conf file not being present on all the hosts, so I would verify that as a first step.

k8s (single-node) not working after restart

I installed kubernetes on Ubuntu 16.04 (Virtualbox vm) - a single node with master tainted. It worked well. But after I restart my vm, it is not working any more.
kubectl commands are not working any more, throws this error -
The connection to the server localhost:8001 was refused - did you specify the
right host or port?
It looks similar to this thread, but the solution is not working for me.
When I try "sudo docker ps -a", all kube pods are showing in Exited status.
Any helps/pointers, please? Thanks in advance.
I've been having the same issue with my rancher 2 setup. I have two nodes in one cluster. One of my node servers was restarted and never connected to my cluster. Even though docker and containers were running fine.
One of the things i tried was reduce the number of workloads that can run in one node. I had increased it to 400. SO i put it back to 100. That's when I got my first breakthrough of what could be happening with my downed node. I go the error "Path /var/lib/docker is mounted on / but it is not a shared or slave mount." A quick search led me to a similar issue in the rancher github page. Basically a workaround by superseb fixed my issue. I sshed into my node and ran
> mount --make-rshared /
> docker start kubelet
Maybe your issue might be different, but maybe you could be having this same shared problem.

Minions can't rejoin cluster on reboot of AWS instance

The kubernetes cluster using v1.3.4 starts a master and 2 minions
The cluster starts fine and pods can be started and controlled without issue
As soon as one of the minions is rebooted, or any of the dependent services, such as kubelet is restarted, the minions will not rejoin the cluster
The error from the kubelet service is of the form:
Aug 08 08:21:15 ip-10-16-1-20 kubelet[911]: E0808 08:21:15.955309 911 kubelet.go:2875] Error updating node status, will retry: error getting node "ip-10-16-1-20.us-west-2.compute.internal": nodes "ip-10-16-1-20.us-west-2.compute.internal" not found
The only way, that we can see to rectify this issue at the moment is to tear down the whole cluster and rebuild it
UPDATE:
I had a look at the controller manager log and got the following
W0815 13:36:39.087991 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
W0815 13:37:39.123811 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
E0815 13:37:39.133045 1 nodecontroller.go:434] pods "kube-proxy-ip-10-16-1-25.us-west-2.compute.internal" not found
This is actually a coreos issue, although it is difficult to ascertain what the problem actually is. It is more than likely the low level os host resolution code being called from the aws go layers, but that is purely a guess. By upgrading the coreos ami to a later version solved many of the issues we were facing.