Unable to setup multiple node kubernetes cluster using kubeadm (Vagrant) - kubernetes

I have been setting up multi node kubernetes cluster using kubeadm.Setup included 1 master and worker node each. I have created the VM using vagrant.
I followed the docs,
https://kubernetes.io/docs/setup/independent/install-kubeadm/
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm
Created 2 VM's using vagrant
IP: Master- 192.168.33.10 , Worker- 192.168.1.21 (Both host only network)
I have experienced 2 scenarios,
Case 1:
Ran kubeadm init --pod-network-cidr=10.244.0.0/16 successfully with all pods running.
Installed "Canal" pod network add on.
Followed all the instructions given at the end of the successfull kubeadm init command.
SSH into 2nd VM and ran kubeadm join .. command and I am struck at "[preflight] Running pre-flight checks"
Case 2:
Did the same process with tag --apiserver-advertise-address=192.168.33.10
Successfully ran the command kubeadm init --apiserver-advertise-address=192.168.33.10
But when I ran the command kubectl get nodes it only showed the master node. (expected the worker node to show too).
Kindly help me understand how can I complete this setup. Thank you.

I have github repository which does exactly what you want. I am pretty sure that you will get idea from it. If anything is not clear, please update with comment or original post.

Related

Kubectl connection refused existing cluster

Hope someone can help me.
To describe the situation in short, I have a self managed k8s cluster, running on 3 machines (1 master, 2 worker nodes). In order to make it HA, I attempted to add a second master to the cluster.
After some failed attempts, I found out that I needed to add controlPlaneEndpoint configuration to kubeadm-config config map. So I did, with masternodeHostname:6443.
I generated the certificate and join command for the second master, and after running it on the second master machine, it failed with
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
Checking the first master now, I get connection refused for the IP on port 6443. So I cannot run any kubectl commands.
Tried recreating the .kube folder, with all the config copied there, no luck.
Restarted kubelet, docker.
The containers running on the cluster seem ok, but I am locked out of any cluster configuration (dashboard is down, kubectl commands not working).
Is there any way I make it work again? Not losing any of the configuration or the deployments already present?
Thanks! Sorry if it’s a noob question.
Cluster information:
Kubernetes version: 1.15.3
Cloud being used: (put bare-metal if not on a public cloud) bare-metal
Installation method: kubeadm
Host OS: RHEL 7
CNI and version: weave 0.3.0
CRI and version: containerd 1.2.6
This is an old, known problem with Kubernetes 1.15 [1,2].
It is caused by short etcd timeout period. As far as I'm aware it is a hard coded value in source, and cannot be changed (feature request to make it configurable is open for version 1.22).
Your best bet would be to upgrade to a newer version, and recreate your cluster.

Kubernetes V1.16.8 doesn't support 'node-role' label using "--node-labels=node-role.kubernetes.io/master="

Upgrade Kube-aws v1.15.5 cluster to the next version 1.16.8.
Use Case:
I want to keep the Same node label for Master and Worker nodes as I'm using in v1.15 .
When I tried to upgrade the cluster to V1.16 the --node-labels is restricted to use 'node-role'
If I keep the node role as "node-role.kubernetes.io/master" the kubelet fails to start after upgrade. if I remove the label, kubectl get node output shows none for the upgraded node.
How do I reproduce?
Before the upgrade I took a backup of 'cp /etc/sysconfig/kubelet /etc/sysconfig/kubelet-bkup' have removed "-role" from it and once the upgrade is completed, I have moved the kubelet sysconfig by replacing the edited file 'mv /etc/sysconfig/kubelet-bkup /etc/sysconfig/kubelet'. Now I could able to see the Noderole as Master/Worker even after kubelet service restart.
The Problem I'm facing now?
Though I perform the upgrade on the existing cluster successfully. The cluster is running in AWS as Kube-aws model. So, the ASG would spin up a new node whenever Cluster-Autoscaler triggers it.
But, the new node fails to join to the cluster since the node label "node-role.kubernetes.io/master" exists in the code base.
How can I add the node-role dynamically in the ASG scale-in process?. Any solution would be appreciated.
Note:
(Kubeadm, kubelet, kubectl )- v1.16.8
I have sorted out the issue. I have created a Python code that watches the node events. So whenever ASG spins up a new node, after it joins to the cluster, the node wil be having a role "" , later the python code will add a appropriate label to the node dynamically.
Also, I have created a docker image with the base of python script I created for node-label and it will run as a pod. The pod will be deployed into the cluster and it does the job of labelling the new nodes.
Ref my solution given in GitHub
https://github.com/kubernetes/kubernetes/issues/91664
I have created as a docker image and it is publicly available
https://hub.docker.com/r/shaikjaffer/node-watcher
Thanks,
Jaffer

Cannot get fabric8 to start on local development machine (OSX or Linux)

I'm trying to give fabric8 a shot but I'm having issues getting it to start on a local machine running minikube and virtualbox (I've attempted this on Linux and OSX. I'm able to get all but one of the pods to start (after manually increasing minikube's VM ram to 8GB). The expose controller won't start and is giving me the following error in the logs:
I0415 14:29:43.431944 1 exposecontroller.go:47] Using build: '2.3.2'
F0415 14:29:43.492059 1 exposecontroller.go:66] failed to create new strategy: failed to create node port expose strategy: failed to list nodes: nodes is forbidden: User "system:serviceaccount:fabric8:exposecontroller" cannot list nodes at the cluster scope
Here are the commands I'm running:
minikube start --cpus=5 --disk-size=50g --memory=8000
curl -sS http://get.fabric8.io/download.txt | bash
gofabric8 start
I also tried creating OAuth secret via GitHub (using bogus IP address info for the redirect URL) but this doesn't make sense to me because I don't have a domain... Then I ran these:
minikube start --vm-driver=xhyve --cpus=5 --disk-size=50g --memory=8000
minikube addons enable ingress
gofabric8 deploy --package system -n fabric8
That resulted in the exposecontroller working but then additional pods (keycloak, for example) were created but failed to start.
I've spent hours trying to get this to work and am about to give up. The documentation on GitHub differs from fabric8's site documentation and I just can't get it to work. If someone is able to help, I would greatly appreciate it.
Note:
I've attempted to follow the instructions here:
http://fabric8.io/guide/getStarted/gofabric8.html
Additionally, I attempted to follow this:
https://github.com/fabric8io/fabric8-platform/blob/master/INSTALL.md

Failed to create pod sandbox kubernetes cluster

I have an weave network plugin.
inside my folder /etc/cni/net.d there is a 10-weave.conf
{
"name": "weave",
"type": "weave-net",
"hairpinMode": true
}
My weave pods are running and the dns pod is also running
But when i want to run a pod like a simple nginx wich will pull an nginx image
The pod stuck at container creating , describe pod gives me the error , failed create pod sandbox.
When i run journalctl -u kubelet i get this error
cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
is my network plugin not good configured ?
i used this command to configure my weave network
kubectl apply -f https://git.io/weave-kube-1.6
After this won't work i also tried this command
kubectl apply -f “https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d ‘\n’)”
I even tried flannel and that gives me the same error.
The system i am setting kubernetes on is a raspberry pi.
I am trying to build a raspberry pi cluster with 3 nodes and 1 master with kubernetes
Dose anyone have ideas on this?
Thank you all for responding to my question. I solved my problem now. For anyone who has come to my question in the future the solution was as followed.
I cloned my raspberry pi images because i wanted a basicConfig.img for when i needed to add a new node to my cluster of when one gets down.
Weave network (the plugin i used) got confused because on every node and master the os had the same machine-id. When i deleted the machine id and created a new one (and reboot the nodes) my error got fixed. The commands to do this was
sudo rm /etc/machine-id
sudo rm /var/lib/dbus/machine-id
sudo dbus-uuidgen --ensure=/etc/machine-id
Once again my patience was being tested. Because my kubernetes setup was normal and my raspberry pi os was normal. I founded this with the help of someone in the kubernetes community. This again shows us how important and great are IT community is. To the people of the future who will come to this question. I hope this solution will fix your error and will decrease the amount of time you will be searching after a stupid small thing.
Looking at the pertinent code in Kubernetes and in CNI, the specific error you see seems to indicate that it cannot find any files ending in .json, .conf or .conflist in the directory given.
This makes me think it could be something as the conf file not being present on all the hosts, so I would verify that as a first step.

k8s (single-node) not working after restart

I installed kubernetes on Ubuntu 16.04 (Virtualbox vm) - a single node with master tainted. It worked well. But after I restart my vm, it is not working any more.
kubectl commands are not working any more, throws this error -
The connection to the server localhost:8001 was refused - did you specify the
right host or port?
It looks similar to this thread, but the solution is not working for me.
When I try "sudo docker ps -a", all kube pods are showing in Exited status.
Any helps/pointers, please? Thanks in advance.
I've been having the same issue with my rancher 2 setup. I have two nodes in one cluster. One of my node servers was restarted and never connected to my cluster. Even though docker and containers were running fine.
One of the things i tried was reduce the number of workloads that can run in one node. I had increased it to 400. SO i put it back to 100. That's when I got my first breakthrough of what could be happening with my downed node. I go the error "Path /var/lib/docker is mounted on / but it is not a shared or slave mount." A quick search led me to a similar issue in the rancher github page. Basically a workaround by superseb fixed my issue. I sshed into my node and ran
> mount --make-rshared /
> docker start kubelet
Maybe your issue might be different, but maybe you could be having this same shared problem.