Unable to run pods on new node - kubernetes

Had to change node (server) with the new one leaving the same node name. What I did was:
master> kubectl delete no srv1 (removing old node)
srv1> kubeadm join... (joining new node)
after new node joined cluster no pods can be created.
Warning FailedCreatePodSandBox 16s kubelet, srv1 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b85728b51a18533e9d57f6a1b1808dbb5ad72bff4d516217de04e7dad4ce358d" network for pod "dpl-6f56777485-6jzm6": NetworkPlugin cni failed to set up pod "dpl-6f56777485-6jzm6_default" network: failed to set bridge addr: "cni0" already has an IP address different from 10.244.16.1/24

Ideally when performing such a task like "replacing a node" below steps should be considered:
Drain node kubectl drain NODE_NAME
Reset that node kubeadm reset in the old node (optional step if the old node is accessible)
Finally kubeadm delete node NODE_NAME
Things to consider when replacing a old node with new node:
The new node should have the same name as the old node which is echo $HOSTNAME should remain same.
The new node should have the same ip as the old one.
Because these are a node identity.
Finally in a scenario where you have already performed kubectl delete node ... and replaced it with a new one.
curl -LO https://raw.githubusercontent.com/coreos/flannel/62e44c867a2846fefb68bd5f178daf4da3095ccb/Documentation/kube-flannel.yml
kubectl delete -f kube-flannel.yml
[perform below in the nodes which are having problems]
sudo ip link del cni0
sudo ip link del flannel.1
sudo systemctl restart network
[re-apply network plugin]
kubectl apply -f kube-flannel.yml

Related

Kubernetes network plugin

I have installed a Kubernetes cluster of 3 nodes with the calico network plugin.
For some reason I decided to remove totally kubernetes and reisntall it with a different network plugin: Flannel.
All seemed fine until I tried to deploy my first container.
kubectl describe pod/cassandra return the following error:
Unknown desc = [failed to set up sandbox container "957f68c3cbe9b230b0e2bd6729a12c340f903de568622e28e335f7b48563a445" network for pod "cassandra-d7db46b86-dz7ck": networkPlugin cni failed to set up pod "cassandra-d7db46b86-dz7ck_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"), failed to clean up sandbox container "957f68c3cbe9b230b0e2bd6729a12c340f903de568622e28e335f7b48563a445" network for pod "cassandra-d7db46b86-dz7ck": networkPlugin cni failed to teardown pod "cassandra-d7db46b86-dz7ck_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")]
Normal SandboxChanged 3s (x3 over 18s) kubelet, <node name> Pod sandbox changed, it will be killed and re-created.
By reading at the errors it seems that the calico plugin is still used by Kubernetes, although I removed it and installed the Flannel plugin.
How can I clean this mess ?
Clear ip route: ip route flush proto bird
remove all calico links in all nodes
ip link list | grep cali | awk '{print $2}' | cut -c 1-15 | xargs -I {} ip link delete {}
remove ipip module modprobe -r ipip
remove calico configs
rm /etc/cni/net.d/10-calico.conflist && rm /etc/cni/net.d/calico-kubeconfig
restart kubelet service
After this you install flannel.
Can you try to rejoin (remove from the cluster and join it again) the compute/slave nodes? It worked for one of my cases before.

Nginx Kubernetes POD stays in ContainerCreating

I was able to setup the Kubernetes Cluster on Centos7 with one master and two worker nodes, however when I try to deploy a pod with nginx, the state of the pod stays in ContainerRunning forever and doesn't seem to get out of it.
For pod network I am using the calico.
Can you please help me resolve this issue? for some reason I don't feel satisfied moving forward without resolving this issue, I tried to check forums etc, since the last two days and this is the last resort that I am reaching out to you.
[root#kube-master ~]# kubectl get pods --all-namespaces
[get pods result][1]
However when I run describe pods I see the below error for the nginx container under events section.
Warning FailedCreatePodSandBox 41s (x8 over 11m) kubelet,
kube-worker1 (combined from similar events): Failed to create pod
sandbox: rpc error: code = Unknown desc = failed to set up sandbox
container
"ac77a42270009cba0c508e2fd82a84d6caef287bdb117d288d5193960b52abcb"
network for pod "nginx-6db489d4b7-2r4d2": networkPlugin cni failed to
set up pod "nginx-6db489d4b7-2r4d2_default" network: unable to connect
to Cilium daemon: failed to create cilium agent client after 30.000000
seconds timeout: Get http:///var/run/cilium/cilium.sock/v1/config:
dial unix /var/run/cilium/cilium.sock: connect: no such file or
directory
Hope you can help here.
Edit 1:
The ip address of the master VM is 192.168.40.133
Used the below command to initialize the kubeadm:
kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address 192.168.40.133
Used the below command to install the pod network:
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
The kubeadm init above gave me the join command that I used to join the workers into the cluster.
All the VMs are connected to host and bridged network adapters.
your pod subnet (specified by --pod-network-cidr) clashes with the network your VMs are located in: these 2 have to be distinct. Use something else for the pod subnet, for example 10.244.0.0/16 and then edit calico.yaml before applying it as described in the official docs:
POD_CIDR="10.244.0.0/16"
kubeadm init --pod-network-cidr=${POD_CIDR} --apiserver-advertise-address 192.168.40.133
curl https://docs.projectcalico.org/manifests/calico.yaml -O
sed -i -e "s?192.168.0.0/16?${POD_CIDR}?g" calico.yaml
kubectl apply -f calico.yaml
hope this helps :)
note: you don't really need to specify --apiserver-advertise-address flag: kubeadm will detect correctly the main IP of the machine most of the time.

My kubernetes cluster IP address changed and now kubectl will no longer connect

Running under Ubuntu I used kubeadm init to setup my cluster (master node) and copied over the /etc/kubernetes/admin.conf $HOME/.kube/config and all was well when using kubectl.
However after a reboot my master node has had an IP address change which is not the same as what is in $HOME/.kube/config so now I can no longer connect kubectl
So how do I regenerate the admin.conf now that I have a new IP address? Running kubeadm init will just kill everything which is not what I want.
I found this solution on the internet and it works for me:
systemctl stop kubelet docker
cd /etc/
mv kubernetes kubernetes-backup
mv /var/lib/kubelet /var/lib/kubelet-backup
mkdir -p kubernetes
cp -r kubernetes-backup/pki kubernetes
rm kubernetes/pki/{apiserver.*,etcd/peer.*}
systemctl start docker
kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd
#Run "kubeadm reset" on all nodes if was this error "error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists
[ERROR Port-10250]: Port 10250 is in use
[ERROR FileAvailable--etc-kubernetes-pki-ca.crt]: /etc/kubernetes/pki/ca.crt already exists"
cp kubernetes/admin.conf ~/.kube/config
kubectl get nodes --sort-by=.metadata.creationTimestamp
kubectl delete node $(kubectl get nodes -o jsonpath='{.items[(#.status.conditions[0].status=="Unknown")].metadata.name}')
kubectl get pods --all-namespaces
After These, Join your Slaves to Master.
Reference: https://medium.com/#juniarto.samsudin/ip-address-changes-in-kubernetes-master-node-11527b867e88
The following command can be used to regenerate admin.conf
kubeadm alpha phase kubeconfig admin --apiserver-advertise-address <new_ip>
However, if you use an IP instead of a hostname, your API-server certificate will be invalid. So, either regenerate your certs ( kubeadm alpha phase certs renew apiserver ), use hostnames instead of IPs or add the insecure --insecure-skip-tls-verify flag when using kubectl
You do not want to use kubeadm reset. That will reset everything and you would have to start configuring your cluster again.
Well, in your scenario, please have a look on the steps below:
nano /etc/hosts (update your new IP against YOUR_HOSTNAME)
nano /etc/kubernetes/config (configuration settings related to your cluster) here in this file look for the following params and update accordingly
KUBE_MASTER="--master=http://YOUR_HOSTNAME:8080"
KUBE_ETCD_SERVERS="--etcd-servers=http://YOUR_HOSTNAME:2379" #2379 is default port
nano /etc/etcd/etcd.conf (conf related to etcd)
KUBE_ETCD_SERVERS="--etcd-servers=http://YOUR_HOSTNAME/WHERE_EVER_ETCD_HOSTED:2379"
2379 is default port for etcd. and you can have multiple etcd servers defined here comma separated
Restart kubelet, apiserver, etcd services.
It is good to use hostname instead of IP to avoid such scenarios.
Hope it helps!

Unable to see join nodes in Kubernetes master

This is my worker node:
root#ivu:~# kubeadm join 10.16.70.174:6443 --token hl36mu.0uptj0rp3x1lfw6n --discovery-token-ca-cert-hash sha256:daac28160d160f938b82b8c720cfc91dd9e6988d743306f3aecb42e4fb114f19 --ignore-preflight-errors=swap
[preflight] Running pre-flight checks.
[WARNING Swap]: running with swap on is not supported. Please disable swap
[WARNING FileExisting-crictl]: crictl not found in system path
Suggestion: go get github.com/kubernetes-incubator/cri-tools/cmd/crictl
[discovery] Trying to connect to API Server "10.16.70.174:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.16.70.174:6443"
[discovery] Requesting info from "https://10.16.70.174:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "10.16.70.174:6443"
[discovery] Successfully established connection with API Server "10.16.70.174:6443"
This node has joined the cluster:
* Certificate signing request was sent to master and a response
was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the master to see this node join the cluster.
While checking in master nodes using command kubectl get nodes, I can only able to see master:
ivum01#ivum01-HP-Pro-3330-SFF:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ivum01-hp-pro-3330-sff Ready master 36m v1.10.0
For question answer:
docker kubelet kubeadm kubectl installed fine
kubectl get node can not see the current added node; of cause kubectl get pods --all-namespaces has no result for this node;
docker which in current node has no report for kubeadm command(means no k8s images pull, no running container for that)
must import is kubelet not running in work node
run kubelet output:
Failed to get system container stats for "/user.slice/user-1000.slice/session-1.scope": failed to get cgroup stats for "/user.slice/user-1000.slice/session-1.scope": failed to get container info for "/user.slice/user-1000.slice/session-1.scope": unknown container "/user.slice/user-1000.slice/session-1.scope"
same as this issue said
tear down and reset cluster(kubeadm reset) and redo that has no problem in my case.
I had this problem and it was solved by ensuring that cgroup driver on the worker nodes also were set properly.
check with:
docker info | grep -i cgroup
cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
set it with:
sed -i "s/cgroup-driver=systemd/cgroup-driver=cgroupfs/g" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
then restart kubelet service and rejoin the cluster:
systemctl daemon-reload
systemctl restart kubelet
kubeadm reset
kubeadm join ...
Info from docs: https://kubernetes.io/docs/tasks/tools/install-kubeadm/#configure-cgroup-driver-used-by-kubelet-on-master-node

How to change name of a kubernetes node

I have a running node in a kubernetes cluster. Is there a way I can change its name?
I have tried to
delete the node using kubectl delete
change the name in the node's manifest
add the node back.
But the node won't start.
Anyone know how it should be done?
Thanks
Usualy it's kubelet that is responsible for registering the node under particular name, so you should make changes to your nodes kubelet configuration and then it should pop up as new node.
Changing the node's name is not possible at the moment, it requires you to remove and rejoin the node.
You need to make sure the hostname is changed to the new name, remove the node, reset it and rejoin it.
(you will notice that with the command : kubectl edit node , you will get an error if you try and save the name:
A copy of your changes has been stored to "/tmp/kubectl-edit-qlh54.yaml"
error: At least one of apiVersion, kind and name was changed
)
Ideally you have removed the running pods on it.
You can try to run kubectl drain <node_name_to_rename> . Proceed at your own risk if that doesn't complete . --ignore-daemon-sets can be used to ignore possible issues for pods that cannot be evicted.
In short, for a node that has been renamed and is out of the cluster on CentOS 7:
kubectl delete node <original-nodename>
Then on the node that you want to rejoin, as root:
kubeadm reset
check the output and see if it applies to your setup (for potential further cleanup).
Now generate the join command on the master node:
export KUBECONFIG=/etc/kubernetes/admin.conf #(or wherever you have it)
kubeadm token create --print-join-command
Run the output on the worker node you have just reset:
kubeadm join <masternode_ip_address>:6443 --token somegeneratedtoken --discovery-token-ca-cert-hash sha256:somesha256hashthatyougotfromtheabovecommand
If you run kubectl get nodes it should show up now with the new name
output in my case:
W0220 10:43:23.286109 11473 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.17" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.
Run 'kubectl get nodes' on the control-plane to see this node join the cluster.
Enjoy your renamed node!
Based on source: https://www.youtube.com/watch?v=TqoA9HwFLVU