Failed to remove etcd after reset kubeadm - kubernetes

When I try to kubeadm reset -f, it report the etcd server can not be removed, you must remove it manually.
failed to remove etcd member: error syncing endpoints with etc: etcdclient: no available endpoints. Please manually remove this etcd member using etcdctl

Is this a control-plane (master) node?
If not: simply running kubectl delete node <node_id> should suffice (see reference below). This will update etcd and take care of the rest of cleanup. You'll still have to diagnose what caused the node to fail to reset in the first place if you're hoping to re-add it... but that's a separate problem. See discussion e.g., here on a related issue:
If the node is hard failed and you cannot call kubeadm reset on it, it requires manual steps. you'd have to:
Remove the control-plane IP from the kubeadm-config CM ClusterStatus
Remove the etcd member using etcdctl
Delete the Node object using kubectl (if you don't want the Node around anymore)
1 and 2 apply only to control-plane nodes.
Hope this helps — if you are dealing with a master node, I'd be happy to include examples of what commands to run.

Related

Recovering Kubernetes cluster without certs

I have the following scenario in the lab and would like to see if its possible to recover. The cluster is broken but very expected since I was testing how far I could go with breaking the cluster and still be able to recover.
Env:
Kubernetes 1.16.3
Kubespray
I was experimenting a bit and don't have any data on this cluster but I am still very curious if it's possible to recover. I have a healthy 3 node etcd cluster with the original configuration (all namespaces, workloads, configmaps etc). I don't have the original SSL certs for the control plane.
I removed all nodes from the cluster (kubeadm reset). I have original manifests and kubelet config and try to re-init master nodes. It is quite more successful than I thought it would be but not where I want it to be.
After successful kubeadm init, the kubelet and control plane containers start successfully but the corresponding pods are not created. I am able to use the kube API with kubectl and see the nodes, namespaces, deployments, etc.
In the kube-system namespace all daemonsets still exist but the pods won't start with the following message:
49m Warning FailedCreate daemonset/kube-proxy Error creating: Timeout: request did not complete within requested timeout
The kubelet logs the following re control plane pods
Jul 21 22:30:02 k8s-master-4 kubelet[13791]: E0721 22:30:02.088787 13791 kubelet.go:1664] Failed creating a mirror pod for "kube-scheduler-k8s-master-4_kube-system(3e128801ef687b022f6c8ae175c9c56d)": Timeout: request did not complete within requested timeout
Jul 21 22:30:53 k8s-master-4 kubelet[13791]: E0721 22:30:53.089517 13791 kubelet.go:1664] Failed creating a mirror pod for "kube-controller-manager-k8s-master-4_kube-system(da5cfae13814fa171a320ce0605de98f)": Timeout: request did not complete within requested timeout
During kubeadm reset/init process I already have some steps so I can get to where I am now (delete serviceaccounts to reset the tokens, delete some configmaps (kuebadm etc))
My question is - is it possible to recover the control plane without the certs. And if its complicated but still possible process I would still like to know.
All help appreciated
Henro
is it possible to recover the control plane without the certs.
Yes, should be able to. The certs 🔏 are required but they don't have to be the very same ones that you created the cluster initially with. All the certificates including the CA can be rotated across the board. The kubelet even supports certificate auto-rotation. The configurations need to match everywhere though. Meaning the CA needs to be the same that created the CSRs and cert keys/certs need to be created from the same CSRs. 🔑
Also, all the components need to use the same CA and be able to authenticate with the API server (kube-controller-manager, kube-scheduler, etc) 🔐. I'm not entirely sure about the logs that you are seeing but it looks like the kube-controller-manager and kube-scheduler are not able to authenticate and join the cluster. So I would take a look at their cert configurations:
/etc/kubernetes/kube-controller-manager.conf
/etc/kubernetes/kube-scheduler.conf
Also, you would find every PKI component that you need to verify under /etc/kubernetes/pki
✌️

Kubernetes V1.16.8 doesn't support 'node-role' label using "--node-labels=node-role.kubernetes.io/master="

Upgrade Kube-aws v1.15.5 cluster to the next version 1.16.8.
Use Case:
I want to keep the Same node label for Master and Worker nodes as I'm using in v1.15 .
When I tried to upgrade the cluster to V1.16 the --node-labels is restricted to use 'node-role'
If I keep the node role as "node-role.kubernetes.io/master" the kubelet fails to start after upgrade. if I remove the label, kubectl get node output shows none for the upgraded node.
How do I reproduce?
Before the upgrade I took a backup of 'cp /etc/sysconfig/kubelet /etc/sysconfig/kubelet-bkup' have removed "-role" from it and once the upgrade is completed, I have moved the kubelet sysconfig by replacing the edited file 'mv /etc/sysconfig/kubelet-bkup /etc/sysconfig/kubelet'. Now I could able to see the Noderole as Master/Worker even after kubelet service restart.
The Problem I'm facing now?
Though I perform the upgrade on the existing cluster successfully. The cluster is running in AWS as Kube-aws model. So, the ASG would spin up a new node whenever Cluster-Autoscaler triggers it.
But, the new node fails to join to the cluster since the node label "node-role.kubernetes.io/master" exists in the code base.
How can I add the node-role dynamically in the ASG scale-in process?. Any solution would be appreciated.
Note:
(Kubeadm, kubelet, kubectl )- v1.16.8
I have sorted out the issue. I have created a Python code that watches the node events. So whenever ASG spins up a new node, after it joins to the cluster, the node wil be having a role "" , later the python code will add a appropriate label to the node dynamically.
Also, I have created a docker image with the base of python script I created for node-label and it will run as a pod. The pod will be deployed into the cluster and it does the job of labelling the new nodes.
Ref my solution given in GitHub
https://github.com/kubernetes/kubernetes/issues/91664
I have created as a docker image and it is publicly available
https://hub.docker.com/r/shaikjaffer/node-watcher
Thanks,
Jaffer

Replacing dead master in Kubernetes 1.15 cluster with stacked control plane

I have a Kubernetes cluster with 3-master stacked control plane - so each master also has its own etcd instance running locally. The problem I am trying solve is this:
"If one master dies such that it cannot be restarted, how do I replace it?"
Currently, when I try to add the replacement master into the cluster, I get the following error while running kubeadm join:
[check-etcd] Checking that the etcd cluster is healthy
I0302 22:43:41.968068 9158 local.go:66] [etcd] Checking etcd cluster health
I0302 22:43:41.968089 9158 local.go:69] creating etcd client that connects to etcd pods
I0302 22:43:41.986715 9158 etcd.go:106] etcd endpoints read from pods: https://10.0.2.49:2379,https://10.0.225.90:2379,https://10.0.247.138:2379
error execution phase check-etcd: error syncing endpoints with etc: dial tcp 10.0.2.49:2379: connect: no route to host
The 10.0.2.49 node is the one that died. These nodes are all running in an AWS AutoScaling group, so I don't have control over the addresses.
I have drained and deleted the dead master node using kubectl drain and kubectl delete; and I have used etcdctl to make sure the dead node was not in the member list.
Why is it still trying to connect to that node's etcd?
It is still trying to connect to the member because etcd maintains a list of members in its store -- that's how it knows to vote on quorum decisions. I don't believe etcd is unique in that way -- most distributed key-value stores know their member list
The fine manual shows how to remove a dead member, but it also warns to add a new member before removing unhealthy ones.
There is also a project etcdadm that is designed to smooth over some of the rough edges about etcd cluster management, but I haven't used it to say what it is good at versus not
The problem turned out to be that the failed node was still listed in the ConfigMap. Further investigation led me to the following thread, which discusses the same problem:
https://github.com/kubernetes/kubeadm/issues/1300
The solution that worked for me was to edit the ConfigMap manually.
kubectl -n kube-system get cm kubeadm-config -o yaml > tmp-kubeadm-config.yaml
manually edit tmp-kubeadm-config.yaml to remove the old server
kubectl -n kube-system apply -f tmp-kubeadm-config.yaml
I believe updating the etcd member list is still necessary to ensure cluster stability, but it wasn't the full solution.

What is the way to make kubernetes nodes have `providerID` spec after creation (manually)?

I'm expecting that kubectl get nodes <node> -o yaml to show the spec.providerID (see reference below) once the kubelet has been provided the additional flag --provider-id=provider://nodeID. I've used /etc/default/kubelet file to add more flags to the command line when kubelet is start/restarted. (On a k8s 1.16 cluster) I see the additional flags via a systemctl status kubelet --no-pager call, so the file is respected.
However, I've not seen the value get returned by kubectl get node <node> -o yaml call. I was thinking it had to be that the node was already registered, but I think kubectl re-registers when it starts up. I've seen the log line via journalctl -u kubelet suggest that it has gone through registration.
How can I add a provider ID to a node manually?
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#nodespec-v1-core
How a kubelet is configured on the node itself is separate (AFAIK) from its definition in the master control plane, which is responsible for updating state in the central etcd store; so it's possible for these to fall out of sync. i.e., you need to communicate to the control place to update its records.
In addition to Subramanian's suggestion, kubectl patch node would also work, and has the added benefit of being easily reproducible/scriptable compared to manually editing the YAML manifest; it also leaves a "paper trail" in your shell history should you need to refer back. Take your pick :) For example,
$ kubectl patch node my-node -p '{"spec":{"providerID":"foo"}}'
node/my-node patched
$ kubectl describe node my-node | grep ProviderID
ProviderID: foo
Hope this helps!
You can edit the node config and append providerID information under spec section.
kubectl edit node <Node Name>
...
spec:
podCIDR:
providerID:

How to restart master node in kubernetes

I have a kubernetes cluster with 3 masters and 3 workers, I want to restart one of the masters to update the system of the master machine.
So can I just reboot the machine directly on the console with reboot,
or some steps need to be done before the reboot to void the risk of out of service and data loss?
If you need to reboot a node (such as for a kernel upgrade, libc upgrade, hardware repair, etc.), and the downtime is brief, then when the Kubelet restarts, it will attempt to restart the pods scheduled to it. If the reboot takes longer (the default time is 5 minutes, controlled by --pod-eviction-timeout on the controller-manager), then the node controller will terminate the pods that are bound to the unavailable node. If there is a corresponding replica set (or replication controller), then a new copy of the pod will be started on a different node. So, in the case where all pods are replicated, upgrades can be done without special coordination, assuming that not all nodes will go down at the same time
If you want more control over the upgrading process, you may use the following workflow:
Use kubectl drain to gracefully terminate all pods on the node while marking the node as unschedulable:
kubectl drain $NODENAME
This keeps new pods from landing on the node while you are trying to get them off.
For pods with a replica set, the pod will be replaced by a new pod which will be scheduled to a new node. Additionally, if the pod is part of a service, then clients will automatically be redirected to the new pod.
For pods with no replica set, you need to bring up a new copy of the pod, and assuming it is not part of a service, redirect clients to it.
Perform maintenance work on the node.
Make the node schedulable again:
kubectl uncordon $NODENAME
Additionally if the node is hosting ETCD then you need to be extra careful in terms of rolling upgrade of ETCD and backing up the data
Take a backup of the ETCD if it's hosting the ETCD. You can use the in-built command to backup the data like
ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/snapshot-pre-boot.db
Now drain the node using
kubectl drain <master01>
Do the System update | patches and reboot.
Now uncordon the node back to the cluster
kubectl uncordon <master01>
Whenever you wish to reboot OS on the particular Node(Master, worker), K8s cluster engine does not aware for that action and it keeps all the cluster related events in ETCD key value storage, backing up the most recent data. As soon as you wish carefully prepare cluster Node reboot, you might have to adjust Maintenance job on this Node in order to drain it from scheduling and gracefully terminate all the existing Pods.
If you compose any relevant K8s resource within defined set of replicas, then ReplicationController guarantees that a specified number of pod replicas are running at any one time through each available Node. It simply re-spawns Pods if they failed health check, deleted or terminated, matching desired replicas. In case of Master nodes which host ETCDs you need to be extra careful in terms of rolling upgrade of ETCD and backing up the data.
1. Backup a single master
As mentioned previously, we need to backup etcd. In addition to that, we need the certificates and
optionally the kubeadm configuration file for easily restoring the
master. If you set up your cluster using kubeadm (with no special
configuration) you can do it similar to this:
Backup certificates:
$ sudo cp -r /etc/kubernetes/pki backup/
Make etcd snapshot:
$ sudo docker run --rm -v $(pwd)/backup:/backup \
--network host \
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
--env ETCDCTL_API=3 \
k8s.gcr.io/etcd-amd64:3.2.18 \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /backup/etcd-snapshot-latest.db
Backup kubeadm-config:
$ sudo cp /etc/kubeadm/kubeadm-config.yaml backup/
Note that the contents of the backup folder should then be stored somewhere safe, where it can survive if the master is completely destroyed. You perhaps want to use e.g. AWS S3 (or similar) for this.
There are three commands in the example and all of them should be run on the master node. The first one copies the folder containing all the certificates that kubeadm creates. These certificates are used for secure communications between the various components in a Kubernetes cluster. The final command is optional and only relevant if you use a configuration file for kubeadm. Storing this file makes it easy to initialize the master with the exact same configuration as before when restoring it.
If master update went wrong you can then simply restore old version of master node.
You can also automate etcd backups.
Doing a single backup manually may be a good first step but you really need to make regular backups for them to be useful. The easiest way to do this is probably to take the commands from the example above, create a small script and a cron job that runs the script every now and then. But since we are running Kubernetes anyway, use a Kubernetes CronJob. This would allow you to keep track of the backup jobs inside Kubernetes just like you monitor your workloads.
More information you can find here: backups-kubernetes.
2. Next step is to mark a node unschedulable, run this command:
$ kubectl drain $NODENAME
The kubectl drain command should only be issued to a single node at a time. However, you can run multiple kubectl drain commands for different nodes in parallel, in different terminals or in the background. Multiple drain commands running concurrently will still respect the PodDisruptionBudget you specify.
3. Execute the system update or patch and reboot.
4. Finally uncordon the node back to the cluster, execute command below:
$ kubectl uncordon $NODENAME
On GCP there is option such auto-upgrading nodes which improve managing node updates.
About maintenance Kubernetes nodes's you can read here: node-maintenace.