Kubernetes kubeadm reset error - unable to reset - kubernetes

I initialized k8 using kubeadm and now when I try to reset using kubeadm reset , am getting the following error. I searched for several forums but couldn't find any answers
> "level":"warn","ts":"2020-05-28T11:57:52.940+0200","caller":"clientv3/retry_interceptor.go:61","msg":"retrying
> o
> f unary invoker
> failed","target":"endpoint://client-e6d5f25b-0ed2-400f-b4d7-2ccabb09a838/192.168.178.200:2379","a
> ttempt":0,"error":"rpc error: code = Unknown desc = etcdserver:
> re-configuration failed due to not enough started
> members"}
The master node status is showing as not ready and I have not been able to reset the network plugin (weave)
ubuntu#ubuntu-nuc-masternode:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ubuntu-nuc NotReady,SchedulingDisabled master 20h v1.18.3
I tried forcing reset but hasn't worked. Any help is much appreciated

This seems to be reported issue kubeadm reset takes more than 50 seconds to retry deleting the last etcd member which was moved here.
Fix was committed kubeadm: skip removing last etcd member in reset phase on May 28th.
What type of PR is this?
/kind bug
What this PR does / why we need it:
If this is the last etcd member of the cluster, it cannot be removed due to "not enough started members". Skip it as the cluster will be destroyed in the next phase, otherwise the retries with exponential backoff will take more than 50 seconds to proceed.
Which issue(s) this PR fixes:
Fixes kubernetes/kubeadm#2144
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
kubeadm: during "reset" do not remove the only remaining stacked etcd member from the cluster and just proceed with the cleanup of the local etcd storage.

Related

Airflow KubernetesPodOperator Losing Connection to Worker Pod

Experiencing an odd issue with KubernetesPodOperator on Airflow 1.1.14.
Essentially for some jobs Airflow is losing contact with the pod it creates.
[2021-02-10 07:30:13,657] {taskinstance.py:1150} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
When I check logs in kubernetes with kubectl logs I can see that the job carried on past the connection broken error.
The connection broken error seems to happen exactly 1 hour after the last logs that Airflow pulls from the pod (we do have a 1 hour config on connections), but the pod keeps running happily in the background.
I've seen this behaviour repeatedly, and it tends to happen with longer running jobs with a gap in the log output, but I have no other leads. Happy to update the question if certain specifics are misssing.
As I have mentioned in comments section I think you can try to set operators get_logs parameter to False - default value is True .
Take a look: airflow-connection-broken, airflow-connection-issue .

ceph 15.2.4 -- authentication changes -- unable to reattach with kubernetes

Yesterday my teammates found a way to disable cephx authentication cluster wide (2 server cluster) in order to bypass issues that were preventing us from joining a 3rd server. Unfortunately they were uncertain which of the steps taken let to the successful addition. I request assistance getting my ceph operational again. Yesterday we left off after editing /etc/ceph/ceph.conf, turning authentication back on here, then copying the file to /var/lib/ceph///config and ensuring permissions were set to 644.
This got one command to work that had previously not been -- my ceph osd df correctly shows all 24 OSDs again, but I cannot run a ceph osd status nor a ceph orch status.

How to search for previous revisions using Microk8s etcd instance?

I'm trying to search previous versions of my secrets inside Microk8s etcd instance but the number of revisions change everytime I refresh my screen and I don't know why.
When I try to access an older version I get the error below:
etcdctl --endpoints=127.0.0.1:2380 get --rev=9133 -w fields /registry/secrets/default/mysql-test-password
{"level":"warn","ts":"2020-09-14T13:40:08.594Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8b3e59a8-efd2-4f77-a96f-5ec3c451b9b7/127.0.0.1:2380","attempt":0,"error":"rpc error: code = OutOfRange desc = etcdserver: mvcc: required revision has been compacted"}
Error: etcdserver: mvcc: required revision has been compacted
I also added the configuration on my etcd config file and restarted the service but didn't help:
--auto-compaction-mode=periodic
--auto-compaction-retention=72h
It seems that every time I refresh my screen the number of revisions increased a lot without doing anything.
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":15322,"raft_term":7}
1 second later
student#desktop:~$ etcdctl --endpoints=127.0.0.1:2380 get /registry/secrets/default/mysql-root-password -w json
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":16412,"raft_term":7}
Have someone faced something like that?
"etcdserver: mvcc: required revision has been compacted." is not an error, it's an expected message when a watch is attempted to be established at a resource version that has already been compacted.
Something is wrong with your etcd cluster, this is almost certainly not an apiserver problem. There's an alarm you might need to disable.
Also remember that you have to update etcd version to 3.0.11 or later -
https://github.com/kubernetes/kubernetes/issues/45506.
Take a look: mvcc-issues, reloader.

Azure Service Fabric Cluster Update

I have a cluster in Azure and it failed to update automatically so I'm trying a manual update. I tried via the portal, it failed so I kicked off an update using PS, it failed also. The update starts then just sits at "UpdatingUserConfiguration" then after an hour or so fails with a time out. I have removed all application types and check my certs for "NETWORK SERVCIE". The cluster is 5 VM single node type, Windows.
Error
Set-AzureRmServiceFabricUpgradeType : Code: ClusterUpgradeFailed,
Message: Cluster upgrade failed. Reason Code: 'UpgradeDomainTimeout',
Upgrade Progress:
'{"upgradeDescription":{"targetCodeVersion":"6.0.219.9494","
targetConfigVersion":"1","upgradePolicyDescription":{"upgradeMode":"UnmonitoredAuto","forceRestart":false,"u
pgradeReplicaSetCheckTimeout":"37201.09:59:01","kind":"Rolling"}},"targetCodeVersion":"6.0.219.9494","target
ConfigVersion":"1","upgradeState":"RollingBackCompleted","upgradeDomains":[{"name":"1","state":"Completed"},
{"name":"2","state":"Completed"},{"name":"3","state":"Completed"},{"name":"4","state":"Completed"}],"rolling
UpgradeMode":"UnmonitoredAuto","upgradeDuration":"02:02:07","currentUpgradeDomainDuration":"00:00:00","unhea
lthyEvaluations":[],"currentUpgradeDomainProgress":{"upgradeDomainName":"","nodeProgressList":[]},"startTime
stampUtc":"2018-05-17T03:13:16.4152077Z","failureTimestampUtc":"2018-05-17T05:13:23.574452Z","failureReason"
:"UpgradeDomainTimeout","upgradeDomainProgressAtFailure":{"upgradeDomainName":"1","nodeProgressList":[{"node
Name":"_mstarsf10_1","upgradePhase":"PreUpgradeSafetyCheck","pendingSafetyChecks":[{"kind":"EnsureSeedNodeQu
orum"}]}]}}'.
Any ideas on what I can do about a "EnsureSeedNodeQuorum" error ?
The root cause was only 3 seed nodes in the cluster as a result of the cluster being build with a VM scale set that had "overprovision" set to true. Lesson learned, remember to set "overprovision" to false.
I ended up deleting the cluster and scale set and recreated using my stored ARM template.

how to update cluster status in dataproc

I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.