Error restoring Rancher: This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready - kubernetes

I am trying to backup and restore rancher server (single node install), as the described here.
After backup, I tried to turn off the rancher server node, and I run a new rancher container on a new node (in the same network, but another ip address), then I restored using the backup file.
After restoring, I logged in to the rancher UI and it showed the error below:
So, I checked the logs of the rancher server and it showed as below:
2019-10-05 16:41:32.197641 I | http: TLS handshake error from 127.0.0.1:38388: EOF
2019-10-05 16:41:32.202442 I | http: TLS handshake error from 127.0.0.1:38380: EOF
2019-10-05 16:41:32.210378 I | http: TLS handshake error from 127.0.0.1:38376: EOF
2019-10-05 16:41:32.211106 I | http: TLS handshake error from 127.0.0.1:38386: EOF
2019/10/05 16:42:26 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:44:34 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:48:50 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019-10-05 16:50:19.114475 I | mvcc: store.index: compact 75951
2019-10-05 16:50:19.137825 I | mvcc: finished scheduled compaction at 75951 (took 22.527694ms)
2019-10-05 16:55:19.120803 I | mvcc: store.index: compact 76282
2019-10-05 16:55:19.124813 I | mvcc: finished scheduled compaction at 76282 (took 2.746382ms)
After that, I checked logs of the master nodes, I found that the rancher agent still tries to connect to the old rancher server (old ip address), not as the new one, so it makes the cluster not available.
How can I fix this?

You need to re-register the node in Rancher using the following steps.
Update the server-url in Rancher by going to Global -> Settings -> server-url
This should be the full URL with https://
Then use this script to re-register the node in Rancher https://github.com/mattmattox/cluster-agent-tool

Related

Hyperledger fabric chaincode connection with peer getting dropped

I have a hyperledger fabric network version 2.4.4 running on Kubernetes. The peers and other components are running behind istio ingress. The chaincode is running on dind (docker-in-docker) container and connects to peer through its URL. The problem is the chaincode connection is being dropped after few minutes. Below are the logs:
2022-07-14T04:31:13.057Z info [c-api:lib/handler.js] [assetschannel-ddc183b4] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
2022-07-14T04:33:04.197Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: Connection dropped\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:187:78\n at processTicksAndRejections (node:internal/process/task_queues:78:11)"
I did set the following environment variables in the peer pod to keep the connection alive:
CORE_CHAINCODE_KEEPALIVE: 60000
CORE_PEER_KEEPALIVE_CLIENT_INTERVAL: 600s
CORE_PEER_KEEPALIVE_CLIENT_TIMEOUT: 2s
CORE_PEER_KEEPALIVE_DELIVERYCLIENT_INTERVAL: 20s
CORE_PEER_KEEPALIVE_MININTERVAL: 15s
but this did not resolve the issue.
Any suggestions would be appreciated.
It appears to be an issue with aws elb. The idle timeout was set to 60s which was breaking the connection between chaincode and peer when there was no communication between them. Increasing this time fixed the issue.

OC Cluster never goes up Error: timed out waiting for the condition

When ever i try to get the cluster up using "oc cluster up"
Below is the error I get. Kindly help on how to fix this
[mano#mano ~]$ oc cluster up
Getting a Docker client ...
Checking if image openshift/origin-control-plane:v3.11 is available ...
Checking type of volume mount ...
Determining server IP ...
Checking if OpenShift is already running ...
Checking for supported Docker version (=>1.22) ...
Checking if insecured registry is configured properly in Docker ...
Checking if required ports are available ...
Checking if OpenShift client is configured properly ...
Checking if image openshift/origin-control-plane:v3.11 is available ...
Starting OpenShift using openshift/origin-control-plane:v3.11 ...
I0923 13:40:32.364326 15396 config.go:40] Running "create-master-config"
I0923 13:40:59.938492 15396 config.go:46] Running "create-node-config"
I0923 13:41:10.721711 15396 flags.go:30] Running "create-kubelet-flags"
I0923 13:41:18.241285 15396 run_kubelet.go:49] Running "start-kubelet"
I0923 13:41:23.016238 15396 run_self_hosted.go:181] Waiting for the kube-apiserver to be ready ...
E0923 13:46:23.023479 15396 run_self_hosted.go:571] API server error: Get https://127.0.0.1:8443/healthz?timeout=32s: dial tcp 127.0.0.1:8443: connect: connection refused ()
Error: timed out waiting for the condition
OC version
[mano#mano` ~]$ oc version
oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
followed the article :https://github.com/openshift/origin/blob/release-3.11/docs/cluster_up_down.md
yet no luck

Minikube is slow and unresponsive

Today randomly minikube seems to be taking very long to respond to command via kubectl.
And occasionally even:
kubectl get pods
Unable to connect to the server: net/http: TLS handshake timeout
How can I diagnose this?
Some logs from minikube logs:
==> kube-scheduler <==
I0527 14:16:55.809859 1 serving.go:319] Generated self-signed cert in-memory
W0527 14:16:56.256478 1 authentication.go:387] failed to read in-cluster kubeconfig for delegated authentication: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0527 14:16:56.256856 1 authentication.go:249] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0527 14:16:56.257077 1 authentication.go:252] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0527 14:16:56.257189 1 authorization.go:177] failed to read in-cluster kubeconfig for delegated authorization: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
W0527 14:16:56.257307 1 authorization.go:146] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0527 14:16:56.264875 1 server.go:142] Version: v1.14.1
I0527 14:16:56.265228 1 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W0527 14:16:56.286959 1 authorization.go:47] Authorization is disabled
W0527 14:16:56.286982 1 authentication.go:55] Authentication is disabled
I0527 14:16:56.286995 1 deprecated_insecure_serving.go:49] Serving healthz insecurely on [::]:10251
I0527 14:16:56.287397 1 secure_serving.go:116] Serving securely on 127.0.0.1:10259
I0527 14:16:57.417028 1 controller_utils.go:1027] Waiting for caches to sync for scheduler controller
I0527 14:16:57.524378 1 controller_utils.go:1034] Caches are synced for scheduler controller
I0527 14:16:57.827438 1 leaderelection.go:217] attempting to acquire leader lease kube-system/kube-scheduler...
E0527 14:17:10.865448 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-scheduler: Get https://localhost:8443/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0527 14:17:43.418910 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-scheduler: Get https://localhost:8443/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
I0527 14:18:01.447065 1 leaderelection.go:227] successfully acquired lease kube-system/kube-scheduler
I0527 14:18:29.044544 1 leaderelection.go:263] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E0527 14:18:38.999295 1 server.go:252] lost master
E0527 14:18:39.204637 1 leaderelection.go:306] error retrieving resource lock kube-system/kube-scheduler: Get https://localhost:8443/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
lost lease
Update:
To work around this issue I just did a minikube delete and minikube start, and the performance issue resolved..
As solution has been found, I am posting this as Community Wiki for future users.
1) Debugging issues with minikube by adding -v flag and set debug level (0, 1, 2, 3, 7).
As example: minikube start --v=1 to set outbut to INFO level.
More detailed information here
2) Use logs command minikube logs
3) Because Minikube is working on Virtual Machine sometimes is better to delete minikube and start it again (It helped in this case).
minikube delete
minikube start
4) It might get slow due to lack of resources.
Minikube as default is using 2048MB of memory and 2 CPUs. More details about this can be fund here
In addition, you can enforce Minikube to create more using command
minikube start --cpus 4 --memory 8192

Scaling-Master / Unable to perform initial IP allocation check

I started from 3 master nodes and I increased it to 5. I am trying to add the new members to the existing cluster. My apiserver container stops working with the following error:
E1106 20:44:18.977854 1 cacher.go:274] unexpected ListAndWatch error: k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/storage/cacher.go:215: Failed to list *storage.StorageClass: client: etcd cluster is unavailable or misconfigured
I1106 20:44:19.043807 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52142: EOF
I1106 20:44:19.072129 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52148: EOF
I1106 20:44:19.084461 1 logs.go:41] http: TLS handshake error from 10.0.118.9:52150: EOF
F1106 20:44:19.103677 1 controller.go:128] Unable to perform initial IP allocation check: unable to refresh the service IP block: client: etcd cluster is unavailable or misconfigured
From the already working master nodes I can see the new member:
azureuser#k8s-master-50639053-0:~$ etcdctl member list
99673c60d6c07e0e: name=k8s-master-50639053-2 peerURLs=http://10.0.118.7:2380 clientURLs=
b130aa7583380f88: name=k8s-master-50639053-3 peerURLs=http://10.0.118.8:2380 clientURLs=
b4b196cc0c9fca4a: name=k8s-master-50639053-1 peerURLs=http://10.0.118.6:2380 clientURLs=
c264b3b67880db3f: name=k8s-master-50639053-0 peerURLs=http://10.0.118.5:2380 clientURLs=
e6e511de7d665829: name=k8s-master-50639053-4 peerURLs=http://10.0.118.9:2380 clientURLs=
If I check the cluster health I got:
azureuser#k8s-master-50639053-0:~$ etcdctl cluster-health
member 99673c60d6c07e0e is healthy: got healthy result from http://10.0.118.7:2379
member b4b196cc0c9fca4a is healthy: got healthy result from http://10.0.118.6:2379
member c264b3b67880db3f is healthy: got healthy result from http://10.0.118.5:2379
member fd36b7acc85d92b8 is unhealthy: got unhealthy result from http://10.0.118.9:2379
cluster is healthy
It works if I run in the new master node and stop the etcd service:
sudo etcd --listen-client-urls http://10.0.118.9:2379 --advertise-client-urls http://10.0.118.9:2379 --listen-peer-urls http://10.0.118.9:2380
Could someone help me?
Thanks.
Update: According to git its due to certificates and its not currently supported by ACS-ENGINE.

mesos masters keep restarting

I have 3 mesos masters with version 0.26.0 setup with a quorum of 2. When I start them, they keep restarting even before I turn up any frameworks or slaves.
Here's the errors I'm seeing:
F0322 19:36:56.009903 51459 master.cpp:1368] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
E0322 19:37:18.300568 41095 process.cpp:1911] Failed to shutdown socket with fd 26: Transport endpoint is not connected
There's no firewall running.
I start them with supervisord and the following command:
/usr/sbin/mesos-master --cluster=int --log_dir=/var/log/mesos/int --quorum=2 --port=5050 --work_dir=/tmp/mesos/work/int --zk=zk://intMesosMaster01:2181,intMesosMaster02:2181,intMesosMaster03:2181/mesos
Zookeeper is up and running fine with 3 nodes. It's in use for other projects and has no issues at all with them.