RKE2 VIP cluster not responding when only 1 master is available - kubernetes

Node IP
Role
OS
192.x.x.11
Master 1
RHEL8
192.x.x.12
Master 2
RHEL8
192.x.x.13
Master 3
RHEL8
192.x.x.16
VIP
Use-Cases
No of Masters Ready or Running
Expected
Actual
3 Masters
Ingress Created with VIP IP and ping to VIP should work
VIP is working
2 Masters
Ingress Created with VIP IP and ping to VIP should work
VIP is working
1 Master
Ingress Created with VIP IP and ping to VIP should work
VIP is not working, Kubectl is not responding
I have Created a RKE2 HA Cluster with kube-vip and the cluster is working fine only when at least 2 masters are in Running, but I want to test a use case where only 1 master is available the VIP should be able to ping and any ingress created with VIP address should work.
In my case when 2 masters are down I'm facing an issue with kube-vip-ds pod, when i check the logs using crictl command I'm getting the below error can someone suggest to me how to reslove this issue.
E0412 12:32:20.733320 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: etcdserver: request timed out
E0412 12:32:20.733715 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: etcdserver: request timed out
E0412 12:32:25.812202 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
E0412 12:32:25.830219 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
E0412 12:33:27.204128 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-cp-lock)
E0412 12:33:27.504957 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-svcs-lock)
E0412 12:34:29.346104 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-cp-lock)
E0412 12:34:29.354454 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-svcs-lock)
Thanks.

Kindly check if you have stacked up etcd datastore as part of your k8s cluster.
etcd for its quorum requires at least 2 masters to be running and in that case failure toleration is n-1, for 3 nodes it shall tolerate only one failure..so in your case as 2 masters are down your cluster is non-operational

Related

Getting Readiness problem failed due to connection refused in tidb pods only in tidb cluster

We have been running tidb cluster in k8s. and its working fine since. But suddenly i am getting following issue only in new statsfull pods tidb-tidb-1 after scaling tidb-tidb statsfulset. Interestingly tidb-tidb-2 is running. All others pd and tikv pods are also running fine.I have checked the pd url which is not reachable from problematic pods but fine for other pods.Can you please help me to solve this issue.
tidb-tidb-1 logs:
[2021/04/11 16:15:44.526 +00:00] [WARN] [base_client.go:180] ["[pd] failed to get cluster id"]
[2021/04/11 16:15:48.527 +00:00] [WARN] [base_client.go:180] ["[pd] failed to get cluster id"] [error="[PD:client:ErrClientGetMember]error:rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: i/o timeout\" target:test-tidb-pd:2379 status:CONNECTING
Could you please show namespace information?
kubectl get all -n -o wide
Please check node information.
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-isolation-restriction
Please check the network. If two nodes could ping successful?
transport: Error while dialing dial TCP: i/o timeout

Error instantiating chaincode in Hyperledger Fabric 1.4 over AKS kubernetes

I am trying to instantiate "sacc" chaincode (which comes with fabric samples) in an hyperledger fabric network deployed over kubernetes in AKS. After hours trying different adjustments, I've not been able to finish the task. I'm always getting the error:
Error: could not assemble transaction, err proposal response was not
successful, error code 500, msg timeout expired while starting
chaincode sacc:0.1 for transaction
Please note that there is no transaccion ID in the error (I've googled some similar cases, but in all of them, there was an ID for the transaction. Not my case, eventhough the error is the same)
The message in the orderer:
2019-07-23 12:40:13.649 UTC [orderer.common.broadcast] Handle -> WARN
047 Error reading from 10.1.0.45:52550: rpc error: code = Canceled
desc = context canceled 2019-07-23 12:40:13.649 UTC [comm.grpc.server]
1 -> INFO 048 streaming call completed {"grpc.start_time":
"2019-07-23T12:34:13.591Z", "grpc.service": "orderer.AtomicBroadcast",
"grpc.method": "Broadcast", "grpc.peer_address": "10.1.0.45:52550",
"error": "rpc error: code = Canceled desc = context canceled",
"grpc.code": "Canceled", "grpc.call_duration": "6m0.057953469s"}
I am calling for instantiation from inside a cli peer, defining variables CORE_PEER_LOCALMSPID, CORE_PEER_TLS_ROOTCERT_FILE, CORE_PEER_MSPCONFIGPATH, CORE_PEER_ADDRESS and ORDERER_CA with appropriate values before issuing the instantiation call:
peer chaincode instantiate -o -n sacc -v 0.1 -c
'{"Args":["init","hi","1"]}' -C mychannelname --tls 'true' --cafile
$ORDERER_CA
All the peers have declared a dockersocket volume pointing to
/run/docker.sock
All the peers have declared the variable CORE_VM_ENDPOINT to unix:///host/var/run/docker.sock
All the orgs were joined to the channel
All the peers have the chaincode installed
I can't see any further message/error in the cli, nor in the orderer, nor in the peers involved in the channel.
Any ideas on what can be going wrong? Or how could I continue troubleshooting the problem? Is it possible to see logs from the docker container that is being created by the peers? how?
Thanks

RabbitMQ Generic server rabbit_connection_tracking terminating

I have RabbitMQ cluster running on Kubernetes.
I found this error in one of RabbitMQ pods:
[error] <0.13899.4> ** Generic server rabbit_connection_tracking terminating
** Last message in was {'$gen_cast',{connection_closed,[{name,<<"10.182.60.4:52930 -> 10.182.60.93:5672">>},{pid,<0.13926.4>},{node,'rabbit#10.182.60.93'},{client_properties,[]}]}}
** When Server state == nostate
** Reason for termination ==
** {{aborted,{no_exists,['tracked_connection_on_node_rabbit#10.182.60.93',{'rabbit#10.182.60.93',<<"10.182.60.4:52930 -> 10.182.60.93:5672">>}]}},[{mnesia,abort,1,[{file,"mnesia.erl"},{line,355}]},{rabbit_connection_tracking,unregister_connection,1,[{file,"src/rabbit_connection_tracking.erl"},{line,282}]},{rabbit_connection_tracking,handle_cast,2,[{file,"src/rabbit_connection_tracking.erl"},{line,101}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
It's occurs in this pod only.
Where:
Kubernetes nodes count: 2
Rabbitmq pods count: 3
10.182.60.4: is one of K8s nodes
10.182.60.93: is the RabitMQ pods that has the error.
the error doesn't affect the flow of messages. but I can't understand what does it mean!
and why it occurs in this pod only!

Kubernetes kubelet error updating node status

Running a kubernetes cluster in AWS via EKS. Everything appears to be working as expected, but just checking through all logs to verify. I hopped on to one of the worker nodes and I noticed a bunch of errors when looking at the kubelet service
Oct 09 09:42:52 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 09:42:52.335445 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Oct 09 10:03:54 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 10:03:54.831820 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Nodes are all showing as ready, but I'm not sure why those errors are appearing. Have 3 worker nodes and all 3 have the same kubelet errors (hostnames are different obviously)
Additional information. It would appear that the error is coming from this line in kubelet_node_status.go
node, err := kl.heartbeatClient.CoreV1().Nodes().Get(string(kl.nodeName), opts)
if err != nil {
return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
}
From the workers I can execute get nodes using kubectl just fine:
kubectl get --kubeconfig=/var/lib/kubelet/kubeconfig nodes
NAME STATUS ROLES AGE VERSION
ip-172-26-0-58.ec2.internal Ready <none> 1h v1.10.3
ip-172-26-1-193.ec2.internal Ready <none> 1h v1.10.3
Turns out this is not an issue. Official reply from AWS regarding these errors:
The kubelet will regularly report node status to the Kubernetes API. When it does so it needs an authentication token generated by the aws-iam-authenticator. The kubelet will invoke the aws-iam-authenticator and store the token in it's global cache. In EKS this authentication token expires after 21 minutes.
The kubelet doesn't understand token expiry times so it will attempt to reach the API using the token in it's cache. When the API returns the Unauthorized response, there is a retry mechanism to fetch a new token from aws-iam-authenticator and retry the request.

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved