RabbitMQ Generic server rabbit_connection_tracking terminating - kubernetes

I have RabbitMQ cluster running on Kubernetes.
I found this error in one of RabbitMQ pods:
[error] <0.13899.4> ** Generic server rabbit_connection_tracking terminating
** Last message in was {'$gen_cast',{connection_closed,[{name,<<"10.182.60.4:52930 -> 10.182.60.93:5672">>},{pid,<0.13926.4>},{node,'rabbit#10.182.60.93'},{client_properties,[]}]}}
** When Server state == nostate
** Reason for termination ==
** {{aborted,{no_exists,['tracked_connection_on_node_rabbit#10.182.60.93',{'rabbit#10.182.60.93',<<"10.182.60.4:52930 -> 10.182.60.93:5672">>}]}},[{mnesia,abort,1,[{file,"mnesia.erl"},{line,355}]},{rabbit_connection_tracking,unregister_connection,1,[{file,"src/rabbit_connection_tracking.erl"},{line,282}]},{rabbit_connection_tracking,handle_cast,2,[{file,"src/rabbit_connection_tracking.erl"},{line,101}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
It's occurs in this pod only.
Where:
Kubernetes nodes count: 2
Rabbitmq pods count: 3
10.182.60.4: is one of K8s nodes
10.182.60.93: is the RabitMQ pods that has the error.
the error doesn't affect the flow of messages. but I can't understand what does it mean!
and why it occurs in this pod only!

Related

How to install Eclipse-che in azure Kubernetes cluster

I'm trying to install the Eclipse-Che by following this blog : https://che.eclipseprojects.io/2022/07/25/#karatkep-installing-eclipse-che-on-aks.html,
yet following all the steps i'm not able to successfully install the Eclipse che.
1)
After running this command:
kubectl logs -l app.kubernetes.io/component=che-operator -n eclipse-che -f
these are the errors i'm facing:
logs: Waited for 1.034843163s due to client-side throttling, not priority and fairness, request: GET:https://10.1.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
time="2022-09-12T14:08:29Z" level=info msg="Successfully reconciled."
2) the Che-gateway pod is failing:
che-gateway-7d54ccdd59-bblw6 3/4 CrashLoopBackOff 18 (2m51s ago) 70m
Description: Oauth-proxy container is getting failed (Crash loop back error)
Logs of the oauth- Proxy container:
#invalid configuration:
missing setting: login-url
missing setting: redeem-url

RKE2 VIP cluster not responding when only 1 master is available

Node IP
Role
OS
192.x.x.11
Master 1
RHEL8
192.x.x.12
Master 2
RHEL8
192.x.x.13
Master 3
RHEL8
192.x.x.16
VIP
Use-Cases
No of Masters Ready or Running
Expected
Actual
3 Masters
Ingress Created with VIP IP and ping to VIP should work
VIP is working
2 Masters
Ingress Created with VIP IP and ping to VIP should work
VIP is working
1 Master
Ingress Created with VIP IP and ping to VIP should work
VIP is not working, Kubectl is not responding
I have Created a RKE2 HA Cluster with kube-vip and the cluster is working fine only when at least 2 masters are in Running, but I want to test a use case where only 1 master is available the VIP should be able to ping and any ingress created with VIP address should work.
In my case when 2 masters are down I'm facing an issue with kube-vip-ds pod, when i check the logs using crictl command I'm getting the below error can someone suggest to me how to reslove this issue.
E0412 12:32:20.733320 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: etcdserver: request timed out
E0412 12:32:20.733715 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: etcdserver: request timed out
E0412 12:32:25.812202 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
E0412 12:32:25.830219 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
E0412 12:33:27.204128 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-cp-lock)
E0412 12:33:27.504957 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-svcs-lock)
E0412 12:34:29.346104 1 leaderelection.go:322] error retrieving resource lock kube-system/plndr-cp-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-cp-lock)
E0412 12:34:29.354454 1 leaderelection.go:325] error retrieving resource lock kube-system/plndr-svcs-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io plndr-svcs-lock)
Thanks.
Kindly check if you have stacked up etcd datastore as part of your k8s cluster.
etcd for its quorum requires at least 2 masters to be running and in that case failure toleration is n-1, for 3 nodes it shall tolerate only one failure..so in your case as 2 masters are down your cluster is non-operational

Enable custom kubernetes scheduler for a namespace

I have a k8 job that brings up multiple pods. This job is used for load testing so all the pods need to come up at the same time. Job shouldn't be started until nodes are available for all pods to be scheduled.
I came across kube-batch https://github.com/kubernetes-sigs/kube-batch to do this scheduling. I have couple of questions:
1. How to enable kube-batch for only one namespace in a cluster?
2. Installed kube-batch by following the tutorial. But pods are failing on startup with below error. How to resolve this error?
I1204 20:07:55.911393 1 allocate.go:96] Queue <default> is overused, ignore it.
I1204 20:07:55.911399 1 allocate.go:194] Leaving Allocate ...
I1204 20:07:55.911407 1 backfill.go:41] Enter Backfill ...
I1204 20:07:55.911413 1 backfill.go:71] Leaving Backfill ...
E1204 20:07:55.911521 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/session.go:368
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins/gang/gang.go:154
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/framework.go:58
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:102
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:85
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x10ab979]
Not sure what you are trying to achive is doable. In my opinion what you can do is to modify the pods dockerfile to include Supervisord . Then in supervisord specify the commands you want to run when the pods come in running state using priority for supervisord.
Example
[program:api]
directory=/usr/local
command=go main.go
priority=100
autostart=true
autorestart=true
stderr_logfile=/var/log
stdout_logfile=/var/log

Kubernetes kubelet error updating node status

Running a kubernetes cluster in AWS via EKS. Everything appears to be working as expected, but just checking through all logs to verify. I hopped on to one of the worker nodes and I noticed a bunch of errors when looking at the kubelet service
Oct 09 09:42:52 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 09:42:52.335445 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Oct 09 10:03:54 ip-172-26-0-213.ec2.internal kubelet[4226]: E1009 10:03:54.831820 4226 kubelet_node_status.go:377] Error updating node status, will retry: error getting node "ip-172-26-0-213.ec2.internal": Unauthorized
Nodes are all showing as ready, but I'm not sure why those errors are appearing. Have 3 worker nodes and all 3 have the same kubelet errors (hostnames are different obviously)
Additional information. It would appear that the error is coming from this line in kubelet_node_status.go
node, err := kl.heartbeatClient.CoreV1().Nodes().Get(string(kl.nodeName), opts)
if err != nil {
return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
}
From the workers I can execute get nodes using kubectl just fine:
kubectl get --kubeconfig=/var/lib/kubelet/kubeconfig nodes
NAME STATUS ROLES AGE VERSION
ip-172-26-0-58.ec2.internal Ready <none> 1h v1.10.3
ip-172-26-1-193.ec2.internal Ready <none> 1h v1.10.3
Turns out this is not an issue. Official reply from AWS regarding these errors:
The kubelet will regularly report node status to the Kubernetes API. When it does so it needs an authentication token generated by the aws-iam-authenticator. The kubelet will invoke the aws-iam-authenticator and store the token in it's global cache. In EKS this authentication token expires after 21 minutes.
The kubelet doesn't understand token expiry times so it will attempt to reach the API using the token in it's cache. When the API returns the Unauthorized response, there is a retry mechanism to fetch a new token from aws-iam-authenticator and retry the request.

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved