RabbitMQ randomly disconnecting application consumers in a Kubernetes/Istio environment - kubernetes

Issue:
My company has recently moved workers from Heroku to Kubernetes. We previously used a Heroku-managed add-on (CloudAMQP) for our RabbitMQ brokers. This worked perfectly and we never saw issues with dropped consumer connections.
Now that our workloads live in Kubernetes deployments on separate nodegroups, we are seeing daily dropped consumer connections, causing our messages to not be processed by our applications living in Kubernetes. Our new RabbitMQ brokers live in CloudAMQP but are not managed Heroku add-ons.
Errors on the consumer side just indicate a Unexpected disconnect. No additional details.
No errors on the Istio envoy proxy level that is evident.
We do not have a Istio Egress, so no destination rules set here.
No errors on the RabbitMQ server that is evident.
Remediation Attempts:
Read all StackOverflow/GitHub issues for the Unexpected errors we are seeing. Nothing we have found has remediated the issue.
Our first attempt to remediate was to change the heartbeat to 0 (disabling heartbeats) on our RabbitMQ server and consumer. This did not fix anything, connections still randomly dropping. CloudAMQP also suggests disabling this, because they rely heavily on TCP keepalive.
Created a message that just logs on the consumer every five minutes. To keep the connection active. This has been a bandaid fix for whatever the real issue is. This is not perfect, but we have seen a reduction of disconnects.
What we think the issue is:
We have researched why this might be happening and are honing in on network TCP keepalive settings either within Kubernetes or on our Istio envoy proxy's outbound connection settings.
Any ideas on how we can troubleshoot this further, or what we might be missing here to diagnose?
Thanks!

Related

How to recover JMS inbound-gateway containers on Active MQ server failure when number of retry is changed to limited

JMS Inbound gateway is used for request processing at worker side. CustomMessageListenerContainer class is configured to impose back off max attempts as limited.
In some scenarios when active MQ server is not responding before max attempts limit reached container is being stopped with below message.
"Stopping container for destination 'senExtractWorkerInGateway': back-off policy does not allow for further attempts."
Wondering is there any configuration available to recover these containers once the Active MQ is back available.
sample configuration is given below.
<int-jms:inbound-gateway
id="senExtractWorkerInGateway"
container-class="com.test.batch.worker.CustomMessageListenerContainer"
connection-factory="jmsConnectionFactory"
correlation-key="JMSCorrelationID"
request-channel="senExtractProcessingWorkerRequestChannel"
request-destination-name="senExtractRequestQueue"
reply-channel="senExtractProcessingWorkerReplyChannel"
default-reply-queue-name="senExtractReplyQueue"
auto-startup="false"
concurrent-consumers="25"
max-concurrent-consumers="25"
reply-timeout="1200000"
receive-timeout="1200000"/>
You probably can emit some ApplicationEvent from the applyBackOffTime() of your CustomMessageListenerContainer when the super call returns false. This way you would know that something is wrong with ActiveMQ connection. At this moment you also need to stop() your senExtractWorkerInGateway - just autowire it into some controlling service as a Lifecycle. When you done fixing the connection problem, you just need to start this senExtractWorkerInGateway back. That CustomMessageListenerContainer is going to be started automatically.

High number of socket descriptor in rmq

We recently had an issue in our rabbitmq server , where it was unable to accept new connections and dropping those TCP connections .
we didn’t saw any spike in our channels or consumers .
Socket Descriptors(SD) and erlang process shoot up in short span of time causing Rabbit MQ to get stuck and no new connections get established post that.
We do not see any significant increase in channels, connections or consumers to establish a link between the sudden increase in SD and erlang Processes.
RMQ VERSION: 3.7.14
Erlang version: Erlang 21.3.8.1
RMQ running on Kubernetes as a stateful set .
RMQ erlang process spike .
Socket used.
Post restarting the server its working fine , but its resurfacing again .
I suggest you to check server's half-open connections. Seems you can have that kind of a situation if you have an aggressive reconnects from clients side. They create connections, and reconnect again and again.
Also, even if you have the same amount of consumers, there can be increased amount of publishers.
So, my suggestion here - to check logs and metrics on reconnects to rabbitmq.

Why am I experiencing endless connection timeouts using quarkus microprofile reactive rest client

At some point of my quarkus app life (under kubernetes) it begins getting endless connection timeouts from multiple different hosts (timeout configured to be 1 second). As of this point the app never recovers until I restart the k8s pod.
These endless connection timeouts are not due to the hosts since other apps in the cluster do not suffer from this, also a restart of my app fixes the problem.
I am declaring multiple hosts(base-uri) through the quarkus application.properties. (maybe its using a single vertx/netty event-loop and it's wrong?)

azure websocket connection through kubernetes, many disconnects with code 1006

A nodejs server on kubernetes get many websocket connections - all is fine, but from time to time an abrupt disconnect occurs (code 1006).
Then every few minutes, the server disconnects from all clients (all disconnects have code 1006).
Important to note that this happens to all replicas at the same time, indicating the cause is external to the servers (and the clients). Could it be the application gateway?
How can I debug this further?
Changing from the default azure application gateway to nginx solved this problem.

How can I fix frequent, but intermittent TLS handshake timeouts in kubectl?

I'm encountering TLS handshake timeout when trying to perform a number of operations against a local Kubernetes cluster on macOS 10.14.6. The errors show up when doing any kubectl action, any helm action (including helm init and helm version), as well as during deployments.
I've tried rebooting Docker for Mac, as well as rebooting the physical host machine, and wiping and recreating the cluster (which is difficult given that deployments will spontaneously fail because of the TLS handshake issue). I've also made sure that my major/minor/patch versions for kubectl (1.14.3 client, 1.14.6 server) and helm (2.9.1) all match those being used by known-good local deployments in the office.
I've also reviewed the firewall rules on my machine, but haven't found anything that would obviously cause this kind of issue.
Additionally, I've browsed many of the threads discussing this on the issue trackers for k8s itself as well as Helm, plus the questions already on SO, but these overwhelmingly concern Azure's AKS, while I'm working on a local setup.
Finally, I've made sure that enough resources are allocated to actually run the target applications -- in this case 16GB of RAM (which I've tried unsuccessfully upgrading to 24GB) as well as 8 CPU cores.
This problem seems to show up at random, and while it's most often manifested as a TLS handshake timeout, it will also occasionally interrupt an established connection, with skaffold run commands sometimes crashing out with "transport closed." It also doesn't seem to be caused by any missing certs, since the commands eventually succeed -- but the success rate is very low, of the order of 1 in 10 calls.