High number of socket descriptor in rmq - sockets

We recently had an issue in our rabbitmq server , where it was unable to accept new connections and dropping those TCP connections .
we didn’t saw any spike in our channels or consumers .
Socket Descriptors(SD) and erlang process shoot up in short span of time causing Rabbit MQ to get stuck and no new connections get established post that.
We do not see any significant increase in channels, connections or consumers to establish a link between the sudden increase in SD and erlang Processes.
RMQ VERSION: 3.7.14
Erlang version: Erlang 21.3.8.1
RMQ running on Kubernetes as a stateful set .
RMQ erlang process spike .
Socket used.
Post restarting the server its working fine , but its resurfacing again .

I suggest you to check server's half-open connections. Seems you can have that kind of a situation if you have an aggressive reconnects from clients side. They create connections, and reconnect again and again.
Also, even if you have the same amount of consumers, there can be increased amount of publishers.
So, my suggestion here - to check logs and metrics on reconnects to rabbitmq.

Related

grpc unary-stream with redis pubsub - degradation with too many clients

We have a python grpc (grpcio with asyncio) server which performs server side streaming of data consumed from redis PUB/SUB (using aioredis 2.x) , combining up to 25 channels per stream. With low traffic everything works fine, as soon as we reach 2000+ concurrent streams , the delivery of messages start falling behind.
Some setup details and what we tried so far:
The client connections to GRPC are loadbalanced over kubernetes cluster with Ingress-NGINX controller, and it seems scaling (we tried 9 pods with 10 process instances each) doesn't help at all (loadbalancing is distributed evenly).
We are running a five node redis 7.x cluster with 96 threads per replica.
Connecting to redis with CLI client while GRPC falls behind - individual channels are on time while GRPC streams are falling behind
Messages are small in size (40B) with a variable rate anywhere between 20-200 per second on each stream.
Aioredis seems to be opening a new connection for each pubsub subscriber even if we're using capped connection pool for each grpc instance.
Memory/CPU utilisation is not dramatic as well as Network I/O, so we're not getting bottlenecked there
Tried identical setup with a very similar grpc server written in Rust, with similar results
#mike_t, As you have mentioned in the comment, switching from Redis Pub/Sub to zmq has helped in resolving the issue.
ZeroMQ (also known as ØMQ, 0MQ, or zmq) is an open-source universal messaging library, looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast.
You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks.
It has a score of language APIs and runs on most operating systems.

RabbitMQ randomly disconnecting application consumers in a Kubernetes/Istio environment

Issue:
My company has recently moved workers from Heroku to Kubernetes. We previously used a Heroku-managed add-on (CloudAMQP) for our RabbitMQ brokers. This worked perfectly and we never saw issues with dropped consumer connections.
Now that our workloads live in Kubernetes deployments on separate nodegroups, we are seeing daily dropped consumer connections, causing our messages to not be processed by our applications living in Kubernetes. Our new RabbitMQ brokers live in CloudAMQP but are not managed Heroku add-ons.
Errors on the consumer side just indicate a Unexpected disconnect. No additional details.
No errors on the Istio envoy proxy level that is evident.
We do not have a Istio Egress, so no destination rules set here.
No errors on the RabbitMQ server that is evident.
Remediation Attempts:
Read all StackOverflow/GitHub issues for the Unexpected errors we are seeing. Nothing we have found has remediated the issue.
Our first attempt to remediate was to change the heartbeat to 0 (disabling heartbeats) on our RabbitMQ server and consumer. This did not fix anything, connections still randomly dropping. CloudAMQP also suggests disabling this, because they rely heavily on TCP keepalive.
Created a message that just logs on the consumer every five minutes. To keep the connection active. This has been a bandaid fix for whatever the real issue is. This is not perfect, but we have seen a reduction of disconnects.
What we think the issue is:
We have researched why this might be happening and are honing in on network TCP keepalive settings either within Kubernetes or on our Istio envoy proxy's outbound connection settings.
Any ideas on how we can troubleshoot this further, or what we might be missing here to diagnose?
Thanks!

azure websocket connection through kubernetes, many disconnects with code 1006

A nodejs server on kubernetes get many websocket connections - all is fine, but from time to time an abrupt disconnect occurs (code 1006).
Then every few minutes, the server disconnects from all clients (all disconnects have code 1006).
Important to note that this happens to all replicas at the same time, indicating the cause is external to the servers (and the clients). Could it be the application gateway?
How can I debug this further?
Changing from the default azure application gateway to nginx solved this problem.

MongoDB connection fails on multiple app servers

We have mongodb with mgo driver for golang. There are two app servers connecting to mongodb running besides apps (golang binaries). Mongodb runs as a replica set and each server connects two primary or secondary depending on replica's current state.
We have experienced the SocketException handling request, closing client connection: 9001 socket exception on one of the mongo servers( which resulted in the connection to mongodb from our apps to die. After that, replica set continued to be functional but our second server (on which the error didn't happen) the connection died as well.
In the golang logs it was manifested as:
read tcp 10.10.0.5:37698-\u003e10.10.0.7:27017: i/o timeout
Why did this happen? How can this be prevented?
As I understand, mgo connects to the whole replica by the url (it detects whole topology by the single instance's url) but why did dy·ing of the connection on one of the servers killed it on second one?
Edit:
Full package path that is used "gopkg.in/mgo.v2"
Unfortunately can't share mongo files here. But besides the socketexecption mongo logs don't contain anything useful. There is indication of some degree of lock contention where lock acquired time is quite high some times but nothing beyond that
MongoDB does some heavy indexing some times but the wasn't any unusual spikes recently so it's nothing beyond normal
First, the mgo driver you are using: gopkg.in/mgo.v2 developed by Gustavo Niemeyer (hosted at https://github.com/go-mgo/mgo) is not maintained anymore.
Instead use the community supported fork github.com/globalsign/mgo. This one continues to get patched and evolve.
Its changelog includes: "Improved connection handling" which seems to be directly relating to your issue.
Its details can be read here https://github.com/globalsign/mgo/pull/5 which points to the original pull request https://github.com/go-mgo/mgo/pull/437:
If mongoServer fail to dial server, it will close all sockets that are alive, whether they're currently use or not.
There are two cons:
Inflight requests will be interrupt rudely.
All sockets closed at the same time, and likely to dial server at the same time. Any occasional fail in the massive dial requests (high concurrency scenario) will make all sockets closed again, and repeat...(It happened in our production environment)
So I think sockets currently in use should closed after idle.
Note that the github.com/globalsign/mgo has backward compatible API, it basically just added a few new things / features (besides the fixes and patches), which means you should be able to just change the import paths and all should be working without further changes.

Docker blocking outgoing connections on high load?

We have a node.js web server that makes some outgoing http requests to an external API. It's running in docker using dokku.
After some time of load (30req/s) these outgoing requests aren't getting responses anymore.
Here's a graph I made while testing with constant req/s:
incoming and outgoing is the amount of concurrent requests (not the number of initialized requests). (It's hard to see in the graph, but it's fairly constant at ~10 requests for each.)
response time is for external requests only.
You can clearly see that they start failing all of a sudden (hitting our 1000ms timeout).
The more req/s we send, the faster we run into this problem, so we must have some sort of limit we're getting closer to with each request.
I used netstat -ant | tail -n +3 | wc -l on the host to get the number of open connections, but it was only ~450 (most of them TIME_WAIT). That shouldn't hit the socket limit. We aren't hitting any RAM or CPU limits, either.
I also tried running the same app on the same machine outside docker and it only happens in docker.
It could be due to the Docker userland proxy. If you are running a recent version of Docker, try running the daemon with the --userland-proxy=false option. This will make Docker handle port forwarding with just iptables and there is less overhead.