Why am I experiencing endless connection timeouts using quarkus microprofile reactive rest client - vert.x

At some point of my quarkus app life (under kubernetes) it begins getting endless connection timeouts from multiple different hosts (timeout configured to be 1 second). As of this point the app never recovers until I restart the k8s pod.
These endless connection timeouts are not due to the hosts since other apps in the cluster do not suffer from this, also a restart of my app fixes the problem.
I am declaring multiple hosts(base-uri) through the quarkus application.properties. (maybe its using a single vertx/netty event-loop and it's wrong?)

Related

RabbitMQ randomly disconnecting application consumers in a Kubernetes/Istio environment

Issue:
My company has recently moved workers from Heroku to Kubernetes. We previously used a Heroku-managed add-on (CloudAMQP) for our RabbitMQ brokers. This worked perfectly and we never saw issues with dropped consumer connections.
Now that our workloads live in Kubernetes deployments on separate nodegroups, we are seeing daily dropped consumer connections, causing our messages to not be processed by our applications living in Kubernetes. Our new RabbitMQ brokers live in CloudAMQP but are not managed Heroku add-ons.
Errors on the consumer side just indicate a Unexpected disconnect. No additional details.
No errors on the Istio envoy proxy level that is evident.
We do not have a Istio Egress, so no destination rules set here.
No errors on the RabbitMQ server that is evident.
Remediation Attempts:
Read all StackOverflow/GitHub issues for the Unexpected errors we are seeing. Nothing we have found has remediated the issue.
Our first attempt to remediate was to change the heartbeat to 0 (disabling heartbeats) on our RabbitMQ server and consumer. This did not fix anything, connections still randomly dropping. CloudAMQP also suggests disabling this, because they rely heavily on TCP keepalive.
Created a message that just logs on the consumer every five minutes. To keep the connection active. This has been a bandaid fix for whatever the real issue is. This is not perfect, but we have seen a reduction of disconnects.
What we think the issue is:
We have researched why this might be happening and are honing in on network TCP keepalive settings either within Kubernetes or on our Istio envoy proxy's outbound connection settings.
Any ideas on how we can troubleshoot this further, or what we might be missing here to diagnose?
Thanks!

K8s graceful upgrade of service with long-running connections

tl;dr: I have a server that handles WebSocket connections. The nature of the workload is that it is necessarily stateful (i.e., each connection has long-running state). Each connection can last ~20m-4h. Currently, I only deploy new revisions of this service at off hours to avoid interrupting users too much.
I'd like to move to a new model where deploys happen whenever, and the services gracefully drain connections over the course of ~30 minutes (typically the frontend can find a "good" time to make that switch over within 30 minutes, and if not, we just forcibly disconnect them). I can do that pretty easily with K8s by setting gracePeriodSeconds.
However, what's less clear is how to do rollouts such that new connections only go to the most recent deployment. Suppose I have five replicas running. Normal deploys have an undesirable mode where a client is on R1 (replica 1) and then K8s deploys R1' (upgraded version) and terminates R1; frontend then reconnects and gets routed to R2; R2 terminates, frontend reconnects, gets routed to R3.
Is there any easy way to ensure that after the upgrade starts, new clients get routed only to the upgraded versions? I'm already running Istio (though not using very many of its features), so I could imagine doing something complicated with some custom deployment infrastructure (currently just using Helm) that spins up a new deployment, cuts over new connections to the new deployment, and gracefully drains the old deployment... but I'd rather keep it simple (just Helm running in CI) if possible.
Any thoughts on this?
This is already how things work with normal Services. Once a pod is terminating, it has already been removed from the Endpoints. You'll probably need to tune up your max burst in the rolling update settings of the Deployment to 100%, so that it will spawn all new pods all at once and then start the shutdown process on all the rest.

azure websocket connection through kubernetes, many disconnects with code 1006

A nodejs server on kubernetes get many websocket connections - all is fine, but from time to time an abrupt disconnect occurs (code 1006).
Then every few minutes, the server disconnects from all clients (all disconnects have code 1006).
Important to note that this happens to all replicas at the same time, indicating the cause is external to the servers (and the clients). Could it be the application gateway?
How can I debug this further?
Changing from the default azure application gateway to nginx solved this problem.

Service Fabric upgrades keep active connections alive

I am trying to upgrade an application deployed to service fabric.
How can I only upgrade nodes that have no active connections and wait for the busy nodes to finish before upgrading them?
Most of the time, you don't really have to worry about the upgrades on a node level as the SF runtime handles it internally if configured in Monitored mode. This is what we've been using with a high level of success and never really had to do much. This also fit our requirement that all upgrade domains (nodes) have to match our health state policies before considered healthy.
If you want to have more advanced control over your upgrades like using request draining etc, have a look at the info as mentioned here. But to be honest, we've been quite happy with just using monitored mode and investigating why stuff fails if it does. We had some apps that had a long background task running as a stateful actor that sometimes failed upgrade and most always it was due to an issue that was caused in the background task itself instead of anything to do with Service Fabric.
Service Fabric knew when no active connections and background tasks were running to then upgrade nodes and we could actually see the nodes that were temporarily 'stuck' due to waiting for an active background task to finish.

Getting error no such device or address on kubernetes pods

I have some dotnet core applications running as microservices into GKE (google kubernetes engine).
Usually everything work right, but sometimes, if my microservice isn't in use, something happen that my application shutdown (same behavior as CTRL + C on terminal).
I know that it is a behavior of kubernetes, but if i request application that is not running, my first request return the error: "No such Device or Address" or timeout error.
I will post some logs and setups:
The key to what's happening is this logged error:
TNS: Connect timeout occured ---> OracleInternal.Network....
Since your application is not used, the Oracle database just shuts down it's idle connection. To solve this problem, you can do two things:
Handle the disconnection inside your application to just reconnect.
Define a livenessProbe to restart the pod automatically once the application is down.
Make your application do something with the connection from time to time -> this can be done with a probe too.
Configure your Oracle database not to close idle connections.