How to recover JMS inbound-gateway containers on Active MQ server failure when number of retry is changed to limited - spring-batch

JMS Inbound gateway is used for request processing at worker side. CustomMessageListenerContainer class is configured to impose back off max attempts as limited.
In some scenarios when active MQ server is not responding before max attempts limit reached container is being stopped with below message.
"Stopping container for destination 'senExtractWorkerInGateway': back-off policy does not allow for further attempts."
Wondering is there any configuration available to recover these containers once the Active MQ is back available.
sample configuration is given below.
<int-jms:inbound-gateway
id="senExtractWorkerInGateway"
container-class="com.test.batch.worker.CustomMessageListenerContainer"
connection-factory="jmsConnectionFactory"
correlation-key="JMSCorrelationID"
request-channel="senExtractProcessingWorkerRequestChannel"
request-destination-name="senExtractRequestQueue"
reply-channel="senExtractProcessingWorkerReplyChannel"
default-reply-queue-name="senExtractReplyQueue"
auto-startup="false"
concurrent-consumers="25"
max-concurrent-consumers="25"
reply-timeout="1200000"
receive-timeout="1200000"/>

You probably can emit some ApplicationEvent from the applyBackOffTime() of your CustomMessageListenerContainer when the super call returns false. This way you would know that something is wrong with ActiveMQ connection. At this moment you also need to stop() your senExtractWorkerInGateway - just autowire it into some controlling service as a Lifecycle. When you done fixing the connection problem, you just need to start this senExtractWorkerInGateway back. That CustomMessageListenerContainer is going to be started automatically.

Related

RabbitMQ randomly disconnecting application consumers in a Kubernetes/Istio environment

Issue:
My company has recently moved workers from Heroku to Kubernetes. We previously used a Heroku-managed add-on (CloudAMQP) for our RabbitMQ brokers. This worked perfectly and we never saw issues with dropped consumer connections.
Now that our workloads live in Kubernetes deployments on separate nodegroups, we are seeing daily dropped consumer connections, causing our messages to not be processed by our applications living in Kubernetes. Our new RabbitMQ brokers live in CloudAMQP but are not managed Heroku add-ons.
Errors on the consumer side just indicate a Unexpected disconnect. No additional details.
No errors on the Istio envoy proxy level that is evident.
We do not have a Istio Egress, so no destination rules set here.
No errors on the RabbitMQ server that is evident.
Remediation Attempts:
Read all StackOverflow/GitHub issues for the Unexpected errors we are seeing. Nothing we have found has remediated the issue.
Our first attempt to remediate was to change the heartbeat to 0 (disabling heartbeats) on our RabbitMQ server and consumer. This did not fix anything, connections still randomly dropping. CloudAMQP also suggests disabling this, because they rely heavily on TCP keepalive.
Created a message that just logs on the consumer every five minutes. To keep the connection active. This has been a bandaid fix for whatever the real issue is. This is not perfect, but we have seen a reduction of disconnects.
What we think the issue is:
We have researched why this might be happening and are honing in on network TCP keepalive settings either within Kubernetes or on our Istio envoy proxy's outbound connection settings.
Any ideas on how we can troubleshoot this further, or what we might be missing here to diagnose?
Thanks!

Should/could a Kubernetes pod process.exit(1) itself or is it better to use liveliness probe?

I have an important service running in Kubernetes.
You can picture it as a dispatcher service which connects to a publisher and dispatch the information to a RabbitMQ queue:
[Service1: publisher] -> [Service2: dispatcher] -> [Service3: RabbitMQ]
(everything is in the same namespace, there is one replica for each service and they are set as Stateful Sets)
If the connection between publisher and dispatcher is down, publisher will buffer the messages, all good.
However, if the connection between dispatcher and RabbitMQ is down, the messages will be lost.
Thus, when I lose connection with RabbitMQ, I'd like to somehow process.exit(1) the dispatcher so it instantly stops receiving messages from publisher and then the messages are buffered.
I'm also thinking about doing it in a more "k8s way" by setting up a liveliness probe but I'm afraid this could take some time before it detects it and restart (without "DDoSing" my pod every 1sec). Because to set up this probe, I would have to "listen for disconnect" and if I know I am disconnected, why should I wait for k8s to take action (and lose precious messages) while I could simply kill the service?
I know the question might be a bit vague but I'm also asking for some hints / best practices here. Thanks.

Blocking a Service Fabric service shutdown externally

I'm going to write a quick little SF service to report endpoints from a service to a load balancer. That part is easy and well understood. FabricClient. Find Services. Discover endpoints. Do stuff with load balancer API.
But I'd like to be able to deal with a graceful drain and shutdown situation. Basically, catch and block the shutdown of a SF service until after my app has had a chance to drain connections to it from the pool.
There's no API I can find to accomplish this. But I kinda bet one of the services would let me do this. Resource manager. Cluster manager. Whatever.
Is anybody familiar with how to pull this off?
From what I know this isn't possible in a way you've described.
Service Fabric service can be shutdown by multiple reasons: re-balancing, errors, outage, upgrade etc. Depending on the type of service (stateful or stateless) they have slightly different shutdown routine (see more) but in general if the service replica is shutdown gracefully then OnCloseAsync method is invoked. Inside this method replica can perform a safe cleanup. There is also a second case - when replica is forcibly terminated. Then OnAbort method is called and there are no clear statements in documentation about guarantees you have inside OnAbort method.
Going back to your case I can suggest the following pattern:
When replica is going to shutdown inside OnCloseAsync or OnAbort it calls lbservice and reports that it is going to shutdown.
The lbservice the reconfigure load balancer to exclude this replica from request processing.
replica completes all already processing requests and shutdown.
Please note that you would need to implement startup mechanism too i.e. when replica is started then it reports to lbservice that it is active now.
In a mean time I like to notice that Service Fabric already implements this mechanics. Here is an example of how API Management can be used with Service Fabric and here is an example of how Reverse Proxy can be used to access Service Fabric services from the outside.
EDIT 2018-10-08
In order to abstract receive notifications about services endpoints changes in general you can try to use FabricClient.ServiceManagementClient.ServiceNotificationFilterMatched Event.
There is a similar situation solved in this question.

What happens when Eureka instance skips a heartbeat against a Eureka server with self preservation turned off?

Consider this set-up:
Eureka server with self preservation mode disabled i.e. enableSelfPreservation: false
2 Eureka instances each for 2 services (say service#1 and service#2). Total 4 instances.
And one of the instances (say srv#1inst#1, an instance of service#1) sent a heartbeat, but it did not reach the Eureka server.
AFAIK, following actions take place in sequence on Server side:
ServerStep1: Server observes that a particular instance has missed a heartbeat.
ServerStep2: Server marks the instance for eviction.
ServerStep3: Server's eviction scheduler (which runs periodically) evicts the instance from registry.
Now on instance (srv#1inst#1) side:
InstanceStep1: It skips a heartbeat.
InstanceStep2: It realizes heartbeat did not reach Eureka Server. It retries with exponential back-off.
AFAIK, the eviction and registration do not happen immediately. Eureka server runs separate scheduler for both tasks periodically.
I have some questions related to this process:
Are the sequences correct? If not, what did I miss?
Is the assumption about eviction and registration scheduler correct?
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
This question was answered by qiangdavidliu in one of the issues of eureka's GitHub repository.
I'm adding his explanations here for sake of completeness.
Before I answer the questions specifically, here's some high level information regarding heartbeats and evictions (based on default configs):
instances are only evicted if they miss 3 consecutive heartbeats
(most) heartbeats do not retry, they are best effort every 30s. The only time a heartbeat will retry is that if there is a threadlevel error on the heartbeating thread (i.e. Timeout or RejectedExecution), but this should be very rare.
Let me try to answer your questions:
Are the sequences correct? If not, what did I miss?
A: The sequences are correct, with the above clarifications.
Is the assumption about eviction and registration scheduler correct?
A: The eviction is handled by an internal scheduler. The registration is processed by the handler thread for the registration request.
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
A: There are a few things here:
until the instance is actually evicted, it will be part of the result
eviction does not involve changing the instance's status, it merely removes the instance from the registry
the server holds 30s caches of the state of the world, and it is this cache that's returned. So the exact result as seem by the client, in an eviction scenario, still depends on when it falls within the cache's update cycle.
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
A: again a few things:
When the actual eviction happen, we check each evictee's time to see if it is eligible to be evicted. If an instance is able to renew its heartbeats before this event, then it is no longer a target for eviction.
The 3 events in question (evaluation of eviction eligibility at eviction time, updating the heartbeat status of an instance, generation of the result to be returned to the read operations) all happen asynchronously and their result will depend on the evaluation of the above described criteria at execution time.

Partition is in quorum loss

I have a Service Fabric application that has a stateless web api and a stateful service with two partitions. The stateless web api defines a web api controller and uses ServiceProxy.Create to get a remoting proxy for the stateful service. The remoting call puts a message into a reliable queue.
The stateful service will dequeue the messages from the queue every X minutes.
I am looking at the Service Fabric explorer and my application has been in an error state for the past few days. When I drill down into the details the stateful service has the following error:
Error event: SourceId='System.FM', Property='State'. Partition is in
quorum loss.
Looking at the explorer I see that I have my primary replica up and running and it seems like a single ActiveSecondary, but the other two replicas show IdleSecondary and they keep going into a Standby / In Build state. I cannot figure out why this is happening.
What are some of the reasons my other secondaries keep failing to get to an ActiveSecondary state / causing this quorum loss?
Try to reset the cluster.
I was facing the same issue having 1 partition for my service.
The error was fixed with resetting the cluster
Have you checked the Windows Event Log on the nodes for additional error message?
I had a similar problem, except I was using a ReliableDictionary. Did you properly implement IEquatable<T> and IComparable<T>? I had a similar problem because my T had a dictionary field, and I was calling Equals on a dictionary directly, instead of comparing the keys and values. Same thing for GetHashCode.
The clue in the event logs was this message: Assert=Cannot update an item that does not exist (null). - it only happened when I edited a key ReliableDictionary.