What happens when Eureka instance skips a heartbeat against a Eureka server with self preservation turned off? - spring-cloud

Consider this set-up:
Eureka server with self preservation mode disabled i.e. enableSelfPreservation: false
2 Eureka instances each for 2 services (say service#1 and service#2). Total 4 instances.
And one of the instances (say srv#1inst#1, an instance of service#1) sent a heartbeat, but it did not reach the Eureka server.
AFAIK, following actions take place in sequence on Server side:
ServerStep1: Server observes that a particular instance has missed a heartbeat.
ServerStep2: Server marks the instance for eviction.
ServerStep3: Server's eviction scheduler (which runs periodically) evicts the instance from registry.
Now on instance (srv#1inst#1) side:
InstanceStep1: It skips a heartbeat.
InstanceStep2: It realizes heartbeat did not reach Eureka Server. It retries with exponential back-off.
AFAIK, the eviction and registration do not happen immediately. Eureka server runs separate scheduler for both tasks periodically.
I have some questions related to this process:
Are the sequences correct? If not, what did I miss?
Is the assumption about eviction and registration scheduler correct?
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?

This question was answered by qiangdavidliu in one of the issues of eureka's GitHub repository.
I'm adding his explanations here for sake of completeness.
Before I answer the questions specifically, here's some high level information regarding heartbeats and evictions (based on default configs):
instances are only evicted if they miss 3 consecutive heartbeats
(most) heartbeats do not retry, they are best effort every 30s. The only time a heartbeat will retry is that if there is a threadlevel error on the heartbeating thread (i.e. Timeout or RejectedExecution), but this should be very rare.
Let me try to answer your questions:
Are the sequences correct? If not, what did I miss?
A: The sequences are correct, with the above clarifications.
Is the assumption about eviction and registration scheduler correct?
A: The eviction is handled by an internal scheduler. The registration is processed by the handler thread for the registration request.
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
A: There are a few things here:
until the instance is actually evicted, it will be part of the result
eviction does not involve changing the instance's status, it merely removes the instance from the registry
the server holds 30s caches of the state of the world, and it is this cache that's returned. So the exact result as seem by the client, in an eviction scenario, still depends on when it falls within the cache's update cycle.
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
A: again a few things:
When the actual eviction happen, we check each evictee's time to see if it is eligible to be evicted. If an instance is able to renew its heartbeats before this event, then it is no longer a target for eviction.
The 3 events in question (evaluation of eviction eligibility at eviction time, updating the heartbeat status of an instance, generation of the result to be returned to the read operations) all happen asynchronously and their result will depend on the evaluation of the above described criteria at execution time.

Related

How to recover JMS inbound-gateway containers on Active MQ server failure when number of retry is changed to limited

JMS Inbound gateway is used for request processing at worker side. CustomMessageListenerContainer class is configured to impose back off max attempts as limited.
In some scenarios when active MQ server is not responding before max attempts limit reached container is being stopped with below message.
"Stopping container for destination 'senExtractWorkerInGateway': back-off policy does not allow for further attempts."
Wondering is there any configuration available to recover these containers once the Active MQ is back available.
sample configuration is given below.
<int-jms:inbound-gateway
id="senExtractWorkerInGateway"
container-class="com.test.batch.worker.CustomMessageListenerContainer"
connection-factory="jmsConnectionFactory"
correlation-key="JMSCorrelationID"
request-channel="senExtractProcessingWorkerRequestChannel"
request-destination-name="senExtractRequestQueue"
reply-channel="senExtractProcessingWorkerReplyChannel"
default-reply-queue-name="senExtractReplyQueue"
auto-startup="false"
concurrent-consumers="25"
max-concurrent-consumers="25"
reply-timeout="1200000"
receive-timeout="1200000"/>
You probably can emit some ApplicationEvent from the applyBackOffTime() of your CustomMessageListenerContainer when the super call returns false. This way you would know that something is wrong with ActiveMQ connection. At this moment you also need to stop() your senExtractWorkerInGateway - just autowire it into some controlling service as a Lifecycle. When you done fixing the connection problem, you just need to start this senExtractWorkerInGateway back. That CustomMessageListenerContainer is going to be started automatically.

Will mongock work correctly with kubernetes replicas?

Mongock looks very promising. We want to use it inside a kubernetes service that has multiple replicas that run in parallel.
We are hoping that when our service is deployed, the first replica will acquire the mongockLock and all of its ChangeLogs/ChangeSets will be completed before the other replicas attempt to run them.
We have a single instance of mongodb running in our kubernetes environment, and we want the mongock ChangeLogs/ChangeSets to execute only once.
Will the mongockLock guarantee that only one replica will run the ChangeLogs/ChangeSets to completion?
Or do I need to enable transactions (or some other configuration)?
I am going to provide the short answer first and then the long one. I suggest you to read the long one too in order to understand it properly.
Short answer
By default, Mongock guarantees that the ChangeLogs/changeSets will be run only by one pod at a time. The one owning the lock.
Long answer
What really happens behind the scenes(if it's not configured otherwise) is that when a pod takes the lock, the others will try to acquire it too, but they can't, so they are forced to wait for a while(configurable, but 4 mins by default)as many times as the lock is configured(3 times by default). After this, if i's not able to acquire it and there is still pending changes to apply, Mongock will throw an MongockException, which should mean the JVM startup fail(what happens by default in Spring).
This is fine in Kubernetes, because it ensures it will restart the pods.
So now, assuming the pods start again and changeLogs/changeSets are already applied, the pods start successfully because they don't even need to acquire the lock as there aren't pending changes to apply.
Potential problem with MongoDB without transaction support and Frameworks like Spring
Now, assuming the lock and the mutual exclusion is clear, I'd like to point out a potential issue that needs to be mitigated by the the changeLog/changeSet design.
This issue applies if you are in an environment such as Kubernetes, which has a pod initialisation time, your migration take longer than that initialisation time an the Mongock process is executed before the pod becomes ready/health(and it's a condition for it). This last condition is highly desired as it ensures the application runs with the right version of the data.
In this situation imagine the Pod starts the Mongock process. After the Kubernetes initialisation time, the process is still not finished, but Kubernetes stops the JVM abruptly. This means that some changeSets were successfully executed, some other not even started(no problem, they will be processed in the next attempt), but one changeSet was partially executed and marked as not done. This is the potential issue. The next time Mongock runs, it will see the changeSet as pending and it will execute it from the beginning. If you haven't designed your changeLogs/changeSets accordingly, you may experience some unexpected results because some part of the data process covered by that changeSet has already taken place and it will happen again.
This, somehow needs to be mitigated. Either with the help of mechanisms like transactions, with a changeLog/changeSet design that takes this into account or both.
Mongock currently provides transactions with “all or nothing”, but it doesn’t really help much as it will retry every time from scratch and will probably end up in an infinite loop. The next version 5 will provide transactions per ChangeLogs and changeSets, which together with good organisation, is the right solution for this.
Meanwhile this issue can be addressed by following this design suggestions.
Just to follow up... Mongock's locking mechanism works fine with replicas. To solve the "long-running script" problem, we will run our Mongock scripts from Kubernetes initContainer. K8s will wait for the initContainers to finish before it starts the pod's main service containers.
For transactions, we will follow the advice above of making our scripts idempotent.

Is there downtime when a partition is moved to a new node?

Service Fabric offers the capability to rebalance partitions whenever a node is removed or added to the cluster. The Service Fabric Cluster Resource Manager will move one or more partitions to this node so more work can be done.
Imagine a reliable actor service which has thousands of actors running who are distributed across multiple partitions. If the Resource Manager decides to move one or more partitions, will this cause any downtime? Or does rebalancing partitions work the same as upgrading a service?
They act pretty much the same way, The main difference I can point is that Upgrades might affect only the services being updated, and re-balancing might affect multiple services at once. During an upgrade, the cluster might re-balance the services as well to fit the new service instance in a node.
Adding or Removing nodes I would compare more with node failures. In any of these cases they will be rebalanced because of the cluster capacity changes, not because of the service metric\load changes.
The main difference between a node failure and a cluster scaling(Add/remove node) is that the rebalance will take in account the services states during the process, when a infrastructure notification comes in telling that a node is being shutdown(for updates or maintenance, or scaling down) the SF will ask the Infrastructure to wait so it can prepare for this announced 'failure', and then start re-balancing the services.
Even though re-balancing cares about the service states for a scale down it should not be considered more reliable than a node failure, because the infrastructure will wait for a while before shutting down the node(the limit it can wait will depend on the reliability tier you defined for your cluster), until SF check if the services meet health conditions, like turn down services and creating new ones, checking if they will run fine without errors, if this process takes too long, these service might be killed once the timeout is reached and the infrastructure proceed with the changes, Also, the new instances of the services might fail on new nodes, forcing the services to move again.
When you design the services is safer to consider the re-balancing as a node failure, because at the end is not much different. Your services will move around, data stored in memory will be lost if not persisted, the service address will change, and etc. The services should have replicated data and the clients should always use a retry logic and refresh the services location to reduce the down time.
The main difference between service upgrade and service rebalancing is that during upgrade all replicas from all partitions are get turned off on particular node. According to documentation here balancing is done on replica basis i.e. only some replicas from some partitions will get moved, so there shouldn't be any outage.

Warmup services on upgrade in Service Fabric

We are wondering if there is a built-in way to warm up services as part of the service upgrades in Service Fabric, similar to the various ways you could warm up e.g. IIS based app pools before they are hit by requests. Ideally we want the individual services to perform some warm-up tasks as part of their initialization (could be cache loading, recovery etc.) before being considered as started and available for other services to contact. This warmup should be part of the upgrade domain processing so the upgrade process should wait for the warmup to be completed and the service reported as OK/Ready.
How are others handling such scenarios, controlling the process for signalling to the service fabric that the specific service is fully started and ready to be contacted by other services?
In the health policy there's this concept:
HealthCheckWaitDurationSec The time to wait (in seconds) after the upgrade has finished on the upgrade domain before Service Fabric evaluates the health of the application. This duration can also be considered as the time an application should be running before it can be considered healthy. If the health check passes, the upgrade process proceeds to the next upgrade domain. If the health check fails, Service Fabric waits for an interval (the UpgradeHealthCheckInterval) before retrying the health check again until the HealthCheckRetryTimeout is reached. The default and recommended value is 0 seconds.
Source
This is a fixed wait period though.
You can also emit Health events yourself. For instance, you can report health 'Unknown' while warming up. And adjust your health policy (HealthCheckWaitDurationSec) to check this.
Reporting health can help. You can't report Unknown, you must report Error very early on, then clear the Error when your service is ready. Warning and Ok do not impact upgrade. To clear the Error, your service can report health state Ok, RemoveWhenExpired=true, low TTL (read more on how to report).
You must increase HealthCheckRetryTimeout based on the max warm up time. Otherwise, if a health check is performed and cluster is evaluated to Error, the upgrade will fail (and rollback or pause, per your policy).
So, the order the events is:
your service reports Error - "Warming up in progress"
upgrade waits for fixed HealthCheckWaitDurationSec (you can set this to min time to warm up)
upgrade performs health checks: if the service hasn't yet warmed up, the health state is Error, so upgrade retries until either HealthCheckRetryTimeout is reached or your service is not in Error anymore (warm up completed and your service cleared the Error).

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover