We've an akka cluster application (sharding some actors). Sometimes, when we deploy and our application should be turned off we see some logs like that:
Coordinated shutdown phase [cluster-sharding-shutdown-region] timed
out after 10000 milliseconds
This happens on the first deploy after more than 2 days since last deploy (on mondays for example). We ask the akka node to quit the cluster with the JMX helper and we have the following code too:
actorSystem.registerOnTermination {
logger.error("Gracefully shutdown of node")
System.exit(0)
}
So when this error happens, eventually node leaves the cluster (or at least it closes the JMX entry point to manage akka cluster) but process don't finish and the log "Gracefully shutdown of node" doesn't appear. So when this happen we need to shutdown the java process manually (we handle this with supervisor) and redeploy.
I know the timeout can be tunned through config but what are the implications of increasing this timeout? Why sometimes coordinated shutdown throws a timeout? What happens when coordinated shutdown timeout?
Any clue would be appreciated :D
Thank you
What happens after timeout? Quoting from Akka documentation:
If tasks are not completed within a configured timeout (see reference.conf) the next phase will be started anyway. It is possible to configure recover=off for a phase to abort the rest of the shutdown process if a task fails or is not completed within the timeout.
Why the shutdown may time out? Quite possible you have a deadlock somewhere. In that case, increasing the timeout wouldn't help. It may also very well be that you need more time for shutdown. Then, you must increase the timeout.
But more related to your problem, could be the following:
By default, the JVM is not forcefully stopped (it will be stopped if all non-daemon threads have been terminated). To enable a hard System.exit as a final action you can configure:
akka.coordinated-shutdown.exit-jvm = on
So you can turn this on, which should solve the "shutdown the java process manually" step.
Nevertheless, the hard question is to find out why the shutdown times out in the first place. I guess with the above trick you can survive for some time, but you'd better spend some time to find the actual cause.
We used to face this problem (One of the Co-ordinated shutdown phase timeout) for short lived application.
Use case where we faced this:
Application joins existing akka cluster
Does some work
Leaves the cluster
But at step 3, the status of member was still (Joining or WeaklyUp) and if you see task added for PhaseClusterLeave, it allows to remove member from cluster only if it's status is UP.
Snippet from ClusterDaemon.scala which is invoked on Running ClusterLeave phase :
def leaving(address: Address): Unit = {
// only try to update if the node is available (in the member ring)
if (latestGossip.members.exists(m ⇒ m.address == address && m.status == Up)) {
val newMembers = latestGossip.members map { m ⇒ if (m.address == address) m.copy(status = Leaving) else m } // mark node as LEAVING
val newGossip = latestGossip copy (members = newMembers)
updateLatestGossip(newGossip)
logInfo("Marked address [{}] as [{}]", address, Leaving)
publishMembershipState()
// immediate gossip to speed up the leaving process
gossip()
}
}
To solve this problem, we ended up writing our own CoordinatedShutdown which you can refer here CswCoordinatedShutdown.scala
Related
we have applications that work with Kafka (MSK), we noticed that once pod is starting to shutdown (during autoscaling or deployment) the app container loses all active connections and the SIGTERM signal causes Kuma to close all connections immediately which cause data loss due to unfinished sessions (which doesn’t get closed gracefully) on the app side and after that we receive connection errors to the kafka brokers,
is anyone have an idea how to make Kuma wait some time once it gets the SIGTERM signal to let the sessions close gracefully?
or maybe a way to let the app know before the kuma about the shutsown?
or any other idea ?
This is known issue getting fixed in the coming 1.7 release: https://github.com/kumahq/kuma/pull/4229
I just switched from ForkPool to gevent with concurrency (5) as the pool method for Celery workers running in Kubernetes pods. After the switch I've been getting a non recoverable erro in the worker:
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
The broker logs gives basically the same message:
2021-11-01 22:26:17.251 [warning] <0.18574.1> Consumer None4 on channel 1 has timed out waiting for delivery acknowledgement. Timeout used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
I have the CELERY_ACK_LATE set up, but was not familiar with the necessity to set a timeout for the acknowledgement period. And that never happened before using processes. Tasks can be fairly long (60-120 seconds sometimes), but I can't find a specific setting to allow that.
I've read in another post in other forum a user who set the timeout on the broker configuration to a huge number (like 24 hours), and was also having the same problem, so that makes me think there may be something else related to the issue.
Any ideas or suggestions on how to make worker more resilient?
For future reference, it seems that the new RabbitMQ versions (+3.8) introduced a tight default for consumer_timeout (15min I think).
The solution I found (that has also been added to Celery docs not long ago here) was to just add a large number for the consumer_timeout in RabbitMQ.
In this question, someone mentions setting consumer_timeout to false, in a way that using a large number is not needed, but apparently there's some specifics regarding the format of the configuration for that to work.
I'm running RabbitMQ in k8s and just done something like:
rabbitmq.conf: |
consumer_timeout = 31622400000
The accepted answer is the correct answer. However, if you have an existing RabbitMQ server running and do not want to restart it, you can dynamically set the configuration value by running the following command on the RabbitMQ server:
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
This will set the new timeout to 10 hrs (36000000ms). For this to take effect, you need to restart your workers though. Existing worker connections will continue to use the old timeout.
You can check the current configured timeout value as well:
rabbitmqctl eval 'application:get_env(rabbit, consumer_timeout).'
If you are running RabbitMQ via Docker image, here's how to set the value: Simply add -e RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-rabbit consumer_timeout 36000000" to your docker run OR set the environment RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS to "-rabbit consumer_timeout 36000000".
Hope this helps!
I'm trying to verify that shutdown is completing cleanly on Kubernetes, with a .NET Core 2.0 app.
I have an app which can run in two "modes" - one using ASP.NET Core and one as a kind of worker process. Both use Console and JSON-which-ends-up-in-Elasticsearch-via-Filebeat-sidecar-container logger output which indicate startup and shutdown progress.
Additionally, I have console output which writes directly to stdout when a SIGTERM or Ctrl-C is received and shutdown begins.
Locally, the app works flawlessly - I get the direct console output, then the logger output flowing to stdout on Ctrl+C (on Windows).
My experiment scenario:
App deployed to GCS k8s cluster (using helm, though I imagine that doesn't make a difference)
Using kubectl logs -f to stream logs from the specific container
Killing the pod from GCS cloud console site, or deleting the resources via helm delete
Dockerfile is FROM microsoft/dotnet:2.1-aspnetcore-runtime and has ENTRYPOINT ["dotnet", "MyAppHere.dll"], so not wrapped in a bash process or anything
Not specifying a terminationGracePeriodSeconds so guess it defaults to 30 sec
Observing output returned
Results:
The API pod log streaming showed just the immediate console output, "[SIGTERM] Stop signal received", not the other Console logger output about shutdown process
The worker pod log streaming showed a little more - the same console output and some Console logger output about shutdown process
The JSON logs didn't seem to pick any of the shutdown log output
My conclusions:
I don't know if Kubernetes is allowing the process to complete before terminating it, or just issuing SIGTERM then killing things very quick. I think it should be waiting, but then, why no complete console logger output?
I don't know if console output is cut off when stdout log streaming at some point before processes finally terminates?
I would guess that the JSON stuff doesn't come through to ES because filebeat running in the sidecar terminates even if there's outstanding stuff in files to send
I would like to know:
Can anyone advise on points 1,2 above?
Any ideas for a way to allow a little extra time or leeway for the sidecar to send stuff up, like a pod container termination order, delay on shutdown for that container, etc?
SIGTERM does indeed signal termination. The less obvious part is that when the SIGTERM handler returns, everything is considered finished.
The fix is to not return from the SIGTERM handler until the app has finished shutting down. For example, using a ManualResetEvent and Wait()ing it in the handler.
I've started to look into this for my own purposes and have come across your question over a year after it was posted... This is a bit late, but have you tried GraceTerm?
There is an associated NuGET package for this.
From the description...
Graceterm middleware provides implementation to ensure graceful shutdown of AspNet Core applications. The basic concept is: After application received a SIGTERM (a signal asking it to terminate), Graceterm will hold it alive till all pending requests are completed or a timeout occur.
I haven't personally tried this yet, but it does look promising.
Try add STOPSIGNAL SIGINT to your Dockerfile
Consider this set-up:
Eureka server with self preservation mode disabled i.e. enableSelfPreservation: false
2 Eureka instances each for 2 services (say service#1 and service#2). Total 4 instances.
And one of the instances (say srv#1inst#1, an instance of service#1) sent a heartbeat, but it did not reach the Eureka server.
AFAIK, following actions take place in sequence on Server side:
ServerStep1: Server observes that a particular instance has missed a heartbeat.
ServerStep2: Server marks the instance for eviction.
ServerStep3: Server's eviction scheduler (which runs periodically) evicts the instance from registry.
Now on instance (srv#1inst#1) side:
InstanceStep1: It skips a heartbeat.
InstanceStep2: It realizes heartbeat did not reach Eureka Server. It retries with exponential back-off.
AFAIK, the eviction and registration do not happen immediately. Eureka server runs separate scheduler for both tasks periodically.
I have some questions related to this process:
Are the sequences correct? If not, what did I miss?
Is the assumption about eviction and registration scheduler correct?
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
This question was answered by qiangdavidliu in one of the issues of eureka's GitHub repository.
I'm adding his explanations here for sake of completeness.
Before I answer the questions specifically, here's some high level information regarding heartbeats and evictions (based on default configs):
instances are only evicted if they miss 3 consecutive heartbeats
(most) heartbeats do not retry, they are best effort every 30s. The only time a heartbeat will retry is that if there is a threadlevel error on the heartbeating thread (i.e. Timeout or RejectedExecution), but this should be very rare.
Let me try to answer your questions:
Are the sequences correct? If not, what did I miss?
A: The sequences are correct, with the above clarifications.
Is the assumption about eviction and registration scheduler correct?
A: The eviction is handled by an internal scheduler. The registration is processed by the handler thread for the registration request.
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
A: There are a few things here:
until the instance is actually evicted, it will be part of the result
eviction does not involve changing the instance's status, it merely removes the instance from the registry
the server holds 30s caches of the state of the world, and it is this cache that's returned. So the exact result as seem by the client, in an eviction scenario, still depends on when it falls within the cache's update cycle.
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
A: again a few things:
When the actual eviction happen, we check each evictee's time to see if it is eligible to be evicted. If an instance is able to renew its heartbeats before this event, then it is no longer a target for eviction.
The 3 events in question (evaluation of eviction eligibility at eviction time, updating the heartbeat status of an instance, generation of the result to be returned to the read operations) all happen asynchronously and their result will depend on the evaluation of the above described criteria at execution time.
I am running CP3.2 in a distributed mode and some of the connector which are defined even with "tasks.max": "1" have task "UNASSIGNED" state. I have increased the memory allocated to worker and restart the worker has solved me the problem or adding one more worker has solved this.
Its ok for me if "tasks.max" > 1 have some task in "UNASSIGNED" state but if I define only one task its should be in "RUNNING" state.
But I need to understand in what all condition does the task goes to "UNASSIGNED" state and how to solve this (make it running).
Regards,
Aradhya
A task goes into UNASSIGNED state if a successful shutdown has occurred on the worker task that has been assigned to run this connector task. This is regardless of the total number of tasks this connector is supposed to spawn (tasks.max property). You may track this in the code by following the calls to onShutdown method in AbstractHerder. Transitioning to UNASSIGNED state requires that no failure has happened or exception has been thrown by the running worker task and that normal shutdown has been triggered.
Is there a reason that your connector task might be stopping at the very start of its regular iteration loop? Can you give a bit more information? Is it a Source or a Sink?
In my case my connector went to UNASSIGNED just because i ran 2 or 4 connectors parallel and while debezium start working on connectors it gets confuse and stop working on that connector i.e connector went to UNASSIGNED state in my case.
May be this happens just because my all connectors are collecting heavy data from different -2 databases of different location and when i check debezium log it shows that stopping connector and then finally if tolerance limit of debizium cross it stop MYSQL connection
For solving this issue i just restart docker compose or connector both help me to put my connector to running state:
docker-compose restart
or
curl -X POST localhost:8083/connectors/<connector-name>/restart