When does a task in a connector move to unassigned state? - apache-kafka

I am running CP3.2 in a distributed mode and some of the connector which are defined even with "tasks.max": "1" have task "UNASSIGNED" state. I have increased the memory allocated to worker and restart the worker has solved me the problem or adding one more worker has solved this.
Its ok for me if "tasks.max" > 1 have some task in "UNASSIGNED" state but if I define only one task its should be in "RUNNING" state.
But I need to understand in what all condition does the task goes to "UNASSIGNED" state and how to solve this (make it running).
Regards,
Aradhya

A task goes into UNASSIGNED state if a successful shutdown has occurred on the worker task that has been assigned to run this connector task. This is regardless of the total number of tasks this connector is supposed to spawn (tasks.max property). You may track this in the code by following the calls to onShutdown method in AbstractHerder. Transitioning to UNASSIGNED state requires that no failure has happened or exception has been thrown by the running worker task and that normal shutdown has been triggered.
Is there a reason that your connector task might be stopping at the very start of its regular iteration loop? Can you give a bit more information? Is it a Source or a Sink?

In my case my connector went to UNASSIGNED just because i ran 2 or 4 connectors parallel and while debezium start working on connectors it gets confuse and stop working on that connector i.e connector went to UNASSIGNED state in my case.
May be this happens just because my all connectors are collecting heavy data from different -2 databases of different location and when i check debezium log it shows that stopping connector and then finally if tolerance limit of debizium cross it stop MYSQL connection
For solving this issue i just restart docker compose or connector both help me to put my connector to running state:
docker-compose restart
or
curl -X POST localhost:8083/connectors/<connector-name>/restart

Related

Apache Kafka Connect Task Restart

I am new to Kafka Connect. I am writing a script which detects kafka connect failed tasks and restarts them. But the restart api which apache kafka has provided doesn't say if the task is actually restarted or not, we just know that the restart command was successfully sent.
I want a response if the task was successfully restarted/not. I could put a wait condition and check the task health after the restart command was issued, but the health api also has a delay in reflecting the actual status.
How can I achieve this. Is there a way to synchronously restart the connector tasks?
There is two restart endpoints, by the way - connector/restart and task-id/restart.
But, in my experience, polling for the health is the best you can do, with a retry-limit on whatever process is sending the restart event.

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

Kafka Connect - how to get a failed task to restart with a new configuration

Whenever we restart a failed task, it will ALWAYS pick up the config it had at the time of the failure, and run with that.. and THEN it picks up the new config.. and runs that as well.
We have connect jobs that we pause, update config, and then resume. This works fine, unless the task has failed.
If we restart a failed task, even if the connector has an updated config, the task will launch with the old config.. run to completion/failure.. then a new task will be launched with the new config.
This can cause various data/etc issues.. if you really don't want that old task to run with that config.
Any ideas how to restart a connector with a failed task.. with a new config.. and NOT have the old config get invoked?
(running Kafka v2.5, btw)
I don't know if it would make sense for the task to pick up the latest config.
For instance, let's assume that your connector fires up 10 distinct tasks and 1 of them fails. It won't make sense to have the remaining 9 tasks of the connector running with the older config while the failed task runs the newest config once it is restarted.
I would say that in cases you want to use a new/different configuration file when a task fails, it might make more sense to restart the connector and not the individual task(s):
POST /connectors/connector-name/restart HTTP/1.1
I was having this problem and managed to "fix" this by a bit of randomness.
I increased the number of Tasks in the connector and then reduced it again and it seemed to have picked up a new configuration.
Was really random.
I do know the restart did not work for me

CoordinatedShutdown timeout on Akka cluster application

We've an akka cluster application (sharding some actors). Sometimes, when we deploy and our application should be turned off we see some logs like that:
Coordinated shutdown phase [cluster-sharding-shutdown-region] timed
out after 10000 milliseconds
This happens on the first deploy after more than 2 days since last deploy (on mondays for example). We ask the akka node to quit the cluster with the JMX helper and we have the following code too:
actorSystem.registerOnTermination {
logger.error("Gracefully shutdown of node")
System.exit(0)
}
So when this error happens, eventually node leaves the cluster (or at least it closes the JMX entry point to manage akka cluster) but process don't finish and the log "Gracefully shutdown of node" doesn't appear. So when this happen we need to shutdown the java process manually (we handle this with supervisor) and redeploy.
I know the timeout can be tunned through config but what are the implications of increasing this timeout? Why sometimes coordinated shutdown throws a timeout? What happens when coordinated shutdown timeout?
Any clue would be appreciated :D
Thank you
What happens after timeout? Quoting from Akka documentation:
If tasks are not completed within a configured timeout (see reference.conf) the next phase will be started anyway. It is possible to configure recover=off for a phase to abort the rest of the shutdown process if a task fails or is not completed within the timeout.
Why the shutdown may time out? Quite possible you have a deadlock somewhere. In that case, increasing the timeout wouldn't help. It may also very well be that you need more time for shutdown. Then, you must increase the timeout.
But more related to your problem, could be the following:
By default, the JVM is not forcefully stopped (it will be stopped if all non-daemon threads have been terminated). To enable a hard System.exit as a final action you can configure:
akka.coordinated-shutdown.exit-jvm = on
So you can turn this on, which should solve the "shutdown the java process manually" step.
Nevertheless, the hard question is to find out why the shutdown times out in the first place. I guess with the above trick you can survive for some time, but you'd better spend some time to find the actual cause.
We used to face this problem (One of the Co-ordinated shutdown phase timeout) for short lived application.
Use case where we faced this:
Application joins existing akka cluster
Does some work
Leaves the cluster
But at step 3, the status of member was still (Joining or WeaklyUp) and if you see task added for PhaseClusterLeave, it allows to remove member from cluster only if it's status is UP.
Snippet from ClusterDaemon.scala which is invoked on Running ClusterLeave phase :
def leaving(address: Address): Unit = {
// only try to update if the node is available (in the member ring)
if (latestGossip.members.exists(m ⇒ m.address == address && m.status == Up)) {
val newMembers = latestGossip.members map { m ⇒ if (m.address == address) m.copy(status = Leaving) else m } // mark node as LEAVING
val newGossip = latestGossip copy (members = newMembers)
updateLatestGossip(newGossip)
logInfo("Marked address [{}] as [{}]", address, Leaving)
publishMembershipState()
// immediate gossip to speed up the leaving process
gossip()
}
}
To solve this problem, we ended up writing our own CoordinatedShutdown which you can refer here CswCoordinatedShutdown.scala

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.