Kafka Connect - how to get a failed task to restart with a new configuration - apache-kafka

Whenever we restart a failed task, it will ALWAYS pick up the config it had at the time of the failure, and run with that.. and THEN it picks up the new config.. and runs that as well.
We have connect jobs that we pause, update config, and then resume. This works fine, unless the task has failed.
If we restart a failed task, even if the connector has an updated config, the task will launch with the old config.. run to completion/failure.. then a new task will be launched with the new config.
This can cause various data/etc issues.. if you really don't want that old task to run with that config.
Any ideas how to restart a connector with a failed task.. with a new config.. and NOT have the old config get invoked?
(running Kafka v2.5, btw)

I don't know if it would make sense for the task to pick up the latest config.
For instance, let's assume that your connector fires up 10 distinct tasks and 1 of them fails. It won't make sense to have the remaining 9 tasks of the connector running with the older config while the failed task runs the newest config once it is restarted.
I would say that in cases you want to use a new/different configuration file when a task fails, it might make more sense to restart the connector and not the individual task(s):
POST /connectors/connector-name/restart HTTP/1.1

I was having this problem and managed to "fix" this by a bit of randomness.
I increased the number of Tasks in the connector and then reduced it again and it seemed to have picked up a new configuration.
Was really random.
I do know the restart did not work for me

Related

Apache Kafka Connect Task Restart

I am new to Kafka Connect. I am writing a script which detects kafka connect failed tasks and restarts them. But the restart api which apache kafka has provided doesn't say if the task is actually restarted or not, we just know that the restart command was successfully sent.
I want a response if the task was successfully restarted/not. I could put a wait condition and check the task health after the restart command was issued, but the health api also has a delay in reflecting the actual status.
How can I achieve this. Is there a way to synchronously restart the connector tasks?
There is two restart endpoints, by the way - connector/restart and task-id/restart.
But, in my experience, polling for the health is the best you can do, with a retry-limit on whatever process is sending the restart event.

Task Scheduler did not launch task "\abc" because instance "(GUID)" of the same task is already running

I am scratching my head for the last 2 days because of this issue. This error is intermittent on the production server as sometimes the task scheduler works and sometimes not.
The same settings work in the development server.
I also checked the execution policy on both servers and it looks the same.
In your second screenshot, you can choose "Stop existing instance" in the latest dropdown list (if the task is already running). Then the retry option might trigger your task again correctly.

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

Quartz job doesn't restart after instance fail

I have QUARTZ 1.8.5 running in a clustered environment (2 nodes, persistence, clustered , JobStoreCMT).
Now I schedule several jobs to run everyday at a specific hour.
I set REQUEST RECOVERY to true for every of these jobs (jobDetail.setRequestsRecovery(true).
I see that the flag is set to 1 into QRTZ_JOB_DETAILS table.
What I want is that a node fails (Jboss server is restarted for example) then the other alive node to restart the failed job. But this doesn't happens.
What I'm doing wrong/ not doing ?
Thanks.
Have you tried to update to the latest Quartz? There is a version 2.1.6 out already.
Otherwise, what you're doing seems to be right.