I am new to Kafka Connect. I am writing a script which detects kafka connect failed tasks and restarts them. But the restart api which apache kafka has provided doesn't say if the task is actually restarted or not, we just know that the restart command was successfully sent.
I want a response if the task was successfully restarted/not. I could put a wait condition and check the task health after the restart command was issued, but the health api also has a delay in reflecting the actual status.
How can I achieve this. Is there a way to synchronously restart the connector tasks?
There is two restart endpoints, by the way - connector/restart and task-id/restart.
But, in my experience, polling for the health is the best you can do, with a retry-limit on whatever process is sending the restart event.
Related
I have a pod running in kubernetes / aws cloud. Due to limited configuration options in a custom deployment process (not my fault!!) I cannot start the symfony messenger as you usually would start it. What I have to do after a deployment is log into the shell and manually do
bin/console messenger:consume my_kafka_messages
Of course as soon as the pod for any reason is automatically restarted my worker isn't running. So until we can change the company deployment process I have to make sure to at least get notice if the worker isn't running.
Is there any option to e.g. run an symfony command which checks if the worker is running? If that was possible I could let the system start the worker or at least send me a notification.
How about
bin/console debug:messenger
?
If I do that and get e.g. this output is this sign that the worker is running? Or is it just the configuration of a worker, which could run, if it were started and may or may not run currently?
$ bin/console deb:mess
Messenger
=========
events
------
The following messages can be dispatched:
--------------------------------------------------
#codeCoverageIgnore
App\Domain\KafkaEvents\ProductPictureCollection
handled by App\Handler\ProductPictureHandler
--------------------------------------------------
Of course I can do a crude approach and check the db, which logs the processed datasets. But t is always possible that for e.g. 5 days there are no data to process. In that case I would get false alarms although everything is fine.
So checking directly if the worker is running would be much better, but I have no idea how to do it.
I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?
The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.
Whenever we restart a failed task, it will ALWAYS pick up the config it had at the time of the failure, and run with that.. and THEN it picks up the new config.. and runs that as well.
We have connect jobs that we pause, update config, and then resume. This works fine, unless the task has failed.
If we restart a failed task, even if the connector has an updated config, the task will launch with the old config.. run to completion/failure.. then a new task will be launched with the new config.
This can cause various data/etc issues.. if you really don't want that old task to run with that config.
Any ideas how to restart a connector with a failed task.. with a new config.. and NOT have the old config get invoked?
(running Kafka v2.5, btw)
I don't know if it would make sense for the task to pick up the latest config.
For instance, let's assume that your connector fires up 10 distinct tasks and 1 of them fails. It won't make sense to have the remaining 9 tasks of the connector running with the older config while the failed task runs the newest config once it is restarted.
I would say that in cases you want to use a new/different configuration file when a task fails, it might make more sense to restart the connector and not the individual task(s):
POST /connectors/connector-name/restart HTTP/1.1
I was having this problem and managed to "fix" this by a bit of randomness.
I increased the number of Tasks in the connector and then reduced it again and it seemed to have picked up a new configuration.
Was really random.
I do know the restart did not work for me
I'm trying to verify that shutdown is completing cleanly on Kubernetes, with a .NET Core 2.0 app.
I have an app which can run in two "modes" - one using ASP.NET Core and one as a kind of worker process. Both use Console and JSON-which-ends-up-in-Elasticsearch-via-Filebeat-sidecar-container logger output which indicate startup and shutdown progress.
Additionally, I have console output which writes directly to stdout when a SIGTERM or Ctrl-C is received and shutdown begins.
Locally, the app works flawlessly - I get the direct console output, then the logger output flowing to stdout on Ctrl+C (on Windows).
My experiment scenario:
App deployed to GCS k8s cluster (using helm, though I imagine that doesn't make a difference)
Using kubectl logs -f to stream logs from the specific container
Killing the pod from GCS cloud console site, or deleting the resources via helm delete
Dockerfile is FROM microsoft/dotnet:2.1-aspnetcore-runtime and has ENTRYPOINT ["dotnet", "MyAppHere.dll"], so not wrapped in a bash process or anything
Not specifying a terminationGracePeriodSeconds so guess it defaults to 30 sec
Observing output returned
Results:
The API pod log streaming showed just the immediate console output, "[SIGTERM] Stop signal received", not the other Console logger output about shutdown process
The worker pod log streaming showed a little more - the same console output and some Console logger output about shutdown process
The JSON logs didn't seem to pick any of the shutdown log output
My conclusions:
I don't know if Kubernetes is allowing the process to complete before terminating it, or just issuing SIGTERM then killing things very quick. I think it should be waiting, but then, why no complete console logger output?
I don't know if console output is cut off when stdout log streaming at some point before processes finally terminates?
I would guess that the JSON stuff doesn't come through to ES because filebeat running in the sidecar terminates even if there's outstanding stuff in files to send
I would like to know:
Can anyone advise on points 1,2 above?
Any ideas for a way to allow a little extra time or leeway for the sidecar to send stuff up, like a pod container termination order, delay on shutdown for that container, etc?
SIGTERM does indeed signal termination. The less obvious part is that when the SIGTERM handler returns, everything is considered finished.
The fix is to not return from the SIGTERM handler until the app has finished shutting down. For example, using a ManualResetEvent and Wait()ing it in the handler.
I've started to look into this for my own purposes and have come across your question over a year after it was posted... This is a bit late, but have you tried GraceTerm?
There is an associated NuGET package for this.
From the description...
Graceterm middleware provides implementation to ensure graceful shutdown of AspNet Core applications. The basic concept is: After application received a SIGTERM (a signal asking it to terminate), Graceterm will hold it alive till all pending requests are completed or a timeout occur.
I haven't personally tried this yet, but it does look promising.
Try add STOPSIGNAL SIGINT to your Dockerfile
I am running CP3.2 in a distributed mode and some of the connector which are defined even with "tasks.max": "1" have task "UNASSIGNED" state. I have increased the memory allocated to worker and restart the worker has solved me the problem or adding one more worker has solved this.
Its ok for me if "tasks.max" > 1 have some task in "UNASSIGNED" state but if I define only one task its should be in "RUNNING" state.
But I need to understand in what all condition does the task goes to "UNASSIGNED" state and how to solve this (make it running).
Regards,
Aradhya
A task goes into UNASSIGNED state if a successful shutdown has occurred on the worker task that has been assigned to run this connector task. This is regardless of the total number of tasks this connector is supposed to spawn (tasks.max property). You may track this in the code by following the calls to onShutdown method in AbstractHerder. Transitioning to UNASSIGNED state requires that no failure has happened or exception has been thrown by the running worker task and that normal shutdown has been triggered.
Is there a reason that your connector task might be stopping at the very start of its regular iteration loop? Can you give a bit more information? Is it a Source or a Sink?
In my case my connector went to UNASSIGNED just because i ran 2 or 4 connectors parallel and while debezium start working on connectors it gets confuse and stop working on that connector i.e connector went to UNASSIGNED state in my case.
May be this happens just because my all connectors are collecting heavy data from different -2 databases of different location and when i check debezium log it shows that stopping connector and then finally if tolerance limit of debizium cross it stop MYSQL connection
For solving this issue i just restart docker compose or connector both help me to put my connector to running state:
docker-compose restart
or
curl -X POST localhost:8083/connectors/<connector-name>/restart