Kafka Conenct: Automatically terminating after processing all data - apache-kafka

I want to backup and restore a huge amount of data in a Kafka topic to various destinations (file, another topic, S3, ...) using Kafka Connect. However, it runs in a streaming mode and hence never terminates. But in my scenario it should exit automatically after processing all data that is currently in the topic (it is ensured in my context that all producers are shut down before the backup starts).
Is there any option/ parameter so that a Kafka Connect connector automatically terminates after all current data is processed and e.g. stored in a file?

AFAIK there is no such option. You can create "watchdog" checking lag on your Kafka Connect group.id and once lag is processed, e.g. = 0, you shutdown the process.
As we do it in our company: we start consumer to process messages every 3-6 hours to process lag, create file and then terminates. File is being uploaded to other destination.

Related

Apache Flink Streaming Job: deployment patterns

We want to use Apache Flink for the streaming job – read from one Kafka topic and write to another. The infrastructure will be deployed to Kubernetes. I want to restart the job on any PR merge to master branch.
Therefore, I wonder whether Flink guarantees that resubmitting the job will continue the data stream from the last processed message? Because one of the most important job's feature is message deduplication on time window.
What are the patterns of updating streaming jobs for Apache Flink? Should I just stop the old job and submit the new one?
My suggestion would be to simply try it.
Deploy your app manually and then stop it. Run kafka-consumer-groups script to find your consumer group. Then restart/upgrade the app, and run the command again with the same group. If the lag goes down (as it should), rather than resets to the beginning of the topic, then it's working as expected, as it would for any Kafka consumer.
read from one Kafka topic and write to another.
Ideally, Kafka Streams is used for this.
Kafka consumer offsets are saved as part of the checkpoint. So as long as your workflow is running in exactly-once mode, and your Kafka source is properly configured (e.g. you've set a group id), then restarting your job from the last checkpoint or savepoint will guarantee no duplicate records in the destination Kafka topic.
If you're stopping/restarting the job as part of your CI processing, then you'd want to:
Stop with savepoint.
Re-start from the savepoint
You could also set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true (so that a checkpoint is taken when the job is terminated), enable externalized checkpoints, and then restart from the last checkpoint, but the savepoint approach is easier.
And you'd want to have some regular process that removes older savepoints, though the size of the savepoints will be very small (only the consumer offsets, for your simple workflow).

confluent kafka dotnet consumer retry

If any message in kafka fails to be read, I want to save these messages in a new topic and retry the reading process a certain number of times. Meanwhile, I need to pause the reading process at the original location. How can I do that?

Standby tasks not writing updates to .checkpoint files

I have a Kafka Streams application that is configured to have 1 standby replica created for each task. I have two instances of the application running. When the application starts the application writes .checkpoint files for each of the partitions it is responsible for. It writes these files for partitions owned by both active and standby tasks.
When sending a new Kafka event to be processed by the application, the instance containing that active task for the partition updates the offsets in the .checkpoint file. However, the .checkpoint file for the standby task on the second instance is never updated. It remains at the old offset.
I believe this is causing us to see OffsetOutOfRangeEceptions to be thrown when we rebalance which results in tasks being torn down and created from scratch.
Am I right in thinking that offsets should be written for partitions in both standby and active tasks?
Is this an indication that my standby tasks are not consuming or could it be that it is purely not able to write the offset?
Any ideas what could be causing this behaviour?
Streams version: 2.3.1
This issue has been fixed in Kafka 2.4.0 which resolves the following bug issues.apache.org/jira/browse/KAFKA-8755
Note: The issue looks to only effect applications the are configured OPTIMIZE="all"

Why my Kafka connect sink cluster only has one worker processing messages?

I've recently setup a local Kafka on my computer for testing and development purposes:
3 brokers
One input topic
Kafka connect sink between the topic and elastic search
I managed to configure it in standalone mode, so everything is localhost, and the Kafka connect was started using ./connect-standalone.sh script.
What I'm trying to do now is to run my connectors in distributed mode, so the Kafka messages can be separated into both workers.
I've started the two workers (still everything on the same machine), but when I send message to my Kafka topic, only one worker (the last started) is processing messages.
So my question is: Why only one worker is processing Kafka messages instead of both ?
When I kill one of the worker, the other one takes the message flow back, so I think the cluster is well setup.
What I think:
I don't put Keys inside my Kafka messages, can it be related to this ?
I'm running everything in localhost, does distributed mode can work this way ? (I've correctly configure specific unique field such as ret.port)
Resolved:
From Kafka documentation:
The division of work between tasks is shown by the partitions that each task is assigned
If you don't use partition (push all messages in same partition), workers won't be able to divide messages.
You don't need to use message keys, you can just push your messages to different partition in a cyclic way.
See: https://docs.confluent.io/current/connect/concepts.html#distributed-workers

Put() vs Flush() in Kafka Connector Sink Task

I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.