If any message in kafka fails to be read, I want to save these messages in a new topic and retry the reading process a certain number of times. Meanwhile, I need to pause the reading process at the original location. How can I do that?
Related
We want to use Apache Flink for the streaming job – read from one Kafka topic and write to another. The infrastructure will be deployed to Kubernetes. I want to restart the job on any PR merge to master branch.
Therefore, I wonder whether Flink guarantees that resubmitting the job will continue the data stream from the last processed message? Because one of the most important job's feature is message deduplication on time window.
What are the patterns of updating streaming jobs for Apache Flink? Should I just stop the old job and submit the new one?
My suggestion would be to simply try it.
Deploy your app manually and then stop it. Run kafka-consumer-groups script to find your consumer group. Then restart/upgrade the app, and run the command again with the same group. If the lag goes down (as it should), rather than resets to the beginning of the topic, then it's working as expected, as it would for any Kafka consumer.
read from one Kafka topic and write to another.
Ideally, Kafka Streams is used for this.
Kafka consumer offsets are saved as part of the checkpoint. So as long as your workflow is running in exactly-once mode, and your Kafka source is properly configured (e.g. you've set a group id), then restarting your job from the last checkpoint or savepoint will guarantee no duplicate records in the destination Kafka topic.
If you're stopping/restarting the job as part of your CI processing, then you'd want to:
Stop with savepoint.
Re-start from the savepoint
You could also set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true (so that a checkpoint is taken when the job is terminated), enable externalized checkpoints, and then restart from the last checkpoint, but the savepoint approach is easier.
And you'd want to have some regular process that removes older savepoints, though the size of the savepoints will be very small (only the consumer offsets, for your simple workflow).
I am new to Kafka .
Lets say I have one kafka topic topoic1(replicationfactor=1,partitions=1) and one consumer(java process) reading(readfrombegining/earliest) from kafka topic1 . Consumer is running fine for some time and later for some reason it got hung and killed by admin.
So if I Restart the consumer it will read from beginning again leading to data duplication So how to handle this usecase ?
NOTE: I am aware that if the consumer code written as to read from latest then we will not get duplicated data. Other than this is there in solution ?
Consumers will only reset from the beginning when auto.offset.reset=earliest, and
you have auto commits disabled + don't manually commit offsets
or don't manually seek the consumer upon startup; i.e. you can track offsets externally from Kafka
I want to backup and restore a huge amount of data in a Kafka topic to various destinations (file, another topic, S3, ...) using Kafka Connect. However, it runs in a streaming mode and hence never terminates. But in my scenario it should exit automatically after processing all data that is currently in the topic (it is ensured in my context that all producers are shut down before the backup starts).
Is there any option/ parameter so that a Kafka Connect connector automatically terminates after all current data is processed and e.g. stored in a file?
AFAIK there is no such option. You can create "watchdog" checking lag on your Kafka Connect group.id and once lag is processed, e.g. = 0, you shutdown the process.
As we do it in our company: we start consumer to process messages every 3-6 hours to process lag, create file and then terminates. File is being uploaded to other destination.
I got confused how Flume data when using file-channel is recoverable and with memory channel is not.
I tried a Kafka sink, when i put Flume down while reading, Kafka reads the data (file) in channel properly. when Flume is restarted, the pipeline continue delivering data in reliable way. So how memory-channel is not recoverable?. In which case i need to recover data in channel?, specially if Flume starts reading the file from a saved offset.
You can restart Kafka to check if the messages are lost.
Offset concept:
This depends on the flume transaction handling. In the process of flume restart, some of the transactions might get committed, but the processing may fail due to connection loss.
For eg: You have a transaction that requires some processing after which you store it in db. And you have transaction.commit() even when the flume sink throws any exception. So, you will loose your data in restart process as your processing logic throws exception. The transaction is committed and the offset is increased.
So it is safer to take an offset before restart process. You should follow
Take export offset
Stop flume
Import the offset
Start flume
I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.