Correct if my understanding is wrong.
Kafka spring batch is important when we are consuming records through topics and writing to file system or Database.The write process occurs on batch mode on file system or database end and not on Kafka end.Thats why spring kafka batch support came in picture.
It is same in case of reading data in batch from file and writing to kafka topics.Here also the batch process is on file end.
So if we neither consuming the data from any file nor we are writing to any file in between consumer and producer process so its same like like using normal kafka message listener and message writer?
I read polling factor decides the data consumption in batch in Kafka .Even if we enable batch listener mode then in that case its related to acknowledgement of multiple messages at once.
Related
If any message in kafka fails to be read, I want to save these messages in a new topic and retry the reading process a certain number of times. Meanwhile, I need to pause the reading process at the original location. How can I do that?
I have a Flink Job which reads data from Kafka topics and writes it to HDFS. There are some problems with checkpoints, for example after stopping Flink Job some files stay in pending mode and other problems with checkpoints which write to HDFS too.
I want to try Kafka Streams for the same type of pipeline Kafka to HDFS. I found the next problem - https://github.com/confluentinc/kafka-connect-hdfs/issues/365
Could you tell me please how to resolve it?
Could you tell me where Kafka Streams keep files for recovery?
Kafka Streams only interacts between topics of the same cluster, not with external systems.
Kafka Connect HDFS2 connector maintains offsets in an internal offsets topic. Older versions of it maintained offsets in the filenames and used a write-ahead log to ensure file delivery
I want to backup and restore a huge amount of data in a Kafka topic to various destinations (file, another topic, S3, ...) using Kafka Connect. However, it runs in a streaming mode and hence never terminates. But in my scenario it should exit automatically after processing all data that is currently in the topic (it is ensured in my context that all producers are shut down before the backup starts).
Is there any option/ parameter so that a Kafka Connect connector automatically terminates after all current data is processed and e.g. stored in a file?
AFAIK there is no such option. You can create "watchdog" checking lag on your Kafka Connect group.id and once lag is processed, e.g. = 0, you shutdown the process.
As we do it in our company: we start consumer to process messages every 3-6 hours to process lag, create file and then terminates. File is being uploaded to other destination.
I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.
I have a simple kafka configuration:
default console-producer writes to one topic with one partition
custom processor (which implements org.apache.kafka.streams.processor.Processor) reads from kafka and process messages.
During processing I validate a batch of messages, if validation fails I write to log the key of the first message of a failed batch.
Is it possible in kafka to query for a message with given key/offset? Or get 100 messages after that message?