When is the Flume memory channel not recoverable, and why? - apache-kafka

I got confused how Flume data when using file-channel is recoverable and with memory channel is not.
I tried a Kafka sink, when i put Flume down while reading, Kafka reads the data (file) in channel properly. when Flume is restarted, the pipeline continue delivering data in reliable way. So how memory-channel is not recoverable?. In which case i need to recover data in channel?, specially if Flume starts reading the file from a saved offset.

You can restart Kafka to check if the messages are lost.
Offset concept:
This depends on the flume transaction handling. In the process of flume restart, some of the transactions might get committed, but the processing may fail due to connection loss.
For eg: You have a transaction that requires some processing after which you store it in db. And you have transaction.commit() even when the flume sink throws any exception. So, you will loose your data in restart process as your processing logic throws exception. The transaction is committed and the offset is increased.
So it is safer to take an offset before restart process. You should follow
Take export offset
Stop flume
Import the offset
Start flume

Related

confluent kafka dotnet consumer retry

If any message in kafka fails to be read, I want to save these messages in a new topic and retry the reading process a certain number of times. Meanwhile, I need to pause the reading process at the original location. How can I do that?

How to handle kafka consumer failures

I am trying understand how to handle failed consumer records. How to
we know there is record failure. What I am seeing is when the record
processing failed in the consumer with runtime exception consumer is
keep on retrying. But when the next record is available to process it
is commiting offset of the latest record, which is expected. My
question how to we know about failed record. In older messaging
systems failed messages are rolled back to queues and processing stops
there. Then we know the queue is down and we can take action.
I can record the failed record into some db table,but what happens if this recording fails?
I can move failures to error/ dead letter queues, again what happens if this moving fails?
I am using kafka 2.6 with spring boot 2.3.4. Any help would be appreciated
Sounds like you would need to disable auto commits and manually commit the offsets yourself when your scope of "sucessfully processed" is achieved. If you include external processes like a database, then you will also need to increase Kafka client timeouts so it doesnt think the consumer is dead while waiting on error logging/handling.

Kafka Conenct: Automatically terminating after processing all data

I want to backup and restore a huge amount of data in a Kafka topic to various destinations (file, another topic, S3, ...) using Kafka Connect. However, it runs in a streaming mode and hence never terminates. But in my scenario it should exit automatically after processing all data that is currently in the topic (it is ensured in my context that all producers are shut down before the backup starts).
Is there any option/ parameter so that a Kafka Connect connector automatically terminates after all current data is processed and e.g. stored in a file?
AFAIK there is no such option. You can create "watchdog" checking lag on your Kafka Connect group.id and once lag is processed, e.g. = 0, you shutdown the process.
As we do it in our company: we start consumer to process messages every 3-6 hours to process lag, create file and then terminates. File is being uploaded to other destination.

Put() vs Flush() in Kafka Connector Sink Task

I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.

Flume use hdfs sink. How to ensure data integrity when hdfs is not available?

When hdfs is not available, is there an approach to make sure the data security? The scenario is: kafka-source, flume memory-channel, hdfs-sink. What if the flume service is down, does it can store the offset of topic's partitions and consume from the right position after recovery?
Usually (with default configuration), kafka stores topic offsets for all consumers. If you start flume source with the same group id (one of consumer properties), kafka will start sending messages right from the offset of your source. But messages that has been already read from kafka and stored in your memory channel will be lost due to HDFS sink failure.