Is there a way to commit manually with Kafka Stream?
Usually with using the KafkaConsumer, I do something like below:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records){
// process records
}
consumer.commitAsync();
}
Where I'm calling commit manually. I don't see a similar API for KStream.
Commits are handled by Streams internally and fully automatic, and thus there is usually no reason to commit manually. Note, that Streams handles this differently than consumer auto-commit -- in fact, auto-commit is disabled for the internally used consumer and Streams manages commits "manually". The reason is, that commits can only happen at certain points during processing to ensure no data can get lost (there a many internal dependencies with regard to updating state and flushing results).
For more frequent commits, you can reduce commit interval via StreamsConfig parameter commit.interval.ms.
Nevertheless, manual commits are possible indirectly, via low-level Processor API. You can use the context object that is provided via init() method to call context#commit(). Note, that this is only a "request to Streams" to commit as soon as possible -- it's not issuing a commit directly.
Related
I have a service that consumes from kafka and store the data to database. Simplified the logic as below:
Flux<ReceiverRecord<String, byte[]>> kafkaFlux = KafkaReceiver.create(options).receive();
kafkaFlux.flatMap(r -> store(r))// IO operation, store to database
.subscribe(record -> record.receiverOffset().acknowledge()); // Ack the record
The flatMap makes the flux disordered. Based on the reactor kafka documentation, the acknowledge() could ack a record that hasn't been store to database:
https://projectreactor.io/docs/kafka/snapshot/api/reactor/kafka/receiver/ReceiverOffset.html
Acknowledges the ReceiverRecord associated with this offset. The offset will be committed automatically based on the commit configuration parameters ReceiverOptions.commitInterval() and ReceiverOptions.commitBatchSize(). When an offset is acknowledged, it is assumed that all records in this partition up to and including this offset have been processed. All acknowledged offsets are committed if possible when the receiver Flux terminates.
How to guarantee at least once but do not block the stream?
Starting with version 1.3.8, commits can be performed out of order and the framework will defer the commits as needed, until any "gaps" are filled. This removes the need for applications to keep track of offsets and commit them in the right order.
You can set the maxDeferredCommits in your ReceiverOptions to enable the out-of-order commits feature.
In the documentation :
BATCH: Commit the offset when all the records returned by the poll()
have been processed.
MANUAL: The message listener is responsible to acknowledge() the
Acknowledgment. After that, the same semantics as BATCH are applied.
if the offset is committed when all the records returned by the poll() have been processed for both cases then I don't get the difference, can you give me a scenario when MANUAL ack mode is used differently ?
If I use MANUAL mode and I don't call acknowledge() within my KafkaListener would be the same as BATCH mode ? and if I call acknowledge() what would change ?
Maybe I don't get the difference between commit and acknowledge notions within spring kafka
In the perfect world, when your application is always UP, you definitely don't need those commits at all. Just because Kafka Consumer keeps the track of offset internally between poll calls. There might be the case when you really don't need to commit on every single batch delivered to you. That's when that MANUAL comes to the rescue. With BATCH mode you don't have control and the framework perform it for you anyway. With MANUAL you may decide to commit now or later on, some where after a couple batches processed.
It is called acknowledge because we might not perform a commit immediately, but rather store it in-memory for subsequent poll cycle. The commit must be performed exactly on the Kafka consumer thread.
I am currently working on the deployment of a distributed stream process chain using Kafka but not Kafka stream library. I've created a kind of node which can be executed and take as input a topic, process the obtained data and send it to an output topic. The node is a simple consumer/producer couple which is associated to a unique upstream partition. The producer is idempotent, the processing is done in a transaction context such as :
producer.initTransaction();
try
{
producer.beginTransaction();
//process
producer.commitTransaction();
}
catch (KafkaException e)
{
producer.abortTransaction();
}
I also used the producer.sendoffsetstotransaction method to ensure an atomic commit for the consumer.
I would like to use a key-value store for keeping the state of my nodes (i was thinking about MapDB which looks simple to use).
But I wonder if I update my state inside the transaction with a map.put(key, value) for example, will the transaction ensure that the state will be updated exactly-once ?
Thank you very much
Kafka only promises exactly once for its components - i.e. When I produce X to output-topic, I will also commit X to input-topic. Either both succeeds or both fails - i.e. Atomic.
So whatever you do between consuming and producing is totally on you to ensure the exactly-once. UNLESS, you use the state-store provided by Kafka itself. That is available to you if you use Kafka-streams.
If you cannot switch to kafka streams, it is still possible to ensure exactly once yourself if you track kafka's offsets in mapDB and add sufficient checks.
For eg, assuming you are trying to do deduplication here,
This is just one way of doing things - assuming that whatever you put in mapDB is committed right away. Even if not, you can always consult the "source of truth" - which are the topics here - and reconstruct the lost data.
In our code, we plan to manually commit the offset. Our processing of data is long run and hence we follow the pattern suggested before
Read the records
Process the records in its own thread
pause the consumer
continue polling paused consumer so that it is alive
When the records are processed, commit the offsets
When commit done, then resume the consumer
The code somewhat looks like this:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(kafkaConfig.getTopicPolling());
if (!records.isEmpty()) {
task = pool.submit(new ProcessorTask(processor, createRecordsList(records)));
}
if (shouldPause(task)) {
consumer.pause(listener.getPartitions());
}
if (isDoneProcessing(task)) {
consumer.commitSync();
consumer.resume(listener.getPartitions());
}
}
If you notice, we commit using commitSync() (without any parameters).
Since the consumer is paused, in the next iteration we would get no records. But commitSync() would happen later. In that case which offset's would it try to commit? I have read the definitive guide and googled but cannot find any information about it.
I think we should explicitly save the offsets. But I am not sure if the current code would be an issue.
Any information would be helpful.
Thanks,
Prateek
If you call consumer.commitSync() with no parameters it should commit the latest offset that your consumer has received. Since you can receive many messages in a single poll() you might want to have finer control over the commit and explicitly commit a specific offset such as the latest message that your consumer has successfully processed. This can be done by calling commitSync(Map<TopicPartition,OffsetAndMetadata> offsets)
You can see the syntax for the two ways to call commitSync here in the Consumer Javadoc http://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#commitSync()
TL;DR
When a Flume source fails to push a transaction to the next channel in the pipeline, does it always keep event instances for the next try?
In general, is it safe to have a stateful Flume interceptor, where processing of events depends on previously processed events?
Full problem description:
I am considering the possibility of leveraging guarantees offered by Apache Kafka regarding the way topic partitions are distributed among consumers in a consumer group to perform streaming deduplication in an existing Flume-based log consolidation architecture.
Using the Kafka Source for Flume and custom routing to Kafka topic partitions, I can ensure that every event that should go to the same logical "deduplication queue" will be processed by a single Flume agent in the cluster (for as long as there are no agent stops/starts within the cluster). I have the following setup using a custom-made Flume interceptor:
[KafkaSource with deduplication interceptor]-->()MemoryChannel)-->[HDFSSink]
It seems that when the Flume Kafka source runner is unable to push a batch of events to the memory channel, the event instances that are part of the batch are passed again to my interceptor's intercept() method. In this case, it was easy to add a tag (in the form of a Flume event header) to processed events to distinguish actual duplicates from events in a failed batch that got re-processed.
However, I would like to know if there is any explicit guarantee that Event instances in failed transactions are kept for the next try or if there is the possibility that events are read again from the actual source (in this case, Kafka) and re-built from zero. In that case, my interceptor will consider those events to be duplicates and discard them, even though they were never delivered to the channel.
EDIT
This is how my interceptor distinguishes an Event instance that was already processed from a non-processed event:
public Event intercept(Event event) {
Map<String,String> headers = event.getHeaders();
// tagHeaderName is the name of the header used to tag events, never null
if( !tagHeaderName.isEmpty() ) {
// Don't look further if event was already processed...
if( headers.get(tagHeaderName)!=null )
return event;
// Mark it as processed otherwise...
else
headers.put(tagHeaderName, "");
}
// Continue processing of event...
}
I encountered the similar issue:
When a sink write failed, Kafka Source still hold the data that has already been processed by interceptors. In next attempt, those data will send to interceptors, and get processed again and again. By reading the KafkaSource's code, I believe it's bug.
My interceptor will strip some information from origin message, and will modify the origin message. Due to this bug, the retry mechanism will never work as expected.
So far, The is no easy solution.