custom Flume interceptor: intercept() method called multiple times for the same Event - apache-kafka

TL;DR
When a Flume source fails to push a transaction to the next channel in the pipeline, does it always keep event instances for the next try?
In general, is it safe to have a stateful Flume interceptor, where processing of events depends on previously processed events?
Full problem description:
I am considering the possibility of leveraging guarantees offered by Apache Kafka regarding the way topic partitions are distributed among consumers in a consumer group to perform streaming deduplication in an existing Flume-based log consolidation architecture.
Using the Kafka Source for Flume and custom routing to Kafka topic partitions, I can ensure that every event that should go to the same logical "deduplication queue" will be processed by a single Flume agent in the cluster (for as long as there are no agent stops/starts within the cluster). I have the following setup using a custom-made Flume interceptor:
[KafkaSource with deduplication interceptor]-->()MemoryChannel)-->[HDFSSink]
It seems that when the Flume Kafka source runner is unable to push a batch of events to the memory channel, the event instances that are part of the batch are passed again to my interceptor's intercept() method. In this case, it was easy to add a tag (in the form of a Flume event header) to processed events to distinguish actual duplicates from events in a failed batch that got re-processed.
However, I would like to know if there is any explicit guarantee that Event instances in failed transactions are kept for the next try or if there is the possibility that events are read again from the actual source (in this case, Kafka) and re-built from zero. In that case, my interceptor will consider those events to be duplicates and discard them, even though they were never delivered to the channel.
EDIT
This is how my interceptor distinguishes an Event instance that was already processed from a non-processed event:
public Event intercept(Event event) {
Map<String,String> headers = event.getHeaders();
// tagHeaderName is the name of the header used to tag events, never null
if( !tagHeaderName.isEmpty() ) {
// Don't look further if event was already processed...
if( headers.get(tagHeaderName)!=null )
return event;
// Mark it as processed otherwise...
else
headers.put(tagHeaderName, "");
}
// Continue processing of event...
}

I encountered the similar issue:
When a sink write failed, Kafka Source still hold the data that has already been processed by interceptors. In next attempt, those data will send to interceptors, and get processed again and again. By reading the KafkaSource's code, I believe it's bug.
My interceptor will strip some information from origin message, and will modify the origin message. Due to this bug, the retry mechanism will never work as expected.
So far, The is no easy solution.

Related

Reconsume Kafka Message that failed during processing due to DB error

I am new to Kafka and would like to seek advice on what is the best practice to handle such scenario.
Scenario:
I have a spring boot application that has a consumer method that is listening for messages via the #KafkaListner annotation. Once an incoming message has occurred, the consumer method will process the message, which simply performs database updates to different tables via JdbcTemplate.
If the updates to the tables are successful, I will manually commit the message by calling the acknowledge() method. If the database update fails, instead of calling the acknowledge() method, I will call the nack() method with a given duration (E.g. 10 seconds) such that the message will reappear again to be consumed.
Things to note
I am not concerned with the ordering of the messages. Whatever event comes I just have to consume and process it, that's all.
I am only given a topic (no retryable topic and no dead letter topic)
Here is the problem
If I do the above method, my consumer becomes inconsistent. Let's say if I call the nack() method with a duration of 1min, meaning to say after 1 min, the same message will reappear.
Within this 1 min, there could "x" number of incoming messages to be consumed and processed. The observation made was none of these messages are getting consumed and processed.
What I want to know
Hence, I hope someone will advise me what I am doing wrongly and what is the best practice / way to handle such scenarios.
Thanks!
Records are always received in order; there is no way to defer the current record until later, but continue to process other records after this one when consuming from a single topic.
Kafka topics are a linear log and not a queue.
You would need to send it to another topic; the #RetryableTopic (non-blocking retrties) feature is specifically designed for this use case.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
You could also increase the container concurrency so at least you could continue to process records from other partitions.

Kafka - understand when all consumers performed seek to LATEST

I have multiple consumers in a consumer group (single Spring Boot app). I use onPartitionsAssigned callback to reset every consumer offset to LATEST (for assigned partitions).
The issue is that I need to know when ALL consumers in a consumer group performed this seek and do some logic AFTER that.
My current understanding is that every consumer performs seek and start processing independently - i.e. there is no any common synchronization point.
It would be great is someone can provide guidance whether this is possible at all.
UPDATE. Let me explain why is it needed.
I have a logic to send HTTP request (to another service) to request some data to be submitted to Kafka. But before sending this request I need to make sure that all consumers are at the latest offset already. Because if this request is sent earlier - some data submitted by this 2nd service may be lost - i.e. if it's submitted before a consumer completed reset to LATEST offset.
There is nothing built in; you could use a CountDownLatch set to the concurrency of the container and count it down for each one.
If the concurrency might change, you can get a reference to the concurrent container from the KafkaListenerEndpointRegistry bean.
int count = ((ConcurrentMessageListenerContainer<?, ?>) registry.getListenerContainer(id))
.getConcurrency();

Update state in a Kafka stream chain without using Kafka Streams in an EOS way

I am currently working on the deployment of a distributed stream process chain using Kafka but not Kafka stream library. I've created a kind of node which can be executed and take as input a topic, process the obtained data and send it to an output topic. The node is a simple consumer/producer couple which is associated to a unique upstream partition. The producer is idempotent, the processing is done in a transaction context such as :
producer.initTransaction();
try
{
producer.beginTransaction();
//process
producer.commitTransaction();
}
catch (KafkaException e)
{
producer.abortTransaction();
}
I also used the producer.sendoffsetstotransaction method to ensure an atomic commit for the consumer.
I would like to use a key-value store for keeping the state of my nodes (i was thinking about MapDB which looks simple to use).
But I wonder if I update my state inside the transaction with a map.put(key, value) for example, will the transaction ensure that the state will be updated exactly-once ?
Thank you very much
Kafka only promises exactly once for its components - i.e. When I produce X to output-topic, I will also commit X to input-topic. Either both succeeds or both fails - i.e. Atomic.
So whatever you do between consuming and producing is totally on you to ensure the exactly-once. UNLESS, you use the state-store provided by Kafka itself. That is available to you if you use Kafka-streams.
If you cannot switch to kafka streams, it is still possible to ensure exactly once yourself if you track kafka's offsets in mapDB and add sufficient checks.
For eg, assuming you are trying to do deduplication here,
This is just one way of doing things - assuming that whatever you put in mapDB is committed right away. Even if not, you can always consult the "source of truth" - which are the topics here - and reconstruct the lost data.

Kafka Consumes unprocessable messages - How to reprocess broken messages later?

We are implementing a Kafka Consumer using Spring Kafka. As I understand correctly if processing of a single message fails, there is the option to
Don't care and just ACK
Do some retry handling using a RetryTemplate
If even this doesn't work do some custom failure handling using a RecoveryCallback
I am wondering what your best practices are for that. I think of simple application exceptions, such as DeserializationException (for JSON formatted messages) or longer local storage downtime, etc. Meaning there is needed some extra work, like a hotfix deployment, to fix the broken application to be able to re-process the faulty messages.
Since losing messages (i. e. not processing them) is not an option for us, the only option left is IMO to store the faulty messages in some persistence store, e. g. another "faulty messages" Kafka topic for example, so that those events can be processed again at a later time and there is no need to stop event processing totally.
How do you handle these scenarios?
One example is Spring Cloud Stream, which can be configured to publish failed messages to another topic errors.foo; users can then copy them back to the original topic to try again later.
This logic is done in the recovery callback.
We have a use case where we can't drop any messages at all, even for faulty messages. So when we encounter a faulty message, we will send a default message in place of that faulty record and at the same time send the message to a failed-topic for retry later.

Concurrent writes for event sourcing on top of Kafka

I've been considering to use Apache Kafka as the event store in an event sourcing configuration. The published events will be associated to specific resources, delivered to a topic associated to the resource type and sharded into partitions by resource id. So for instance a creation of a resource of type Folder and id 1 would produce a FolderCreate event that would be delivered to the "folders" topic in a partition given by sharding the id 1 across the total number of partitions in the topic. Even though I don't know how to handle concurrent events that make the log inconsistent.
The simplest scenario would be having two concurrent actions that can invalidate each other such as one to update a folder and one to destroy that same folder. In that case the partition for that topic could end up containing the invalid sequence [FolderDestroy, FolderUpdate]. That situation is often fixed by versioning the events as explained here but Kafka does not support such feature.
What can be done to ensure the consistency of the Kafka log itself in those cases?
I think it's probably possible to use Kafka for event sourcing of aggregates (in the DDD sense), or 'resources'. Some notes:
Serialise writes per partition, using a single process per partition (or partitions) to manage this. Ensure you send messages serially down the same Kafka connection, and use ack=all before reporting success to the command sender, if you can't afford rollbacks. Ensure the producer process keeps track of the current successful event offset/version for each resource, so it can do the optimistic check itself before sending the message.
Since a write failure might be returned even if the write actually succeeded, you need to retry writes and deal with deduplication by including an ID in each event, say, or reinitialize the producer by re-reading (recent messages in) the stream to see whether the write actually worked or not.
Writing multiple events atomically - just publish a composite event containing a list of events.
Lookup by resource id. This can be achieved by reading all events from a partition at startup (or all events from a particular cross-resource snapshot), and storing the current state either in RAM or cached in a DB.
https://issues.apache.org/jira/browse/KAFKA-2260 would solve 1 in a simpler way, but seems to be stalled.
Kafka Streams appears to provide a lot of this for you. For example, 4 is a KTable, which you can have your event producer use one to work out whether an event is valid for the current resource state before sending it.