Update state in a Kafka stream chain without using Kafka Streams in an EOS way - apache-kafka

I am currently working on the deployment of a distributed stream process chain using Kafka but not Kafka stream library. I've created a kind of node which can be executed and take as input a topic, process the obtained data and send it to an output topic. The node is a simple consumer/producer couple which is associated to a unique upstream partition. The producer is idempotent, the processing is done in a transaction context such as :
producer.initTransaction();
try
{
producer.beginTransaction();
//process
producer.commitTransaction();
}
catch (KafkaException e)
{
producer.abortTransaction();
}
I also used the producer.sendoffsetstotransaction method to ensure an atomic commit for the consumer.
I would like to use a key-value store for keeping the state of my nodes (i was thinking about MapDB which looks simple to use).
But I wonder if I update my state inside the transaction with a map.put(key, value) for example, will the transaction ensure that the state will be updated exactly-once ?
Thank you very much

Kafka only promises exactly once for its components - i.e. When I produce X to output-topic, I will also commit X to input-topic. Either both succeeds or both fails - i.e. Atomic.
So whatever you do between consuming and producing is totally on you to ensure the exactly-once. UNLESS, you use the state-store provided by Kafka itself. That is available to you if you use Kafka-streams.
If you cannot switch to kafka streams, it is still possible to ensure exactly once yourself if you track kafka's offsets in mapDB and add sufficient checks.
For eg, assuming you are trying to do deduplication here,
This is just one way of doing things - assuming that whatever you put in mapDB is committed right away. Even if not, you can always consult the "source of truth" - which are the topics here - and reconstruct the lost data.

Related

Will Kafka Streams guarentee at-least once processing in stateful processors even when Eaxctly-once is disabled?

This question comes in mind as we are running kafka streams applications without EOS enabled due to infra constraints. We are unsure of its behavior when doing some custom logic using transformer/processor API with changeloged state stores .
Say we are using following topology to de-duplicate records before sending to downstream:
[topic] -> [flatTransformValues + state store] -> [...(downstream)]
the transformer here will compare incoming records against the state store and only forward + update the record when there's a value change, so for messages [A:1], [A:1], [A:2], we expect downstream will only get [A:1], [A:2]
Question is when failures happens, is it possible that [A:2] get stored in the state store's changelog, while downstream does not receive the message, so that any retry reading [A:2] will discard the record and its lost forever?
If not, please tell me if any mechanism prevent this happening, one way i think it could work is if kafka stream produce to changelog topics and commit offsets only after produce to downstream succeeds?
Much appreciated!

Is it possible to configure/code a Kafka consumer application for "Exactly Once" failure recovery w/o calling Producer methods?

Is it possible to configure/code a Kafka consumer application to unilaterally implement "Exactly Once Semantics" to handle failure recovery (i.e., resume where left off after a comm failure, etc) independent of producer code (calling KafkaProducer methods, etc)?
After some googling, it appears all the "Exactly Once Semantics" (EOS) demos I've found (at least so far) involve calling methods on both producer and consumer instances within the same application to accomplish this.
Here's an example: https://www.baeldung.com/kafka-exactly-once
Can an independent consumer/client application be configured for EOS failure recovery/resume - independent of producer code (i.e., calling KafkaProducer methods, etc)?
If so, can you point me to an example?
No, an independent consumer can not be configured to consume messages from Kafka exactly-once.
You can either have it as "at-most-once" or "at-least-once". Making it exactly-once highly depends on what the consumer is doing with the data and how and when you commit the messages back to Kafka.
You would have to implement this on your own. As an example you could have a look at the implementation of Spark Structured Streaming (also: spark-sql-kafka library) which makes use of write-ahead-logs in order to ensure exactly-once semantics.
Although the other answer is correct, I would state briefly this in a slightly different fashion:
the target / sink needs to be idempotent (KV store or UPSert to something like KUDU)
and the source replayable.
Quoting from this blog explains it well imho, https://www.waitingforcode.com/apache-spark-structured-streaming/fault-tolerance-apache-spark-structured-streaming/read:
"...
Indeed, neither the replayable source nor commit log don't guarantee
exactly-once processing itself. What if the batch commit fails ? As
told previously, the engine will detect the last committed offsets as
offsets to reprocess and output once again the processed data to the
sink. It'll obviously lead to a duplicated output. But it'd be the
case only when the writes and the sink aren't idempotent.
An idempotent write is the one that generates the same written data
for given input. The idempotent sink is the one that writes given
generated row only once, even if it's sent multiple times. A good
example of such sink are key-value data stores. Now, if the writer is
idempotent, obviously it generates the same keys every time and since
the row identification is key-based, the whole process is idempotent.
Together with replayable source it guarantees exactly-once end-2-end
processing.
..."
As an English native speaker not 100% sure the don't is correct, but I think we can get the drift.

How to ensure exactly once semantics while processing kafka messages in Apache Storm

I needed exactly once delivery in my app. I explored kafka and realised that to have message produced exactly once, I have to set idempotence=true in producer config. This also sets acks=all, making producer resend messages till all replicas have committed it. To ensure that consumer does not do duplicate processing or leave any message unprocessed, it is advised to commit the processing output and offset to external database in same database transaction, so that either both of them will be persisted or none avoiding duplicate and no processing.
In consumer, message is left processed if consumer first commits it but fails before processing it and message is processed more than once if consumers first processes it but fails before committing it.
Q1. Now I was guessing how can I imitate the same with Apache Storm. I guess exactly once production of message can be ensured by setting idemptence=true in KafkaBolt. Am I right?
I was guessing how I can ensure missed and duplicate message processing in Storm. For example, this doc page says if I anchor a tuple (by passing it as first parameter to OutputCollector.emit()) and then pass the tuple to OutputCollector.ack() or OutputCollector.fail(), Storm will ensure data loss. This is what it exactly says:
Now that you understand the reliability algorithm, let's go over all the failure cases and see how in each case Storm avoids data loss:
A tuple isn't acked because the task died: In this case the spout tuple ids at the root of the trees for the failed tuple will time out and be replayed.
Acker task dies: In this case all the spout tuples the acker was tracking will time out and be replayed.
Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
Q2. I guess this ensures that message is not left unprocessed, but does not avoid duplicate processing of messages. Am I correct with this? Also is there anything else that Storm offers to ensure exactly once semantics like kafka that I am missing?
Regarding Q1: Yes, you can get the same behavior from the KafkaBolt by setting that property, the KafkaBolt simply wraps a KafkaProducer.
Regarding semantics on the consuming side, you have the same options with Storm as you do with Kafka. When you read a message from Kafka, you can choose to commit before or after you do your processing (e.g. write to a database). If you do it before, and the program crashes, you will lose the message. Let's call this at-most-once processing. If you do it after, you risk processing the same message twice if the program crashes after the processing but before the commit, called at-least-once processing.
So, regarding Q2: Yes, using anchored tuples and acking will provide you with at-least-once semantics. Not using anchored tuple would give you at-most-once.
Yes, there is something else Storm offers to ensure exactly once semantics called Trident, but it requires you to write your topology differently, and your data store has to be adapted to it so message deduplication can happen. See the documentation at https://storm.apache.org/releases/2.0.0/Trident-tutorial.html.
Also just to caution you: When documentation for Storm (or Kafka) talk about exactly-once semantics, there are some assumptions made about what kind of processing you'll do. For example, when Storm's Trident docs talk about exactly-once, there's an assumption that you'll adapt your database so you can decide when given a message whether it has already been stored. When Kafka's documentation talks about exactly-once, the assumption is that your processing will be reading from Kafka, doing some computation (most likely with no side effects) and writing back to Kafka.
This is just to say that for some types of processing, you may still need to pick between at-least-once and at-most-once. If you can make your processing idempotent, at-least-once is a good option.
Finally if your processing fits the "read from Kafka, do computation, write to Kafka" model, you can likely get nicer semantics out of Kafka Streams than Storm, as Storm can't provide the exactly-once semantics Kafka can provide in that case.

When to use Kafka transactional API?

I was trying to understand Kafka's transactional API. This link defines atomic read-process-write cycle as follows:
First, let’s consider what an atomic read-process-write cycle means. In a nutshell, it means that if an application consumes a message A at offset X of some topic-partition tp0, and writes message B to topic-partition tp1 after doing some processing on message A such that B = F(A), then the read-process-write cycle is atomic only if messages A and B are considered successfully consumed and published together, or not at all.
It further says says following:
Using vanilla Kafka producers and consumers configured for at-least-once delivery semantics, a stream processing application could lose exactly once processing semantics in the following ways:
The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
Finally, in distributed environments, applications will crash or—worse!—temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”
We designed transaction APIs in Kafka to solve the second and third problems. Transactions enable exactly-once processing in read-process-write cycles by making these cycles atomic and by facilitating zombie fencing.
Doubts:
Points 2 and 3 above describe when message duplication can occur which are dealt with using transactional API. Does transactional API also help to avoid message loss in any scenario?
Most online (for example, here and here) examples of Kafka transactional API involve:
while (true)
{
ConsumerRecords records = consumer.poll(Long.MAX_VALUE);
producer.beginTransaction();
for (ConsumerRecord record : records)
producer.send(producerRecord(“outputTopic”, record));
producer.sendOffsetsToTransaction(currentOffsets(consumer), group);
producer.commitTransaction();
}
This is basically read-process-write loop. So does transactional API useful only in read-process-write loop?
This article gives example of transactional API in non read-process-write scenario:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
It says:
This allows a producer to send a batch of messages to multiple partitions such that either all messages in the batch are eventually visible to any consumer or none are ever visible to consumers.
Is this example correct and shows another way to use transactional API different from read-process-write loop? (Note that it also does not commit offset to transaction.)
In my application, I simply consume messages from kafka, do processing and log them to the database. That is my whole pipeline.
a. So, I guess this is not read-process-write cycle. Is Kafka transactional API of any use to my scenario?
b. Also I need to ensure that each message is processed exactly once. I guess setting idempotent=true in producer will suffice and I dont need transactional API, right?
c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)
a. So, I guess this is not read-process-write cycle. Is Kafka
transactional API of any use to my scenario?
It is a read-process-write, except you are writing to a database instead of Kafka. Kafka has its own transaction manager and thus writing inside a transaction with idempotency would enable exactly once processing, assuming you can resume the state of your consumer-write processor correctly. You cannot do that with a DB because the DB's transaction manager doesn't sync with Kafka's. What you can do instead is make sure that even if kafka transactions are not atomic with respect to your database, they are still eventually consistent.
Let's assume your consumer reads, writes to the DB and then acks. If the DB fails you don't ack and you can resume normally based on the offset. If the ack fails you will process twice and save to the DB twice. If you can make this operation idempotent, then you are safe. This means that your processor must be pure and the DB has to dedupe: processing the same message twice should always lead to the same result on the DB.
b. Also I need to ensure that each message is processed exactly once.
I guess setting idempotent=true in producer will suffice and I dont
need transactional API, right?
Assuming that you respect the requirements from point a, exactly once processing with persistence on a different store also requires that between your initial write and the duplicate no other change has happened to the objects that you are saving. Imagine having a value written as X, then some other actor changes it to Y, then the message is reprocessed and changes it back to X. This can be avoided for example, by making your database table be a log, similar to a kafka topic.
c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)
It is the producer which writes to the topic you consume from that may create zombie messages. That producer needs to play nice with kafka so that zombies are ignored. The transactional API together with your consumer will make sure that this producer writes atomically and your consumer reads committed messages, albeit not atomically. If you want exactly once idempotency is enough. If the messages are supposed to be atomically written you need transactions too. Either way your read-write/consume-produce processor needs to be pure and you have to dedupe. Your DB is also part of this processor since the DB is the one that actually persists.
I've looked for a bit on the internet, maybe this link helps you: processing guarantees
The links you posted: exactly once semantics and transactions in kafka are great.

kafka Java API Consumer and producer Offset value comparison?

I have a requirement to match Kafka producer offset value to consumer offset by using Java API?
I am new to KAFKA,Could anyone suggest how to proceed with this ?
Depending on your exact use case there are a couple of ways that you could go about this, but all of them will require an external system.
First of, Confluent offers the Confluent Control Center as part of their commercial offering, this would probably be the easiest way to go about this, if you are willing to spend the money.
If that is not for you, then you'd need to implement some sort of system to keep track of what you are producing and what you are consuming. For example you could simply use a database, take topic, partition and offset as primary key and have columns for produced_at and consumed_at.
Every time your producer writes a message to the cluster you have it update the produced_at column (look at ProducerInterceptor). Same on the consumer side, you could implement an interceptor that confirms having read the message, or confirm from the consumer itself, once it has successfully been processed.
Or if you don't need every message confirmed you could just implement regular checkpointing every 10k messages or something similar and trust that the consumer read everything up to the last offset it confirmed.
There's also the possibility of injecting checkpoint messages into the stream at regular intervalls and when the consumer sees one of these it triggers an action - again, you have to trust the consumer that it got everything in between the checkpoints.
As I said initially, it all depends on your exact use case, if you give us more detail I'm sure we can come up with something that works for you.
Update:
If you want to retrieve the offset after sending a message to Kafka you need to check the Future that the producer returns on send, this will contain the offset.
// Send message and store the future
Future<RecordMetadata> messageFuture = producer.send(new ProducerRecord<String, byte[]>(topic, serialize(currentMessage)));
producer.flush();
// as flush blocks until all operations have been completed (regardless of success or failure) we can be sure
// that our future is available at this point
try {
RecordMetadata metaData = messageFuture.get();
System.out.println("Sent message with offset: " + metaData.offset());
} catch (Exception e) {
// do some error handling
}
You can expose the offset of the producer and the consumer via Java Management Beans. There by you can do the comparison in realtime using the JConsole provided with the JDK.
Read about Gauge on how to expose the offset position of the producer and the consumer.