End-to-end Exactly-once processing in Apache Flink - apache-kafka

Apache Flink guarantee exactly once processing upon failure and recovery by resuming the job from a checkpoint, with the checkpoint being a consistent snapshot of the distributed data stream and operator state (Chandy-Lamport algorithm for distributed snapshots). This guarantee exactly once upon failover.
In case of normal cluster operation, how does Flink guarantee exactly once processing, for instance given a Flink source that reads from external source (say Kafka), how does Flink guarantee that the event is read one time from the source? are there any kind of application level acking between the event source and Flink source? Also, how does Flink guarantee that events are propogated exactly once from upstream operators to downstream operators? Does that require any kind of acking for received events as well?

Flink does not guarantee that every event is read once from the sources. Instead, it guarantees that every event affects the managed state exactly once.
Checkpoints include the source offsets, and during a checkpoint restore, the sources are rewound and some events may be replayed. That's fine, because the checkpoint included the state throughout the job that had resulted from reading everything up to the offsets that were stored in the checkpoint, and nothing beyond those offsets.
Thus Flink's exactly once guarantee requires replayable sources. Exactly once messaging between operators depends on tcp.
Guaranteeing that the sinks don't receive duplicated results further requires transactional sinks. Flink commits transactions as part of checkpointing.


Is it possible to configure/code a Kafka consumer application for "Exactly Once" failure recovery w/o calling Producer methods?

Is it possible to configure/code a Kafka consumer application to unilaterally implement "Exactly Once Semantics" to handle failure recovery (i.e., resume where left off after a comm failure, etc) independent of producer code (calling KafkaProducer methods, etc)?
After some googling, it appears all the "Exactly Once Semantics" (EOS) demos I've found (at least so far) involve calling methods on both producer and consumer instances within the same application to accomplish this.
Here's an example: https://www.baeldung.com/kafka-exactly-once
Can an independent consumer/client application be configured for EOS failure recovery/resume - independent of producer code (i.e., calling KafkaProducer methods, etc)?
If so, can you point me to an example?
No, an independent consumer can not be configured to consume messages from Kafka exactly-once.
You can either have it as "at-most-once" or "at-least-once". Making it exactly-once highly depends on what the consumer is doing with the data and how and when you commit the messages back to Kafka.
You would have to implement this on your own. As an example you could have a look at the implementation of Spark Structured Streaming (also: spark-sql-kafka library) which makes use of write-ahead-logs in order to ensure exactly-once semantics.
Although the other answer is correct, I would state briefly this in a slightly different fashion:
the target / sink needs to be idempotent (KV store or UPSert to something like KUDU)
and the source replayable.
Quoting from this blog explains it well imho, https://www.waitingforcode.com/apache-spark-structured-streaming/fault-tolerance-apache-spark-structured-streaming/read:
Indeed, neither the replayable source nor commit log don't guarantee
exactly-once processing itself. What if the batch commit fails ? As
told previously, the engine will detect the last committed offsets as
offsets to reprocess and output once again the processed data to the
sink. It'll obviously lead to a duplicated output. But it'd be the
case only when the writes and the sink aren't idempotent.
An idempotent write is the one that generates the same written data
for given input. The idempotent sink is the one that writes given
generated row only once, even if it's sent multiple times. A good
example of such sink are key-value data stores. Now, if the writer is
idempotent, obviously it generates the same keys every time and since
the row identification is key-based, the whole process is idempotent.
Together with replayable source it guarantees exactly-once end-2-end
As an English native speaker not 100% sure the don't is correct, but I think we can get the drift.

How to ensure exactly once semantics while processing kafka messages in Apache Storm

I needed exactly once delivery in my app. I explored kafka and realised that to have message produced exactly once, I have to set idempotence=true in producer config. This also sets acks=all, making producer resend messages till all replicas have committed it. To ensure that consumer does not do duplicate processing or leave any message unprocessed, it is advised to commit the processing output and offset to external database in same database transaction, so that either both of them will be persisted or none avoiding duplicate and no processing.
In consumer, message is left processed if consumer first commits it but fails before processing it and message is processed more than once if consumers first processes it but fails before committing it.
Q1. Now I was guessing how can I imitate the same with Apache Storm. I guess exactly once production of message can be ensured by setting idemptence=true in KafkaBolt. Am I right?
I was guessing how I can ensure missed and duplicate message processing in Storm. For example, this doc page says if I anchor a tuple (by passing it as first parameter to OutputCollector.emit()) and then pass the tuple to OutputCollector.ack() or OutputCollector.fail(), Storm will ensure data loss. This is what it exactly says:
Now that you understand the reliability algorithm, let's go over all the failure cases and see how in each case Storm avoids data loss:
A tuple isn't acked because the task died: In this case the spout tuple ids at the root of the trees for the failed tuple will time out and be replayed.
Acker task dies: In this case all the spout tuples the acker was tracking will time out and be replayed.
Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.
Q2. I guess this ensures that message is not left unprocessed, but does not avoid duplicate processing of messages. Am I correct with this? Also is there anything else that Storm offers to ensure exactly once semantics like kafka that I am missing?
Regarding Q1: Yes, you can get the same behavior from the KafkaBolt by setting that property, the KafkaBolt simply wraps a KafkaProducer.
Regarding semantics on the consuming side, you have the same options with Storm as you do with Kafka. When you read a message from Kafka, you can choose to commit before or after you do your processing (e.g. write to a database). If you do it before, and the program crashes, you will lose the message. Let's call this at-most-once processing. If you do it after, you risk processing the same message twice if the program crashes after the processing but before the commit, called at-least-once processing.
So, regarding Q2: Yes, using anchored tuples and acking will provide you with at-least-once semantics. Not using anchored tuple would give you at-most-once.
Yes, there is something else Storm offers to ensure exactly once semantics called Trident, but it requires you to write your topology differently, and your data store has to be adapted to it so message deduplication can happen. See the documentation at https://storm.apache.org/releases/2.0.0/Trident-tutorial.html.
Also just to caution you: When documentation for Storm (or Kafka) talk about exactly-once semantics, there are some assumptions made about what kind of processing you'll do. For example, when Storm's Trident docs talk about exactly-once, there's an assumption that you'll adapt your database so you can decide when given a message whether it has already been stored. When Kafka's documentation talks about exactly-once, the assumption is that your processing will be reading from Kafka, doing some computation (most likely with no side effects) and writing back to Kafka.
This is just to say that for some types of processing, you may still need to pick between at-least-once and at-most-once. If you can make your processing idempotent, at-least-once is a good option.
Finally if your processing fits the "read from Kafka, do computation, write to Kafka" model, you can likely get nicer semantics out of Kafka Streams than Storm, as Storm can't provide the exactly-once semantics Kafka can provide in that case.

Apache NiFi & Kafka Integration

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

KafkaIO checkpoint - how to commit offsets to Kafka

I'm running a job using Beam KafkaIO source in Google Dataflow and cannot find an easy way to persist offsets across job restarts (job update option is not enough, i need to restart the job)
Comparing Beam's KafkaIO against PubSubIO (or to be precise comparing PubsubCheckpoint with KafkaCheckpointMark) I can see that checkpoint persistence is not implemented in KafkaIO (KafkaCheckpointMark.finalizeCheckpoint method is empty) whereas it's implemented in PubsubCheckpoint.finalizeCheckpoint which does acknowledgement to PubSub.
Does this mean I have no means of reliably managing Kafka offsets on job restarts with minimum effort?
Options I considered so far:
Implement my own logic for persisting offsets - sounds complicated, I'm using Beam though Scio in Scala.
Do nothing but that would result in many duplicates on job restarts (topic has 30 days retention period).
Enable auto-commit but that would result in lost messages so even worse.
There two options : enable commitOffsetsInFinalize() in KafkaIO or alternately enable auto-commit in Kafka consumer configuration. Note that while commitOffsetsInFinalize() is more in sync with what has been processed in Beam than Kafka's auto-commit, it does not provide strong guarantees exactly-once processing. Imagine a two stage pipeline, Dataflow finalizes Kafka reader after the first stage, without waiting for second stage to complete. If you restart the pipeline from scratch at that time, you would not process the records that completed first stage, but haven't been processed by the second. The issue is no different for PubsubIO.
Regd option (2) : You can configure KafkaIO to start reading from specific timestamp (assuming Kafka server supports it (version 10+)). But does not look any better than enabling auto_commit.
That said, KafkaIO should support finalize. Might be simpler to use than enabling auto_commit (need to think about frequency etc). We haven't had many users asking for it. Please mention it on user#beam.apache.org if you can.
[Update: I am adding support for committing offsets to KafkaCheckpointMark in PR 4481]

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
Approach 2.