I'm running a job using Beam KafkaIO source in Google Dataflow and cannot find an easy way to persist offsets across job restarts (job update option is not enough, i need to restart the job)
Comparing Beam's KafkaIO against PubSubIO (or to be precise comparing PubsubCheckpoint with KafkaCheckpointMark) I can see that checkpoint persistence is not implemented in KafkaIO (KafkaCheckpointMark.finalizeCheckpoint method is empty) whereas it's implemented in PubsubCheckpoint.finalizeCheckpoint which does acknowledgement to PubSub.
Does this mean I have no means of reliably managing Kafka offsets on job restarts with minimum effort?
Options I considered so far:
Implement my own logic for persisting offsets - sounds complicated, I'm using Beam though Scio in Scala.
Do nothing but that would result in many duplicates on job restarts (topic has 30 days retention period).
Enable auto-commit but that would result in lost messages so even worse.
There two options : enable commitOffsetsInFinalize() in KafkaIO or alternately enable auto-commit in Kafka consumer configuration. Note that while commitOffsetsInFinalize() is more in sync with what has been processed in Beam than Kafka's auto-commit, it does not provide strong guarantees exactly-once processing. Imagine a two stage pipeline, Dataflow finalizes Kafka reader after the first stage, without waiting for second stage to complete. If you restart the pipeline from scratch at that time, you would not process the records that completed first stage, but haven't been processed by the second. The issue is no different for PubsubIO.
Regd option (2) : You can configure KafkaIO to start reading from specific timestamp (assuming Kafka server supports it (version 10+)). But does not look any better than enabling auto_commit.
That said, KafkaIO should support finalize. Might be simpler to use than enabling auto_commit (need to think about frequency etc). We haven't had many users asking for it. Please mention it on user#beam.apache.org if you can.
[Update: I am adding support for committing offsets to KafkaCheckpointMark in PR 4481]
Related
Sometimes we encounter lag in kafka consumer due to some external issues.
Flink job will always consume kafka history (delayed data) with exactly-once semantics, but here's a scenario:
We will skip delayed data when kafka consumer lag is too much in order to let our downstream service get the latest data in time.
I am thinking to set a window period to do it. What should I code for it?
I'd say your least painful option is to always read all the messages, but do not process (discard them as soon as possible) the ones you want to skip. Just reading and discarding without any further processing is really fast.
You could stop the Flink job and use kafka-consumer-groups CLI from Kafka to seek the consumer group forward (assuming Flink is using one, rather than maintaining offsets itself)
When the job restarts, it'll start from the new offset location
I would like to write a simple Flink application that reads from a Kafka queue and processes the message and stores the output to an external system, with at least once semantics and without using checkpoints. I would like to avoid checkpoints because if the Kafka offsets are checkpointed, then all intermediate state will have to be checkpointed as well. In other words, I want the application to be as stateless as possible.
The way I envision at least once to work is the following:
a source reads from kafka
processing happens
the output is stored to the external system
the message is acknowledged to kafka
Note that:
If 2. or 3. fail, and the app restarts, the same message will be processed again (good)
If 2. and 3. succeed, 4. fails and the app restarts, we will will have stored the result twice (acceptable)
Based on https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration, the only way to get at least once (or the stronger exactly once) guarantees is by using checkpoints.
It seems that the core of the issue is that 4. needs to communicate back to 1. to ack to Kafka, which cannot happen in standard Flink, but should be possible using stateful functions.
To put it all together, the question is:
Is it possible to achieve at least once semantics using kafka in flink without using chekpoints?
According to the documentation you already linked it says:
"Checkpointing disabled: if checkpointing is disabled, the Flink Kafka Consumer relies on the automatic periodic offset committing capability of the internally used Kafka clients. Therefore, to disable or enable offset committing, simply set the enable.auto.commit / auto.commit.interval.ms keys to appropriate values in the provided Properties configuration."
As your goal is to disable checkpointing, you could set
enable.auto.commit=true
auto.commit.interval.ms=??? // use a time high enough such that your steps 2. and 3. are covered.
Apache Flink guarantee exactly once processing upon failure and recovery by resuming the job from a checkpoint, with the checkpoint being a consistent snapshot of the distributed data stream and operator state (Chandy-Lamport algorithm for distributed snapshots). This guarantee exactly once upon failover.
In case of normal cluster operation, how does Flink guarantee exactly once processing, for instance given a Flink source that reads from external source (say Kafka), how does Flink guarantee that the event is read one time from the source? are there any kind of application level acking between the event source and Flink source? Also, how does Flink guarantee that events are propogated exactly once from upstream operators to downstream operators? Does that require any kind of acking for received events as well?
Flink does not guarantee that every event is read once from the sources. Instead, it guarantees that every event affects the managed state exactly once.
Checkpoints include the source offsets, and during a checkpoint restore, the sources are rewound and some events may be replayed. That's fine, because the checkpoint included the state throughout the job that had resulted from reading everything up to the offsets that were stored in the checkpoint, and nothing beyond those offsets.
Thus Flink's exactly once guarantee requires replayable sources. Exactly once messaging between operators depends on tcp.
Guaranteeing that the sinks don't receive duplicated results further requires transactional sinks. Flink commits transactions as part of checkpointing.
I'm using Spring Kafka to interface with my Kafka instance. Assume that I have a single topic with, say, 2+ partitions.
In the instances where, for example, my Spring Kafka-based application crashes (or even rebalances), and then comes back online and there are messages waiting in the topic, I'm currently using a strategy where the latest committed offsets for each partition are stored in an external store, which I then look up on a consumer's assignment to a partition and then seek to that offset to resume processing.
(This is based on a strategy I'd read about in an O'Reilly book.)
Is there a better way of handling this situation in order to implement "exactly once" semantics and not to miss any waiting messages? Or is there a better/more idiomatic way with Spring Kafka to handle this situation?
Thanks in advance.
Is there a reason you dont checkpoint your offsets to kafka itself?
generally, your options for "exactly once" processing are:
store your offsets and your side-effects together transactionally. this is only possible if your side effects go into a transaction-capable system (say a database)
use kafka transactions. this is a simplified variant of 1 as long as your side effects go to the same kafka cluster you read from
come up with a scheme that allows you to detect and disregard duplicates downstream of your kafka pipeline (aka idempotence)
I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.