Exactly once in flink kafka producer and consumer - apache-kafka

I am trying to achieve an Exactly-Once semantics in Flink-Kafka integration. I have my producer module as below:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.enableCheckpointing(1000)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000) //Gap after which next checkpoint can be written.
env.getCheckpointConfig.setCheckpointTimeout(4000) //Checkpoints have to complete within 4secs
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) //Only 1 checkpoints can be executed at a time
env.getCheckpointConfig.enableExternalizedCheckpoints(
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) //Checkpoints are retained if the job is cancelled explicitly
//env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 10)) //Number of restart attempts, Delay in each restart
val myProducer = new FlinkKafkaProducer[String](
"topic_name", // target topic
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema()), // serialization schema
getProperties(), // producer config
FlinkKafkaProducer.Semantic.EXACTLY_ONCE) //Producer Config
Consumer Module:
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
val consumer = new FlinkKafkaConsumer[String]("topic_name", new SimpleStringSchema(), properties)
I am generating few records and pushing it to this producer. The records are in below fashion:
1
2
3
4
5
6
..
..
and so on. So suppose while pushing this data, the producer was able to push the data till the 4th record and due to some failure it went down so when it is up and running again, will it push the record from 5th onwards? Are my properties enough for that?
I will be adding one property on the consumer side as per this link mentioned by the first user. Should I add Idempotent property on the producer side as well?
My Flink version is 1.13.5, Scala 2.11.12 and I am using Flink Kafka connector 2.11.
I think I am not able to commit the transactions using the EXACTLY_ONCE because checkpoints are not written at the mentioned path. Attaching screenshots of the Web UI:
Do I need to set any property for that?

For the producer side, Flink Kafka Consumer would bookkeeper the current offset in the distributed checkpoint, and if the consumer task failed, it will restarted from the latest checkpoint and re-emit from the offset recorded in the checkpoint. For example, suppose the latest checkpoint records offset 3, and after that flink continue to emit 4, 5 and then failover, then Flink would continue to emit records from 4. Notes that this would not cause duplication since the state of all the operators are also fallback to the state after processed records 3.
For the producer side, Flink use two-phase commit [1] to achieve exactly-once. Roughly Flink Producer would relies on Kafka's transaction to write data, and only commit data formally after the transaction is committed. Users could use Semantics.EXACTLY_ONCE to enable this functionality.
[1] https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
[2] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/kafka/#fault-tolerance

Related

Topic and partition discovery for Kafka consumer

I am fairly new to Flink and Kafka and have some data aggregation jobs written in Scala which run in Apache Flink, the jobs consume data from Kafka perform aggregation and produce results back to Kafka.
I need the jobs to consume data from any new Kafka topic created while the job is running which matches a pattern. I got this working by setting the following properties for my consumer
val properties = new Properties()
properties.setProperty(“bootstrap.servers”, “my-kafka-server”)
properties.setProperty(“group.id”, “my-group-id”)
properties.setProperty(“zookeeper.connect”, “my-zookeeper-server”)
properties.setProperty(“security.protocol”, “PLAINTEXT”)
properties.setProperty(“flink.partition-discovery.interval-millis”, “500”);
properties.setProperty(“enable.auto.commit”, “true”);
properties.setProperty(“auto.offset.reset”, “earliest”);
val consumer = new FlinkKafkaConsumer011[String](Pattern.compile(“my-topic-start-.*”), new SimpleStringSchema(), properties)
The consumer works fine and consumes data from existing topics which start with “my-topic-start-”
When I publish data against a new topic say for example “my-topic-start-test1” for the first time, my consumer does not recognise the topic until after 500 milliseconds after the topic was created, this is based on the properties.
When the consumer identifies the topic it does not read the first data record published and starts reading subsequent records so effectively I loose that first data record every time data is published against a new topic.
Is there a setting I am missing or is it how Kafka works? Any help would be appreciated.
Thanks
Shravan
I think part of the issue is my producer was creating topic and publishing message in one go, so by the time consumer discovers new partition that message has already been produced.
As a temporary solution I updated my producer to create the topic if it does not exists and then publish a message (make it 2 step process) and this works.
Would be nice to have a more robust consumer side solution though :)

PySpark and Kafka "Set are gone. Some data may have been missed.."

I'm running PySpark using a Spark cluster in local mode and I'm trying to write a streaming DataFrame to a Kafka topic.
When I run the query, I get the following message:
java.lang.IllegalStateException: Set(topicname-0) are gone. Some data may have been missed..
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
This is my code:
query = (
output_stream
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "ratings-cleaned")
.option("checkpointLocation", "checkpoints-folder")
.start()
)
sleep(2)
print(query.status)
This error message typically shows up when some messages/offsets were removed from the source topic since the last run of the query. The removal happened due to the cleanup policy, such as retention time.
Imagine your topic has messages with offsets 0, 1, 2 which have all been processed by the application. The checkpoint files stores that last offset 2 to remember continue with offset 3 next time it starts.
After some time, messages with offset 3, 4, 5 were produced to the topic but messages with offset 0, 1, 2, 3 were removed from the topic due to its retention.
Now, when restarting your spark structured streaming job it tries to fetch 3 based on its checkpoint files but realises that only the message with offset 4 is available. In exactly that case it will throw this exception.
You can solve this by
setting .option("failOnDataLoss", "false") in your readStream operation, or
delete existing checkpoint files
According to the Structured Streaming + Kafka Integration Guide the option failOnDataLoss is described as:
"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. Batch queries will always fail if it fails to read any data from the provided offsets due to lost data."
On top of the answers above, Bartosz Konieczny posted a more detailed reason. The first part of the error message says the Set() is empty; that is a set of topic partitions (hence the -0 at the end). That means the partition to which the Spark cluster subscribed has been deleted. My guess is the Kafka setup was restarted. The Spark queries are using some default checkpoint folder that assumes the Kafka setup was not restarted.
This error message hints at issues with the checkpoints. During development, this can be caused by using an old checkpoints folder with an updated query.
If this is in a development environment and you don't need to save the state of the previous query, you can just remove the checkpoints folder (checkpoints-folder in the code example) and rerun your query.

Pause and resume KafkaConsumer in SparkStreaming

:)
I've ended myself in a (strange) situation where, briefly, I don't want to consume any new record from Kafka, so pause the sparkStreaming consumption (InputDStream[ConsumerRecord]) for all partitions in the topic, do some operations and finally, resume consuming records.
First of all... is this possible?
I've been trying sth like this:
var consumer: KafkaConsumer[String, String] = _
consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe(java.util.Arrays.asList(topicName))
consumer.pause(consumer.assignment())
...
consumer.resume(consumer.assignment())
but I got this:
println(s"Assigned partitions: $consumer.assignment()") --> []
println(s"Paused partitions: ${consumer.paused()}") --> []
println(s"Partitions for: ${consumer.partitionsFor(topicNAme)}") --> [Partition(topic=topicAAA, partition=0, leader=1, replicas=[1,2,3], partition=1, ... ]
Any help to understand what I'm missing and why I'm getting empty results when it's clear the consumer has partitions assigned will be welcomed!
Versions:
Kafka: 0.10
Spark: 2.3.0
Scala: 2.11.8
Yes it is possible
Add check pointing in your code and pass persistent storage (local disk,S3,HDFS) path
and whenever you start/resume your job it will pickup the Kafka Consumer group info with consumer offsets from the check pointing and start processing from where it was stopped.
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
Spark Check-=pointing is mechanism not only for saving the offset but also save the serialize state of your DAG of your Stages and Jobs. So whenever you restart your job with new code it would
Read and process the serialized data
Clean the cached DAG stages if there are any code changes in your Spark App
Resume processing from the new data with latest code.
Now here reading from disk is just a one time operation required by Spark to load the Kafka Offset, DAG and the old incomplete processed data.
Once it has done it will always keep on saving the data to disk on default or specified checkpoint interval.
Spark streaming provides an option to specifying Kafka group id but Spark structured stream does not.

How to set Kafka offset for consumer?

Assume there are already 10 data in my topic and now I start my consumer written Flink, the consumer will consume the 11th data.
Therefore, I have 3 questions:
How to get the number of partitions of current topic and the offset of each partition respectively?
How to set the starting position for each partition for consumer manually?
If the Flink consumer crashes and after several minutes it gets recovered. How will the consumer know where to restart?
Any help is appreciated. The sample codes (I tried FlinkKafkaConsumer08, FlinkKafkaConsumer10 but all exception.):
public class kafkaConsumer {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.95.2:9092");
properties.setProperty("group.id", "test");
properties.setProperty("auto.offset.reset", "earliest");
FlinkKafkaConsumer09<String> myConsumer = new FlinkKafkaConsumer09<>(
"game_event", new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(myConsumer);
stream.map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
#Override
public String map(String value) throws Exception {
return "Stream Value: " + value;
}
}).print();
env.execute();
}
}
And pom.xml:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.6.1</version>
</dependency>
In order consume messages from a partition starting from a particular offset you can refer to the Flink Documentationl:
You can also specify the exact offsets the consumer should start from
for each partition:
Map<KafkaTopicPartition, Long> specificStartOffsets = new HashMap<>();
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L);
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L);
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L);
myConsumer.setStartFromSpecificOffsets(specificStartOffsets);
The above example configures the consumer to start from the specified
offsets for partitions 0, 1, and 2 of topic myTopic. The offset values
should be the next record that the consumer should read for each
partition. Note that if the consumer needs to read a partition which
does not have a specified offset within the provided offsets map, it
will fallback to the default group offsets behaviour (i.e.
setStartFromGroupOffsets()) for that particular partition.
Note that these start position configuration methods do not affect the
start position when the job is automatically restored from a failure
or manually restored using a savepoint. On restore, the start position
of each Kafka partition is determined by the offsets stored in the
savepoint or checkpoint (please see the next section for information
about checkpointing to enable fault tolerance for the consumer).
In case one of the consumers crashes, once it recovers Kafka will refer to consumer_offsets topic in order to continue processing the messages from the point it was left to before crashing. consumer_offsets is a topic which is used to store meta-data information regarding committed offsets for each triple (topic, partition, group). It is also periodically compacted so that only latest offsets are available. You can also refer to Flink's Kafka Connectors Metrics:
Flink’s Kafka connectors provide some metrics through Flink’s metrics
system to analyze the behavior of the connector. The producers export
Kafka’s internal metrics through Flink’s metric system for all
supported versions. The consumers export all metrics starting from
Kafka version 0.9. The Kafka documentation lists all exported metrics
in its documentation.
In addition to these metrics, all consumers expose the current-offsets
and committed-offsets for each topic partition. The current-offsets
refers to the current offset in the partition. This refers to the
offset of the last element that we retrieved and emitted successfully.
The committed-offsets is the last committed offset.
The Kafka Consumers in Flink commit the offsets back to Zookeeper
(Kafka 0.8) or the Kafka brokers (Kafka 0.9+). If checkpointing is
disabled, offsets are committed periodically. With checkpointing, the
commit happens once all operators in the streaming topology have
confirmed that they’ve created a checkpoint of their state. This
provides users with at-least-once semantics for the offsets committed
to Zookeeper or the broker. For offsets checkpointed to Flink, the
system provides exactly once guarantees.
The offsets committed to ZK or the broker can also be used to track
the read progress of the Kafka consumer. The difference between the
committed offset and the most recent offset in each partition is
called the consumer lag. If the Flink topology is consuming the data
slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.

How can I know that I have consumed all of a Kafka Topic?

I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.