FlinkKafkaConsumer082 auto.offset.reset setting doesn't work? - apache-kafka

I have a Flink streaming program which read data from a topic of Kafka. In the program, auto.offset.reset is set to "smallest". When test in IDE/Intellij-IDEA, the program could always read data from the beginning of the topic. Then I set up a flink/kafka cluster and produced some data into kafka topic. The first time I run the streaming job, it could read data from the beginning of the topic. But after that I stopped the streaming job and run it again, it will not read data from the beginning of the topic. How could I make the program always read data from the beginning of the topic?
Properties properties = new Properties();
properties.put("bootstrap.servers", kafkaServers);
properties.put("zookeeper.connect", zkConStr);
properties.put("group.id", group);
properties.put("topic", topics);
properties.put("auto.offset.reset", offset);
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer082<String>(topics, new SimpleStringSchema(), properties));

If you want to read always from the beginning you need to disable checkpointing in your stream context.
Also disable it on the level of consumer properties:
enable.auto.commit=false or auto.commit.enable=false (depends on kafka version)
Another way:
you can keep ckeckpointing for failover but generate new group.id when you need to read from the beginning(just clean up sometimes zookeeper)

Related

Exactly once in flink kafka producer and consumer

I am trying to achieve an Exactly-Once semantics in Flink-Kafka integration. I have my producer module as below:
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.enableCheckpointing(1000)
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(1000) //Gap after which next checkpoint can be written.
env.getCheckpointConfig.setCheckpointTimeout(4000) //Checkpoints have to complete within 4secs
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) //Only 1 checkpoints can be executed at a time
env.getCheckpointConfig.enableExternalizedCheckpoints(
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) //Checkpoints are retained if the job is cancelled explicitly
//env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 10)) //Number of restart attempts, Delay in each restart
val myProducer = new FlinkKafkaProducer[String](
"topic_name", // target topic
new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema()), // serialization schema
getProperties(), // producer config
FlinkKafkaProducer.Semantic.EXACTLY_ONCE) //Producer Config
Consumer Module:
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("zookeeper.connect", "localhost:2181")
val consumer = new FlinkKafkaConsumer[String]("topic_name", new SimpleStringSchema(), properties)
I am generating few records and pushing it to this producer. The records are in below fashion:
1
2
3
4
5
6
..
..
and so on. So suppose while pushing this data, the producer was able to push the data till the 4th record and due to some failure it went down so when it is up and running again, will it push the record from 5th onwards? Are my properties enough for that?
I will be adding one property on the consumer side as per this link mentioned by the first user. Should I add Idempotent property on the producer side as well?
My Flink version is 1.13.5, Scala 2.11.12 and I am using Flink Kafka connector 2.11.
I think I am not able to commit the transactions using the EXACTLY_ONCE because checkpoints are not written at the mentioned path. Attaching screenshots of the Web UI:
Do I need to set any property for that?
For the producer side, Flink Kafka Consumer would bookkeeper the current offset in the distributed checkpoint, and if the consumer task failed, it will restarted from the latest checkpoint and re-emit from the offset recorded in the checkpoint. For example, suppose the latest checkpoint records offset 3, and after that flink continue to emit 4, 5 and then failover, then Flink would continue to emit records from 4. Notes that this would not cause duplication since the state of all the operators are also fallback to the state after processed records 3.
For the producer side, Flink use two-phase commit [1] to achieve exactly-once. Roughly Flink Producer would relies on Kafka's transaction to write data, and only commit data formally after the transaction is committed. Users could use Semantics.EXACTLY_ONCE to enable this functionality.
[1] https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
[2] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/kafka/#fault-tolerance

Topic and partition discovery for Kafka consumer

I am fairly new to Flink and Kafka and have some data aggregation jobs written in Scala which run in Apache Flink, the jobs consume data from Kafka perform aggregation and produce results back to Kafka.
I need the jobs to consume data from any new Kafka topic created while the job is running which matches a pattern. I got this working by setting the following properties for my consumer
val properties = new Properties()
properties.setProperty(“bootstrap.servers”, “my-kafka-server”)
properties.setProperty(“group.id”, “my-group-id”)
properties.setProperty(“zookeeper.connect”, “my-zookeeper-server”)
properties.setProperty(“security.protocol”, “PLAINTEXT”)
properties.setProperty(“flink.partition-discovery.interval-millis”, “500”);
properties.setProperty(“enable.auto.commit”, “true”);
properties.setProperty(“auto.offset.reset”, “earliest”);
val consumer = new FlinkKafkaConsumer011[String](Pattern.compile(“my-topic-start-.*”), new SimpleStringSchema(), properties)
The consumer works fine and consumes data from existing topics which start with “my-topic-start-”
When I publish data against a new topic say for example “my-topic-start-test1” for the first time, my consumer does not recognise the topic until after 500 milliseconds after the topic was created, this is based on the properties.
When the consumer identifies the topic it does not read the first data record published and starts reading subsequent records so effectively I loose that first data record every time data is published against a new topic.
Is there a setting I am missing or is it how Kafka works? Any help would be appreciated.
Thanks
Shravan
I think part of the issue is my producer was creating topic and publishing message in one go, so by the time consumer discovers new partition that message has already been produced.
As a temporary solution I updated my producer to create the topic if it does not exists and then publish a message (make it 2 step process) and this works.
Would be nice to have a more robust consumer side solution though :)

Flink kafka consumer fetch messages from specific partition

We want to achieve parallelism while reading a message form kafka. hence we wanted to specify partition number in flinkkafkaconsumer. It will read messages from all partition in kafka instead of specific partition number. Below is sample code:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "Message-Test-Consumers");
properties.setProperty("partition", "1"); //not sure about this syntax.
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<String>("EventLog", new SimpleStringSchema(), properties);
Please suggest any better option to get the parallelism.
I don't believe there is a mechanism to restrict which partitions Flink will read from. Nor do I see how this would help you achieve your goal of reading from the partitions in parallel, which Flink does regardless.
The Flink Kafka source connector reads from all available partitions, in parallel. Simply set the parallelism of the kafka source connector to whatever parallelism you desire, keeping in mind that the effective parallelism cannot exceed the number of partitions. In this way, each instance of Flink's Kafka source connector will read from one or more partitions. You can also configure the kafka consumer to automatically discover new partitions that may be created while the job is running.

How can I know that I have consumed all of a Kafka Topic?

I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.

Kafka High Level Consumer Fetch All Messages From Topic Using Java API (Equivalent to --from-beginning)

I am testing the Kafka High Level Consumer using the ConsumerGroupExample code from the Kafka site. I would like to retrieve all the existing messages on the topic called "test" that I have in the Kafka server config. Looking at other blogs, auto.offset.reset should be set to "smallest" to be able to get all messages:
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
Properties props = new Properties();
props.put("zookeeper.connect", a_zookeeper);
props.put("group.id", a_groupId);
props.put("auto.offset.reset", "smallest");
props.put("zookeeper.session.timeout.ms", "10000");
return new ConsumerConfig(props);
}
The question I really have is this: what is the equivalent Java api call for the High Level Consumer that is the equivalent of:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Basically, everytime a new consumer tries to consume a topic, it'll read messages from the beginning. If you're especially just consuming from the beginning each time for testing purposes, everytime you initialise your consumer with a new groupID, it'll read the messages from the beginning. Here's how I did it :
properties.put("group.id", UUID.randomUUID().toString());
and read messages from the beginning each time!
Looks like you need to use the "low level SimpleConsumer API"
For most applications, the high level consumer Api is good enough.
Some applications want features not exposed to the high level consumer
yet (e.g., set initial offset when restarting the consumer). They can
instead use our low level SimpleConsumer Api. The logic will be a bit
more complicated and you can follow the example in here.
This example worked for getting all messages from a topic with the following arguments: (note that the port is the Kafka port, not the ZooKeeper port, topics set up from this example):
10 my-replicated-topic 0 localhost 9092
Specifically, there is a method to get readOffset which takes kafka.api.OffsetRequest.EarliestTime():
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
Here is another post may provide some alternate ideas on how to sort this out: How to get data from old offset point in Kafka?
To fetch messages from the beginning, you can do this:
import kafka.utils.ZkUtils;
ZkUtils.maybeDeletePath("zkhost:zkport", "/consumers/group.id");
then just follow the routine work...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("auto.offset.reset", "earliest");
props.put("group.id", UUID.randomUUID().toString());
This properties will help you.