We want to achieve parallelism while reading a message form kafka. hence we wanted to specify partition number in flinkkafkaconsumer. It will read messages from all partition in kafka instead of specific partition number. Below is sample code:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "Message-Test-Consumers");
properties.setProperty("partition", "1"); //not sure about this syntax.
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<String>("EventLog", new SimpleStringSchema(), properties);
Please suggest any better option to get the parallelism.
I don't believe there is a mechanism to restrict which partitions Flink will read from. Nor do I see how this would help you achieve your goal of reading from the partitions in parallel, which Flink does regardless.
The Flink Kafka source connector reads from all available partitions, in parallel. Simply set the parallelism of the kafka source connector to whatever parallelism you desire, keeping in mind that the effective parallelism cannot exceed the number of partitions. In this way, each instance of Flink's Kafka source connector will read from one or more partitions. You can also configure the kafka consumer to automatically discover new partitions that may be created while the job is running.
Related
I am fairly new to Flink and Kafka and have some data aggregation jobs written in Scala which run in Apache Flink, the jobs consume data from Kafka perform aggregation and produce results back to Kafka.
I need the jobs to consume data from any new Kafka topic created while the job is running which matches a pattern. I got this working by setting the following properties for my consumer
val properties = new Properties()
properties.setProperty(“bootstrap.servers”, “my-kafka-server”)
properties.setProperty(“group.id”, “my-group-id”)
properties.setProperty(“zookeeper.connect”, “my-zookeeper-server”)
properties.setProperty(“security.protocol”, “PLAINTEXT”)
properties.setProperty(“flink.partition-discovery.interval-millis”, “500”);
properties.setProperty(“enable.auto.commit”, “true”);
properties.setProperty(“auto.offset.reset”, “earliest”);
val consumer = new FlinkKafkaConsumer011[String](Pattern.compile(“my-topic-start-.*”), new SimpleStringSchema(), properties)
The consumer works fine and consumes data from existing topics which start with “my-topic-start-”
When I publish data against a new topic say for example “my-topic-start-test1” for the first time, my consumer does not recognise the topic until after 500 milliseconds after the topic was created, this is based on the properties.
When the consumer identifies the topic it does not read the first data record published and starts reading subsequent records so effectively I loose that first data record every time data is published against a new topic.
Is there a setting I am missing or is it how Kafka works? Any help would be appreciated.
Thanks
Shravan
I think part of the issue is my producer was creating topic and publishing message in one go, so by the time consumer discovers new partition that message has already been produced.
As a temporary solution I updated my producer to create the topic if it does not exists and then publish a message (make it 2 step process) and this works.
Would be nice to have a more robust consumer side solution though :)
I'd like to re-read all Kafka events programmatically. I know there is an Application Reset Tool, but from what I understand, that requires me to shut down my application. I can't shut the application down in my use-case.
Is there a way to make my application re-read all events on a Kafka topic? Examples or code snippets would be much appreciated. Preferably but not necessarily using Kafka Streams.
You cannot re-read topic with Kafka Streams, but with "plain" Kafka you can position consumer to any valid offset.
Something like
final Map<Integer, TopicPartition> partitions = new HashMap<>();
// get all partitions for the topic
for (final PartitionInfo partition : consumer.partitionsFor("your_topic")) {
final TopicPartition tp = new TopicPartition("your_topic", partition.partition());
partitions.put(partition.partition(), tp);
}
consumer.assign(partitions.values());
consumer.seekToBeginning(partitions.values());
Consumers are required to stop in order to avoid running into race conditions between consumers committing offsets and AdminClient altering offsets.
If you wish to keep the consumer group id, you can use Kafka Consumer seek APIs to look for the earliest offsets. Then AdminClient can be used to alter consumer group offsets.
kafka-consumer-groups --reset-offsets implementation should be a good example on how to accomplish this: https://github.com/apache/kafka/blob/85b6545b8159885c57ab67e08b7185be8a607988/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L446-L469
Otherwise, using another consumer group id should be enough to consume from the beginning, if your auto.offset.reset is set to earliest.
Right now my kafka producer is sinking all the messages to a single partition of a kafka topic which actually have more than 1 partition.
How can i create a producer that will use the default partitioner and distribute the messages among different partitions of the topic.
Code snippet of my kafka producer:
Properties props = new Properties();
props.put(ProducerConfig.RETRIES_CONFIG, 0);
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,bootstrap.servers);
props.put(ProducerConfig.ACKS_CONFIG, "all");
I am using flink kafka producer to sink the messages on kafka topic.
speStream.addSink(
new FlinkKafkaProducer011(kafkaTopicName,
new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),
props,
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE)
With the default partitioner, messages are assigned a partition using the following logic:
keyed messages: a hash of the key is generated and based on that a partition is selected. That means messages with the same key will end up on the same partition
unkeyed messages: round robin is used to assign partitions
One option that explain the behaviour you see is if you're using the same key for all your messages, then with the default partitioner they will end up on the same partition.
Solved this by changing the flinkproducer to
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(), props));
I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.
I have a Flink streaming program which read data from a topic of Kafka. In the program, auto.offset.reset is set to "smallest". When test in IDE/Intellij-IDEA, the program could always read data from the beginning of the topic. Then I set up a flink/kafka cluster and produced some data into kafka topic. The first time I run the streaming job, it could read data from the beginning of the topic. But after that I stopped the streaming job and run it again, it will not read data from the beginning of the topic. How could I make the program always read data from the beginning of the topic?
Properties properties = new Properties();
properties.put("bootstrap.servers", kafkaServers);
properties.put("zookeeper.connect", zkConStr);
properties.put("group.id", group);
properties.put("topic", topics);
properties.put("auto.offset.reset", offset);
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer082<String>(topics, new SimpleStringSchema(), properties));
If you want to read always from the beginning you need to disable checkpointing in your stream context.
Also disable it on the level of consumer properties:
enable.auto.commit=false or auto.commit.enable=false (depends on kafka version)
Another way:
you can keep ckeckpointing for failover but generate new group.id when you need to read from the beginning(just clean up sometimes zookeeper)