I am using Flink's FlinkKafkaConsumer010 and Kafka version 1.1.
I want to get the offset lag info in my code
Flink Kafka Connector Metric
committedOffsets: The last successfully committed offsets to Kafka, for each partition. A particular partition's metric can be specified by topic name and partition id.
currentOffsets: The consumer's current read offset, for each partition. A particular partition's metric can be specified by topic name and partition id.
val endOffset = new PartitionOffsetsRetrieverImpl(
kafkaConsumer,
kafkaAdminClient,
"group_" + System.currentTimeMillis()
).endOffsets(util.Arrays.asList(topicPartition)).get(topicPartition)
Here kafkaConsumer is org.apache.kafka.clients.consumer.KafkaConsumer obj
and kafkaAdminClient is org.apache.kafka.clients.admin.AdminClient obj
This could be improved, because with this every time it creates the instance of kafka consumer and that is not efficient.
Related
Earlier kafka used to store consumer offsets in zookeeper, but since kafka 0.10 or 0.11 - kafka started to store consumer offsets in an internal topic.
As stated in this post -
Kafka brokers use an internal topic named __consumer_offsets that
keeps track of what messages a given consumer group last successfully
processed. As we know, each message in a Kafka topic has a partition
ID and an offset ID attached to it.
But a topic is not like a DB Table - which can be queried for data based on some input. So I am wondering how this is efficient at all and how exactly does kafka retrieve the offsets for a particular partiton for a particular consumer-group.
Kafka Streams or an in-memory hashtable can make compacted topics very much like an KV database store.
The consumer offsets topic is a compacted topic, keyed by group name. When you give a group.id in the client, the Controller node and Group Coordinator are easily able to lookup that name from the topic, by key, and return all currently committed offsets for all partitions for the group. Then the consumer looks up the offsets for its assigned partitions from the returned map.
It's not a question of "better". Removing dependencies of Zookeeper was always the goal, and is finally production ready as of Kafka 3.3.1.
I am trying to write a simple Beam pipeline that starts consuming data from the earliest offsets existing in the partitions of each Kafka Topic.
I have not been able to figure out how to consume data from the earliest possible offsets in a topic.
In general, KafkaConsumer instances will try to consume data from the latest offset for each partition. This means they'll start consuming only new messages published to the topic.
If you want your pipeline to start by consuming the earliest available Kafka offsets, then you can achieve this by calling withStartReadTime parameter:
p.apply(KafkaIO.read()
.withBootstrapServers(KAFKA_BOOTSTRAP_SERVER)
.withTopic(KAFKA_TOPIC)
// By reading data from EPOCH, you'll ensure the earliest messages are consumed
.withStartReadTime(Instant.EPOCH))
And that should do it!
I know that all the messages (or offset) in a Kafka Queue Partition has its offset number and it takes care of the sequence of offsets.
But if I have a Kafka Consumer Group (or single Kafka Consumer) which is reading particularly the Kafka Topic Partition then how it maintains up to which offset messages are read and who maintains this offset counter?
If the consumer goes down then how a new consumer will start reading the offset from the next unread (or not acknowledged) offset.
The information about Consumer Groups is all stored in the internal Kafka topic __consumer_offsets. Whenever a new group tries to read data from a topic it checks its offset position in that internal topic which has a deletion policy set to compact. The compaction keeps this topic small.
Kafka comes with a command line tool kafka-consumer-groups.sh that helps you understand which information is stored for each consumer group.
More information is given in the Kafka Documentation on offset tracking.
Assume there are already 10 data in my topic and now I start my consumer written Flink, the consumer will consume the 11th data.
Therefore, I have 3 questions:
How to get the number of partitions of current topic and the offset of each partition respectively?
How to set the starting position for each partition for consumer manually?
If the Flink consumer crashes and after several minutes it gets recovered. How will the consumer know where to restart?
Any help is appreciated. The sample codes (I tried FlinkKafkaConsumer08, FlinkKafkaConsumer10 but all exception.):
public class kafkaConsumer {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.95.2:9092");
properties.setProperty("group.id", "test");
properties.setProperty("auto.offset.reset", "earliest");
FlinkKafkaConsumer09<String> myConsumer = new FlinkKafkaConsumer09<>(
"game_event", new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(myConsumer);
stream.map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
#Override
public String map(String value) throws Exception {
return "Stream Value: " + value;
}
}).print();
env.execute();
}
}
And pom.xml:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.6.1</version>
</dependency>
In order consume messages from a partition starting from a particular offset you can refer to the Flink Documentationl:
You can also specify the exact offsets the consumer should start from
for each partition:
Map<KafkaTopicPartition, Long> specificStartOffsets = new HashMap<>();
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L);
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L);
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L);
myConsumer.setStartFromSpecificOffsets(specificStartOffsets);
The above example configures the consumer to start from the specified
offsets for partitions 0, 1, and 2 of topic myTopic. The offset values
should be the next record that the consumer should read for each
partition. Note that if the consumer needs to read a partition which
does not have a specified offset within the provided offsets map, it
will fallback to the default group offsets behaviour (i.e.
setStartFromGroupOffsets()) for that particular partition.
Note that these start position configuration methods do not affect the
start position when the job is automatically restored from a failure
or manually restored using a savepoint. On restore, the start position
of each Kafka partition is determined by the offsets stored in the
savepoint or checkpoint (please see the next section for information
about checkpointing to enable fault tolerance for the consumer).
In case one of the consumers crashes, once it recovers Kafka will refer to consumer_offsets topic in order to continue processing the messages from the point it was left to before crashing. consumer_offsets is a topic which is used to store meta-data information regarding committed offsets for each triple (topic, partition, group). It is also periodically compacted so that only latest offsets are available. You can also refer to Flink's Kafka Connectors Metrics:
Flink’s Kafka connectors provide some metrics through Flink’s metrics
system to analyze the behavior of the connector. The producers export
Kafka’s internal metrics through Flink’s metric system for all
supported versions. The consumers export all metrics starting from
Kafka version 0.9. The Kafka documentation lists all exported metrics
in its documentation.
In addition to these metrics, all consumers expose the current-offsets
and committed-offsets for each topic partition. The current-offsets
refers to the current offset in the partition. This refers to the
offset of the last element that we retrieved and emitted successfully.
The committed-offsets is the last committed offset.
The Kafka Consumers in Flink commit the offsets back to Zookeeper
(Kafka 0.8) or the Kafka brokers (Kafka 0.9+). If checkpointing is
disabled, offsets are committed periodically. With checkpointing, the
commit happens once all operators in the streaming topology have
confirmed that they’ve created a checkpoint of their state. This
provides users with at-least-once semantics for the offsets committed
to Zookeeper or the broker. For offsets checkpointed to Flink, the
system provides exactly once guarantees.
The offsets committed to ZK or the broker can also be used to track
the read progress of the Kafka consumer. The difference between the
committed offset and the most recent offset in each partition is
called the consumer lag. If the Flink topology is consuming the data
slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
I'm using sarama-cluster (written by Golang kafka consumer client)
In broker, my topic's partition offset was 11000 and my consumer group's partition offset was 10100.
Then I run my cluster-consumer, but nothing consume. (consume time was 1~2days later)
But when I produce message in the topic's partition, it consume! (In each partition)
A number of message is 901.
Why is it, that my consumer-cluster consume seems to activate when produce message?
My consumer setting was auto.offset.reset = lastest
This is because of your offset reset settings. auto.offset.reset = latest means your consumer group should wait for the newest records. If you want to consume from the beginning, use auto.offset.reset = earliest.
The official Kafka documentation: https://kafka.apache.org/0110/documentation.html