Kafka Consumer not consuming from last commited offset after restart - apache-kafka

I have a consumer polling from subscribed topic. It consumes each message and does some processing (within seconds), pushes to different topic and commits offset.
There are totally 5000 messages,
before restart - consumed 2900 messages and committed offset
after restart - started consuming from offset 0.
Even though consumer is created with same consumer group, it started processing messages from offset 0.
kafka version (strimzi) > 2.0.0
kafka-python == 2.0.1

We don't know how many partitions you have in your topic but when consumers are created within a same consumer group, they will consume records from different partitions ( We can't have two consumers in a consumer group that consume from the same partition and If you add a consumer the group coordinator will execute the process of Re-balancing to reassign each consumer to a specific partition).
I think the offset 0 comes from the property auto.offset.reset which can be :
latest: Start at the latest offset in log
earliest: Start with the earliest record.
none: Throw an exception when there is no existing offset data.
But this property kicks in only if your consumer group doesn't have a valid offset committed.
N.B: Records in a topic have a retention period log.retention.ms property so your latest messages could be deleted when your are processing the first records in the log.
Questions: While you want to consume message from one topic and process data and write them to another topic why you didn't use Kafka Streaming ?

Related

Kafka consumer - how does rebalance work if one consumer fails

I'm using AWS Kafka MSK and I have a topic with 2 partitions.
I also have a 2 consumers which are part of the same consumer group.
I'm wondering that will happen in the following case:
Consumer A - took messages 1 - 100
Consumer B - took messages 101 - 200
Consumer A failed
Consumer B succeeded
What happens to the messages 1 - 100?
Will the auto Kafka rebalance set consumer B to read messages 1 - 100?
or the new consumer that will startup instead of Consumer A will read the messages?
Thanks in advance.
Offset ranges are for partitions, not topics.
This scenario is not possible for a fresh consumer application unless one of the following is true
Offsets 0-100 of the partition assigned to consumer B have been removed due to retention
Your code calls seek method to skip those offsets
On the other hand, if the consumer group already existed and consumed none of the records of partition assigned to consumer A (say, it had failed before), and did commit offset 100 of the other partition. In this case, perhaps the same thing would happen; the consumer group might fail reading offset 0 of the "first" partition.
When any consumer instance fails, the group will rebalance. Depending on how you handle errors/failures, the previously healthy instance may then be assigned both partitions, and then fail consuming the "first" partition again (since it'll be the same code that died previously). Or, writing code differently, you'll ignore consumer exceptions and optionally mark bad offsets in a dead-letter queue. When logged or ignored, you'd commit offsets for the original consumer and skip those records.

What consumer offset will be set if auto.offset.reset=earliest but topic has no messages

I have Kafka server version 2.4 and set log.retention.hours=168(so that messages in the topic will get deleted after 7 days) and auto.offset.reset=earliest(so that if the consumer doesn't get the last committed offset then it should be processed from the beginning). And since I am using Kafka 2.4 version so by default value offsets.retention.minutes=10080 (since I am not setting this property in my application).
My Topic data is : 1,2,3,4,5,6,7,8,9,10
current consumer offset before shutting down consumer: 10
End offset:10
last committed offset by consumer: 10
So let's say my consumer is not running for the past 7 days and I have started the consumer on the 8th day. So my last committed offset by the consumer will get expired(due to offsets.retention.minutes=10080 property) and topic messages also will get deleted(due to log.retention.hours=168 property).
So wanted to know what consumer offset will be set by auto.offset.reset=earliest property now?
Although no data is available in the Kafka topic, your brokers still know the "next" offset within that partition. In your case the first and last offset of this topic is 10 whereas it does not contain any data.
Therefore, your consumer which already has committed offset 10 will try to read 11 when started again, independent of the consumer configuration auto.offset.reset.
Your example will get even more interesting when your topic has had offsets, say, until 15 while the consumer was shut down after committing offset 10. Now, imagine all offsets were removed from the topic due to the retention policy. If you then start your consumer only then the consumer configuration auto.offset.reset comes into effect as stated in the documentation:
"What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted)"
As long as the Kafka topic is empty there is no offset "set" for the consumer. The consumer just tries to find the next available offset, either based on
the last committed offset or,
in case the last committed offset does not exist anymore, the configuration given through auto.offset.reset.
Just as an additional note: Even though the messages seem to get cleaned by the retention policy you may still see some data in the topic due to Data still remains in Kafka topic even after retention time/size
Once the consumer group gets deleted from log, auto.offset.reset will take the precedence and consumers will start consuming data from beginning.
My Topic data is : 1,2,3,4,5,6,7,8,9,10
If the topic has the above data, the consumer will start from beginning, and all 1 to 10 records will be consumed
My Topic data is : 11,12,13,14,15,16,17,18,19,20
In this case if old data is purged due to retention, the consumer will reset the offset to earliest (earliest offset available at that time) and start consuming from there, for example in this scenario it will consume all from 11 to 20 (since 1 to 10 are purged)

How multiple consumers from different consumer groups read from same partition?

I have a use case where i have 2 consumers in different consumer groups(cg1 and cg2) subscribing to same topic(Topic A) with 4 partitions.
What happens if both consumers are reading from same partition and one of them failed and other one commited the offset?
In Kafka the offset management is done by Consumer Group per Partition.
If you have two consumer groups reading the same topic and even partition a commit from one consumer group will not have any impact to the other consumer group. The consumer groups are completely discoupled.
One consumer of a consumer group can read data from a single topic partition. A single consumer can't read data from multiple partitions of a topic.
Example Consumer 1 of Consumer Group 1 can read data of only single topic partition.
Offset management is done by the zookeeper.
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper).
When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
simultaneously two consumers from two different consumer groups(cg1 and cg2) can read the data from same topic.
In kafka 1: Offset management is taken care by zookeeper.
In kafka 2: offsets of each consumer is stored at __Consumer_offsets topic
Offset used for keeping the track of consumers (how much records consumed by consumers), let say consumer-1 consume 10 records and consumer-2 consume-20 records and suddenly consumer-1 got died now whenever the consumer-1 will up then it will start reading from 11th record onward.

Kafka Consumer configuration - How does auto.offset.reset controls the message consumption

I'm trying to understand, how does the ConsumerConfig.auto.offset.reset = latest would affect the message consumption.
For example I've a consumer, sending 100 messages initially at time t1 and then my consumer is up and running at t1+30 sec, then would my consumer consume the messages published after t1+30 sec or will it consume messages published t1 onwards?
It depends.
auto.offset.reset only applies when there is no stored offset for the consumer group.
It applies to the following conditions:
the first time a consumer group consumes
if a consumer doesn't commit any offsets, the next time it is started
if a consumer group has been expired (7 days by default with modern brokers)
if the message the stored offset points to has been removed due to message retention policies (an attempt to read a message that has been purged triggers the application of the rule)
If a consumer commits an offset; it will start at the last committed offset the next time it is started.

Kafka multiple consumer

When we have multiple consumer reading from the topic with single partition Is there any possibility that all the consumer will get all the message.
I have created the two consumers with manual offset commit.started the first consumer and after 2 mins started 2nd consumer . The second consumer is reading from the message from where the 1st consumer stopped reading. Is there any possibility that the 2nd consumer will read all the message from beginning.I'm new to kafka please help me out.
In your consumer, you would be using commitSync which commits offset returned on last poll. Now, when you start your 2nd consumer, since it is in same consumer group it will read messages from last committed offset.
Messages which your consumer will consumes depends on the ConsumerGroup it belongs to. Suppose you have 2 partitions and 2 consumers in single Consumer Group, then each consumer will read from different partitions which helps to achieve parallelism.
So, if you want your 2nd consumer to read from beginning, you can do one of 2 things:
a) Try putting 2nd consumer in different consumer group. For this consumer group, there won't be any offset stored anywhere. At this time, auto.offset.reset config will decide the starting offset. Set auto.offset.reset to earliest(reset the offset to earliest offset) or to latest(reset the offset to latest offset).
b) Seek to start of all partitions your consumer is assigned by using: consumer.seekToBeginning(consumer.assignment())
Documentation: https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seekToBeginning-java.util.Collection-
https://kafka.apache.org/documentation/#consumerconfigs
Partition is always assigned to unique consumer in single consumer group irrespective of multiplpe consumers. It means only that consumer can read the data and others won't consume data until the partition is assigned to them. When consumer goes down, partition rebalance happens and it will be assigned to another consumer. Since you are performing manual commit, new consumer will start reading from committed offset.