KafkaConsumer CPP API assign() with auto commit - apache-kafka

I have a CPP Kafka consumer which uses assign to specify the partitions. Since I assign the partitions using assign() and not use the subscribe() which I am fine with. Because of this, my the re-balancing doesn't take place which also I am fine with.
Question 1:
I want to understand how autocommit works here. Lets say if there are 2 consumers both of which have the same groupId. Both of them will get all the updates but could someone help me understand how commit will happen here ? If there is only one consumer, the commit happens using the consumer group id. But how does it work with 2 consumers. I don't see any commit failures as well in these cases.
Question 2:
How does rd_kafka_offsets_store work when I assign partition. Do they go along well or should I make use of subscribe in these cases ?

Two non-subscribing consumers with the same group.id will commit offsets for their assigned partitions without correlation or conflict resolution, if they're assigned the same partitions they will overwrite eachothers's commits.
Either use unique group.id's or subscribe to the topics.
rd_kafka_offsets_store() works the same way with assign() or subscribe(), namely by storing (in memory) the offset to commit on the next auto or manual commit.

Related

Is it influential when use assign instead of subscribe in kafka consumer that use the same group_id?

When I use assign command to consume messages from kafka with two consumers which use the same group_id, will there be any influence to each consumer?
If another consumer use the same group_id with subscribe command, will there be any influence to the above consumer?
Based on the documentation, mixing assign (manual partition assignment) and subscribe (dynamic partition assignment) isn't possible, but it's not clear what the problems are exactly or how it will fail.
Two consumers with same group_id work independently, but it can cause commit offset conflicts.
See Manual Partition Assignment in api docs
Each consumer acts independently even if it shares a groupId with
another consumer. To avoid offset commit conflicts, you should usually
ensure that the groupId is unique for each consumer instance. Note
that it isn't possible to mix manual partition assignment (i.e. using
assign) with dynamic partition assignment through topic subscription
(i.e. using subscribe).

Commit issue with Kafka

I am working on a module and the requirement is there is a producer and we are using kafka as queue for data producing and feeding it to consumer.
Now In consumer,I am trying to implement At-Least-Once messaging scenario.
For this i have to pool the messages from kafka and then consumer those.After consuming i am calling consumer.commitAsync(offset,Callback).
I want to know what will happen
Case 1). when commitAsync() api is never called(suppose there was an exception just before calling this api).In my case,I was supposing the message will be pumped again to consumer; but it is not happening.Consumer never get that data again.
Case 2). if the consumer reboots.
Below is the code snippet of properties set with the consumer
private Properties getConsumerProperties() {
final Properties props = new Properties();
props.put(BOOTSTRAP_SERVERS_CONFIG, "server");
props.put(GROUP_ID_CONFIG, "groupName");
props.put(ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(HEARTBEAT_INTERVAL_MS_CONFIG, heartBeatinterval);
props.put(METADATA_MAX_AGE_CONFIG, metaDataMaxAge);
props.put(SESSION_TIMEOUT_MS_CONFIG, sessionTimeout);
props.put(AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
props.put(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
return props;
}
Now in consumer, on the basis of some property set; i have 3 topics and create 3 consumers for each topic(as there are 3 partition and 3 brokers of kafka).
For consumption of data...I identify the packet in the basis of some propertywhen received fron kafka..and pass it to the relevant topic(i have taken a different thread pools for different topics and create the tasks on the basis of property in the packet and submit to thread pool).In the tasks, after processing i call the consumer.commitAsync(offset,callback).
I was expecting the same message to be pulled again from kafka in case of commitAsync is not called for some packet...but to my surprise it is not coming back...Am i missing something.Is there any sort of setting we need to do in the apache-kafka as well for At-Least-One.
Please suggest.
There is couple of things to be addressed in your question.
Before I get to the suggestions on how to achieve at-least-once behavior, I'll try and address the 2 cases:
Case 1). when commitAsync() api is never called(suppose there was an
exception just before calling this api).In my case,I was supposing the
message will be pumped again to consumer; but it is not
happening.Consumer never get that data again.
The reason why your consumer does not get the old data could be because of the enable.auto.commit property, this is set to true by default and will commit the offsets regularly in the background. Due to this, the consumer on subsequent runs will find an offset to work with and will just wait for new data/messages to arrive.
Case 2). if the consumer reboots.
This would also be similar i.e. if the consumer after rebooting finds a committed offset to work with, it will start consuming from that offset whether the offset was committed automatically due to the enable.auto.commit property set to true or by invoking commitAsync()/commitSync() explicitly.
Now, moving to the part on how to achieve at-least-once behavior - I could think of the following 2 ways:
If you want to take control of committing offsets, then set the "enable.auto.commit" property to false and then invoke commitSync() or commitAsync() with retries handled in the Callback function.
Note: The choice of Synchronous vs Asynchronous commit will depend on your latency budget and any other requirements. So, not going too much into those details here.
The other option is to utilise the automatic offset commit feature i.e. setting enable.auto.commit to true and auto.commit.interval.ms to an acceptable number (again, based on your requirements on how often would you like to commit offsets).
I think the default behaviour of Kafka is centered around at-least-once semantics, so it should be fairly straightforward.
I hope this helps!

auto-offset-reset=latest does not work in spring-kafka

I have a use case where I want the consumer to always start from the latest offset. I don't need to commit offsets for this consumer. This is not possible to achieve with spring-kafka, as a new consumer group always commits newly assigned partitions. Then, on subsequent starts of the program, the consumer reads from this stored offset, and not from the latest. In other words, only the very first start with a new consumer group behaves correctly, i.e. consuming from the latest. The problem is in KafkaMessageListenerContainer$ListenerConsumer.onPartitionsAssigned()
For reference, i set the following in spring boot
spring.kafka.listener.ack-mode=manual
spring.kafka.consumer.auto-offset-reset=latest
spring.kafka.consumer.enable-auto-commit=false
That code was added to solve some nasty race conditions when a repartition occurred while a new consumer group started consuming; it could cause lost or duplicate records, depending on configuration.
It was felt best to commit the initial offset to avoid these conditions.
I agree, though, that if the user takes complete responsibility for offsets (with a MANUAL ackmode) then we should probably not do that commit; it's up to the user code to deal with the race (in your case, you don't care about lost records).
Feel free to open a GitHub issue (contributions are welcome).
In the meantime, you can avoid the situation by having your listener implement ConsumerSeekAware and seek to the topic/partition ends during assignment.
Another alternative is to use a UUID for the group.id each time; and you will always start at the topic end.

kafka subscribe commit offset manually

I am using Kafka 9 and confused with the behavior of subscribe.
Why does it expects group.id with subscribe.
Do we need to commit the offset manually using commitSync. Even if don't do that I see that it always starts from the latest.
Is there a way a replay the messages from beginning.
Why does it expects group.id with subscribe?
The concept of consumer groups is used by Kafka to enable parallel consumption of topics - every message will be delivered once per consumer group, no matter how many consumers actually are in that group. This is why the group parameter is mandatory, without a group Kafka would not know how this consumer should be treated in relation to other consumers that might subscribe to the same topic.
Whenever you start a consumer it will join a consumer group, based on how many other consumers are in this consumer group it will then be assigned partitions to read from. For these partitions it then checks whether a list read offset is known, if one is found it will start reading messages from this point.
If no offset is found, the parameter auto.offset.reset controls whether reading starts at the earliest or latest message in the partition.
Do we need to commit the offset manually using commitSync? Even if
don't do that I see that it always starts from the latest.
Whether or not you need to commit the offset depends on the value you choose for the parameter enable.auto.commit. By default this is set to true, which means the consumer will automatically commit its offset regularly (how often is defined by auto.commit.interval.ms). If you set this to false, then you will need to commit the offsets yourself.
This default behavior is probably also what is causing your "problem" where your consumer always starts with the latest message. Since the offset was auto-committed it will use that offset.
Is there a way a replay the messages from beginning?
If you want to start reading from the beginning every time, you can call seekToBeginning, which will reset to the first message in all subscribed partitions if called without parameters, or just those partitions that you pass in.

FlinkKafkaConsumer08 offset control

I want to use FlinkKafkaConsumer08 to read a kafka topic. The messages are commands in terms of event-sourcing. I want to start from the end, not reading messages already in the topic.
I suppose there is a way to tell FlinkKafkaConsumer08 to start from the end.
How?
edit
I have tried setting "auto.offset.reset" property to "largest" with no result. I have tried enableCheckpoing too.
I have tried setting "auto.commit.interval.ms" to 1000. Then, at least, messages that have been previously processed are not processed again. This is a big improvements as, at least, commands are not executed twice, but it would be much better to discard old command messages. The solution I will adopt is to discard old messages based on date, and return error.
The auto.offset.reset property is only used if Kafka cannot find committed offsets in Kafka/ZooKeeper for the current consumer group. Thus, if you're reusing a consumer group, this property will most likely not be respected. However, starting the Kafka consumer in a new consumer group should do the trick.