kafka offset management auto vs manual - apache-kafka

I'm working on an application of spring boot which uses Kafka stream, in my application, I want to manage Kafka offset and commit the offset in case of the successful message processing only. This is important, to be certain I won't lose messages even if Kafka restarted or the zookeeper is down. my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages.
also, I need to know what is the difference between managing the Kafka offset automatic using autoCommitOffset and manging it manually using HBase or zookeeper or checkpoints?
also, what are the benefits of managing it manually if there is an automatic config we can use?

You have no guarantee of durability with auto commit
Older Kafka clients did use Zookeeper for offset storage, but now it is all in the broker to minimize dependencies. Kafka Streams API has no way to integrate offset storage outside of Kafka itself, and so you must use the Consumer API to lookup and seek/commit offsets to external storage, if you choose to do so, however, you can still then end up with less than optimal message processing.
my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages
Sounds like you set auto.offset.reset=earliest and you never commit any offsets at all...
The auto commit setting does a periodic commit, not "automatic after reading any message".
If you want to guarantee delivery, then you need to set at least acks=1 in the producer and actually do a commitSync in the consumer

Related

If my service consumes Kafka messages, can kafka somehow lose my offsets?

If I have a service that connects to kafka as a message consumer, and every message I read I send a commit to that message offset, so that if my service shutsdown and restarts it will start reading from the last read message onwards. My understanding is that the committed offset will be maintained by kafka.
Now my question is, do I have to worry about the offset? Can kafka somehow lose that information and when the service restarts start reading messages from the beginning of the topic or the end of it depending on my initial offset config? Or if kafka loses my offset it will also have lost all messages in the topic so that it is alright to read from the beginning?
Note: I use spring-kafka on the service, but not sure if that is relevant to the question.
In most cases where you have an active consumer (with manual or auto-committing), you don't need to worry about it.
The cases where you do need to consider the behavior of auto.offset.reset setting is when the offsets.retention.minutes time on the broker has elapsed while your consumer group(s) are inactive. When this happens, Kafka compacts the __consumer_offsets topic and removes any offsets stored for those inactive groups
Losing offsets doesn't affect the source topic. Your client topic(s) have their own independent retention settings, and its message can be removed as well (or not), depending on how you've configured it.

confluent kafka - Rate limiting

Rate limiting: As Kafka is able to generate messages at a much higher rate than MQ can consume, can we have some configuration setup # kafka consumer to to enable a rate-limiting transfer to protect the stability of MQ?
Also Exactly-Once Semantic - Understand that kafka supports exactly-once semantics which would stop the retransfer of messages that have already been consumed by consumers. Can someone guide me on how to setup this configuration?
we are using confluent kafka enterprise version in our organization.
Rate limiting: Kafka is pull based so your consumer could read messages at an own peace and transfer them into MQ (but if the second system is constantly slower, the buffer of unprocessed message in Kafka will increase though the time).
Exactly once semantic: to ensure exactly once sematic for consumer you need to commit read offset manually once the message is successfully processed (the default behavior is automatic commit of read offset after small timeout. It could lead to lost of the message, if the fail happens after commit of read offset but before end of the processing of the message)

Preventing message loss with Kafka High Level Consumer 0.8.x

A typical kafka consumer looks like the following:
kafka-broker ---> kafka-consumer ----> downstream-consumer like Elastic-Search
And according to the documentation for Kafka High Level Consumer:
The ‘auto.commit.interval.ms’ setting is how often updates to the
consumed offsets are written to ZooKeeper
It seems that there can be message loss if the following two things happen:
Offsets are committed just after some messages are retrieved from kafka brokers.
Downstream consumers (say Elastic-Search) fail to process the most recent batch of messages OR the consumer process itself is killed.
It would perhaps be most ideal if the offsets are not committed automatically based on a time interval but they are committed by an API. This would make sure that the kafka-consumer can signal the committing of offsets only after it receives an acknowledgement from the downstream-consumer that they have successfully consumed the messages. There could be some replay of messages (if kafka-consumer dies before committing offsets) but there would at least be no message loss.
Please let me know if such an API exists in the High Level Consumer.
Note: I am aware of the Low Level Consumer API in 0.8.x version of Kafka but I do not want to manage everything myself when all I need is just one simple API in High Level Consumer.
Ref:
AutoCommitTask.run(), look for commitOffsetsAsync
SubscriptionState.allConsumed()
There is a commitOffsets() API in the High Level Consumer API that can be used to solve this.
Also set option "auto.commit.enable" to "false" so that at no time, the offsets are committed automatically by kafka consumer.

Apache Kafka Consumer group and Simple Consumer

I am new to Kafka, what I've understood sofar regarding the consumer is there are basically two types of implementation.
1) The High level consumer/consumer group
2) Simple Consumer
The most important part about the high level abstraction is it used when Kafka doesn't care about handling the offset while the Simple consumer provides much better control over the offset management. What confuse me is what if I want to run consumer in a multithreaded environment and also want to have control over the offset.If I use consumer group does that mean I must read from the last offset stored in zookeeper? is that the only option I have.
For the most part, the high-level consumer API does not let you control the offset directly.
When the consumer group is first created, you can tell it whether to start with the oldest or newest message that kafka has stored using the auto.offset.reset property.
You can also control when the high-level consumer commits new offsets to zookeeper by setting auto.commit.enable to false.
Since the high-level consumer stores the offsets in zookeeper, your app could access zookeeper directly and manipulate the offsets - but it would be outside of the high-level consumer API.
Your question was a little confusing but you can use the simple consumer in a multi-threaded environment. That's what the high-level consumer does.
In Apache Kafka 0.9 and 0.10 the consumer group management is handled entirely within the Kafka application by a Broker (for coordination) and a topic (for state storage).
When a consumer group first subscribes to a topic the setting of auto.offset.reset determines where consumers begin to consume messages (http://kafka.apache.org/documentation.html#newconsumerconfigs)
You can register a ConsumerRebalanceListener to receive a notification when a particular consumer is assigned topics/partitions.
Once the consumer is running, you can use seek, seekToBeginning and seekToEnd to get messages from a specific offset. seek affects the next poll for that consumer, and is stored on the next commit (e.g. commitSync, commitAsync or when the auto.commit.interval elapses, if enabled.)
The consumer javadocs mention more specific situations: http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
You can combine the group management provided by Kafka with manual management of offsets via seek(..) once partitions are assigned.

Simple-Kafka-consumer message delivery duplication

I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.