How does Spring Kafka manual commit works in Batch listener mode - apache-kafka

I have a topic with 2 partitions. I am using kafka batch listener mode in my consumer application. Since, I am using a single consumer application therefore I will receive messages from both partitions. Once the consumer application process those list of messages, I want to commit the largest offset of each partition manually.
If I use MANUAL_IMMEDIATE mode, will it commit the highest offset of each partition? If not what is the approach I should use?

Yes; acknowledgment.ack() will commit the offset of the highest offset of all partitions for which records were received.
However, the container will do it automatically for you if you use the default AckMode.BATCH. This is simpler than dealing with manual acks.

Related

kafka offset management auto vs manual

I'm working on an application of spring boot which uses Kafka stream, in my application, I want to manage Kafka offset and commit the offset in case of the successful message processing only. This is important, to be certain I won't lose messages even if Kafka restarted or the zookeeper is down. my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages.
also, I need to know what is the difference between managing the Kafka offset automatic using autoCommitOffset and manging it manually using HBase or zookeeper or checkpoints?
also, what are the benefits of managing it manually if there is an automatic config we can use?
You have no guarantee of durability with auto commit
Older Kafka clients did use Zookeeper for offset storage, but now it is all in the broker to minimize dependencies. Kafka Streams API has no way to integrate offset storage outside of Kafka itself, and so you must use the Consumer API to lookup and seek/commit offsets to external storage, if you choose to do so, however, you can still then end up with less than optimal message processing.
my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages
Sounds like you set auto.offset.reset=earliest and you never commit any offsets at all...
The auto commit setting does a periodic commit, not "automatic after reading any message".
If you want to guarantee delivery, then you need to set at least acks=1 in the producer and actually do a commitSync in the consumer

Kafka Topic ordering when scaling up the partitions

Consider your producers create messages for the users of a system and the order of them is important in the user level.
My producers, add messages to the topic which have two partitions and I am using hashing against the user_id to put all the messages of each user in the same partition to guarantee the order.
How can I scale up the system and add more partitions to the topic while keeping the order of the messages?
How Kafka treat the messages that are already produced before partitioning?
What will happen to the messages that consume but not committed back to the Kafka to update the offset?
1.use a treeset(ordered set) cache messages at consumer client, keep 1 minute(or less); kafka only guarantee one partition's order, and I think producer also cannot guarantee order。
2.if you not commit offset manually, in the next fetch request ,will get same message. anyway, at consumer client, you should ensure message idempotency, even you conmmited offset.

Consume messages without committing from Kafka 10 consumer

I have a requirement to read messages from a topic, batch them and push the batch to an external system. If the batch fails for any reason, I need to consume the same set of messages again and repeat the process. So for every batch, the from and to offsets for each partition are stored in a database. In order to achieve this, I am creating one Kafka consumer per partition by assigning partition to the reader, based on the previous offsets stored, the consumers seek to that position and start reading. I have turned off auto commit and I dont commit offsets from the consumer. For every batch, I create a new consumer per partition, read messages from the last offset stored and publish to the external system. Do you see any problems in consuming messages without committing offsets and using the same consumer group across batches, but at any point there won't be more than one consumer per partition ?
Your design seems reasonable to me.
Committing offsets to Kafka is just a convenient built-in mechanism within Kafka to keep track of offsets. However, there is no requirement whatsoever to use it -- you can use any other mechanism to track offsets, too (like using a DB as in your case).
Furthermore, if you assign partitions manually, there will be no group management anyway. So parameter group.id has no effect. See http://docs.confluent.io/current/clients/consumer.html for more details.
In kafka version two i achieved this behaviour without the need for a database to store the offsets.
The following is a configuration for spring-boot-kafka but it should also work with any kafka consumer api
spring:
kafka:
bootstrap-servers: ...
consumer:
value-deserializer: ...
max-poll-records: 1000
enable-auto-commit: false
fetch-min-size: 262144 # 1/4 mb..
group-id: ...
fetch-max-wait: 10000 # we will consume every 10s or when 1/4 mb or 1000 records are accumulated.
auto-offset-reset: earliest
listener:
type: batch
concurrency: 7
ack-mode: manual
This gives me the messages in batches of max. 1000 records (dependent on load). I then write these records asynchronously to a database and count how many success callbacks i get. If the successful writes equals the received batch size i acknowledge the batch, e.g. i commit the offset. This design was very reliable even in a high-load production environment.

Does a Kafka Consumer receive a list of offsets first, before receiving the bytes/data?

I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick
On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.

read kafka message starting from a specific offset using high level API

I hope I am not making a mistake, but I remember that in Kafka documentation it mentioned that using high level APIs you can't start reading messages from a specific offset, but it was mentioned that it would change.
Is it possible now using the high level APIs to read messages from a specific partition and a specific offset? Could you please give me an example how to do it?
I am using kafka 0.8.1.1.
Thanks in advance.
You can do that with kafka 0.9:
http://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
Kafka 0.8.1.1 can use Zookeeper to store offsets for each consumer group. If you configure your consumer to commit offsets to zookeeper than you Need just to manually set the starting offset for the topic and partition under zookeeper for your consumer Group.
You Need to connect to zookeeper and use the set command:
set /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
E.g. setting offset 10 for partition 0 of topicname for the spark-app consumer Group.
set /consumers/spark-app/offsets/topicname/0 10
When a consumer starts to consume message from Kafka it always starts to consume from the last committed offset. If this last committes offset is not.valid for any reason than the consumer applies the logic due the configurazione properties auto.offset.reset.
Hope this helps.