Overview: I have a lambda setup with Kafka Consumer that polls messages from a Kafka cluster and indexes them to an Elasticsearch domain. The lambda is invoked every one minute which mean the client id of the Kafka Consumer changes but the Kafka group ID always remains the same. One key thing to note is that the consumer has auto-commit disable and that we're manually committing the offsets once the messages have been successfully indexed. In the case of unsuccessful indexing, our code just catches the error and logs it and does not commit those messages.
However, I've noticed that when the indexing fails, while the consumer does not commit the unindexed messages, it doesn't see those messages on subsequent polls either. Any messages published after the unindexed messages occupy the offset of the uncommitted messages.
To recreate this scenario, I ran a small test and saved the logs:
h
https://pastebin.com/6aDzbvD4
From the logs, you can see that one record of size 415kb was polled at 16:06 and failed getting indexed to ES. On subsequent polls after 16:06, the uncommitted messages are nowhere to be seen. At 16:11, I published another message of size 3kb to the Kafka which was successfully polled by the consumer albeit at the same offset as the previous uncommitted message. This new message was indexed and committed successfully.
I'm new to Kafka and trying to understand why this could happen. I've also the config value of auto.offset.reset to "earliest" so the consumer should be able to see the earliest uncommitted messages. I'd appreciate any help regarding this!
Related
I use Kafka Spring to insert to database processing messages as a batch with container "ConcurrentKafkaListenerContainerFactory" and in case of error occurs
Bad message I will send that messages to another topic.
If connection failed or time out I will rollback both database transaction and producer transaction to prevent false positive
And I don't get assignmentCommitOption option how dose it work and how it different between ALWAYS,NEVER,LATEST_ONLY and LATEST_ONLY_NO_TX,
If there is no current committed offset for a partition that is assigned, this option controls whether or not to commit an initial offset during the assignment.
It is really only useful when using auto.offset.reset=latest.
Consider this scenario.
Application comes up and is assigned a "new" partition; the consumer will be positioned at the end of the topic.
No records are received from that topic/partition and the application is stopped.
A record is then published to the topic/partition and the consumer application restarted.
Since there is still no committed offset for the partition, it will again be positioned at the end and we won't receive the published record.
This may be what you want, but it may not be.
Setting the option to ALWAYS, LATEST_ONLY, or LATEST_ONLY_NO_TX (default) will cause the initial position to be committed during assignment so the published record will be received.
The _NO_TX variant commits the offset via the Consumer, the other one commits it via a transactional producer.
I am a beginner in Kafka. I understood that multiple consumers with same group id can't consume messages from the same partition in a topic. I am wondering what may happen if multiple Kafka consumers from a consumer group read the same message from a partition and why its a bad thing.
.
Obviously processing the same record multiple times is almost never intended, but it more comes down to offset management
If multiple consumers in a group read the same message and commit the offset of the message to indicate it's successfully been processed, then the final commit (the slowest consumer) always wins. Meanwhile, other consumers would've already continued processing other data.
When that happens, and any consumer client restarts, it would need to rewind to the last committed offset, despite having already processed messages afterwards
How does Kafka guarantee consumers doesn't read a single message twice?
Or is the above scenario possible?
Could the same message be read twice by single or by multiple consumers?
There are many scenarios which cause Consumer to consume the duplicate message
Producer published the message successfully but failed to acknowledge which cause to retry the same message
Producer publishing a batch of the message but failed partially published messages. In that case, it will retry and resent the same batch again which will cause duplicate
Consumers receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false).
If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
To guarantee not to consume duplicate messages the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side.
You can use the below parameter to achieve exactly one semantic. But please you have understood this comes with a compromise with performance.
enable idempotence on the producer side which will guarantee not to publish the same message twice
enable.idempotence=true
Defined Transaction (isolation.level) is read_committed
isolation.level=read_committed
In Kafka Stream above setting can be achieved by setting Exactly-Once
semantic true to make it as unit transaction
Idempotent
Idempotent delivery enables producers to write messages to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure) Consumer uses isolation. level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
Please refer more in detail refrence
It is absolutely possible if you don't make your consume process idempotent.
For example; you are implementing at-least-one delivery semantic and firstly process messages and then commit offsets. It is possible to couldn't commit offsets because of server failure or rebalance. (maybe your consumer is revoked at that time) So when you poll you will get same messages twice.
To be precise, this is what Kafka guarantees:
Kafka provides order guarantee of messages in a partition
Produced messages are considered "committed" when they were written to the partition on all its in-sync replicas
Messages that are committed will not be losts as long as at least one replica remains alive
Consumers can only read messages that are committed
Regarding consuming messages, the consumers keep track of their progress in a partition by saving the last offset read in an internal compacted Kafka topic.
Kafka consumers can automatically commit the offset if enable.auto.commit is enabled. However, that will give "at most once" semantics. Hence, usually the flag is disabled and the developer commits the offset explicitly once the processing is complete.
I'm considering using Apache Kafka and I could not find any information about durable subscriptions. Let's say I have expiration of 5 seconds for messages in my partition. Now if consumer fails and reconnects after 5 seconds, the message he missed will be gone. Even worse, he wont know that he missed a message. The durable subscription pattern solves this by saving the message for the consumer that failed or was disconnected. Is similar feature implemented in Kafka?
This is not supported by Kafka. But you can of course always increase your retention time, and thus limit the probability that a consumer misses messages.
Furthermore, if you set auto.offset.reset to none you will get an exception that informs you if a consumer misses any messages. Hence, it is possible to get informed if this happens.
Last but not least, it might be possible, to use a compacted topic -- this would ensure, that messages are not deleted until you explicitly write a so-called tombstone message. Note, that records must have unique keys to use a compacted topic.
I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.