I use Kafka Spring to insert to database processing messages as a batch with container "ConcurrentKafkaListenerContainerFactory" and in case of error occurs
Bad message I will send that messages to another topic.
If connection failed or time out I will rollback both database transaction and producer transaction to prevent false positive
And I don't get assignmentCommitOption option how dose it work and how it different between ALWAYS,NEVER,LATEST_ONLY and LATEST_ONLY_NO_TX,
If there is no current committed offset for a partition that is assigned, this option controls whether or not to commit an initial offset during the assignment.
It is really only useful when using auto.offset.reset=latest.
Consider this scenario.
Application comes up and is assigned a "new" partition; the consumer will be positioned at the end of the topic.
No records are received from that topic/partition and the application is stopped.
A record is then published to the topic/partition and the consumer application restarted.
Since there is still no committed offset for the partition, it will again be positioned at the end and we won't receive the published record.
This may be what you want, but it may not be.
Setting the option to ALWAYS, LATEST_ONLY, or LATEST_ONLY_NO_TX (default) will cause the initial position to be committed during assignment so the published record will be received.
The _NO_TX variant commits the offset via the Consumer, the other one commits it via a transactional producer.
Related
I am using Spring Kafka and have a requirement where I have to listen from a DLQ topic and put the message to another topic after few minutes. Here I am only acknowledging a msg only when it is put to another topic else I am not committing it and calling kafkaListenerEndpointRegistry.stop() which is stopping my kafka consumer. Then there is scheduled cron job running after every 3 minutes and starts the consumer by running kafkaListenerEndpointRegistry.start() and since auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted offset and checking their eligibility to be put on other topic.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer. If I am able to seek to beginning of offset for each partition and get all msgs from uncommitted offset then I don't have to stop and start my consumer.
For this, I tried ConsumerSeekAware.onPartitionAssigned and calling callback.seekToBeginning() to reset offsets. But looks like it's also consuming all committed offset which is increasing huge load on my services. So is there anything I am missing or seekToBeginning always read all msgs(committed and uncommitted).
and is there any way to trigger partition assignment manually while running kafka consumer so that it goes to onPartitionAssigned method?
auto.offset.reset is set to earliest then consumer is getting all msgs from previously uncommitted
auto.offset.reset is meaningless if there is a committed offset; it just determines the behavior if there is no committed offset.
seekToBeginning always read all msgs(committed and uncommitted).
Kafka maintains 2 pointers - the current position and the committed offset; seek has nothing to do with committed offset, seekToBeginning just changes the position to the earliest record, so the next poll will return all records.
This approach is working fine for small volume but for very large volume I am not seeing the expected retries in both topics. So I am suspecting that this might be happening because I am using kafkaListenerEndpointRegistry.stop() to stop the consumer.
That should not be a problem; you might want to consider using a container stopping error handler instead; then throw an exception and the container will stop itself (you should also set the stopImmediate container property).
https://docs.spring.io/spring-kafka/docs/current/reference/html/#container-stopping-error-handlers
I have a Kafka wrapper library that uses transactions on the produce side only. The library does not cover the consumer. The producer publishes to multiple topics. The goal is to achieve transactionality. So the produce should either succeed which means there should be exactly once copy of the message written in each topic, or fail which means message was not written to any topics. The users of the library are applications that run on Kubernetes pods. Hence, the pods could fail, or restart frequently. Also, the partition is not going to be explicitly set upon sending the message.
My question is, how should I choose the transactional.id for producers? My first idea is to simply choose UUID upon object initiation, as well as setting a transaction.timeout.ms to some reasonable time (a few seconds). That way, if a producer gets terminated due to pod restart, the consumers don't get locked on the transaction forever.
Are there any flaws with this strategy? Is there a smarter way to do this? Also, I cannot ask the library user for some kind of id.
UUID can be used in your library to generate transaction id for your producers. I am not really sure what you mean by: That way, if a producer gets terminated due to pod restart, the consumers don't get locked on the transaction forever.
Consumer is never really "stuck". Say the producer goes down after writing message to one topic (and hence transaction is not yet committed), then consumer will behave in one of the following ways:
If isolation.level is set to read_committed, consumer will never process the message (since the message is not committed). It will still read the next committed message that comes along.
If isolation.level is set to read_uncommitted, the message will be read and processed (defeating the purpose of transaction in the first place).
Overview: I have a lambda setup with Kafka Consumer that polls messages from a Kafka cluster and indexes them to an Elasticsearch domain. The lambda is invoked every one minute which mean the client id of the Kafka Consumer changes but the Kafka group ID always remains the same. One key thing to note is that the consumer has auto-commit disable and that we're manually committing the offsets once the messages have been successfully indexed. In the case of unsuccessful indexing, our code just catches the error and logs it and does not commit those messages.
However, I've noticed that when the indexing fails, while the consumer does not commit the unindexed messages, it doesn't see those messages on subsequent polls either. Any messages published after the unindexed messages occupy the offset of the uncommitted messages.
To recreate this scenario, I ran a small test and saved the logs:
h
https://pastebin.com/6aDzbvD4
From the logs, you can see that one record of size 415kb was polled at 16:06 and failed getting indexed to ES. On subsequent polls after 16:06, the uncommitted messages are nowhere to be seen. At 16:11, I published another message of size 3kb to the Kafka which was successfully polled by the consumer albeit at the same offset as the previous uncommitted message. This new message was indexed and committed successfully.
I'm new to Kafka and trying to understand why this could happen. I've also the config value of auto.offset.reset to "earliest" so the consumer should be able to see the earliest uncommitted messages. I'd appreciate any help regarding this!
Issue we were facing:
In our system we were logging a ticket in database with status NEW and also putting it in the kafka queue for further processing. The processors pick those tickets from kafka queue, do processing and update the status accordingly. We found that some tickets are left in NEW state forever. So we were guessing whether tickets are failing to get produced in the queue or are no getting consumed.
Message loss / duplication scenarios (and some other related points):
So I started to dig exhaustively to know in what all ways we can face message loss and duplication in Kafka. Below I have listed all possible message loss and duplication scenarios that I can find in this post:
How data loss can occur in different approaches to handle all replicas down
Handle by waiting for leader to come online
Messages sent between all replica down and leader comes online are lost.
Handle by electing new broker as a leader once it comes online
If new broker is out of sync from previous leader, all data written between the
time where this broker went down and when it was elected the new leader will be
lost. As additional brokers come back up, they will see that they have committed
messages that do not exist on the new leader and drop those messages.
How data loss can occur when leader goes down, while other replicas may be up
In this case, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of in sync replicas. This may take a few seconds and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
When a consumer may miss to consume a message
If Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages.
Evaluating different approaches to consumer consistency
Message might not be processed when consumer is configured to receive each message at most once
Message might be duplicated / processed twice when consumer is configured to receive each message at least once
No message is processed multiple times or left unprocessed if consumer is configured to receive each message exactly once.
Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability:
Messages sent to a topic partition will be appended to the commit log in the order they are sent,
a single consumer instance will see messages in the order they appear in the log,
a message is ‘committed’ when all in sync replicas have applied it to their log, and
any committed message will not be lost, as long as at least one in sync replica is alive.
Approach I came up with:
After reading several articles, I felt I should do following:
If message is not enqueued, producer should resend
For this producer should listen for acknowledgement for each message sent. If no ackowledement is received, it can retry sending message
Producer should be async with callback:
As explained in last example here
How to avoid duplicates in case of producer retries sending
To avoid duplicates in queue, set enable.idempotence=true in producer configs. This will make producer ensure that exactly one copy of each message is sent. This requires following properties set on producer:
max.in.flight.requests.per.connection<=5
retries>0
acks=all (Obtain ack when all brokers has committed message)
Producer should be transactional
As explained here.
Set transactional id to unique id:
producerProps.put("transactional.id", "prod-1");
Because we've enabled idempotence, Kafka will use this transaction id as part of its algorithm to deduplicate any message this producer sends, ensuring idempotency.
Use transactions semantics: init, begin, commit, close
As explained here:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Consumer should be transactional
consumerProps.put("isolation.level", "read_committed");
This ensures that consumer don't read any transactional messages before the transaction completes.
Manually commit offset in consumer
As explained here
Process record and save offsets atomically
Say by atomically saving both record processing output and offsets to any database. For this we need to set auto commit of database connection to false and manually commit after persisting both processing output and offset. This also requires setting enable.auto.commit to false.
Read initial offset (say to read after recovery from cache) from database
Seek consumer to this offset and then read from that position.
Doubts I have:
(Some doubts might be primary and can be resolved by implementing code. But I want words from experienced kafka developer.)
Does the consumer need to read the offset from database only for initial (/ first after consumer recovery) read or for all reads? I feel it needs to read offset from database only on restarts, as explained here
Do we have to opt for manual partitioning? Does this approach works only with auto partitioning off? I have this doubt because this example explains storing offset in MySQL by specifying partitions explicitly.
Do we need both: Producer side kafka transactions and consumer side database transactions (for storing offset and processing records atomically)? I feel for producer idempotence, we need producer to have unique transaction id and for that we need to use kafka transactional api (init, begin, commit). And as a counterpart, consumer also need to set isolation.level to read_committed. However can we ensure no message loss and duplicate processing without using kafka transactions? Or they are absolutely necessary?
Should we persist offset to external db as explained above and here
or send offset to transaction as explained here (also I didnt get what does it exactly mean by sending offset to transaction)
or follow sync async commit combo explained here.
I feel message loss / duplication scenarios 1 and 2 are handled by points 1 to 4 of approach I explained above.
I feel message loss / duplication scenario 3 is handled by point 6 of approach I explained above.
How do we implement different consumer consistency approaches as stated in message loss / duplication scenario 4? Is their any configuration or it needs to be implemented inside custom logic inside consumer?
Message loss / duplication scenario 5 says: "Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition."? Is it something to concern about while building correct system?
Is any consideration unnecessary/redundant in the approach I came up with above? Also did I miss any necessary consideration? Did I miss any message loss / duplication scenarios?
Is their any other standard / recommended / preferable approach to ensure no message loss and duplicate processing than what I have thought above?
Do I have to actually code above approach using kafka APIs? or is there any high level API built atop kafka API which allows to easily ensure no message loss and duplicate processing?
Looking at issue we were facing (as stated at very beginning), we were thinking if we can recover any lost/unprocessed messages from files in which kafka stores messages. However that isnt correct, right?
(Extremely sorry for such an exhaustive post but wanted to write question which will ask all related question at one place allowing to build big picture of how to build system around kafka.)
I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.