We are using Kafka 0.10... I'm seeing some conflicting information online (and in documentation) regarding how offsets are managed in kafka when enable.auto.commit is TRUE. Does the same poll() method that retrieves messages also handle the commits at the configured intervals?
If i retrieve messages from poll in a single threaded application, process the messages to completion (including handling errors) in the SAME thread, meaning poll() will not be invoked again until after my processing is complete, then I presume there is no fear in losing messages, correct? This only works if poll() attempts the commit at the subsequent invocation (if the auto.commit.interval.ms has passed, of course). If the commits are done immediately upon receiving the messages (prior to my app processing the messages), this will not work for us....
This is important, as I want to be certain we won't lose messages if we use the automatic commit policy. Duplicate messages are tolerable for us, we just have no tolerance for lost data.
Thanks for the clarification!
Does the same poll() method that retrieves messages also handle the commits at the configured intervals?
Yes. (If enable.auto.commit=true.)
If i retrieve messages from poll in a single threaded application, process the messages to completion (including handling errors) in the SAME thread, meaning poll() will not be invoked again until after my processing is complete, then I presume there is no fear in losing messages, correct?
Yes.
This only works if poll() attempts the commit at the subsequent invocation (if the auto.commit.interval.ms has passed, of course)
This is exactly how it is done.
See here for further details: http://docs.confluent.io/current/clients/consumer.html
Related
In the documentation :
BATCH: Commit the offset when all the records returned by the poll()
have been processed.
MANUAL: The message listener is responsible to acknowledge() the
Acknowledgment. After that, the same semantics as BATCH are applied.
if the offset is committed when all the records returned by the poll() have been processed for both cases then I don't get the difference, can you give me a scenario when MANUAL ack mode is used differently ?
If I use MANUAL mode and I don't call acknowledge() within my KafkaListener would be the same as BATCH mode ? and if I call acknowledge() what would change ?
Maybe I don't get the difference between commit and acknowledge notions within spring kafka
In the perfect world, when your application is always UP, you definitely don't need those commits at all. Just because Kafka Consumer keeps the track of offset internally between poll calls. There might be the case when you really don't need to commit on every single batch delivered to you. That's when that MANUAL comes to the rescue. With BATCH mode you don't have control and the framework perform it for you anyway. With MANUAL you may decide to commit now or later on, some where after a couple batches processed.
It is called acknowledge because we might not perform a commit immediately, but rather store it in-memory for subsequent poll cycle. The commit must be performed exactly on the Kafka consumer thread.
In my Kafka streams application I have a single processor that is scheduled to produce output messages every 60 seconds. Output message is built from messages that come from a single input topic. Sometimes it happens that the output message is bigger than the configured limit on broker (1MB by default). An exception is thrown and the application shuts down. Commit interval is set to default (60s).
In such case I would expect that on the next run all messages that were consumed during those 60s preceding the crash would be re-consumed. But in reality the offset of those messages is committed and the messages are not processed again on the next run.
Reading answers to similar questions it seems to me that the offset should not be committed. When I increase commit interval to 120s (processor still punctuates every 60s) then it works as expected and the offset is not committed.
I am using default processing guarantee but I have also tried exactly_once. Both have the same result. Calling context.commit() from processor seems to have no effect on the issue.
Am I doing something wrong here?
The contract of a Processor in Kafka Streams is, that you have fully processed an input record and forward() all corresponding output messages before process() return. -- This contract implies that Kafka Streams is allowed to commit the corresponding offset after process() returns.
It seem you "buffer" messages within process() in-memory to emit them later. This violated this contract. If you want to "buffer" messages, you should attach a state store to the Processor and put all those messages into the store (cf https://kafka.apache.org/25/documentation/streams/developer-guide/processor-api.html#state-stores). The store is managed by Kafka Streams for you and it's fault-tolerant. This way, after an error the state will be recovered and you don't loose any data (even if the input messages are not reprocessed).
I doubt that setting the commit interval to 120 seconds actually works as expected for all cases, because there is no alignment between when a commit happens and when punctuation is called.
Some of this will depend on the client you are using and whether it's based on librdkafka.
Some of the answer will also depend on how you are "looping" over the "poll" method. A typical example will look like the code under "Automatic Offset Committing" at https://kafka.apache.org/23/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
But this assumes quite a rapid poll loop (100ms + processing time) and a auto.commit.timeout.ms at 1000ms (the default is usually 5000ms).
If I read your question correctly, you seem to consuming messages once per 60 seconds?
Something to be aware of is that the behavior of kafka client is quite tied to how frequently poll is called (some libraries will wrap poll inside something like a "Consume" method). Calling poll frequently is important in order to appear "alive" to the broker. You will get other exceptions if you do not poll at least every max.poll.interval.ms (default 5min). It can lead to clients being kicked out of their consumer groups.
anyway, to the point... auto.commit.interval.ms is just a maximum. If a message has been accepted/acknowledged or StoreOffset has been used, then, on poll, the client can decide to update the offset on the broker. Maybe due to client side buffer size being hit or some other semantic.
Another thing to look at (esp if using a librdkafka based client. others have something similar) is enable.auto.offset.store (default true) this will "Automatically store offset of last message provided to application" so every time you poll/consume a message from the client it will StoreOffset. If you also use auto.commit then your offset may move in ways you might not expect.
See https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md for the full set of config for librdkafka.
There are many/many ways of consuming/acknowledging. I think for your case, the comment for max.poll.interval.ms on the config page might be relevant.
"
Note: It is recommended to set enable.auto.offset.store=false for long-time processing applications and then explicitly store offsets (using offsets_store()) after message processing
"
Sorry that this "answer" is a bit long winded. I hope there are some threads for you to pull on.
I read the docs on using the pause and resume methods for a kafka consumer, and they seem easy enough to implement. However, do I need another thread to continue calling the poll() method while paused to meet the heartbeat requirements and not trigger a rebalance?
My consumer is running SQL scripts after polling the topic and depending the messages returned, the scripts may take longer than the current session.timeout.ms interval (we have increased this value, but the length of time for the scripts to run can vary quiet a bit and regardless of the interval we will exceed it at times). I also want to avoid a rebalance as safe ordering and data integrity are more important than throughput and error detention.
From version 0.10.1.0 heartbeat is sent via a separate thread so pausing your process thread wouldn't affect heartbeat thread.
You can check this for more information.
yes, you need to continue calling poll() on the consumer, even if you pause all partitions, or it will be kicked out of any consumer group its a member of and its assigned partitions will transfer to another consumer. as to which thread ends up calling poll - that doesnt matter (so long as only a single thread interacts with the consumer at a time)
quoting from kip-62:
max.poll.interval.ms. This config sets the maximum delay between client calls to poll(). When the timeout expires, the consumer will stop sending heartbeats and send an explicit LeaveGroup request.
In the kafka consumer documentation https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html it states that care needs to taken to make sure poll is called every so often or the broker will assume the consumer is dead.
The most reliable procedure was pretty complicated:
For use cases where message processing time varies unpredictably,
neither of these options may be sufficient. The recommended way to
handle these cases is to move message processing to another thread,
which allows the consumer to continue calling poll while the processor
is still working. Some care must be taken to ensure that committed
offsets do not get ahead of the actual position. Typically, you must
disable automatic commits and manually commit processed offsets for
records only after the thread has finished handling them (depending on
the delivery semantics you need). Note also that you will need to
pause the partition so that no new records are received from poll
until after thread has finished handling those previously returned.
Does spring kafka handle this for me under the hood?
The heartbeat is mentioned very brief in the documentation. Apparently the heartbeat is managed by Spring-Kafka on a different thread.
Since version 0.10.1.0 heartbeats are sent on a background thread
You can also read this github issue to read more about the heartbeat.
What should be the better approach while implementing kafka consumer.
Objective is read from Kafka and write back to db. Millions of Rows
Approach 1 :
Per Partition - Per Consumer - Wait for message to consume(i.e. written back to db) then proceed to next in polling loop.
Approach 2 :
Per Partition - Per Consumer - Send Record to worker thread or threadpool to be written back to db and later on commit the offset and keep on polling. Offset Management needs to be taken taken care. In this don't wait for message to written back to DB. Just keep on polling, pass the message to worker thread.
Any insights on both of them ?
Thanks
Approach 1:
The approach is applicable only if it is possible for you to estimate the message processing time otherwise it is not recommended.
Problem: In this approach the main problem is keeping the consumer alive, If you will wait for the messages to be completely processed before calling the poll() again, you have to make sure that your consumer should be alive until it calls poll() because kafka maintains a property named "session.timeout.ms". The kafka broker/cluster takes it action on the value of this property, if consumer is unable to call poll() again with in the time period of "session.timeout.ms", broker will mark consumer dead and it will be kicked out. Now, when consumer will finish the message processing and will call poll() again, it is considered as a new joiner and will again give the set of records starting from the offset as it was before. Keeping this scenario in mind, consumer will be stuck in an infinite loop where it will never proceed its offset.
Possible solution 1: To use this approach you need a good value of following property "session.timeout.ms" with the following side effects:
1: Value too low: Consumer will be marked dead as described above and will never proceed its offset, however messages will be processed but every time it finish the messages it will get the previous messages + new messages again.
2: Value too high: Broker will be very late in detecting the genuine failure of consumer that will result in record duplication and will effect the overall throughput.
Possible Solution 2: (Only valid for version 0.10.1.x) Official fix by Kafka in release (0.10.1.0).
In this approach, two notable entities are introduced: a new property "max.poll.interval.ms" that sets the maximum delay between client calls to poll() and a background thread that is responsible for keeping the consumer alive. So, in a scenario, when consumer calls a method poll() and then gets busy in message processing , the internal background thread will keep the heart beat alive and as a result consumer will stay alive. However, this internal background thread will itself remain alive until the timeout value for the property “max.poll.interval.ms” remains valid. So, this thread will wait for the consumer to call poll() with in the time period value of “max.poll.interval.ms” if not, it will send a leave request and will die itself as well."
Again the tricky part in this solution is to find a suitable value of this property: "max.poll.interval.ms" (very important, This time will be the time for which background thread will keep the heartbeat alive without the need of explicit calling poll()).
Approach 2: Using a worker thread is a good idea but then you have to maintain an internal queue or validation for received messages which can be complex and also you need to use manual commits against auto commits. For more information about commits see this and search heading "Commits and Offsets".
Problem: In this approach the main problem is to keep track of messages received and messages processed successfully. As, your consumer will receive the message it will pass message to respective worker thread and will commit the offset and move forward to receive more messages. During this process you have to take care of following issues:
What if the message is received and offset committed but later for whatever reason the worker thread failed to process the message, now how to get that message again ?
What if messages are received by consumer but there are no free worker threads to process ?
Solution: There can be different ways to resolve the above issues and one way is to use the internal queue to keep the messages and manual commits that will be sent only when worker thread will report the successful processing of the message. However a very careful implementation is required because it can leads to complex code and can also results in memory management or threading issues.
Suggestion: Depending upon your requirements, you can use one approach or the other with implementing fixed for the possible issues as described above. However I would recommend a more robust solution will be to use partition pause/resume. In very abstract way your consumer should do following steps:
1: poll () for messages.
2: Pause all the respective topics/partitions.
3: Assigned messages to worker threads and wait for their processing.
4: Keep calling poll() but as the partitions are paused there will be no extra message received while consumer will be kept alive. (Make sure no new topic is registered during this point)
5: If all worker threads should report message processing success/failure then commit the offsets accordingly.
6: Resume all the partitions.
Note: There can be better ways or other solutions possible depending upon your scenario and requirements. It's just an idea or one of the possible solutions.