The last messages are not consumed from a kafka topic even though they should be, leaving a constant consumer lag - apache-kafka

I have a very basic kafka consumer which needs to consume data from a 32-partitions topic with a large amount of data on each partition.
It manages to consume most data from that topic, but once we get towards the end of each partition, it does not quite reach the end of it and always keeps a small lag instead of reaching the latest offset for that partition.
Every time I restart my consumer, it consumes from a few of those partitions, reducing the lag to 0, but not all of them.
Here is the smallest consuming code that reproduces this error:
from confluent_kafka import Consumer
consumer = Consumer({
"bootstrap.servers": "localhost:9092",
"auto.offset.reset": "earliest",
"enable.auto.commit": False,
"group.id": "group-id",
})
consumer.subscribe(["topic"])
while True:
batch = consumer.consume(timeout=1, num_messages=100)
if batch:
consumer.commit(batch[-1])

After trying to explicitly set min.fetch.bytes to 1 to make sure my broker was not holding data, and trying to refactor my original code, I noticed that I was only committing the last message received in the batch: and for some reason I subconsciously assumed that all messages in one batch came from the same partition, but I was wrong!
Making sure to commit offsets for all partitions responsible for at least one message in the batch fixed my issue:
partitions_to_commit = {m.partition(): m for m in batch}
for message in partitions_to_commit.values():
consumer.commit(message)

Related

Multiple consumers with same group id

I am a beginner in Kafka. I understood that multiple consumers with same group id can't consume messages from the same partition in a topic. I am wondering what may happen if multiple Kafka consumers from a consumer group read the same message from a partition and why its a bad thing.
.
Obviously processing the same record multiple times is almost never intended, but it more comes down to offset management
If multiple consumers in a group read the same message and commit the offset of the message to indicate it's successfully been processed, then the final commit (the slowest consumer) always wins. Meanwhile, other consumers would've already continued processing other data.
When that happens, and any consumer client restarts, it would need to rewind to the last committed offset, despite having already processed messages afterwards

Kafka to Kafka -> reading source kafka topic multiple times

I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?
Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/

KafkaConsumer resume partition cannot continue to receive uncommitted messages

I'm using one topic, one partition, one consumer, Kafka client version is 0.10.
I got two different results:
If I paused partition first, then to produce a message and to invoke resume method. KafkaConsumer can poll the uncommitted message successfully.
But If I produced message first and didn't commit its offset, then to pause the partition, after several seconds, to invoke the resume method. KafkaConsumer would not receive the uncommitted message. I checked it on Kafka server using kafka-consumer-groups.sh, it shows LOG-END-OFFSET minus CURRENT-OFFSET = LAG = 1.
I have been trying to figure out it for two days, I repeated such tests a lot of times, the results are always like so. I need some suggestion or someone can tell me its Kafka's original mechanism.
For your observation#2, if you restart the application, it will supply you all records from the un-committed offset, i.e. the missing record and if your consumer again does not commit, it will be sent again when application registers consumer with Kafka upon restart. It is expected.
Assuming you are using consumer.poll() which creates a hybrid-streaming interface i.e. if accumulates data coming into Kafka for the duration mentioned and provides it to the consumer for processing once the duration is finished. This continuous accumulation happens in the backend and is not dependent on whether you have committed offset or not.
KafkaConsumer
The position of the consumer gives the offset of the next record that
will be given out. It will be one larger than the highest offset the
consumer has seen in that partition. It automatically advances every
time the consumer receives messages in a call to poll(long).

Kafka manual offset managment issue

While implementing manual offset management, I encountered the following issue: (using 0.9)
In order to manage the offsets manually, for each consumed record, I retrieve the current offset of the record and commit the new offset (currentOffset + 1, since the offset reset strategy is "latest").
When a new consumer group is created, it has no explicit offsets (offset is "unknown"), therefore, if it didn't consume messages from all existing partitions before it is stopped, it will have committed offsets for only part of the partitions (the ones the consumer got messages from), while the offset for the rest of the partitions will still be "unknown".
When the consumer is started again, it gets only some of the messages that were produced while it was down (only the ones from the partitions that had a committed offset), the messages from partitions with "unknown" offset are lost and will never be consumed due to the offset reset strategy.
Since it's unacceptable in my case to miss any messages once a consumer group is created, I'd like to explicitly commit an offset for each partition before starting consumption.
To do that I found two options:
Use low level consumer to send an offset request.
Use high level consumer, call consumer.poll(0) (to trigger the assignment), then call consumer.assignment(), and for each TopicPartition call consumer.committed(topicPartition); consumer.seekToEnd(topicPartition); consumer.position(topicPartition) and eventually commit all offsets.
Both are more complex and noisy than I'd expect (I'd expect a simpler API I could use to get the log end position for all partitions assigned to a consumer).
Any thoughts or ideas for a better implementation would be appreciated.
10x.
Using consumer API totally depends upon where are you committing offsets.
If your offsets are getting stored in Kafka broker then definitely
you should use high-level consumer API it will provide you with more control
over offsets.
If you are keeping offsets in zookeeper than you can use any old consumer API like
List< KafkaStream < byte[], byte[] > > streams
=consumer.createMessageStreamsByFilter(new Whitelist(topicRegex),1)

Consume messages without committing from Kafka 10 consumer

I have a requirement to read messages from a topic, batch them and push the batch to an external system. If the batch fails for any reason, I need to consume the same set of messages again and repeat the process. So for every batch, the from and to offsets for each partition are stored in a database. In order to achieve this, I am creating one Kafka consumer per partition by assigning partition to the reader, based on the previous offsets stored, the consumers seek to that position and start reading. I have turned off auto commit and I dont commit offsets from the consumer. For every batch, I create a new consumer per partition, read messages from the last offset stored and publish to the external system. Do you see any problems in consuming messages without committing offsets and using the same consumer group across batches, but at any point there won't be more than one consumer per partition ?
Your design seems reasonable to me.
Committing offsets to Kafka is just a convenient built-in mechanism within Kafka to keep track of offsets. However, there is no requirement whatsoever to use it -- you can use any other mechanism to track offsets, too (like using a DB as in your case).
Furthermore, if you assign partitions manually, there will be no group management anyway. So parameter group.id has no effect. See http://docs.confluent.io/current/clients/consumer.html for more details.
In kafka version two i achieved this behaviour without the need for a database to store the offsets.
The following is a configuration for spring-boot-kafka but it should also work with any kafka consumer api
spring:
kafka:
bootstrap-servers: ...
consumer:
value-deserializer: ...
max-poll-records: 1000
enable-auto-commit: false
fetch-min-size: 262144 # 1/4 mb..
group-id: ...
fetch-max-wait: 10000 # we will consume every 10s or when 1/4 mb or 1000 records are accumulated.
auto-offset-reset: earliest
listener:
type: batch
concurrency: 7
ack-mode: manual
This gives me the messages in batches of max. 1000 records (dependent on load). I then write these records asynchronously to a database and count how many success callbacks i get. If the successful writes equals the received batch size i acknowledge the batch, e.g. i commit the offset. This design was very reliable even in a high-load production environment.