Horizontally Scaled Kafka Consumers consuming different offsets - apache-kafka

I've been developing a kafka consumer application (C# in kubernetes) and have been running it as a single node for a while, consuming from a single topic.
I noticed today that the topic I have been consuming from was quite full - I was doing continuous processing and was at offsets around ~38k (in general, agnostic of partition), but records my producer were putting on the topic (also, ignoring partition differences) were around offsets ~58k.
I decided to scale up another consumer pod - same code and config all around (group id, etc)
When it came online, it logged that it was processing messages in the ~58k offset range. I considered that this was maybe just a different partition, but I can see the same partition in both logs (with different offsets).
I was under the impression that if multiple consumers had the same groupid, that message consumption would be balanced between them, in order.
In other words, why wouldn't my second (or n-th) consumer come online and process messages in the same offset range as my first consumer which has been running for days?
I did eye some of the IConsumer settings such as:
https://docs.confluent.io/platform/current/clients/confluent-kafka-dotnet/_site/api/Confluent.Kafka.ConsumerConfig.html#Confluent_Kafka_ConsumerConfig_QueuedMinMessages
which seems to specify the minimum number of messages to keep in a "local consumer queue" (Default: 100,000), but I don't know if this actually indicates that ConsumerA has laid claim on 100K+ messages and ConsumerB is naturally starting +100k down the line
Other notes:
What limited access I have to the Administrative Tools (Control Center) shows that my consumer group id is about 900k messages behind.
ControlCenter says my topic has 60 partitions
Autocommit is not off (default: true)
Regardless of the Autocommit setting I am still doing a _consumer.Commit(msg) at the finally{} block after processing each individual message
I don't want to kill my long-running consumer (that's still processing like a champ) in the event that there's an offset retention problem and I will "miss" all messages in the delta between these two

Related

Apache Kafka Cleanup while consuming messages

Playing around with Apache Kafka and its retention mechanism I'm thinking about following situation:
A consumer fetches first batch of messages with offsets 1-5
The cleaner deletes the first 10 messages, so the topic now has offsets 11-15
In the next poll, the consumer fetches the next batch with offsets 11-15
As you can see the consumer lost the offsets 6-10.
Question, is such a situation possible at all? With other words, will the cleaner execute while there is an active consumer? If yes, is the consumer able to somehow recognize that gap?
Yes such a scenario can happen. The exact steps will be a bit different:
Consumer fetches message 1-5
Messages 1-10 are deleted
Consumer tries to fetch message 6 but this offset is out of range
Consumer uses its offset reset policy auto.offset.reset to find a new valid offset.
If set to latest, the consumer moves to the end of the partition
If set to earliest the consumer moves to offset 11
If none or unset, the consumer throws an exception
To avoid such scenarios, you should monitor the lead of your consumer group. It's similar to the lag, but the lead indicates how far from the start of the partition the consumer is. Being near the start has the risk of messages being deleted before they are consumed.
If consumers are near the limits, you can dynamically add more consumers or increase the topic retention size/time if needed.
Setting auto.offset.reset to none will throw an exception if this happens, the other values only log it.
Question, is such a situation possible at all? will the cleaner execute while there is an active consumer
Yes, if the messages have crossed TTL (Time to live) period before they are consumed, this situation is possible.
Is the consumer able to somehow recognize that gap?
In case where you suspect your configuration (high consumer lag, low TTL) might lead to this, the consumer should track offsets. kafka-consumer-groups.sh command gives you the information position of all consumers in a consumer group as well as how far behind the end of the log they are.

How does a Kafka Consumer behave if a Producer goes down. What happens to the data in the interval when the producer goes down

I just want to know how the Consumer is able to consume data when the producer is down. Let's say Producer keeps sending logs to the consumer at a steady rate and then the producer goes down from 8AM- 6PM. How does the consumer work in such a case and is there a way that the consumer can get the data that would have been sent during 8am - 6pm if the producer was up.
In Apache Kafka there is no relationship between how producer and consumer behaves.
Acting as a messaging system, Kafka allows to decoupling producer from a consumer providing an asynchronous communication channel.
The producer can send messages at its own pace and the consumer can read these messages in real time or later at its own pace (different from the producer one).
The messages are saved in a topic living in the Kafka cluster, and each message has a position in the topic partition (offset).
Of course, it's possible to tune when messages are deleted from the topic if the consumer isn't online for long time reading the messages.
You can set to store messages for very long time (days, weeks, months) and after that they will be deleted; or you can set to store messages based on time (so deleting the ones older than a time).
Furthermore, the consumer is also able to rewind the stream of messages in the topic, actually re-reading the messages if needed.
Finally, the consumer can also seek to a specific position in the topic partition based on offset or specifiying a time.
The Kafka doc has a nice diagram which I copied below. It shows the novelty of Kafka in a succinct way.
Without Kafka, the situation is something like this. We have multiple servers, e.g. Frontend servers, DB servers, Chat servers etc. On the other side, we have probably different metrics and monitoring tools (e.g. DB monitor, UI monitor etc.). Direct one-to-one communications between different servers and collectors might work out for smaller systems, but it breaks down pretty quickly after the system has surpassed a a certain threshold, in terms of scalability. Kafka solves this problem by decoupling the senders and receivers. Both of them talk through the Kafka brokers instead of talking to each other.
So, in your case the consumer would simply ask the broker if there's any new data on the topic it's subscribing to. As the producer is down, and assuming there is no data in the queue, broker would reply, there's nothing to be consumed.. So, the consumer would be perpetually polling in a fixed interval, in an endless loop and do nothing. Whenever the producer comes up and starts pumping out data, consumer would start receiving (and processing) it. There are more involved use cases when you might be losing data if retention period for particular topic is over, and the consumer hasn't processed the backlog. But I don't think that's a concern for you at this point of your journey.

Certain partitions seem to take precedence when a consumer is reading from multiple partitions

I have a service which reads from a Kafka topic using librdkafka. I've noticed that if the consumer shuts down for a while, some log entries build up in kafka (this is perfectly fine and expected)
What's weird, is that sometimes when I start the consumer back up and look at the pending log entries by partition, partitions assigned to the same consumer seem to be recovered at a different rate.
For example, say I have a consumer X and it claims partitions 30 through 50. When the consumer starts there are 10,000 entries pending on each.
What I see is the pending entries for 30-40 trend downward while the pending entries for 41-50 grow. When 30-40 finally hits zero (or gets close enough to zero) 41-50 starts trending downward.
Why is this happening? Is it a client feature or a server feature?
The way kafka works is consumer will keep switching through the partitions to take the data, however Kafka is smart to ensure switch and handle only those many partitions what it can handle based upon the capacity of your consumer i.e had your consumer been a more powerful (server performance) it would take a little more partitions but never mind it would take the remaining partitions in second go after being done with the first ones.
In summary: if you create X partitions you are expecting it to go through all one by one before re-visiting the first one, but that would eat the performance by more effort in switching.
In your case, I understand that since the other partitions also have business data you don't want to delay them heavily, i suggest to reduce the number of partitions.

Can we have retention period of zero in Kafka broker?

Does retention period of zero makes sense in kafka borker?
We want to quickly forward message from producer to consumer via kafka broker. From buffercache/pagecache on broker machine without flushing to disk. We do not need replication and assume our broker will never crash.
When a message is produced to a Kafka topic it is written to the disk. Once the message has been consumed, the offset of this message is committed by the consumer (if you are using the high-level consumer API) however, there is no functionality that deletes only the messages that have been consumed (many consumers may subscribe to the same topic and some of them might have consumed that message while some others might have not).
What I would suggest in your case is to set a short retention period (which by default is set to 7 days) but allow a reasonable amount of time in order to allow your consumer to consume the messages. To do this, you simply need to configure the following parameter in server.properties:
log.retention.ms=X
Note that there is no guarantee that the deleted message(s) have been successfully consumed by your consumer(s). For example, if you set the retention period to 2 seconds (i.e. log.retention.ms=2000) and your consumer crashes, then every message which is sent to the topic while the consumer is down will be lost.

Kafka Topic Lag is keep increasing gradually when Message Size is huge

I am using the Kafka Streams Processor API to construct a Kafka Streams application to retrieve messages from a Kafka topic. I have two consumer applications with the same Kafka Streams configuration. The difference is only in the message size. The 1st one has messages with 2000 characters (3KB) while 2nd one has messages with 34000 characters (60KB).
Now in my second consumer application I am getting too much lag which increases gradually with the traffic while my first application is able to process the messages at the same time without any lag.
My Stream configuration parameters are as below,
application.id=Application1
default.key.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
num.stream.threads=1
commit.interval.ms=10
topology.optimization=all
Thanks
In order to consume messages faster, you need to increase the number of partitions (if it's not yet done, depending on the current value), and do one of the following two options:
1) increase the value for the config num.stream.threads within your application
or
2) start several applications with the same consumer group (the same application.id).
as for me, increasing num.stream.threads is preferable (until you reach the number of CPUs of the machine your app runs on). Try gradually increasing this value, e.g go from 4 over 6 to 8, and monitor the consumer lag of your application.
By increasing num.stream.threads your app will be able to consume messages in parallel, assuming you have enough partitions.