Kafka topic partition has missing offsets - apache-kafka

I have a Flink streaming application which is consuming data from a Kafka topic which has 3 partitions. Even though, the application is continuously running and working without any obvious errors, I see a lag in the consumer group for the flink app on all 3 partitions.
./kafka-consumer-groups.sh --bootstrap-server $URL --all-groups --describe
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
group-1 topic-test 0 9566 9568 2 - - -
group-1 topic-test 1 9672 9673 1 - - -
group-1 topic-test 2 9508 9509 1 - - -
If I send new records, they get processed but the lag still exists. I tried to view the last few records for partition 0 and this is what I got (ommiting the message part) -
./kafka-console-consumer.sh --topic topic-test --bootstrap-server $URL --property print.offset=true --partition 0 --offset 9560
Offset:9560
Offset:9561
Offset:9562
Offset:9563
Offset:9564
Offset:9565
The log-end-offset value is at 9568 and the current offset is at 9566. Why are these offsets not available in the console consumer and why does this lag exist?
There were a few instances where I noticed missing offsets. Example -
Offset:2344
Offset:2345
Offset:2347
Offset:2348
Why did the offset jump from 2345 to 2347 (skipping 2346)? Does this have something to do with how the producer is writing to the topic?

You can describe your topic for any sort of configuration added while it was created. If the log compaction is enabled through log.cleanup.policy=compact, then the behaviour will be different in the runtime. You can see these lags, due to log compactions lags value setting or missing offsets may be due messages produced with a key but null value.
Configuring The Log Cleaner
The log cleaner is enabled by default. This will start the pool of cleaner threads. To enable log cleaning on a particular topic, add the log-specific property log.cleanup.policy=compact.
The log.cleanup.policy property is a broker configuration setting defined in the broker's server.properties file; it affects all of the topics in the cluster that do not have a configuration override in place. The log cleaner can be configured to retain a minimum amount of the uncompacted "head" of the log. This is enabled by setting the compaction time lag log.cleaner.min.compaction.lag.ms.
This can be used to prevent messages newer than a minimum message age from being subject to compaction. If not set, all log segments are eligible for compaction except for the last segment, i.e. the one currently being written to. The active segment will not be compacted even if all of its messages are older than the minimum compaction time lag.
The log cleaner can be configured to ensure a maximum delay after which the uncompacted "head" of the log becomes eligible for log compaction log.cleaner.max.compaction.lag.ms.

The lag is calculated based on the latest offset committed by the Kafka consumer (lag=latest offset-latest offset committed). In general, Flink commits Kafka offsets when it performs a checkpoint, so there is always some lag if check it using the consumer groups commands.
That doesn't mean that Flink hasn't consumed and processed all the messages in the topic/partition, it just means that it has still not committed them.

Related

Current offset behavior when set by kafka-consumer-groups to earliest?

I have a kafka topic with 25 partitions and the cluster has been running for 5 months.
As per my understanding for each partition for a given topic, the offset starts from 0,1,2... (un-bounded)
I see log-end-offset at a very high value (right now -> 1230628032)
I created a new consumer group with offset being set to earliest; so i expected the offset from which a client for that consumer group will start from offset 0.
The command which I used to create a new consumer group with offset to earliest:
kafka-consumer-groups --bootstrap-server <IP_address>:9092 --reset-offsets --to-earliest --topic some-topic --group to-earliest-cons --execute
I see the consumer group being created. I expected the current-offset being to 0; however when I described the consumer group the current offset was very high , at the moment --> 1143755193.
The record retention period set is for 7 days (standard value).
My question is why didn't we see the first offset from which a consumer from this consumer group will read 0? Has it to do something with data-retention?
Can anyone help understand this?
It is exactly data retention. It is highly probable that Kafka already removed old messages with offset 0 from your partitions, so it doesn't make sense to start from 0. Instead, Kafka will set offset to the earliest available message on your partition. You can check those offsets using:
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <IP_address>:9092 --topic some-topic --time -2
You will probably see values really close to what you're seeing as new consumer offset.
You can also try and set offset explicitly to 0:
./kafka-consumer-groups.sh --bootstrap-server <IP_address>:9092 --reset-offsets --to-offset 0 --topic some-topic --group to-earliest-cons --execute
However, you will see warning that offset 0 does not exist and it will use higher value (aforementioned earliest message available)
New offset (0) is lower than earliest offset for topic partition some-topic. Value will be set to 1143755193

Kafka Consumer left consumer group

I've faced some problem using Kafka. Any help is much appreciated!
I have zookeeper and kafka cluster 3 nodes each in docker swarm. Kafka broker configuration you can see below.
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_NUM_PARTITIONS: 8
KAFKA_REPLICA_SOCKET_TIMEOUT_MS: 30000
KAFKA_REQUEST_TIMEOUT_MS: 30000
KAFKA_COMPRESSION_TYPE: "gzip"
KAFKA_JVM_PERFORMANCE_OPTS: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80"
KAFKA_HEAP_OPTS: "-Xmx768m -Xms768m -XX:MetaspaceSize=96m"
My case:
20x Producers producing messages to kafka topic constantly
1x Consumer reads and log messages
Kill kafka node (docker container stop) so now cluster has 2 nodes of Kafka broker (3rd will start and join cluster automatically)
And Consumer not consuming messages anymore because it left consumer group due to rebalancing
Does exist any mechanism to tell consumer to join group after rebalancing?
Logs:
INFO 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] Attempt to heartbeat failed since group is rebalancing
WARN 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
#Rostyslav Whenever we make a call by consumer to read a message, it does 2 major calls.
Poll
Commit
Poll is basically fetching records from kafka topic and commit tells kafka to save it as a read message , so that it's not read again. While polling few parameters play major role:
max_poll_records
max_poll_interval_ms
FYI: variable names are per python api.
Hence, whenever we try to read message from Consumer, it makes a poll call every max_poll_interval_ms and the same call is made only after the records fetched (as defined in max_poll_records) are processed. So, whenever, max_poll_records are not processed in max_poll_inetrval_ms, we get the error.
In order to overcome this issue, we need to alter one of the two variable. Altering max_poll_interval_ms ccan be hectic as sometime it may take longer time to process certain records, sometime lesser records. I always advice to play with max_poll_records as a fix to the issue. This works for me.

How to solve a problem with checkpointed invalid __consumer_offsets and producer epoch on partitions of __transaction_state

I have two kinds of log entries in server.log
First kind:
WARN Resetting first dirty offset of __consumer_offsets-6 to log start offset 918 since the checkpointed offset 903 is invalid. (kafka.log.LogCleanerManager$)
Second kind:
INFO [TransactionCoordinator id=3] Initialized transactionalId Source: AppService Kafka consumer -> Not empty string filter -> CDMEvent mapper -> (NonNull CDMEvent filter -> Map -> Sink: Kafka CDMEvent producer, Nullable CDMEvent filter -> Map -> Sink: Kafka Error producer)-bddeaa8b805c6e008c42fc621339b1b9-2 with producerId 78004 and producer epoch 23122 on partition __transaction_state-45 (kafka.coordinator.transaction.TransactionCoordinator)
I have found some suggestion that mentions that removing the checkpoint file might help:
https://medium.com/#anishekagarwal/kafka-log-cleaner-issues-80a05e253b8a
"What we gathered was to:
stop the broker
remove the log cleaner checkpoint file
( cleaner-offset-checkpoint )
start the broker
that solved the problem for us."
Is it safe to try that with all checkpoint files (cleaner-offset-checkpoint, log-start-offset-checkpoint, recovery-point-offset-checkpoint, replication-offset-checkpoint) or is it not recommendable at all with any of them?
I have stopped each broker and moved cleaner-offset-checkpoint to a backup location and started it without that file, brokers neatly started, deleted a lot of excessive segments and they don't log:
WARN Resetting first dirty offset of __consumer_offsets to log start offset since the checkpointed offset is invalid
any more, obviously, this issue/defect https://issues.apache.org/jira/browse/KAFKA-6266 is not solved yet, even in 2.0. 2. However, that didn't compact the consumer offset according to expectations, namely offsets.retention.minutes default is 10080 (7 days), and I tried to set it explicitely to 5040, but it didn't help, still there are messages more than one month old, since log.cleaner.enable is by default true, they should be compacted, but they are not, the only possible try is to set the cleanup.policy to delete again for the __consumer_offsets topic, but that is the action that triggered the problem, so I am a bit reluctant to do that. The problem that I described here No Kafka Consumer Group listed by kafka-consumer-groups.sh is also not resolved by that, obviously there is something preventing kafka-consumer-groups.sh to read the __consumer_offsets topic (when issued with --bootstrap-server option, otherwise it reads it from zookeeper) and display results, that's something that Kafka Tool does without problem, and I believe these two problems are connected.
And the reason why I think that topic is not compacted, is because it has messages with exactly the same key (and even timestamp), older than it should, according to broker settings. Kafka Tool also ignores certain records and doesn't interpret them as Consumer Groups in that display. Why kafka-consumer-groups.sh ignores all, that is probably due to some corruption of these records.

Kafka: bizarre assignment of partitions in a topic

I'm not sure how to explain the issue I'm facing with Kafka, but I'll try my best. I have a set of 4 consumers in the same consumer group named :
absolutegrounds.helper.processor
consuming from a topic with 5 partitions; therefore each consumer in the group is being assigned to 1 partition and 1 consumer to 2 partitions in order to distribute 5 partitions between 4 consumers fairly.
But for some reason I cannot figure out, the initial assignment turned into only 2 consumers being assigned into all the available partitions, i.e. 1 consumer with 3 partitions and 1 consumer with 2 partitions. However, there are still 4 consumers theoretically in the same consumer group:
[medinen#ocvlp-rks001 kafka_2.11-0.10.0.1]$ ./bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --new-consumer --bootstrap-server localhost:9092 --describe --group absolutegrounds.helper.processor
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
absolutegrounds.helper.processor AG_TASK_SOURCE 0 27286 31535 4249 consumer-1_/10.132.9.128
absolutegrounds.helper.processor AG_TASK_SOURCE 1 28015 28045 30 consumer-1_/10.132.9.128
absolutegrounds.helper.processor AG_TASK_SOURCE 2 35437 40091 4654 consumer-1_/10.132.9.128
absolutegrounds.helper.processor AG_TASK_SOURCE 3 31765 31874 109 consumer-1_/10.132.8.23
absolutegrounds.helper.processor AG_TASK_SOURCE 4 33279 38003 4724 consumer-1_/10.132.8.23
The most bizarre behaviour is that the other 2 consumers left out of the consumer group (as per the response from Kafka above) seem to be still consuming from the topic as per the logs I see in my application, although I cannot find them anywhere as part of the consumer group. And even more bizarre is the fact that 1 consumer is supposed to be assigned to all the partitions in the topic whereas the other one is assigned only partition 4. See logs from the application (this is a Spring Boot application using Spring Kafka):
First left consumer:
- - 08/05/2017 12:27:29.119 - [-kafka-consumer-1] INFO o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [AG_TASK_SOURCE-0, AG_TASK_SOURCE-1, AG_TASK_SOURCE-2, AG_TASK_SOURCE-3, AG_TASK_SOURCE-4] for group absolutegrounds.helper.processor
Second left consumer:
- - 08/05/2017 12:27:19.044 - [-kafka-consumer-1] INFO o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [AG_TASK_SOURCE-4] for group absolutegrounds.helper.processor
Trying to understand the reasoning behind this behaviour, I've taken a look into the topic that stores all the offsets for the consumers:
__consumer_offsets
using this command:
kafka/kafka_2.11-0.10.0.1/bin/kafka-console-consumer.sh --consumer.config /tmp/consumer.config --formatter "kafka.coordinator.GroupMetadataManager\$GroupMetadataMessageFormatter" --zookeeper ocvlp-rks003:2181 --topic __consumer_offsets --from-beginning | grep "absolutegrounds.helper.processor"
and this is what I've found:
absolutegrounds.helper.processor::[absolutegrounds.helper.processor,consumer,Stable,Map(consumer-1-170fb8f6-c8d3-4782-8940-350673b859cb -> [consumer-1-170fb8f6-c8d3-4782-8940-350673b859cb,consumer-1,/10.132.8.23,10000], consumer-1-b8d3afc0-159e-4660-bc65-faf68900c332 -> [consumer-1-b8d3afc0-159e-4660-bc65-faf68900c332,consumer-1,/10.132.9.128,10000], consumer-1-dddf10ad-187b-4a29-9996-e05edaad3caf -> [consumer-1-dddf10ad-187b-4a29-9996-e05edaad3caf,consumer-1,/10.132.8.22,10000], consumer-1-2e4069f6-f3a8-4ede-a4f4-aadce6a3adb7 -> [consumer-1-2e4069f6-f3a8-4ede-a4f4-aadce6a3adb7,consumer-1,/10.132.9.129,10000])]
absolutegrounds.helper.processor::[absolutegrounds.helper.processor,consumer,Stable,Map(consumer-1-66de4a46-538c-425f-8e95-5a00ff5eb5fd -> [consumer-1-66de4a46-538c-425f-8e95-5a00ff5eb5fd,consumer-1,/10.132.9.129,10000])]
absolutegrounds.helper.processor::[absolutegrounds.helper.processor,consumer,Stable,Map(consumer-1-5b96166e-e528-48f7-8f6e-18a67328eae6 -> [consumer-1-5b96166e-e528-48f7-8f6e-18a67328eae6,consumer-1,/10.132.9.128,10000], consumer-1-dcfff37a-8ad3-403c-a070-cca82a1f6d21 -> [consumer-1-dcfff37a-8ad3-403c-a070-cca82a1f6d21,consumer-1,/10.132.8.23,10000])]
absolutegrounds.helper.processor::[absolutegrounds.helper.processor,consumer,Stable,Map(consumer-1-5b96166e-e528-48f7-8f6e-18a67328eae6 -> [consumer-1-5b96166e-e528-48f7-8f6e-18a67328eae6,consumer-1,/10.132.9.128,10000], consumer-1-dcfff37a-8ad3-403c-a070-cca82a1f6d21 -> [consumer-1-dcfff37a-8ad3-403c-a070-cca82a1f6d21,consumer-1,/10.132.8.23,10000])]
From the response of Kafka, I can see that at some point in time all 4 consumers where properly distributed between the partitions:
absolutegrounds.helper.processor::[absolutegrounds.helper.processor,consumer,Stable,Map(consumer-1-170fb8f6-c8d3-4782-8940-350673b859cb -> [consumer-1-170fb8f6-c8d3-4782-8940-350673b859cb,consumer-1,/10.132.8.23,10000], consumer-1-b8d3afc0-159e-4660-bc65-faf68900c332 -> [consumer-1-b8d3afc0-159e-4660-bc65-faf68900c332,consumer-1,/10.132.9.128,10000], consumer-1-dddf10ad-187b-4a29-9996-e05edaad3caf -> [consumer-1-dddf10ad-187b-4a29-9996-e05edaad3caf,consumer-1,/10.132.8.22,10000], consumer-1-2e4069f6-f3a8-4ede-a4f4-aadce6a3adb7 -> [consumer-1-2e4069f6-f3a8-4ede-a4f4-aadce6a3adb7,consumer-1,/10.132.9.129,10000])]
However, at some point later, the assignment changed to the current scenario where only 2 of the 4 consumers in the consumer group are assigned partitions.
I've struggled to understand what could have led to this situation, but I cannot find a valid answer to figure it out and fix it.
Anyone can help here? Thanks.

How can we run multiple kafka consumers through command line?

I am testing kafka performance through the shell script they already provided in the kafka package. I have created a topic with 10 partitions and pumping data as shown below:
./bin/kafka-producer-perf-test.sh --topic test-topic --num-records 9000000 --record-size 300 --throughput 250000 --producer-props bootstrap.servers=110.17.14.302:9092 acks=1 max.in.flight.requests.per.connection=1 batch.size=5000
Now I want to consume the data which I am pumping as shown above from multiple consumers not just from single consumer. So I started using kafka-consumer-perf-test.sh. This is what I was doing:
./bin/kafka-consumer-perf-test.sh --zookeeper localhost:2181 --topic test-topic --group test1
Is there any way by which we can run multiple kafka consumers in a single consumer group through command line and each of those consumers working on different partitions using kafka-consumer-perf-test.sh? I am working with Kafka version 0.10.1.0
I saw this so post but it doesn't say where to configure how many consumers we want to run and what partition they will work on?
Update:
This is the error I saw:
./bin/kafka-consumer-perf-test.sh --zookeeper 110.27.14.10:2181 --messages 50 --topic test-topic --threads 1
[2017-01-11 22:34:09,785] WARN [ConsumerFetcherThread-perf-consumer-14195_kafka-cluster-3098529006-zeidk-1484174043509-46a51434-2-0], Error in fetch kafka.consumer.ConsumerFetcherThread$FetchRequest#54fb48b6 (kafka.consumer.ConsumerFetcherThread)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:93)
at kafka.network.BlockingChannel.readCompletely(BlockingChannel.scala:129)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:120)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:99)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:83)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:132)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:131)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:130)
at kafka.consumer.ConsumerFetcherThread.fetch(ConsumerFetcherThread.scala:109)
at kafka.consumer.ConsumerFetcherThread.fetch(ConsumerFetcherThread.scala:29)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
Just run the same command (i.e., ./bin/kafka-consumer-perf-test.sh) multiple times in different consoles.
About partition assignment: Kafka will so this automatically for you. If you use consumer groups.
If you want to do manual partition assignment, you cannot use consumer groups. For this, you cannot use kafka-consumer-perf-test.sh but need to write your own.
Read JavaDoc here: https://kafka.apache.org/0101/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html