kafka consumer keeps looping over a bunch of messages after CommitFailedException - apache-kafka

I am running a multi threaded kafka 091 consumer [New].
They way i generate a client.id is using a combination of the "hostname the consumer is running on" + "AtomicInt" + "the PID of the process".
I am running into issues when I have to stop the consumer and restart. Consumer keeps trying to process the offsets that were not consumed by the previous run(about 100 of them). But it keeps failing with this message.
2016-10-21 14:22:55,293 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Marking the coordinator 2147483647 dead.
2016-10-21 14:22:55,295 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group x.cg
2016-10-21 14:22:55,296 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:493)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at com.kfc.kafka.consumer.KFCConsumer$KafkaConsumerRunner.run(KFCConsumer.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-21 14:22:55,397 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to join group x.cg failed due to unknown member id, resetting and retrying.
.........
2016-10-21 14:22:58,124 [pool-3-thread-3] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to heart beat failed since the group is rebalancing, try to re-join group.
From the kakfa log, I see a lot of rebalances happening.
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordin

Turns out we were having long, recurring pauses [slow network, problems with external components etc.] w.r.t the external components that our consumer was interacting with.
Solution was to split our consumer into three consumer with different consumer group and Kafka config's (heartbeatinterval.ms, session.timeout.ms, request.timeout.ms, maxPartitionFetchBytes).
Having 3 different consumers with custom config for the above mentioned properties helped us get rid of the above problem.
The general thinking is not to have a lot of external communication within the consumer as this increases the uncertainty in Kafka consumer behavior and when you do have external communication make sure the Kafka Consumer Config's are inline with the SLA's of the external components.

Related

kafka broker Connection to INTERNAL_BROKER_DNS failed

i have 3 kafka brokers in MSK. Two of them gives below error.
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file LOG_DIR/__amazon_msk_connect_offsets_debezium-kafka-connector-x/leader-epoch-checkpoint
Caused by: java.io.IOException: No space left on device
And the other broker gives below error.
[2022-10-18 11:50:43,383] INFO [GroupCoordinator 1]: Member connect-1-97b56132-1f54-46a4-91f1-8d31e61e18a9 in group __amazon_msk_connect_cluster_debezium-kafka-connector-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,383] INFO [GroupCoordinator 1]: Stabilized group __amazon_msk_connect_cluster_debezium-kafka-connector-x generation 1115 (__consumer_offsets-18) (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,387] INFO [GroupCoordinator 1]: Assignment received from leader for group __amazon_msk_connect_cluster_debezium-kafka-connector-x for generation 1115 (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,387] INFO [GroupCoordinator 1]: Preparing to rebalance group __amazon_msk_connect_cluster_debezium-kafka-connector-x in state PreparingRebalance with old generation 1115 (__consumer_offsets-18) (reason: error when storing group assignment during SyncGroup (member: connect-1-9c20e001-f852-4007-8614-78a5c27207f6)) (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,876] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={test-0=PartitionData(fetchOffset=149547, logStartOffset=7151, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), test-0=PartitionData(fetchOffset=74123, logStartOffset=826, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=INVALID, epoch=INITIAL), rackId=) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to INTERNAL_BROKER_DNS (id: 3 rack: null) failed.
I want to know why i'm getting above error Error in response for fetch request ? What does it mean ?
No space left on device - You need a larger EBS, or remove log segments from it manually, assuming you can SSH to it. If not, contact MSK Support.
Given that there are brokers failing with storage requirements, then other brokers will simply be unable to connect to them

Kafka built on Kubernetes gets stuck while consuming

I installed Kafka(3.1.0) as Stateful on Kubernetes. Then I created a Topic. We send data to this topic by receiving data with HA proxy from outside of Kubernetes. Then we consume this data with an application.
Everything looks normal. This Topic works fine. There is no problem with the consumer.
But if I try to consume this Topic via a different group. Kafka is starting to get clogged. This doesn't always happen. It only happens when you become Consume a few times and leave.
Now I will try to explain this with an example. (I replaced some special parts with ...)
There is a Topic and it is already being consumed. The name of the group is "test1".
1- I am now joining this Topic with console-consumer as a new consumer.
Server Logs:
[2022-03-21 06:11:27,538] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-18856 in Empty state. Created a new member id consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:27,637] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 0 (__consumer_offsets-39) (reason: Adding new member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,639] INFO [GroupCoordinator 0]: Stabilized group console-consumer-18856 generation 1 (__consumer_offsets-39) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,748] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 for group console-consumer-18856 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
2- Let's look at our groups
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-18856
test1
3- Now let's stop the consumer. (ctrl + c)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 1 (__consumer_offsets-39) (reason: Removing member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Group console-consumer-18856 with generation 2 is now empty (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,647] INFO [GroupCoordinator 0]: Member MemberMetadata(memberId=consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641, groupInstanceId=None, clientId=consumer-console-consumer-18856-1, clientHost=/..., sessionTimeoutMs=10000, rebalanceTimeoutMs=300000, supportedProtocols=List(range)) has left group console-consumer-18856 through explicit `LeaveGroup` request (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:57,243] INFO [GroupMetadataManager brokerId=0] Group console-consumer-18856 transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager)
4- Let's look at our groups (Of course, if this line exists, "transitioned to Dead in generation 2". Otherwise it still appears in the list.)
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
test1
Everything is normal up to this point. Joins and leaves Topic with a group of consumers. However, the situation changes when we repeat the process of joining the consumer a few times.
1- Let's re-enter the same Topic with a different group.
[2022-03-21 07:02:46,377] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-43677 in Empty state. Created a new member id consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:46,510] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-43677 in state PreparingRebalance with old generation 0 (__consumer_offsets-38) (reason: Adding new member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,511] INFO [GroupCoordinator 0]: Stabilized group console-consumer-43677 generation 1 (__consumer_offsets-38) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,650] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 for group console-consumer-43677 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
2- Let's look at our groups
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-43677
test1
3- Now let's stop the consumer. (ctrl + c)
Here the problem starts. No exit log is seen after stopping the consumer. Sometimes a log like the one below may appear.
2022-03-21 07:03:30,045] INFO [GroupCoordinator 0]: Member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 in group console-consumer-43677 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
And Kafka gets stuck. I can no longer consume it at all. Operations like create, delete in Kafka no longer work. Only list and describe work.
1- If I try to delete the group it won't let me. Because even if I stopped the consumer (ctrl+c), actually the quit process didn't happen.
bin/kafka-consumer-groups.sh --bootstrap-server ... --delete --group console-consumer-43677
Error: Deletion of some consumer groups failed:
* Group 'console-consumer-43677' could not be deleted due to: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=deleteConsumerGroups, deadlineMs=1647846924163, tries=1, nextAllowedTryMs=1647846924266) timed out at 1647846924166 after 1 attempt(s)
2- If I'm trying to set up a new Topic.
bin/kafka-topics.sh --create --topic test-topic ...
Error while executing topic command : Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
[2022-03-21 10:52:33,170] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: createTopics
(kafka.admin.TopicCommand$)
3- If I try to join Topic with consumer again.
It will give timeout errors. In the trace logs, it will give "re-join group" errors. I don't think it has anything to do with concepts like sessiontimeout, heart. Because Kafka shouldn't be locked even if I can't consume it.
Re-Deploy is the only way to fix the situation. But why does the error occur? Is this a Bug? A race condition? Is there a solution? Is it a case with Kubernetes? Is it related to Kafka 3.1.0?

ProducerFencedException Processing Kafka Stream

I'm using kafka 1.1.0. A kafka stream consistently throws this exception (albeit with different messages)
WARN o.a.k.s.p.i.RecordCollectorImpl#onCompletion:166 - task [0_0] Error sending record (key KEY value VALUE timestamp TIMESTAMP) to topic OUTPUT_TOPIC due to Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.; No more records will be sent and no more offsets will be recorded for this task.
WARN o.a.k.s.p.i.AssignedStreamsTasks#closeZombieTask:202 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] stream task 0_0 got migrated to another thread already. Closing it as zombie.
WARN o.a.k.s.p.internals.StreamThread#runLoop:752 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] Detected a task that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Trying to rejoin the consumer group now.
org.apache.kafka.streams.errors.TaskMigratedException: StreamsTask taskId: 0_0
ProcessorTopology:
KSTREAM-SOURCE-0000000000:
topics:
[INPUT_TOPIC]
children: [KSTREAM-PEEK-0000000001]
KSTREAM-PEEK-0000000001:
children: [KSTREAM-MAP-0000000002]
KSTREAM-MAP-0000000002:
children: [KSTREAM-SINK-0000000003]
KSTREAM-SINK-0000000003:
topic:
OUTPUT_TOPIC
Partitions [INPUT_TOPIC-0]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:238)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:918)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:798)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: task [0_0] Abort sending since producer got fenced with a previous record
I'm not sure what is causing this exception. When I restart application it appears to successfully process a few records before failing with the same exception. Strangely enough, the records are successfully processed several times even though the stream is set to exactly once processing. Here is the stream configuration:
Properties streamProperties = new Properties();
streamProperties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
streamProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, service.getName());
streamProperties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
//Should be DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG - but that field is private.
streamProperties.put("default.production.exception.handler", ErrorHandler.class);
streamProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl);
streamProperties.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
streamProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
streamProperties.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
streamProperties.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
Out of the three servers, only two generate relevant logs when restarting the streams application. Here are logs from the first server:
[2018-05-09 14:42:14,635] INFO [GroupCoordinator 1]: Member INPUT_TOPIC-09dd8ac8-2cd6-4dd1-b963-63ea804c8fcc-StreamThread-1-consumer-3fedb398-91fe-480a-b5ee-1b5879d0956c in group INPUT_TOPIC has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 1 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Group INPUT_TOPIC with generation 2 is now empty (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 2 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Stabilized group INPUT_TOPIC generation 3 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,871] INFO [GroupCoordinator 1]: Assignment received from leader for group INPUT_TOPIC for generation 3 (kafka.coordinator.group.GroupCoordinator)
And from the second server:
[2018-05-09 14:42:16,228] INFO [TransactionCoordinator id=0] Initialized transactionalId INPUT_TOPIC-0_0 with producerId 2010 and producer epoch 37 on partition __transaction_state-37 (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:22,121] INFO [TransactionCoordinator id=0] Completed rollback ongoing transaction of transactionalId: INPUT_TOPIC-0_0 due to timeout (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:42,263] ERROR [ReplicaManager broker=0] Error processing append operation on partition OUTPUT_TOPIC-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer valid. There is probably another producer with a newer epoch. 37 (request epoch), 38 (server epoch)
It appears like the first server sees that the consumer has failed and removes it from the consumer group before it is registered with the second server. Any ideas what could be causing the consumer to fail? Or, any ideas handling this failure gracefully? It's possible that it is this bug, does anyone know of a possible workaround?
I'm not sure what caused the problem, but reducing the max.poll.records to 1 fixed the problem.

Kafka Streams - Rebalancing exception in Kafka 1.0.0

In Kafka Streams 1.0.0, we saw a strange error coming my way. My stream app ingest a kafka topic and emit multiple aggregations on different state stores, now the app works on a cluster of 2 nodes, but the moment the third one is added the stream app crashes on all the nodes.
Following is the exception that i get.
2017-12-12 15:32:55 ERROR Kafka010Base:47 - Exception caught in thread c-7-aq32-5648256f-9142-49e2-98c0-da792e6da48e-StreamThread-5
org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The server experienced an unexpected error when processing the request
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:566)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:539)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:788)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:204)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:127)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:506)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:268)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214)
On Taking a look further on brokers, I saw another exception
[2017-12-12 17:28:36,822] INFO [GroupCoordinator 90]: Preparing to rebalance group c-7-aq32 with old generation 25 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:40,500] INFO [GroupCoordinator 90]: Stabilized group c-7-aq32 generation 26 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:42,290] INFO [GroupCoordinator 90]: Assignment received from leader for group c-7-aq32 for generation 26 **(kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:42,300] ERROR [GroupMetadataManager brokerId=90] Appending metadata message for group c-7-aq32 generation 26 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager)**
[2017-12-12 17:28:42,301] INFO [GroupCoordinator 90]: Preparing to rebalance group c-7-aq32 with old generation 26 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:52,301] INFO [GroupCoordinator 90]: Member c-7-aq32-6138ec53-4aff-4596-8f4b-44ae6f5d72da-StreamThread-13-consumer-e0cc0931-0619-4908-82c2-28f7bf9bace9 in group c-7-aq32 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
Now, I would need help in understanding why I would be getting RecordTooLargeException.
Changing following property at broker fixed the issue
message.max.bytes: 5,242,880 (5mb)
The default value for the property is 1mb. When consumers in consumer group are rebalancing , each consumer tries to put metadata of assigned and revoked partitions back to kafka. If this metadata size is bigger than 1mb then this would result in 'RecordTooLargeException' error at broker.
Although the question is old but I have also faced similar issue recently.

Why does a Kafka consumer take a long time to start consuming?

We start a Kafka consumer, listening on a topic which may not yet be created (topic auto creation is enabled though).
Not long thereafter a producer is publishing messages on that topic.
However, it takes some time for the consumer to notice this: 5 minutes to be exact. At this point the consumer revokes its partitions and rejoins the consumer group. Kafka re-stabilizes the group. Looking at the time-stamps of the consumer vs. kafka logs, this process is initiated at the consumer side.
I suppose this is expected behavior but I would like to understand this. Is this actually a re-balancing going on (from 0 to 1 partition)? If we'd create topics upfront, would this not happen?
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:36:45.693 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
2017-02-01 08:36:45.738 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : Successfully joined group tps-kafka-partitioning with generation 1
2017-02-01 08:36:45.747 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Setting newly assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.749 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions assigned:[]
2017-02-01 08:41:45.540 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
kafka logs
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-partitioning with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-partitioning generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,551] INFO [GroupCoordinator 1001]: Assignment received from leader for group tps-kafka-partitioning for generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-group-id with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-group-id generation 2 (kafka.coordinator.GroupCoordinator)
This is probably due to the default value of the parameter metadata.max.age.ms which controls how often the consumer forces a refresh of metadata for a topic.
What happens when you start the consumer up with a non existing topic is that the brokers autocreate this topic, but this takes a little bit of time with leader election etc., so when your consumer requests metadata for that topic it gets a LEADER_NOT_AVAILABLE warning and can't fetch any messages.
After the timeout mentioned above is reached the consumer refreshes metadata, successfully this time around and starts reading messages. This is not dependent on a producer writing messages to the topic, it is purely a consumer thing.
If you start your consumer with for example 1000ms timeout, you should see a much shorter delay until messages are consumed.
Also, if you create topics up front, or start the producer before the consumer, this behavior should not happen at all.