I'm using kafka 1.1.0. A kafka stream consistently throws this exception (albeit with different messages)
WARN o.a.k.s.p.i.RecordCollectorImpl#onCompletion:166 - task [0_0] Error sending record (key KEY value VALUE timestamp TIMESTAMP) to topic OUTPUT_TOPIC due to Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.; No more records will be sent and no more offsets will be recorded for this task.
WARN o.a.k.s.p.i.AssignedStreamsTasks#closeZombieTask:202 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] stream task 0_0 got migrated to another thread already. Closing it as zombie.
WARN o.a.k.s.p.internals.StreamThread#runLoop:752 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] Detected a task that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Trying to rejoin the consumer group now.
org.apache.kafka.streams.errors.TaskMigratedException: StreamsTask taskId: 0_0
ProcessorTopology:
KSTREAM-SOURCE-0000000000:
topics:
[INPUT_TOPIC]
children: [KSTREAM-PEEK-0000000001]
KSTREAM-PEEK-0000000001:
children: [KSTREAM-MAP-0000000002]
KSTREAM-MAP-0000000002:
children: [KSTREAM-SINK-0000000003]
KSTREAM-SINK-0000000003:
topic:
OUTPUT_TOPIC
Partitions [INPUT_TOPIC-0]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:238)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:918)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:798)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: task [0_0] Abort sending since producer got fenced with a previous record
I'm not sure what is causing this exception. When I restart application it appears to successfully process a few records before failing with the same exception. Strangely enough, the records are successfully processed several times even though the stream is set to exactly once processing. Here is the stream configuration:
Properties streamProperties = new Properties();
streamProperties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
streamProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, service.getName());
streamProperties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
//Should be DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG - but that field is private.
streamProperties.put("default.production.exception.handler", ErrorHandler.class);
streamProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl);
streamProperties.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
streamProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
streamProperties.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
streamProperties.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
Out of the three servers, only two generate relevant logs when restarting the streams application. Here are logs from the first server:
[2018-05-09 14:42:14,635] INFO [GroupCoordinator 1]: Member INPUT_TOPIC-09dd8ac8-2cd6-4dd1-b963-63ea804c8fcc-StreamThread-1-consumer-3fedb398-91fe-480a-b5ee-1b5879d0956c in group INPUT_TOPIC has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 1 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Group INPUT_TOPIC with generation 2 is now empty (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 2 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Stabilized group INPUT_TOPIC generation 3 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,871] INFO [GroupCoordinator 1]: Assignment received from leader for group INPUT_TOPIC for generation 3 (kafka.coordinator.group.GroupCoordinator)
And from the second server:
[2018-05-09 14:42:16,228] INFO [TransactionCoordinator id=0] Initialized transactionalId INPUT_TOPIC-0_0 with producerId 2010 and producer epoch 37 on partition __transaction_state-37 (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:22,121] INFO [TransactionCoordinator id=0] Completed rollback ongoing transaction of transactionalId: INPUT_TOPIC-0_0 due to timeout (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:42,263] ERROR [ReplicaManager broker=0] Error processing append operation on partition OUTPUT_TOPIC-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer valid. There is probably another producer with a newer epoch. 37 (request epoch), 38 (server epoch)
It appears like the first server sees that the consumer has failed and removes it from the consumer group before it is registered with the second server. Any ideas what could be causing the consumer to fail? Or, any ideas handling this failure gracefully? It's possible that it is this bug, does anyone know of a possible workaround?
I'm not sure what caused the problem, but reducing the max.poll.records to 1 fixed the problem.
Related
I installed Kafka(3.1.0) as Stateful on Kubernetes. Then I created a Topic. We send data to this topic by receiving data with HA proxy from outside of Kubernetes. Then we consume this data with an application.
Everything looks normal. This Topic works fine. There is no problem with the consumer.
But if I try to consume this Topic via a different group. Kafka is starting to get clogged. This doesn't always happen. It only happens when you become Consume a few times and leave.
Now I will try to explain this with an example. (I replaced some special parts with ...)
There is a Topic and it is already being consumed. The name of the group is "test1".
1- I am now joining this Topic with console-consumer as a new consumer.
Server Logs:
[2022-03-21 06:11:27,538] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-18856 in Empty state. Created a new member id consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:27,637] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 0 (__consumer_offsets-39) (reason: Adding new member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,639] INFO [GroupCoordinator 0]: Stabilized group console-consumer-18856 generation 1 (__consumer_offsets-39) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,748] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 for group console-consumer-18856 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
2- Let's look at our groups
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-18856
test1
3- Now let's stop the consumer. (ctrl + c)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 1 (__consumer_offsets-39) (reason: Removing member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Group console-consumer-18856 with generation 2 is now empty (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,647] INFO [GroupCoordinator 0]: Member MemberMetadata(memberId=consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641, groupInstanceId=None, clientId=consumer-console-consumer-18856-1, clientHost=/..., sessionTimeoutMs=10000, rebalanceTimeoutMs=300000, supportedProtocols=List(range)) has left group console-consumer-18856 through explicit `LeaveGroup` request (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:57,243] INFO [GroupMetadataManager brokerId=0] Group console-consumer-18856 transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager)
4- Let's look at our groups (Of course, if this line exists, "transitioned to Dead in generation 2". Otherwise it still appears in the list.)
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
test1
Everything is normal up to this point. Joins and leaves Topic with a group of consumers. However, the situation changes when we repeat the process of joining the consumer a few times.
1- Let's re-enter the same Topic with a different group.
[2022-03-21 07:02:46,377] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-43677 in Empty state. Created a new member id consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:46,510] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-43677 in state PreparingRebalance with old generation 0 (__consumer_offsets-38) (reason: Adding new member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,511] INFO [GroupCoordinator 0]: Stabilized group console-consumer-43677 generation 1 (__consumer_offsets-38) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,650] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 for group console-consumer-43677 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
2- Let's look at our groups
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-43677
test1
3- Now let's stop the consumer. (ctrl + c)
Here the problem starts. No exit log is seen after stopping the consumer. Sometimes a log like the one below may appear.
2022-03-21 07:03:30,045] INFO [GroupCoordinator 0]: Member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 in group console-consumer-43677 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
And Kafka gets stuck. I can no longer consume it at all. Operations like create, delete in Kafka no longer work. Only list and describe work.
1- If I try to delete the group it won't let me. Because even if I stopped the consumer (ctrl+c), actually the quit process didn't happen.
bin/kafka-consumer-groups.sh --bootstrap-server ... --delete --group console-consumer-43677
Error: Deletion of some consumer groups failed:
* Group 'console-consumer-43677' could not be deleted due to: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=deleteConsumerGroups, deadlineMs=1647846924163, tries=1, nextAllowedTryMs=1647846924266) timed out at 1647846924166 after 1 attempt(s)
2- If I'm trying to set up a new Topic.
bin/kafka-topics.sh --create --topic test-topic ...
Error while executing topic command : Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
[2022-03-21 10:52:33,170] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: createTopics
(kafka.admin.TopicCommand$)
3- If I try to join Topic with consumer again.
It will give timeout errors. In the trace logs, it will give "re-join group" errors. I don't think it has anything to do with concepts like sessiontimeout, heart. Because Kafka shouldn't be locked even if I can't consume it.
Re-Deploy is the only way to fix the situation. But why does the error occur? Is this a Bug? A race condition? Is there a solution? Is it a case with Kubernetes? Is it related to Kafka 3.1.0?
I using kafka and spring boot application for consuming message
after read some message from topic I will receive this error
[2020-07-07 17:05:32,265] INFO [GroupCoordinator 1001]: Member consumer-1-708fe639-905d-432a-b6ba-17bf58adcf03 in group biker-service-prod has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-07-07 17:05:32,267] INFO [GroupCoordinator 1001]: Preparing to rebalance group biker-service-prod in state PreparingRebalance with old generation 460 (__consumer_offsets-7) (reason: removing member consumer-1-708fe639-905d-432a-b6ba-17bf58adcf03 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2020-07-07 17:05:32,267] INFO [GroupCoordinator 1001]: Group biker-service-prod with generation 461 is now empty (__consumer_offsets-7) (kafka.coordinator.group.GroupCoordinator)
this is my config in consumer side
spring.kafka.listener.poll-timeout=3000000
spring.kafka.consumer.heartbeat-interval=500
spring.kafka.consumer.fetch-max-wait=3000000
spring.kafka.consumer.auto-commit-interval=1000
how can i fix it?
The broker has decided your consumer is dead, as it didn't call poll() within the needed interval.
Take a look for these consumer-side params:
max.poll.interval.ms
session.timeout.ms
These will decide when your consumer will be considered dead and throw the exception you are showing. Take care of the time you need processing a message, if it is bigger than the defined max.poll.interval.ms, you should:
Decrease the time needed to process the msg and call poll() again.
or
Increase the max.poll.interval.ms and session.timeout.ms intervals.
We have 5 kafka 1.0.0 clusters:
4 of them are made of 3 nodes and are in different regions in the world
the last one is made of 5 nodes and is an aggregate only cluster.
We are using MirrorMaker (later referenced as MM) to read from the regional clusters and copy the data in the aggregate cluster in our HQ datacenter.
And not sure about where to run it we have currently 2 cases in our prod environment:
MM in the region: reading locally and pushing to aggregate cluster in remote data-center (DC), before committing offsets locally. I tend to call this the push mode (pushing the data)
MM in the DC of the aggregate cluster: reading remotely the data, writing it locally before committing the offsets on remote DC.
What happened is that we got the entire DC where we have our aggregate server totally isolated from a network point of view. And in both cases, we got duplicated records in our aggregate cluster.
Push mode = MM local to the regional cluster, pushing data to remote aggregate cluster
MM started to throw errors like this:
WARN [Producer clientId=producer-1] Got error produce response with correlation id 674364 on topic-partition <topic>-4, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
then:
WARN [Producer clientId=producer-1] Connection to node 1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
which is ok so far because of idempotence.
But finally we got errors like:
ERROR Error when sending message to topic debug_sip_callback-delivery with key: null, value: 1640 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for <topic>-4: 30032 ms has passed since batch creation plus linger time
ERROR Error when sending message to topic <topic> with key: null, value: 1242 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
java.lang.IllegalStateException: Producer is closed forcefully.
causing MM to stop and I think this is the problem causing duplicates (I need to dig the code, but could be that it lost information about idempotence and on restart it resumed from previously committed offsets).
Pull mode = MM local to the aggregate cluster, pulling data from remote regional cluster
MM instances (with logs at INFO level in this case) started seeing the broker as dead:
INFO [Consumer clientId=mirror-maker-region1-agg-0, groupId=mirror-maker-region1-agg] Marking the coordinator kafka1.region1.internal:9092 (id: 2147483646 rack: null) dead (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
At the same time on the broker side, we got:
INFO [GroupCoordinator 1]: Member mirror-maker-region1-agg-0-de2af312-befb-4af7-b7b0-908ca8ecb0ed in group mirror-maker-region1-agg has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
...
INFO [GroupCoordinator 1]: Group mirror-maker-region1-agg with generation 42 is now empty (__consumer_offsets-2) (kafka.coordinator.group.GroupCoordinator)
Later on MM side, a lot of:
WARN [Consumer clientId=mirror-maker-region1-agg-0, groupId=mirror-maker-region1-agg] Connection to node 2 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
and finally when network came back:
ERROR [Consumer clientId=mirror-maker-region1-agg-0, groupId=mirror-maker-region1-agg] Offset commit failed on partition <topic>-dr-8 at offset 382424879: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
i.e., it could not commit in region1 the offsets written on agg because of the rebalancing. And it resumed after rebalance from previously successfully committed offset causing duplicates.
Configuration
Our MM instances are configured like this:
For our consumer:
bootstrap.servers=kafka1.region1.intenal:9092,kafka2.region1.internal:9092,kafka3.region1.internal:9092
group.id=mirror-maker-region-agg
auto.offset.reset=earliest
isolation.level=read_committed
For our producer:
bootstrap.servers=kafka1.agg.internal:9092,kafka2.agg.internal:9092,kafka3.agg.internal:9092,kafka4.agg.internal:9092,kafka5.agg.internal:9092
compression.type=none
request.timeout.ms=30000
max.block.ms=60000
linger.ms=15000
max.request.size=1048576
batch.size=32768
buffer.memory=134217728
retries=2147483647
max.in.flight.requests.per.connection=1
acks=all
enable.idempotence=true
Any idea how we can get the "only once" delivery on top of exactly once in case of 30 min isolated DCs?
In Kafka Streams 1.0.0, we saw a strange error coming my way. My stream app ingest a kafka topic and emit multiple aggregations on different state stores, now the app works on a cluster of 2 nodes, but the moment the third one is added the stream app crashes on all the nodes.
Following is the exception that i get.
2017-12-12 15:32:55 ERROR Kafka010Base:47 - Exception caught in thread c-7-aq32-5648256f-9142-49e2-98c0-da792e6da48e-StreamThread-5
org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The server experienced an unexpected error when processing the request
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:566)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:539)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:788)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:204)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:127)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:506)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:268)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214)
On Taking a look further on brokers, I saw another exception
[2017-12-12 17:28:36,822] INFO [GroupCoordinator 90]: Preparing to rebalance group c-7-aq32 with old generation 25 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:40,500] INFO [GroupCoordinator 90]: Stabilized group c-7-aq32 generation 26 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:42,290] INFO [GroupCoordinator 90]: Assignment received from leader for group c-7-aq32 for generation 26 **(kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:42,300] ERROR [GroupMetadataManager brokerId=90] Appending metadata message for group c-7-aq32 generation 26 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager)**
[2017-12-12 17:28:42,301] INFO [GroupCoordinator 90]: Preparing to rebalance group c-7-aq32 with old generation 26 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:52,301] INFO [GroupCoordinator 90]: Member c-7-aq32-6138ec53-4aff-4596-8f4b-44ae6f5d72da-StreamThread-13-consumer-e0cc0931-0619-4908-82c2-28f7bf9bace9 in group c-7-aq32 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
Now, I would need help in understanding why I would be getting RecordTooLargeException.
Changing following property at broker fixed the issue
message.max.bytes: 5,242,880 (5mb)
The default value for the property is 1mb. When consumers in consumer group are rebalancing , each consumer tries to put metadata of assigned and revoked partitions back to kafka. If this metadata size is bigger than 1mb then this would result in 'RecordTooLargeException' error at broker.
Although the question is old but I have also faced similar issue recently.
I am running a multi threaded kafka 091 consumer [New].
They way i generate a client.id is using a combination of the "hostname the consumer is running on" + "AtomicInt" + "the PID of the process".
I am running into issues when I have to stop the consumer and restart. Consumer keeps trying to process the offsets that were not consumed by the previous run(about 100 of them). But it keeps failing with this message.
2016-10-21 14:22:55,293 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Marking the coordinator 2147483647 dead.
2016-10-21 14:22:55,295 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group x.cg
2016-10-21 14:22:55,296 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:493)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at com.kfc.kafka.consumer.KFCConsumer$KafkaConsumerRunner.run(KFCConsumer.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-21 14:22:55,397 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to join group x.cg failed due to unknown member id, resetting and retrying.
.........
2016-10-21 14:22:58,124 [pool-3-thread-3] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to heart beat failed since the group is rebalancing, try to re-join group.
From the kakfa log, I see a lot of rebalances happening.
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordin
Turns out we were having long, recurring pauses [slow network, problems with external components etc.] w.r.t the external components that our consumer was interacting with.
Solution was to split our consumer into three consumer with different consumer group and Kafka config's (heartbeatinterval.ms, session.timeout.ms, request.timeout.ms, maxPartitionFetchBytes).
Having 3 different consumers with custom config for the above mentioned properties helped us get rid of the above problem.
The general thinking is not to have a lot of external communication within the consumer as this increases the uncertainty in Kafka consumer behavior and when you do have external communication make sure the Kafka Consumer Config's are inline with the SLA's of the external components.