kafka broker Connection to INTERNAL_BROKER_DNS failed - apache-kafka

i have 3 kafka brokers in MSK. Two of them gives below error.
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file LOG_DIR/__amazon_msk_connect_offsets_debezium-kafka-connector-x/leader-epoch-checkpoint
Caused by: java.io.IOException: No space left on device
And the other broker gives below error.
[2022-10-18 11:50:43,383] INFO [GroupCoordinator 1]: Member connect-1-97b56132-1f54-46a4-91f1-8d31e61e18a9 in group __amazon_msk_connect_cluster_debezium-kafka-connector-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,383] INFO [GroupCoordinator 1]: Stabilized group __amazon_msk_connect_cluster_debezium-kafka-connector-x generation 1115 (__consumer_offsets-18) (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,387] INFO [GroupCoordinator 1]: Assignment received from leader for group __amazon_msk_connect_cluster_debezium-kafka-connector-x for generation 1115 (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,387] INFO [GroupCoordinator 1]: Preparing to rebalance group __amazon_msk_connect_cluster_debezium-kafka-connector-x in state PreparingRebalance with old generation 1115 (__consumer_offsets-18) (reason: error when storing group assignment during SyncGroup (member: connect-1-9c20e001-f852-4007-8614-78a5c27207f6)) (kafka.coordinator.group.GroupCoordinator)
[2022-10-18 11:50:43,876] WARN [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={test-0=PartitionData(fetchOffset=149547, logStartOffset=7151, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), test-0=PartitionData(fetchOffset=74123, logStartOffset=826, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=INVALID, epoch=INITIAL), rackId=) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to INTERNAL_BROKER_DNS (id: 3 rack: null) failed.
I want to know why i'm getting above error Error in response for fetch request ? What does it mean ?

No space left on device - You need a larger EBS, or remove log segments from it manually, assuming you can SSH to it. If not, contact MSK Support.
Given that there are brokers failing with storage requirements, then other brokers will simply be unable to connect to them

Related

Kafka built on Kubernetes gets stuck while consuming

I installed Kafka(3.1.0) as Stateful on Kubernetes. Then I created a Topic. We send data to this topic by receiving data with HA proxy from outside of Kubernetes. Then we consume this data with an application.
Everything looks normal. This Topic works fine. There is no problem with the consumer.
But if I try to consume this Topic via a different group. Kafka is starting to get clogged. This doesn't always happen. It only happens when you become Consume a few times and leave.
Now I will try to explain this with an example. (I replaced some special parts with ...)
There is a Topic and it is already being consumed. The name of the group is "test1".
1- I am now joining this Topic with console-consumer as a new consumer.
Server Logs:
[2022-03-21 06:11:27,538] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-18856 in Empty state. Created a new member id consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:27,637] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 0 (__consumer_offsets-39) (reason: Adding new member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,639] INFO [GroupCoordinator 0]: Stabilized group console-consumer-18856 generation 1 (__consumer_offsets-39) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:11:30,748] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 for group console-consumer-18856 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
2- Let's look at our groups
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-18856
test1
3- Now let's stop the consumer. (ctrl + c)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-18856 in state PreparingRebalance with old generation 1 (__consumer_offsets-39) (reason: Removing member consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,646] INFO [GroupCoordinator 0]: Group console-consumer-18856 with generation 2 is now empty (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:36,647] INFO [GroupCoordinator 0]: Member MemberMetadata(memberId=consumer-console-consumer-18856-1-7898e8a9-e182-4d1c-8c62-556181ca0641, groupInstanceId=None, clientId=consumer-console-consumer-18856-1, clientHost=/..., sessionTimeoutMs=10000, rebalanceTimeoutMs=300000, supportedProtocols=List(range)) has left group console-consumer-18856 through explicit `LeaveGroup` request (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 06:20:57,243] INFO [GroupMetadataManager brokerId=0] Group console-consumer-18856 transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager)
4- Let's look at our groups (Of course, if this line exists, "transitioned to Dead in generation 2". Otherwise it still appears in the list.)
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
test1
Everything is normal up to this point. Joins and leaves Topic with a group of consumers. However, the situation changes when we repeat the process of joining the consumer a few times.
1- Let's re-enter the same Topic with a different group.
[2022-03-21 07:02:46,377] INFO [GroupCoordinator 0]: Dynamic member with unknown member id joins group console-consumer-43677 in Empty state. Created a new member id consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 and request the member to rejoin with this id. (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:46,510] INFO [GroupCoordinator 0]: Preparing to rebalance group console-consumer-43677 in state PreparingRebalance with old generation 0 (__consumer_offsets-38) (reason: Adding new member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,511] INFO [GroupCoordinator 0]: Stabilized group console-consumer-43677 generation 1 (__consumer_offsets-38) with 1 members (kafka.coordinator.group.GroupCoordinator)
[2022-03-21 07:02:49,650] INFO [GroupCoordinator 0]: Assignment received from leader consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 for group console-consumer-43677 for generation 1. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
2- Let's look at our groups
bin/kafka-consumer-groups.sh --bootstrap-server ... --list
console-consumer-43677
test1
3- Now let's stop the consumer. (ctrl + c)
Here the problem starts. No exit log is seen after stopping the consumer. Sometimes a log like the one below may appear.
2022-03-21 07:03:30,045] INFO [GroupCoordinator 0]: Member consumer-console-consumer-43677-1-8809ea17-b557-4571-8c27-8cdbe417e052 in group console-consumer-43677 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
And Kafka gets stuck. I can no longer consume it at all. Operations like create, delete in Kafka no longer work. Only list and describe work.
1- If I try to delete the group it won't let me. Because even if I stopped the consumer (ctrl+c), actually the quit process didn't happen.
bin/kafka-consumer-groups.sh --bootstrap-server ... --delete --group console-consumer-43677
Error: Deletion of some consumer groups failed:
* Group 'console-consumer-43677' could not be deleted due to: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=deleteConsumerGroups, deadlineMs=1647846924163, tries=1, nextAllowedTryMs=1647846924266) timed out at 1647846924166 after 1 attempt(s)
2- If I'm trying to set up a new Topic.
bin/kafka-topics.sh --create --topic test-topic ...
Error while executing topic command : Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
[2022-03-21 10:52:33,170] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1647849153162, tries=2, nextAllowedTryMs=1647849153263) timed out at 1647849153163 after 2 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: createTopics
(kafka.admin.TopicCommand$)
3- If I try to join Topic with consumer again.
It will give timeout errors. In the trace logs, it will give "re-join group" errors. I don't think it has anything to do with concepts like sessiontimeout, heart. Because Kafka shouldn't be locked even if I can't consume it.
Re-Deploy is the only way to fix the situation. But why does the error occur? Is this a Bug? A race condition? Is there a solution? Is it a case with Kubernetes? Is it related to Kafka 3.1.0?

Unexpected failing/rebalancing of consumers

Using Apache Kafka 2.1.0 and spring-kafka 2.1.7, we are getting error messages like the following on our spring-kafka consumer-clients:
2019-01-13 23:01:34.019 consumer-1-C-1 LogContext$KafkaLogger.error SEVERE: [Consumer clientId=consumer-2, groupId=kafka-consumer-group-x] Offset commit failed on partition topic-x-16 at offset 57882: The coordinator is not aware of this member.
A few seconds before this error, we can see the following log messages on one of the kafka borkers:
[2019-01-13 23:01:17,329] INFO [GroupCoordinator 2]: Member consumer-30-13dc06ff-aed2-4e4e-a66d-2d60d79ac526 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:17,330] INFO [GroupCoordinator 2]: Preparing to rebalance group kafka-consumer-group-x in state PreparingRebalance with old generation 1370 (__consumer_offsets-40) (reason: removing member consumer-30-13dc06ff-aed2-4e4e-a66d-2d60d79ac526 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:17,330] INFO [GroupCoordinator 2]: Member consumer-20-ba370e86-e1cc-4261-a73c-78cea1b00479 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:17,335] INFO [GroupCoordinator 2]: Member consumer-32-be8807df-b88f-4cc9-bddf-bed772d1244f in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:17,335] INFO [GroupCoordinator 2]: Member consumer-17-3e34f026-894e-40dc-916b-d169a43da135 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:17,335] INFO [GroupCoordinator 2]: Member consumer-31-4dd9cb6e-09e9-47db-9610-37e0ab5633e0 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:17,335] INFO [GroupCoordinator 2]: Member consumer-18-90175650-1224-4f22-9350-246e17e75367 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,332] INFO [GroupCoordinator 2]: Member consumer-19-663239af-9702-4e59-ad3d-f8202e9d579d in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,347] INFO [GroupCoordinator 2]: Member consumer-22-c54fb4c0-1fa1-4d9f-91fc-1da6df41b227 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,347] INFO [GroupCoordinator 2]: Member consumer-25-3bfd915c-8bd1-454b-85e3-60212b4c568e in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,347] INFO [GroupCoordinator 2]: Member consumer-27-cbb97ebf-b5cd-4cfa-991a-5302462ddab9 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,615] INFO [GroupCoordinator 2]: Member consumer-24-37fbcc73-e8c6-4820-ad56-580fd88f5a10 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,618] INFO [GroupCoordinator 2]: Member consumer-21-eea1b841-202e-4ebe-bdde-007775d001dd in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,636] INFO [GroupCoordinator 2]: Member consumer-28-881da47e-87c9-4675-9f88-e3b33748cff1 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,708] INFO [GroupCoordinator 2]: Member consumer-26-375880ee-b2a9-4ece-8eee-987d282956d8 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,708] INFO [GroupCoordinator 2]: Member consumer-23-492417e9-f3cb-4bec-bbac-130895356907 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,731] INFO [GroupCoordinator 2]: Member consumer-29-64732e9a-2c2b-44fb-a8a5-f606462a4201 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:18,947] INFO [GroupCoordinator 2]: Member consumer-10-fdd0ca92-3604-46de-9e2b-97ca41d36150 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,228] INFO [GroupCoordinator 2]: Member consumer-3-feb6986d-79af-4c64-a8f8-2dbb3bdb73c3 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,257] INFO [GroupCoordinator 2]: Member consumer-2-0345e5d5-86fc-4df0-bd39-c35b75514cea in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,257] INFO [GroupCoordinator 2]: Member consumer-1-c301f59f-8a56-4bdb-a5ef-dc163232d378 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,257] INFO [GroupCoordinator 2]: Member consumer-13-56aea64a-ecca-45e7-9474-b8f1163d01c8 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,266] INFO [GroupCoordinator 2]: Member consumer-9-3ee76e0e-86f1-4c0c-85cc-d07721bf36b1 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,273] INFO [GroupCoordinator 2]: Member consumer-4-9fa81414-870d-444d-b5d1-c38ce5c157a8 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,296] INFO [GroupCoordinator 2]: Member consumer-14-8236578f-b60d-4199-b621-913d025149d1 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,656] INFO [GroupCoordinator 2]: Member consumer-12-2921b7de-1721-460f-adbf-4fb6951cca22 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,665] INFO [GroupCoordinator 2]: Member consumer-11-09d7015c-cc33-464e-93ac-fb270f209b3f in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,667] INFO [GroupCoordinator 2]: Member consumer-5-b3fe06ff-8ef4-4d60-8571-68b7cfee12bc in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,722] INFO [GroupCoordinator 2]: Member consumer-15-5af82ca6-0ebf-463e-b9c5-4bbde513453d in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,754] INFO [GroupCoordinator 2]: Member consumer-7-c1e2bf89-c7c5-4363-b099-191956ed1c89 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,848] INFO [GroupCoordinator 2]: Member consumer-6-9b3be0e4-c1be-4d6a-98b1-caa9d095c403 in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,848] INFO [GroupCoordinator 2]: Member consumer-16-0f48ad44-402a-4706-9d78-9d0d5077a56d in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,848] INFO [GroupCoordinator 2]: Member consumer-8-0496aa54-79f7-41b8-8f31-7823ed72f16a in group kafka-consumer-group-x has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:19,848] INFO [GroupCoordinator 2]: Group kafka-consumer-group-x with generation 1371 is now empty (__consumer_offsets-40) (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:35,226] INFO [GroupCoordinator 2]: Preparing to rebalance group kafka-consumer-group-x in state PreparingRebalance with old generation 1371 (__consumer_offsets-40) (reason: Adding new member consumer-1-7787a334-acf2-4534-bc19-78af35371bfb) (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:38,227] INFO [GroupCoordinator 2]: Stabilized group kafka-consumer-group-x generation 1372 (__consumer_offsets-40) (kafka.coordinator.group.GroupCoordinator)
[2019-01-13 23:01:38,239] INFO [GroupCoordinator 2]: Assignment received from leader for group kafka-consumer-group-x for generation 1372 (kafka.coordinator.group.GroupCoordinator)
As we don't see any errors while processing the messages or signs that the process takes to much time, we can't explain these sudden rebalancings.
Does anyone have a hint where this could come from?
The configuration for our consumer ist mostly default with enable.auto.commit=false and AckMode.RECORD.
Reasons to rebalance in Kafka:
A new consumer joins to group
A consumer left the group (clean shut down)
New partitions are added
A consumer seems to be dead in the view of Kafka
Reasons for 4th:
Consumer couldn't pool in max.poll.interval.ms (long running process)
Consumer couldn't send heartbeat to Kafka in session.timeout.ms
**Normally heartbeat thread is run in every heartbeat.interval.ms (default 3 sec.)
Your situtation seems like 4.2.
There can be various reasons for that. To solve the problem you can increase session.timeout.ms. (default is 10 sec.)
Another solution is to optimize your system to run heartbeat threads as expected. (avoiding high IOWait, load balancing etc.)
I am rather sure you are running into KAFKA-7196.
You should upstep your server to 2.0.1 or later.
As a workaround, you might try to configure a random client.id every time you start up, but this can have some unwanted side effects.

ProducerFencedException Processing Kafka Stream

I'm using kafka 1.1.0. A kafka stream consistently throws this exception (albeit with different messages)
WARN o.a.k.s.p.i.RecordCollectorImpl#onCompletion:166 - task [0_0] Error sending record (key KEY value VALUE timestamp TIMESTAMP) to topic OUTPUT_TOPIC due to Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.; No more records will be sent and no more offsets will be recorded for this task.
WARN o.a.k.s.p.i.AssignedStreamsTasks#closeZombieTask:202 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] stream task 0_0 got migrated to another thread already. Closing it as zombie.
WARN o.a.k.s.p.internals.StreamThread#runLoop:752 - stream-thread [90556797-3a33-4e35-9754-8a63200dc20e-StreamThread-1] Detected a task that got migrated to another thread. This implies that this thread missed a rebalance and dropped out of the consumer group. Trying to rejoin the consumer group now.
org.apache.kafka.streams.errors.TaskMigratedException: StreamsTask taskId: 0_0
ProcessorTopology:
KSTREAM-SOURCE-0000000000:
topics:
[INPUT_TOPIC]
children: [KSTREAM-PEEK-0000000001]
KSTREAM-PEEK-0000000001:
children: [KSTREAM-MAP-0000000002]
KSTREAM-MAP-0000000002:
children: [KSTREAM-SINK-0000000003]
KSTREAM-SINK-0000000003:
topic:
OUTPUT_TOPIC
Partitions [INPUT_TOPIC-0]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:238)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:918)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:798)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: task [0_0] Abort sending since producer got fenced with a previous record
I'm not sure what is causing this exception. When I restart application it appears to successfully process a few records before failing with the same exception. Strangely enough, the records are successfully processed several times even though the stream is set to exactly once processing. Here is the stream configuration:
Properties streamProperties = new Properties();
streamProperties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
streamProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, service.getName());
streamProperties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once");
//Should be DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG - but that field is private.
streamProperties.put("default.production.exception.handler", ErrorHandler.class);
streamProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl);
streamProperties.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
streamProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
streamProperties.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
streamProperties.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
Out of the three servers, only two generate relevant logs when restarting the streams application. Here are logs from the first server:
[2018-05-09 14:42:14,635] INFO [GroupCoordinator 1]: Member INPUT_TOPIC-09dd8ac8-2cd6-4dd1-b963-63ea804c8fcc-StreamThread-1-consumer-3fedb398-91fe-480a-b5ee-1b5879d0956c in group INPUT_TOPIC has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 1 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:14,636] INFO [GroupCoordinator 1]: Group INPUT_TOPIC with generation 2 is now empty (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Preparing to rebalance group INPUT_TOPIC with old generation 2 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,848] INFO [GroupCoordinator 1]: Stabilized group INPUT_TOPIC generation 3 (__consumer_offsets-29) (kafka.coordinator.group.GroupCoordinator)
[2018-05-09 14:42:15,871] INFO [GroupCoordinator 1]: Assignment received from leader for group INPUT_TOPIC for generation 3 (kafka.coordinator.group.GroupCoordinator)
And from the second server:
[2018-05-09 14:42:16,228] INFO [TransactionCoordinator id=0] Initialized transactionalId INPUT_TOPIC-0_0 with producerId 2010 and producer epoch 37 on partition __transaction_state-37 (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:22,121] INFO [TransactionCoordinator id=0] Completed rollback ongoing transaction of transactionalId: INPUT_TOPIC-0_0 due to timeout (kafka.coordinator.transaction.TransactionCoordinator)
[2018-05-09 14:44:42,263] ERROR [ReplicaManager broker=0] Error processing append operation on partition OUTPUT_TOPIC-0 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer valid. There is probably another producer with a newer epoch. 37 (request epoch), 38 (server epoch)
It appears like the first server sees that the consumer has failed and removes it from the consumer group before it is registered with the second server. Any ideas what could be causing the consumer to fail? Or, any ideas handling this failure gracefully? It's possible that it is this bug, does anyone know of a possible workaround?
I'm not sure what caused the problem, but reducing the max.poll.records to 1 fixed the problem.

Kafka Streams - Rebalancing exception in Kafka 1.0.0

In Kafka Streams 1.0.0, we saw a strange error coming my way. My stream app ingest a kafka topic and emit multiple aggregations on different state stores, now the app works on a cluster of 2 nodes, but the moment the third one is added the stream app crashes on all the nodes.
Following is the exception that i get.
2017-12-12 15:32:55 ERROR Kafka010Base:47 - Exception caught in thread c-7-aq32-5648256f-9142-49e2-98c0-da792e6da48e-StreamThread-5
org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The server experienced an unexpected error when processing the request
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:566)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:539)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:808)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:788)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:204)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:127)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:506)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:353)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:268)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214)
On Taking a look further on brokers, I saw another exception
[2017-12-12 17:28:36,822] INFO [GroupCoordinator 90]: Preparing to rebalance group c-7-aq32 with old generation 25 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:40,500] INFO [GroupCoordinator 90]: Stabilized group c-7-aq32 generation 26 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:42,290] INFO [GroupCoordinator 90]: Assignment received from leader for group c-7-aq32 for generation 26 **(kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:42,300] ERROR [GroupMetadataManager brokerId=90] Appending metadata message for group c-7-aq32 generation 26 failed due to org.apache.kafka.common.errors.RecordTooLargeException, returning UNKNOWN error code to the client (kafka.coordinator.group.GroupMetadataManager)**
[2017-12-12 17:28:42,301] INFO [GroupCoordinator 90]: Preparing to rebalance group c-7-aq32 with old generation 26 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
[2017-12-12 17:28:52,301] INFO [GroupCoordinator 90]: Member c-7-aq32-6138ec53-4aff-4596-8f4b-44ae6f5d72da-StreamThread-13-consumer-e0cc0931-0619-4908-82c2-28f7bf9bace9 in group c-7-aq32 has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
Now, I would need help in understanding why I would be getting RecordTooLargeException.
Changing following property at broker fixed the issue
message.max.bytes: 5,242,880 (5mb)
The default value for the property is 1mb. When consumers in consumer group are rebalancing , each consumer tries to put metadata of assigned and revoked partitions back to kafka. If this metadata size is bigger than 1mb then this would result in 'RecordTooLargeException' error at broker.
Although the question is old but I have also faced similar issue recently.

kafka consumer keeps looping over a bunch of messages after CommitFailedException

I am running a multi threaded kafka 091 consumer [New].
They way i generate a client.id is using a combination of the "hostname the consumer is running on" + "AtomicInt" + "the PID of the process".
I am running into issues when I have to stop the consumer and restart. Consumer keeps trying to process the offsets that were not consumed by the previous run(about 100 of them). But it keeps failing with this message.
2016-10-21 14:22:55,293 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Marking the coordinator 2147483647 dead.
2016-10-21 14:22:55,295 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Error UNKNOWN_MEMBER_ID occurred while committing offsets for group x.cg
2016-10-21 14:22:55,296 [pool-3-thread-6] ERROR o.a.k.c.c.i.ConsumerCoordinator : Offset commit failed.
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed due to group rebalance
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:552)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:493)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:665)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:644)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:274)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorKnown(AbstractCoordinator.java:184)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:886)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at com.kfc.kafka.consumer.KFCConsumer$KafkaConsumerRunner.run(KFCConsumer.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-21 14:22:55,397 [pool-3-thread-6] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to join group x.cg failed due to unknown member id, resetting and retrying.
.........
2016-10-21 14:22:58,124 [pool-3-thread-3] INFO o.a.k.c.c.i.AbstractCoordinator : Attempt to heart beat failed since the group is rebalancing, try to re-join group.
From the kakfa log, I see a lot of rebalances happening.
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,196] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,200] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:18,952] INFO [GroupCoordinator 1]: Preparing to restabilize group x.cg with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,233] INFO [GroupCoordinator 1]: Stabilized group x.cg generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordinator)
[2016-10-21 21:28:48,243] INFO [GroupCoordinator 1]: Assignment received from leader for group x.cg for generation 2 (kafka.coordinator.GroupCoordin
Turns out we were having long, recurring pauses [slow network, problems with external components etc.] w.r.t the external components that our consumer was interacting with.
Solution was to split our consumer into three consumer with different consumer group and Kafka config's (heartbeatinterval.ms, session.timeout.ms, request.timeout.ms, maxPartitionFetchBytes).
Having 3 different consumers with custom config for the above mentioned properties helped us get rid of the above problem.
The general thinking is not to have a lot of external communication within the consumer as this increases the uncertainty in Kafka consumer behavior and when you do have external communication make sure the Kafka Consumer Config's are inline with the SLA's of the external components.