Kafka continuous rebalancing when slow consumer - apache-kafka

I've been trying to test our Kafka for some negative scenarios and one of them is a very slow consumer. I set Thread.sleep(15000) inside my #KafkaListener method(it's spring-kafka) and set the concurrency to 3. I have 1 topic with 1 partition.
I put 10 messages into the topic and started the service.
When 3 consumers start, they all get to (Re-)joining group point,
but then only one of them(let's assume it's consumer-2) will get to:
Successfully joined group with generation X
And start slowly consuming the messages.
(by the way, I use MANUAL_IMMEDIATE Ack mode, but it is reproducible even when I don't add Acknowledgement argument to the listener and don't acknowledge messages).
What I see next is as follows:
Up untill all messages are rpocessed by consumer-2, every 3 seconds(default heartbeat interval) I've got a message in the console:
AbstractCoordinator$HeartbeatResponseHandler: [Consumer clientId=consumer-2, groupId=pixel-group] Attempt to heartbeat failed since group is rebalancing
I wonder why it happens. Only after all 10 messages are processes, there will be another rebalancing after which all 3 consumers will print:
Successfully joined group with generation X
And one of them will be assigned a partition and there will be no heartbeat problems any longer.
This only happens when I set the sleep interval to a value higher than heartbeat interval. It usually happens once when all consumers are starting, but they will set up successfully shortly afterwards.
So, to sum it up it seems that:
If consumer processing time > heartbeat interval time - all but first consumer can't finish rebalancing(they probably can't talk to their slow Leader).
What I fail to understand is why this heartbeat error is so persistent?
Why can't the rest of the consumers finish rebalancing somewhere in between Leader's message consumption if the sleep if longer than heartbeat?
UPDATE
Kafka version 2.12-2.2.0
Spring-Kafka 2.2.3.RELEASE

... concurrency to 3. I have 1 topic with 1 partition ...
You need at least as many partitions as consumers - a partition can only be consumed by one consumer.
What version are you using? Since KIP-62 (Kafka 0.10.1.0), heartbeats are sent in the background by the kafka-clients. So a rebalance should only occur if the listener takes longer than max.poll.interval.ms. Google KIP-62 for more information.
EDIT
You should see logs like this, while your listener is sleeping...
2019-06-13 09:47:52.008 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Heartbeat thread started
...
2019-06-13 09:47:52.072 INFO 61914 --- [ main] com.example.Rbgh664Application : Sleeping for 15
2019-06-13 09:47:55.120 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Sending Heartbeat request to coordinator localhost:9092 (id: 2147483647 rack: null)
2019-06-13 09:47:55.121 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Sending HEARTBEAT {group_id=rbgh664,generation_id=45,member_id=source-82297cee-063a-4e0d-89d0-15cfbd6ef680} with correlation id 10 to node 2147483647
2019-06-13 09:47:55.226 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Completed receive from node 2147483647 for HEARTBEAT with correlation id 10, received {throttle_time_ms=0,error_code=0}
2019-06-13 09:47:55.227 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Received successful Heartbeat response
2019-06-13 09:47:58.120 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Sending Heartbeat request to coordinator localhost:9092 (id: 2147483647 rack: null)
2019-06-13 09:47:58.120 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Sending HEARTBEAT {group_id=rbgh664,generation_id=45,member_id=source-82297cee-063a-4e0d-89d0-15cfbd6ef680} with correlation id 11 to node 2147483647
2019-06-13 09:47:58.225 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Completed receive from node 2147483647 for HEARTBEAT with correlation id 11, received {throttle_time_ms=0,error_code=0}
2019-06-13 09:47:58.226 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Received successful Heartbeat response
2019-06-13 09:48:01.203 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Sending Heartbeat request to coordinator localhost:9092 (id: 2147483647 rack: null)
2019-06-13 09:48:01.204 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Sending HEARTBEAT {group_id=rbgh664,generation_id=45,member_id=source-82297cee-063a-4e0d-89d0-15cfbd6ef680} with correlation id 12 to node 2147483647
2019-06-13 09:48:01.310 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Completed receive from node 2147483647 for HEARTBEAT with correlation id 12, received {throttle_time_ms=0,error_code=0}
2019-06-13 09:48:01.310 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Received successful Heartbeat response
2019-06-13 09:48:04.285 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Sending Heartbeat request to coordinator localhost:9092 (id: 2147483647 rack: null)
2019-06-13 09:48:04.286 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Sending HEARTBEAT {group_id=rbgh664,generation_id=45,member_id=source-82297cee-063a-4e0d-89d0-15cfbd6ef680} with correlation id 13 to node 2147483647
2019-06-13 09:48:04.390 TRACE 61914 --- [hread | rbgh664] org.apache.kafka.clients.NetworkClient : [Consumer clientId=source, groupId=rbgh664] Completed receive from node 2147483647 for HEARTBEAT with correlation id 13, received {throttle_time_ms=0,error_code=0}
2019-06-13 09:48:04.390 DEBUG 61914 --- [hread | rbgh664] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=source, groupId=rbgh664] Received successful Heartbeat response
20
...

Related

Confluent Control Center failure: Unable to fetch consumer offsets for cluster id

I am running confluent platform (version 6.1.1). I deploy the following components: 3 Brokers, 3 ZK, Schema Registry, 3 Kafka Connect, KSQL and Confluent Control Center (CCC).
The CCC has entered into a failed state and I have difficulties to bring it back.
To make things cleaner, I have created another EC2 instance (m4.2xlarge) where I configured new CCC with the aim to connect it to the current cluster. New CCC has exactly the same configuration as the failed one, but with a different confluent.controlcenter.id.
I start the CCC and it is running. I can access the CCC UI but it is not working properly: the pages are loading too long, it keeps showing the changing state of the connect cluster (sometimes healthy, sometimes not), it keeps showing the changing state of the brokers (sometimes healthy, sometimes not)
For example it looks like this (see screenshots):
After running certain amount of time, it is automatically restarted and keeps restarting every 5-7 minutes.
When it is started, I see a bunch of new topics created in the Kafka cluster.
After that in the control-center.log I see :
INFO [main] Setting offsets for topic=_confluent-monitoring (io.confluent.controlcenter.KafkaHelper)
INFO [main] found 12 topicPartitions for topic=_confluent-monitoring (io.confluent.controlcenter.KafkaHelper)
INFO [main] Setting offsets for topic=_confluent-metrics (io.confluent.controlcenter.KafkaHelper)
INFO [main] found 12 topicPartitions for topic=_confluent-metrics (io.confluent.controlcenter.KafkaHelper)
INFO [main] action=starting topology=command (io.confluent.controlcenter.ControlCenter)
INFO [main] waiting for streams to be in running state REBALANCING (io.confluent.command.CommandStore)
INFO [main] Streams state RUNNING (io.confluent.command.CommandStore)
INFO [main] action=started topology=command (io.confluent.controlcenter.ControlCenter)
INFO [main] action=starting operation=command-migration (io.confluent.controlcenter.ControlCenter)
INFO [main] action=completed operation=command-migration (io.confluent.controlcenter.ControlCenter)
INFO [main] action=starting topology=monitoring (io.confluent.controlcenter.ControlCenter)
INFO [main] action=started topology=monitoring (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Health Check (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Alert Manager (io.confluent.controlcenter.ControlCenter)
INFO [main] Starting Consumer Offsets Fetch (io.confluent.controlcenter.ControlCenter)
INFO [control-center-heartbeat-0] current clusterId=lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] broker id set has changed new={1001=[10.251.xx.xx:9093 (id: 1001 rack: null)], 1002=[10.251.xx.xx:9093 (id: 1002 rack: null)], 1003=[10.251.xx.xx:9093 (id: 1003 rack: null)]} removed={} (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] new controller=10.251.xx.xx:9093 (id: 1002 rack: null) (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [main] Initial capacity 128, increased by 64, maximum capacity 2147483647. (io.confluent.rest.ApplicationServer)
INFO [main] Adding listener: http://0.0.0.0:9021 (io.confluent.rest.ApplicationServer)
INFO [main] x509=X509#3a8ead9(ip-44-135-xx-xx.eu-central-1.compute.internal,h=[ip-44-135-xx-xx.eu-central-1.compute.internal],w=[]) for Server#7c8b37a8[provider=null,keyStore=file:///var/kafka-ssl/server.keystore.jks,trustStore=file:///var/kafka-ssl/client.truststore.jks] (org.eclipse.jetty.util.ssl.SslContextFactory)
INFO [main] x509=X509#3831f4c2(caroot,h=[eu-central-1.compute.internal],w=[]) for Server#7c8b37a8[provider=null,keyStore=file:///var/kafka-ssl/server.keystore.jks,trustStore=file:///var/kafka-ssl/client.truststore.jks] (org.eclipse.jetty.util.ssl.SslContextFactory)
INFO [main] jetty-9.4.38.v20210224; built: 2021-02-24T20:25:07.675Z; git: 288f3cc74549e8a913bf363250b0744f2695b8e6; jvm 11.0.13+8-LTS (org.eclipse.jetty.server.Server)
INFO [main] DefaultSessionIdManager workerName=node0 (org.eclipse.jetty.server.session)
INFO [main] No SessionScavenger set, using defaults (org.eclipse.jetty.server.session)
INFO [main] node0 Scavenging every 660000ms (org.eclipse.jetty.server.session)
INFO [main] Started o.e.j.s.ServletContextHandler#1ef5cde4{/,[jar:file:/usr/share/java/acl/acl-6.1.1.jar!/io/confluent/controlcenter/rest/static],AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
INFO [main] Started o.e.j.s.ServletContextHandler#5401c6a8{/ws,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
INFO [main] Started NetworkTrafficServerConnector#5d6b5d3d{HTTP/1.1, (http/1.1)}{0.0.0.0:9021} (org.eclipse.jetty.server.AbstractConnector)
INFO [main] Started #36578ms (org.eclipse.jetty.server.Server)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.count type=monitoring cluster= value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.rate type=monitoring cluster= value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.timestamp type=monitoring cluster= value=NaN (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=monitoring-input-topic-progress-.min type=monitoring cluster= value=1.7976931348623157E308 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.count type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.rate type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=0.0 (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.timestamp type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=NaN (io.confluent.controlcenter.util.StreamProgressReporter)
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-1] name=metrics-input-topic-progress-lCRehAk0RqmLR04nhXKHtA.min type=metrics cluster=lCRehAk0RqmLR04nhXKHtA value=1.7976931348623157E308 (io.confluent.controlcenter.util.StreamProgressReporter)
WARN [control-center-heartbeat-0] misconfigured topic=_confluent-command config=segment.bytes value=1073741824 expected=134217728 (io.confluent.controlcenter.healthcheck.HealthCheck)
WARN [control-center-heartbeat-0] misconfigured topic=_confluent-command config=delete.retention.ms value=86400000 expected=259200000 (io.confluent.controlcenter.healthcheck.HealthCheck)
INFO [control-center-heartbeat-0] misconfigured topic=_confluent-metrics config=min.insync.replicas value=1 expected=2 (io.confluent.controlcenter.healthcheck.HealthCheck)
WARN [control-center-heartbeat-1] Unable to fetch consumer offsets for cluster id lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.data.ConsumerOffsetsFetcher)
java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupDescriptions(ConsumerOffsetsDao.java:220)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupOffsets(ConsumerOffsetsDao.java:58)
at io.confluent.controlcenter.data.ConsumerOffsetsFetcher.run(ConsumerOffsetsFetcher.java:73)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_confluent-ksql-eim_ksql_non_prodquery_CSAS_SDL_STMTS_GG_347 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled describeConsumerGroups request with correlation id 168 due to node 1001 being disconnected
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=connect-mongo-dci-grid-partner-test11 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeConsumerGroups
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_confluent-ksql-eim_ksql_non_prodquery_CSAS_SDL_STMTS_UPWARD_GG_355 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeConsumerGroups
WARN [kafka-admin-client-thread | adminclient-3] failed fetching description for consumerGroup=_eim_c3_non_prod-4 (io.confluent.controlcenter.data.ConsumerOffsetsDao)
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeConsumerGroups, deadlineMs=1654853629184, tries=1, nextAllowedTryMs=1654853629324) timed out at 1654853629224 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeConsumerGroups
...
and so on...
WARN [control-center-heartbeat-1] Unable to fetch consumer offsets for cluster id lCRehAk0RqmLR04nhXKHtA (io.confluent.controlcenter.data.ConsumerOffsetsFetcher)
java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupDescriptions(ConsumerOffsetsDao.java:220)
at io.confluent.controlcenter.data.ConsumerOffsetsDao.getAllConsumerGroupOffsets(ConsumerOffsetsDao.java:58)
at io.confluent.controlcenter.data.ConsumerOffsetsFetcher.run(ConsumerOffsetsFetcher.java:73)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
and so on...
In the control-center-kafka.log I see:
INFO [control-center-heartbeat-1] Kafka version: 6.1.1-ce (org.apache.kafka.common.utils.AppInfoParser)
INFO [control-center-heartbeat-1] Kafka commitId: 73deb3aeb1f8647c (org.apache.kafka.common.utils.AppInfoParser)
INFO [control-center-heartbeat-1] Kafka startTimeMs: 1654853610852 (org.apache.kafka.common.utils.AppInfoParser)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-monitoring-message-rekey-store-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-monitoring-trigger-event-rekey-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-MonitoringStream-ONE_MINUTE-repartition-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.2:9093 (id: 1002 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Resetting offset for partition _eim_c3_non_prod-4-aggregatedTopicPartitionTableWindows-ONE_MINUTE-repartition-7 to position FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[10.251.6.1:9093 (id: 1001 rack: null)], epoch=0}}. (org.apache.kafka.clients.consumer.internals.SubscriptionState)
and so on ...
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1003: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-3-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1001: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-10-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-5-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=1478925475, epoch=1) to node 1003: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
INFO [kafka-coordinator-heartbeat-thread | _eim_c3_non_prod-4] [Consumer clientId=_eim_c3_non_prod-4-b6c9d6bd-717d-4559-bcfe-a4c9be647b7f-StreamThread-6-consumer, groupId=_eim_c3_non_prod-4] Error sending fetch request (sessionId=1947312909, epoch=1) to node 1002: (org.apache.kafka.clients.FetchSessionHandler)
org.apache.kafka.common.errors.DisconnectException
and so on ...
Any ideas what can be wrong here?

Same offset and partition record is getting consumed twice causing duplicates

I am trying to consume records using application written in spring-kafka. I am facing very unique condition and not able to understand why this is happening ?
My consumer application is running with 2 concurrency meaning 2 consumer thread subscribed to topic having two partitions.I am consuming records and placing it into table using upsert with offset, partitions and insert timestamp.
I am seeing duplicate values with same offset and partition in the table which should not occur. There is no difference in the timestamp value, means insert occurred at the same time. I am not sure how is it possible? I don't see any issue in the log as well. I am not sure what is happening at the Producer end but we can't have 2 values at the same offset anyway, so not sure whether this is an issue at consumer end of producer end.Any suggestion or thought which would help me to triage this issue?
Kafka log :
I don't see anything unusual in the log as well.
14:29:56.318 [main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka version: 2.4.0
14:29:56.318 [main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka commitId: 77a89fcf8d7fa018
14:29:56.318 [main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka startTimeMs: 1604154596318
14:29:56.319 [main] INFO org.apache.kafka.clients.consumer.KafkaConsumer - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] Subscribed to topic(s): kaas.pe.enrollment.csp.ts2
14:29:57.914 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.Metadata - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] Cluster ID: 6cbv7QOaSW6j1vXrOCE4jA
14:29:57.914 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.Metadata - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] Cluster ID: 6cbv7QOaSW6j1vXrOCE4jA
14:29:57.915 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] Discovered group coordinator apslp1563.uhc.com:9093 (id: 2147483574 rack: null)
14:29:57.915 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] Discovered group coordinator apslp1563.uhc.com:9093 (id: 2147483574 rack: null)
14:29:57.923 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] (Re-)joining group
14:29:57.924 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] (Re-)joining group
14:29:58.121 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] (Re-)joining group
14:29:58.125 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] (Re-)joining group
14:30:13.127 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] Finished assignment for group at generation 23: {consumer-csp-prov-emerald-test-1-19d92ba5-5dc3-433d-b967-3cf1ce1b4174=org.apache.kafka.clients.consumer.ConsumerPartitionAssignor$Assignment#d7e2a1f, consumer-csp-prov-emerald-test-2-5833c212-7031-4ab1-944b-7e26f7d7a293=org.apache.kafka.clients.consumer.ConsumerPartitionAssignor$Assignment#53c3aad3}
14:30:13.131 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] Successfully joined group with generation 23
14:30:13.131 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.consumer.internals.AbstractCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] Successfully joined group with generation 23
14:30:13.134 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] Adding newly assigned partitions: kaas.pe.enrollment.csp.ts2-1
14:30:13.134 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] Adding newly assigned partitions: kaas.pe.enrollment.csp.ts2-0
14:30:13.143 [org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-1, groupId=csp-prov-emerald-test] Setting offset for partition kaas.pe.enrollment.csp.ts2-0 to the committed offset FetchPosition{offset=500387, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=apslp1559.uhc.com:9093 (id: 69 rack: null), epoch=37}}
14:30:13.143 [org.springframework.kafka.KafkaListenerEndpointContainer#0-1-C-1] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=consumer-csp-prov-emerald-test-2, groupId=csp-prov-emerald-test] Setting offset for partition kaas.pe.enrollment.csp.ts2-1 to the committed offset FetchPosition{offset=499503, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=apslp1562.uhc.com:9093 (id: 72 rack: null), epoch=36}}
Code :
#KafkaListener(topics = "${app.topic}", groupId = "${app.group_id_config}")
public void run(ConsumerRecord<String, GenericRecord> record, Acknowledgment acknowledgement) throws Exception {
try {
prov_tin_number = record.value().get("providerTinNumber").toString();
prov_tin_type = record.value().get("providerTINType").toString();
enroll_type = record.value().get("enrollmentType").toString();
vcp_prov_choice_ind = record.value().get("vcpProvChoiceInd").toString();
error_flag = "";
dataexecutor.peStremUpsertTbl(prov_tin_number, prov_tin_type, enroll_type, vcp_prov_choice_ind, error_flag,
record.partition(), record.offset());
acknowledgement.acknowledge();
}catch (Exception ex) {
System.out.println(record);
System.out.println(ex.getMessage());
}
}

Kafka Streams Topology stuck in num.stream.threads=50 and 100 partitions

There is a topology:
kStreamBuilder.stream(kafkaProperties.getInboundTopicName(), consumed)
.filterNot((k,v) -> Objects.isNull(v))
.transform(() -> new CustomTransformer(...))
.transform(() -> new AnotherTransformer(...))
.to(kafkaProperties.getOutTopicName(), resultProduced);
with configured
num.stream.threads: 50
On startup application stuck with constantly logging messages(I'm not 100% sure it stuck but after 20 minutes there are no changes in the state and CPU, network usage is very high) :
State transition from RUNNING to PARTITIONS_REVOKED
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-1-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-2-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-3-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-4-consumer, groupId=group_id] (Re-)joining group
AbstractCoordinator : [Consumer clientId=consumer_id-StreamThread-5-consumer, groupId=group_id] (Re-)joining group
etc.
Topic has 100 partitions.
What we noticed: every transformer uses it's own persistentStateStore. After replacing it to inMemoryStateStore there still were logs written above but after ~3 minutes topology started successfully.
Kafka streams version 2.1.0.
Broker version 1.1.0

Kafka could not create internal topics after 32 topics created

We have 9 microservices that creates 32 topics (2 of them from beginning , 30 of them from internal) , after I make a new join kafka gets down. Is there any limitation that only 32 topics can be created with Kafka, or how can I solve this?
Thank you for your time.
Started SpringBootCounterMS in 4.652 seconds (JVM running for 7.59)
2018-07-04 10:39:29.513 INFO 14956 --- [-StreamThread-1] o.a.k.s.p.internals.StreamThread : stream-thread [counter-service-3ca6bb7e-addd-445e-a22b-8b7be1b3b6c7-StreamThread-1] State transition from PARTITIONS_ASSIGNED to RUNNING
2018-07-04 10:39:29.514 INFO 14956 --- [-StreamThread-1] org.apache.kafka.streams.KafkaStreams : stream-client [counter-service-3ca6bb7e-addd-445e-a22b-8b7be1b3b6c7]State transition from REBALANCING to RUNNING
2018-07-04 10:39:30.579 INFO 14956 --- [-StreamThread-1] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=counter-service-3ca6bb7e-addd-445e-a22b-8b7be1b3b6c7-StreamThread-1-consumer, groupId=counter-service] Marking the coordinator localhost:9092 (id: 2147483647 rack: null) dead
2018-07-04 10:39:30.599 WARN 14956 --- [-StreamThread-2] o.a.k.s.p.i.InternalTopicManager : stream-thread [main] Could not create internal topics: Empty response for client request. Retry #0
Updating JVM to a 64bit one solves this problem. It is a outofmemory error.

Why does a Kafka consumer take a long time to start consuming?

We start a Kafka consumer, listening on a topic which may not yet be created (topic auto creation is enabled though).
Not long thereafter a producer is publishing messages on that topic.
However, it takes some time for the consumer to notice this: 5 minutes to be exact. At this point the consumer revokes its partitions and rejoins the consumer group. Kafka re-stabilizes the group. Looking at the time-stamps of the consumer vs. kafka logs, this process is initiated at the consumer side.
I suppose this is expected behavior but I would like to understand this. Is this actually a re-balancing going on (from 0 to 1 partition)? If we'd create topics upfront, would this not happen?
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.692 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:36:45.693 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
2017-02-01 08:36:45.738 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : Successfully joined group tps-kafka-partitioning with generation 1
2017-02-01 08:36:45.747 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Setting newly assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:36:45.749 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions assigned:[]
2017-02-01 08:41:45.540 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.ConsumerCoordinator : Revoking previously assigned partitions [] for group tps-kafka-partitioning
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.s.k.l.KafkaMessageListenerContainer : partitions revoked:[]
2017-02-01 08:41:45.544 INFO 7 --- [afka-consumer-1] o.a.k.c.c.internals.AbstractCoordinator : (Re-)joining group tps-kafka-partitioning
kafka logs
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-partitioning with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,546] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-partitioning generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:41:45,551] INFO [GroupCoordinator 1001]: Assignment received from leader for group tps-kafka-partitioning for generation 2 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Preparing to restabilize group tps-kafka-group-id with old generation 1 (kafka.coordinator.GroupCoordinator)
[2017-02-01 08:42:14,636] INFO [GroupCoordinator 1001]: Stabilized group tps-kafka-group-id generation 2 (kafka.coordinator.GroupCoordinator)
This is probably due to the default value of the parameter metadata.max.age.ms which controls how often the consumer forces a refresh of metadata for a topic.
What happens when you start the consumer up with a non existing topic is that the brokers autocreate this topic, but this takes a little bit of time with leader election etc., so when your consumer requests metadata for that topic it gets a LEADER_NOT_AVAILABLE warning and can't fetch any messages.
After the timeout mentioned above is reached the consumer refreshes metadata, successfully this time around and starts reading messages. This is not dependent on a producer writing messages to the topic, it is purely a consumer thing.
If you start your consumer with for example 1000ms timeout, you should see a much shorter delay until messages are consumed.
Also, if you create topics up front, or start the producer before the consumer, this behavior should not happen at all.