Kafka Streams - Failed to Rebalance Error - apache-kafka

I have a basic Kafka Streams application that reads from an in_topic, performs a rolling aggregate, and performs a join to publish to an out_topic. This has been running fine for weeks, but it crashed this morning and will no longer start. I do not think it has anything to do with the code. The log prior to the error are:
2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] Instantiated a transactional producer.
2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] Overriding the default acks to all since idempotence is enabled.
2019-01-21 17:46:32,818 localhost org.apache.kafka.common.utils.AppInfoParser: Kafka version : 2.0.0
2019-01-21 17:46:32,818 localhost org.apache.kafka.common.utils.AppInfoParser: Kafka commitId : 3402a8361b734732
2019-01-21 17:46:32,832 localhost org.apache.kafka.clients.producer.internals.TransactionManager: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] ProducerId set to -1 with epoch -1
2019-01-21 17:47:32,833 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Error caught during partition assignment, will abort the current process and re-throw at the end of rebalance: {}
org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.
2019-01-21 17:47:32,843 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] partition assignment took 60062 ms.
current active tasks: []
current standby tasks: []
previous active tasks: []
2019-01-21 17:47:32,845 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] State transition from PARTITIONS_ASSIGNED to PENDING_SHUTDOWN
2019-01-21 17:47:32,845 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Shutting down
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.KafkaStreams: stream-client [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804] State transition from REBALANCING to ERROR
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.KafkaStreams: stream-client [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804] All stream threads have died. The instance will be in error state and should be closed.
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Shutdown complete
Exception in thread "rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Failed to rebalance.
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:870)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:810)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.
None of the kafka settings/configs have changed, and all of the brokers are available. My Kafka version is 2.0. I am able to read from the in_topic from the console-consumer, therefore everything prior to this application is fine. All help is appreciated.

Our project has the same timeout failure after we upgrade to Kafka 2.1, and we don't know the reason yet.
Our temporary work around is to disable the exactly_once config which skips the initializing transactional state.

We also got these error after an upgrade to 2.1 (and I think also when previously we upgraded to earlier versions.)
We run in a kubernetes environment where after a rolling upgrade, brokers may change IP address. From the broker log:
[2019-02-20 02:20:20,085] WARN [TransactionCoordinator id=1001] Connection
to node 0 (khaki-joey-kafka-0.khaki-joey-kafka-headless.hyperspace-dev/10.233.124.181:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2019-02-20 02:20:57,205] WARN [TransactionCoordinator id=1001] Connection to node 1 (khaki-joey-kafka-1.khaki-joey-kafka-headless.hyperspace-dev/10.233.122.67:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
I can see the transaction coordinator is still using stale IP addresses for the 2 brokers that were restarted after itself (a day after the upgrade.)
Possible options:
As this answer says, switch off Exactly Once for your streamer. It then doesn't use transactions and all seems to work ok. Not helpful if you require EOS or some other client code requires transactions.
restart any brokers that are reporting warnings to force them to re-resolve the IP address. They would need to be restarted in a way that they don't change IP address themselves. Not usually possible in kubernetes.
Defect raised Issue KAFKA-7958 - Transactions are broken with kubernetes hosted brokers
Update 2017-02-20 This may have been resolved in Kafka 2.1.1 (Confluent 5.1.2) released today. See the linked issue.

It's resolved after upgrade
https://kafka.apache.org/25/documentation/streams/developer-guide/write-streams.html
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Optionally include Kafka Streams DSL for Scala for Scala 2.12 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams-scala_2.12</artifactId>
<version>2.5.0</version>
</dependency>

Related

Why does the message "The Critical Analyzer detected slow paths on the broker" mean in Artemis broker?

Setup: I have an artemis broker HA cluster with 3 brokers. The replication policy is replication. Each broker is running in its own VM.
Problem: When I leave my brokers running for long time, usually after 5-6 hours, I get the below error.
2022-11-21 21:32:37,902 WARN
[org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager
is expired on path 0 2022-11-21 21:32:37,902 INFO
[org.apache.activemq.artemis.core.server] AMQ224107: The Critical
Analyzer detected slow paths on the broker. It is recommended that
you enable trace logs on org.apache.activemq.artemis.utils.critical
while you troubleshoot this issue. You should disable the trace logs
when you have finished troubleshooting. 2022-11-21 21:32:37,902 ERROR
[org.apache.activemq.artemis.core.server] AMQ224079: The process for
the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager#46d59067
is not responsive 2022-11-21 21:32:37,969 WARN
[org.apache.activemq.artemis.core.server] AMQ222199: Thread dump:
******************************************************************************* Complete Thread dump "Thread-517
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7#437da279)"
Id=602 TIMED_WAITING on
java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
What does this really mean? I understand that the critical analyzer sees an error and it halts the broker but what is causing this error?
You may take a look at the documentation. Basically you are experiencing some issue tat the broker detects and it shuts down before it becomes too irresponsive. Setting the policy to LOG you might get more clues on the issue.

Kafka streams apps on same machine

I run into issues when starting two streams applications from the same machine. It has to do with the Cooperative rebalancing protocol. For some reason when the second one comes up the first one crashes with:
[streams-StreamThread-1] INFO org.apache.kafka.streams.processor.internals.StreamThread - stream-thread [streams-stream-1-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
[streams-StreamThread-1] INFO org.apache.kafka.streams.KafkaStreams - stream-client [streams-app.id-1] State transition from RUNNING to ERROR
[streams-StreamThread-1] ERROR org.apache.kafka.streams.KafkaStreams - stream-client [streams-1] All stream threads have died. The instance will be in error state and should be closed.
[streams-1] INFO org.apache.kafka.streams.processor.internals.StreamThread - stream-thread [streams-StreamThread-1] Shutdown complete
[streams-StreamThread-1] ERROR app.id - Uncaught exception in Streams thread (streams-StreamThread-1)
java.lang.IllegalStateException: Assignor supporting the COOPERATIVE protocol violates its requirements
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.validateCooperativeAssignment(ConsumerCoordinator.java:668)
I have tried using a separate state.dir on the two instances but that didn’t seem to work for me either.
Can be ignored, see first comment: Further, I notice the same behavior when trying to start two consumers with the CooperativeStickyAssignor configured. Is there something with regards to Cooperative Rebalancing configuration that I am missing?

Auto restart Quarkus Microservice after broker unavailability

I have a very simple Quarkus microservice which uses smallrye reactive messaging (kafka). Sometimes my kafka broker goes down and I got the following logs :
2020-09-24 04:04:27,067 WARN [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | producer-1) [Producer clientId=producer-1] Bootstrap broker xxxxxxx.xxxx.xxx:2202 (id: -1 rack: null) disconnected 2020-09-24 04:04:27,083 WARN [org.apa.kaf.cli.NetworkClient] (kafka-producer-network-thread | producer-3) [Producer clientId=producer-3] Connection to node -1 (xxxxx.xxxx.xxxx.fr/XX.XX.XX.XXX:2202) could not be established. Broker may not be available.
After the broker has been restarted, I have to manually restart my microservice. Is it possible to add to capability to the microservice to reconsume the new incoming messages without any manual action?
Thank you!
If you are using KafkaProducer and Consumer API they automatically reconnect once the broker is up again.
Please ensure that in your application you do not throw an exception and kill the thread. If you keep the thread alive then it will reconnect. Catch all exceptions for Consumer thread to ensure it is not exiting due to a runtime exception.

Kafka brokers not starting up

I have 2 broker cluster of kafka 0.10.1 running up previously on my development servers with zookeeper 3.3.6 correctly.
I recently tried upgrading broker version to latest kafka 2.3.0 but it didn't start.
There is nothing much changed in the configuration.
Can anybody direct me what possibly could go wrong. Why brokers are not getting started?
Changed server.properties on broker server 1
broker.id=1
log.dirs=/mnt/kafka_2.11-2.3.0/logs
zookeeper.connect=local1:2181,local2:2181
listeners=PLAINTEXT://local1:9092
advertised.listeners=PLAINTEXT://local1:9092
Changed server.properties on broker server 2
broker.id=2
log.dirs=/mnt/kafka_2.11-2.3.0/logs
zookeeper.connect=local1:2181,local2:2181
listeners=PLAINTEXT://local2:9092
advertised.listeners=PLAINTEXT://local2:9092
NOTE:
1. Zookeeper is running on both servers
2. Kafka directories namely /brokers, /brokers/ids, /consumers etc are getting created.
3. Nothing is getting registered under /brokers/ids. Zookeeper CLI get /brokers/ids returns
[]
4. Command lsof -i tcp:9082 returns
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 18290 cass 118u IPv6 52133 0t0 TCP local2:9092 (LISTEN)
4. logs/server.log has no errors logged.
5. No more logs are getting appended to server.log.
Server logs
[2019-07-01 10:56:14,534] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-07-01 10:56:14,801] INFO Awaiting socket connections on local2:9092. (kafka.network.Acceptor)
[2019-07-01 10:56:14,829] INFO [SocketServer brokerId=1] Created data-plane acceptor and processors for endpoint : EndPoint(local2,9092,ListenerName(PLAINTEXT),PLAINTEXT) (kafka.network.SocketServer)
[2019-07-01 10:56:14,830] INFO [SocketServer brokerId=1] Started 1 acceptor threads for data-plane (kafka.network.SocketServer)
[2019-07-01 10:56:14,850] INFO [ExpirationReaper-1-Produce]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,851] INFO [ExpirationReaper-1-Fetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,851] INFO [ExpirationReaper-1-DeleteRecords]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,852] INFO [ExpirationReaper-1-ElectPreferredLeader]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,860] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler)
[2019-07-01 10:56:14,892] INFO Creating /brokers/ids/1 (is it secure? false) (kafka.zk.KafkaZkClient)
From the docs regarding ZooKeeper
Stable version
The current stable branch is 3.4 and the latest release of that branch is 3.4.9.
Upgrading zookeeper version to latest 3.5.5 helped and Kafka broker started correctly.
It would have been great if docs had stated the incompatibility with previous zookeeper version.
PS: Answer added to help someone struck with similar issue because of zookeeper version.

Zookeeper: Connection request from old client will be dropped if server is in r-o mode

storm version: 0.82
zookeeper version: 3.4.5.
We have a small storm cluster (1 nimbus and 3 supervisors), so using just 1 zookeeper instance that's co-located with storm nimbus.
Infrequently we start getting the following errors in the zookeeper logs and our storm cluster comes to a standstill.
2014-04-05 13:27:32,885 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFact
ory#197] - Accepted socket connection from /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#7
93] - Connection request from old client /10.0.1.183:56121; will be dropped if server is in r-o mode
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#8
32] - Client attempting to renew session 0x1452dd02834002e at /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#5
95] - Established session 0x1452dd02834002e with negotiated timeout 40000 for client /10.0.1.183:561
21
On the storm end we start seeing the following in supervisor and worker logs:
2014-04-05 11:37:29 ConnectionStateManager [WARN] There are no ConnectionStateListeners registered.
2014-04-05 11:37:29 cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2014-04-05 11:37:31 ClientCnxn [WARN] Session 0x1452dd028340015 for server null, unexpected error,
losing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-04-05 11:37:42 CuratorFrameworkImpl [ERROR] Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(Curat
rFrameworkImpl.java:380)
at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl
java:49)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
Do we need to downgrade zookeeper to 3.3.3 or is there a known issue/config that we're missing?
We also experienced several issues with Storm 0.9 and Zookeeper 3.4.X, even though not exactly the one you describe.
Storm mailing list are also reporting such incompatibility issues:
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/144313a45ba069b5
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/1447d95d10ce7582
This later one is pointing us to this Storm pull request, which should hopefully let us use ZK 3.4.X with future versions of Storm when it will be released:
https://github.com/apache/incubator-storm/pull/29
Until then, I would recommend downgrading ZK to 3.3.6 (you may install a specific separate instance of ZK for Storm if you absolutely need ZK 3.4.X for another system). You could also clone the Storm code and merge that pull request locally or compile the latest version of the trunk, but that's a bit adventurous and more tiresome than just waiting for those nice folks to just deliver a new release for us :)
A workaround for this situation is to clear storm's data directory (configured in strom.yaml==>storm.local.dir), then restart the supervisor. I did that in my test environment by clear storm's data directory and restart the nimbus and supervisor.
I think it's caused by a previous crash of the storm cluster, and the supervisor can not recovery from such a spot.