I run into issues when starting two streams applications from the same machine. It has to do with the Cooperative rebalancing protocol. For some reason when the second one comes up the first one crashes with:
[streams-StreamThread-1] INFO org.apache.kafka.streams.processor.internals.StreamThread - stream-thread [streams-stream-1-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
[streams-StreamThread-1] INFO org.apache.kafka.streams.KafkaStreams - stream-client [streams-app.id-1] State transition from RUNNING to ERROR
[streams-StreamThread-1] ERROR org.apache.kafka.streams.KafkaStreams - stream-client [streams-1] All stream threads have died. The instance will be in error state and should be closed.
[streams-1] INFO org.apache.kafka.streams.processor.internals.StreamThread - stream-thread [streams-StreamThread-1] Shutdown complete
[streams-StreamThread-1] ERROR app.id - Uncaught exception in Streams thread (streams-StreamThread-1)
java.lang.IllegalStateException: Assignor supporting the COOPERATIVE protocol violates its requirements
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.validateCooperativeAssignment(ConsumerCoordinator.java:668)
I have tried using a separate state.dir on the two instances but that didn’t seem to work for me either.
Can be ignored, see first comment: Further, I notice the same behavior when trying to start two consumers with the CooperativeStickyAssignor configured. Is there something with regards to Cooperative Rebalancing configuration that I am missing?
Related
Setup: I have an artemis broker HA cluster with 3 brokers. The replication policy is replication. Each broker is running in its own VM.
Problem: When I leave my brokers running for long time, usually after 5-6 hours, I get the below error.
2022-11-21 21:32:37,902 WARN
[org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager
is expired on path 0 2022-11-21 21:32:37,902 INFO
[org.apache.activemq.artemis.core.server] AMQ224107: The Critical
Analyzer detected slow paths on the broker. It is recommended that
you enable trace logs on org.apache.activemq.artemis.utils.critical
while you troubleshoot this issue. You should disable the trace logs
when you have finished troubleshooting. 2022-11-21 21:32:37,902 ERROR
[org.apache.activemq.artemis.core.server] AMQ224079: The process for
the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager#46d59067
is not responsive 2022-11-21 21:32:37,969 WARN
[org.apache.activemq.artemis.core.server] AMQ222199: Thread dump:
******************************************************************************* Complete Thread dump "Thread-517
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7#437da279)"
Id=602 TIMED_WAITING on
java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
What does this really mean? I understand that the critical analyzer sees an error and it halts the broker but what is causing this error?
You may take a look at the documentation. Basically you are experiencing some issue tat the broker detects and it shuts down before it becomes too irresponsive. Setting the policy to LOG you might get more clues on the issue.
I have an apache kafka streams application . I notice that it sometimes shutsdown when a rebalancing occurs with no real reason for the shutdown . It doesn't even throw an exception.
Here are some logs on the same
[2022-03-08 17:13:37,024] INFO [Consumer clientId=svc-stream-collector-StreamThread-1-consumer, groupId=svc-stream-collector] Adding newly assigned partitions: (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-03-08 17:13:37,024] ERROR stream-thread [svc-stream-collector-StreamThread-1] A Kafka Streams client in this Kafka Streams application is requesting to shutdown the application (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,030] INFO stream-client [svc-stream-collector] State transition from REBALANCING to PENDING_ERROR (org.apache.kafka.streams.KafkaStreams)
old state:REBALANCING new state:PENDING_ERROR
[2022-03-08 17:13:37,031] INFO [Consumer clientId=svc-stream-collector-StreamThread-1-consumer, groupId=svc-stream-collector] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-03-08 17:13:37,032] INFO stream-thread [svc-stream-collector-StreamThread-1] Informed to shut down (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,032] INFO stream-thread [svc-stream-collector-StreamThread-1] State transition from PARTITIONS_REVOKED to PENDING_SHUTDOWN (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,067] INFO stream-thread [svc-stream-collector-StreamThread-1] Thread state is already PENDING_SHUTDOWN, skipping the run once call after poll request (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,067] WARN stream-thread [svc-stream-collector-StreamThread-1] Detected that shutdown was requested. All clients in this app will now begin to shutdown (org.apache.kafka.streams.processor.internals.StreamThread)
I'm suspecting its because there are no newly assigned partitions in the log below
[2022-03-08 17:13:37,024] INFO [Consumer clientId=svc-stream-collector-StreamThread-1-consumer, groupId=svc-stream-collector] Adding newly assigned partitions: (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
However I'm not exactly sure why this error occurs . Any help would be appreciated.
I have a basic Kafka Streams application that reads from an in_topic, performs a rolling aggregate, and performs a join to publish to an out_topic. This has been running fine for weeks, but it crashed this morning and will no longer start. I do not think it has anything to do with the code. The log prior to the error are:
2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] Instantiated a transactional producer.
2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] Overriding the default acks to all since idempotence is enabled.
2019-01-21 17:46:32,818 localhost org.apache.kafka.common.utils.AppInfoParser: Kafka version : 2.0.0
2019-01-21 17:46:32,818 localhost org.apache.kafka.common.utils.AppInfoParser: Kafka commitId : 3402a8361b734732
2019-01-21 17:46:32,832 localhost org.apache.kafka.clients.producer.internals.TransactionManager: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] ProducerId set to -1 with epoch -1
2019-01-21 17:47:32,833 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Error caught during partition assignment, will abort the current process and re-throw at the end of rebalance: {}
org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.
2019-01-21 17:47:32,843 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] partition assignment took 60062 ms.
current active tasks: []
current standby tasks: []
previous active tasks: []
2019-01-21 17:47:32,845 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] State transition from PARTITIONS_ASSIGNED to PENDING_SHUTDOWN
2019-01-21 17:47:32,845 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Shutting down
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.KafkaStreams: stream-client [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804] State transition from REBALANCING to ERROR
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.KafkaStreams: stream-client [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804] All stream threads have died. The instance will be in error state and should be closed.
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Shutdown complete
Exception in thread "rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Failed to rebalance.
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:870)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:810)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.
None of the kafka settings/configs have changed, and all of the brokers are available. My Kafka version is 2.0. I am able to read from the in_topic from the console-consumer, therefore everything prior to this application is fine. All help is appreciated.
Our project has the same timeout failure after we upgrade to Kafka 2.1, and we don't know the reason yet.
Our temporary work around is to disable the exactly_once config which skips the initializing transactional state.
We also got these error after an upgrade to 2.1 (and I think also when previously we upgraded to earlier versions.)
We run in a kubernetes environment where after a rolling upgrade, brokers may change IP address. From the broker log:
[2019-02-20 02:20:20,085] WARN [TransactionCoordinator id=1001] Connection
to node 0 (khaki-joey-kafka-0.khaki-joey-kafka-headless.hyperspace-dev/10.233.124.181:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2019-02-20 02:20:57,205] WARN [TransactionCoordinator id=1001] Connection to node 1 (khaki-joey-kafka-1.khaki-joey-kafka-headless.hyperspace-dev/10.233.122.67:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
I can see the transaction coordinator is still using stale IP addresses for the 2 brokers that were restarted after itself (a day after the upgrade.)
Possible options:
As this answer says, switch off Exactly Once for your streamer. It then doesn't use transactions and all seems to work ok. Not helpful if you require EOS or some other client code requires transactions.
restart any brokers that are reporting warnings to force them to re-resolve the IP address. They would need to be restarted in a way that they don't change IP address themselves. Not usually possible in kubernetes.
Defect raised Issue KAFKA-7958 - Transactions are broken with kubernetes hosted brokers
Update 2017-02-20 This may have been resolved in Kafka 2.1.1 (Confluent 5.1.2) released today. See the linked issue.
It's resolved after upgrade
https://kafka.apache.org/25/documentation/streams/developer-guide/write-streams.html
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.5.0</version>
</dependency>
<!-- Optionally include Kafka Streams DSL for Scala for Scala 2.12 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams-scala_2.12</artifactId>
<version>2.5.0</version>
</dependency>
I have configured ConcurrentMessageListenerContainer with concurrency of 3 to consume from 3 partitions, and also KafkaTemplate with producerFactory that produces messages to 3 partitions. Spring bean is configured with destroy-method to invoke stop() of the consumer listener container and the producer at application shutdown time. After shutdown, the log is shown below which looks like consumers has stopped, but there is no info if the producer is stopped or not.
[kafkaContainer-0-C-1] INFO org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer - Consumer stopped
[kafkaContainer-2-C-1] INFO org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer - Consumer stopped
[kafkaContainer-1-C-1] INFO org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer - Consumer stopped
but the application is not shutting down completely. Doing a jstack command shows 3 ThreadPoolTaskScheduler still running in the background. Output snippet of jstack command:
"ThreadPoolTaskScheduler-1" #74 prio=5 os_prio=0 tid=0x00007f8564f23800 nid=0x77e5 waiting on condition [0x00007f8525292000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000ec8f0808> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The above block is printed for each MessageListener, I mean if I set the concurrency to 5, then jstack output contains the above message 5 times. So I think even if consumer listener is logged as shutdown it is internally not shutting down completely.
I'm I missing something on shutting down the producer and consumer properly?
You don't say which version you are using.
This is fixed in 2.1.0, 2.0.2 and 1.3.2.
You don't need to stop() the container from a destroy() method; the context will stop the consumer when it is closed.
Today I met one issue,the broker 2 is not existent in zookeeper, I thought the borker 2 is down, but it's still running well. I checked the zookeeper log,
it only mentioned " Established session 0x153514345be0321 with negotiated timeout 6000 for client broker 2",
from the removed broker server and state-change log,
[2016-03-08 06:00:00,257] ERROR Controller 2 epoch 19 initiated state change for partition [robotEvents,186] from OfflinePartition to OnlinePartition failed (state.change.logger)
kafka.common.StateChangeFailedException: encountered error while electing leader for partition [***,186] due to: aborted leader election for partition [****,186] since the LeaderAndIsr path was already written by another controller. This probably means that the current controller 2 went through a soft failure and another controller was elected with epoch 20..
a lot of these kind of errors occur, from the source code, this error seems normal process during leader election, but it can't explain why the broker 2 was removed from zk. Any idea?
Thanks in advance