Kafka Stream failed to rebalance on startup - apache-kafka

I'm trying to run a Kafka stream application and I'm running into the following exception when kafka stream starts.
Here the configurations that I'm using on Kafka 0.10.1
final Map<String, String> properties = new HashMap<>();
properties.put(StreamsConfig.APPLICATION_ID_CONFIG, "some app id");
properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(StreamsConfig.CLIENT_ID_CONFIG, "some app id");
properties.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
properties.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, "org.apache.kafka.common.serialization.Serdes$StringSerde");
properties.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, "org.apache.kafka.common.serialization.Serdes$StringSerde");
The exception that I'm getting.
Exception in thread "StreamThread-1" org.apache.kafka.streams.errors.StreamsException: stream-thread [StreamThread-1] Failed to rebalance
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:410)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:242)
Caused by: java.lang.NullPointerException
at java.io.File.<init>(File.java:360)
at org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:157)
at org.apache.kafka.streams.state.internals.RocksDBStore.init(RocksDBStore.java:163)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.init(MeteredKeyValueStore.java:85)
at org.apache.kafka.streams.state.internals.CachingKeyValueStore.init(CachingKeyValueStore.java:62)
at org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:81)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:120)
at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:633)
at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:660)
at org.apache.kafka.streams.processor.internals.StreamThread.access$100(StreamThread.java:69)
at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:124)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:407)
... 1 more
Seems like I need to set state.dir but that doesn't seem to work. By default on Kafka 0.10.1, it already defaults to /tmp/kafka-streams. Anyone have any ideas?

Related

kafka client getting TOPIC_AUTHORIZATION_FAILED on cluster restart

I am using an SSL enabled kafka cluster to connect to consumer and publish messages. Below is the tech stack.
spring-kafka : 2.6.6
spring-boot : 2.4.3
Kafka properties
kafka:
bootstrap-servers: ${BOOTSTRAP-SERVERS-HOST}
subscription-topic: TEST
properties:
security.protocol: SSL
ssl.truststore.location: ${SUBSCRIPTION_TRUSTSTORE_PATH}
ssl.truststore.password: ${SUBSCRIPTION_TRUSTSTORE_PWD}
ssl.keystore.location: ${SUBSCRIPTION_KEYSTORE_PATH}
ssl.keystore.password: ${SUBSCRIPTION_KEYSTORE_PWD}
Issue:
Kafka Client application is up connected to the kafka cluster consumer and publishing messages as expected.
Now we stop the kafka broker/cluster below error is logged.
could not be established. Broker may not be available.
This is fine and expected as broker/cluster is down.
Now We start the broker/cluster and below error start appearing and kafka consumer stops consuming messages from topic however kafka publisher is able to send message to the topic. [application restart resolves this issue]
Trying to understand the root cause any help is much appreciated.
2022-01-13 13:34:52.078 [TEST.CONSUMER-GROUP-0-C-1] ERROR--SUBSCRIPTION - -org.apache.kafka.clients.Metadata.checkUnauthorizedTopics - [Consumer clientId=consumer-TEST.CONSUMER-GROUP-1, groupId=TEST.CONSUMER-GROUP] Topic authorization failed for topics [TEST]
2022-01-13 13:34:52.078 [TEST.CONSUMER-GROUP-0-C-1] ERROR- -SUBSCRIPTION - -org.springframework.core.log.LogAccessor.error - Authorization Exception and no authorizationExceptionRetryInterval set
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [TEST]
2022-01-13 13:34:52.081 [TEST.CONSUMER-GROUP-0-C-1] ERROR- IRVS-SUBSCRIPTION - -org.springframework.core.log.LogAccessor.error - Fatal consumer exception; stopping container
2022-01-13 13:34:52.083 [TEST.CONSUMER-GROUP-0-C-1] INFO - IRVS-SUBSCRIPTION - -org.springframework.scheduling.concurrent.ExecutorConfigurationSupport.shutdown - Shutting down ExecutorService
Above issue was resolved after adding
AuthorizationException RetryInterval
Below is an example illustrating this
#Bean
ConcurrentKafkaListenerContainerFactory<Object, Object> kafkaListenerContainerFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
ConsumerFactory<Object, Object> kafkaConsumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConcurrency(2);
configurer.configure(factory, kafkaConsumerFactory);
// other setters like error handler , retry handler
// setting Authorization Exception Retry Interval
factory.getContainerProperties()
.setAuthorizationExceptionRetryInterval(Duration.ofSeconds(5l));
return factory;
}

Flink with kafka issue: Timeout expired while fetching topic metadata

I tried submitting the simple flink job to accept messages from kafka, but after submitting the job, within less than a minute, the job fails with the following kafka exception. I have kafka 2.12 running on my local machine and I have configured the topic which this job consumes from.
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> kafkaData = env
.addSource(new FlinkKafkaConsumer<String>("test-topic",
new SimpleStringSchema(), properties));
kafkaData.print();
env.execute("Aggregation Job");
}
Here's the exception:
Job has been submitted with JobID 5cc30fe72f685406126e2f5a26f10341
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 5cc30fe72f685406126e2f5a26f10341)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
...
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
I saw another question in stackoverflow, but that does not resolve the problem. I have not configured any SSL on the kafka broker. Any suggestions would be appreciated.
I had this same issue today. In my case, the problem was that I failed to put my flink application in a VPC (my MSK cluster lives in a VPC). After editing the flink application and moving it into the appropriate VPC, the problem went away.
I realize this question is a few months old, but I figured I'd post my findings in case anyone else happens to come across this from a Google search like I did.

Kafka producer timeout exception comes randomaly

I am using below kafka config for one of my producer, functionality works fine.
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "hostaddress:9092");
props.put(ProducerConfig.CLIENT_ID_CONFIG,"usertest");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, "3");
props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 33554432);
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 1600);
But I get timeout exception randomly, like everything works for some 1 hour to two hours but then suddenly gets following timeout exception for few records.
In my test run, producer sent around 20k msgs and consumer received 18978.
2019-09-24 13:45:43,106 ERROR c.j.b.p.UserProducer$1 [http-nio-8185-exec-13] Send failed for record ProducerRecord(topic=user_test_topic, partition=null, headers=RecordHeaders(headers = [], isReadOnly = false), key=UPDATE_USER, value=CreatePartnerSite [userid=3, name=user123, email=testuser#gmail.com, phone=1234567890]], timestamp=null)
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
2019-09-24 13:45:43,107 ERROR c.j.b.s.UserServiceImpl [http-nio-8185-exec-13] failed to puplish
Try updating "max.block.ms" producer config to more than 60000ms.

Member stream consumer failed removing it from group

I have a Streams application that throws the following exception:
Exception in thread "<StreamsApp>-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=topic1, partition=0, offset=1
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:240)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:918)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:798)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort sending since an error caught with a previous record (key null value {...} timestamp 1530812658459) to topic topic2 due to Failed to update metadata after 60000 ms.
You can increase producer parameter `retries` and `retry.backoff.ms` to avoid this error.
In Streams app I have the following configs:
props.put(StreamsConfig.producerPrefix(ProducerConfig.RETRIES_CONFIG), 5);
props.put(StreamsConfig.producerPrefix(ProducerConfig.RETRY_BACKOFF_MS_CONFIG), 300000);
props.put(StreamsConfig.producerPrefix(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG), 300000);
I checked kafka broker logs (I have a single kafka broker) and see the following logs related to this:
INFO [GroupCoordinator 1001]: Member <StreamsApp>-StreamThread-1-consumer-49d0a5b3-be2a-4b5c-a4ab-ced7a2484a02 in group <StreamsApp> has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-07-03 14:39:23,893] INFO [GroupCoordinator 1001]: Preparing to rebalance group <StreamsApp> with old generation 1 (__consumer_offsets-46) (kafka.coordinator.group.GroupCoordinator)
[2018-07-03 14:39:23,893] INFO [GroupCoordinator 1001]: Group <StreamsApp> with generation 2 is now empty (__consumer_offsets-46) (kafka.coordinator.group.GroupCoordinator)
I read somewhere that its related to consumer not calling poll() for quite some time and hence it is kicked out by the consumer coordinator and now the new consumer uses heartbeat as the failure detection protocol. I am not sure if this can be the reason since I am using Kafka version 1.1.0 and streams version 1.1.0 as well.
How can I avoid this failure scenario? For now I have to restart streams application every time this happens.
UPDATE-1:
I am trying to handle this StreamsException by enclosing the main in try-catch block but I cant catch the exception. What can be the reason for it and how can I catch it and exit the application? Currently streams-app is in a halt state and not doing anything after this exception.
UPDATE-2:
Following this and this I have updated my code to this:
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
streams.setUncaughtExceptionHandler((Thread thread, Throwable throwable) -> {
log.error("EXITTING");
log.error(throwable.getMessage());
streams.close(5 ,TimeUnit.SECONDS);
latch.countDown();
System.exit(-1);
});
Now the exception is handled and logged. However, Streams app is not exited (its still running in terminal in halt state). Ctrl+C doesn't kill it. I have to kill it by getting pid of the process and calling kill on it.
This is how I have stopped streams app and exited it.
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
streams.setUncaughtExceptionHandler((Thread thread, Throwable throwable) -> {
log.error("EXITING");
log.error(throwable.getMessage());
latch.countDown();
});
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
try {
streams.start();
latch.await();
} catch (Throwable e) {
log.info("Exitting apllication with status code 1");
System.exit(1);
}
log.info("Exitting apllication with status code 0");
streams.close(5 ,TimeUnit.SECONDS);
System.exit(0);
For the exception and related broker logs:
Reduce max.poll.records so poll() is called more often. (Confluent help forum).
I haven't tried this yet. Will update once I test this out.
I had the same problem.Even restarting the process couldn't get consumers to join in.I have to restart this Kafka every time for all the consumers.I adjusted the relevant configuration, but the problem continued to occur with consumers when a large number of messages were written.
props.put("max.poll.interval.ms",30601000); props.put("session.timeout.ms",100000); props.put("max.poll.records",100);
I use manual submission, and when I have consumed the message and processed it, I submit the offset. However, I usually use only a few seconds for the whole transaction, but I still set it to 30 minutes. Expect someone to tell me the question.

Storm - KafkaSpout fails on open()

For the past two days, I’ve been tying to implement a KafkaSpout within our topology. Here
are some important information.
All three services are running on the same instance. Kafka’s brokers use as by default the 9092
port, with advertised.listeners set to PLAINTEXT://localhost:9092. Zookeeper, uses the default
client port 2181. Whereas the Storm Nimbus host name has been set to localhost as well.
A custom Kafka Producer creates log messages successfully, whereas by using the zkCli
Zookeeper script I’ve seen that when using the /brokers path, the partitions and other relevant
information are stored correctly.
However, I keep getting the error when activating, and afterwards monitoring the topology.
Here is the source code of the Storm topology I’ve implemented:
BrokerHosts hosts = new ZkHosts("127.0.0.1:2181");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "bytes", "/kafkastorm/", "bytes" + UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.zkServers = Arrays.asList("127.0.0.1");
spoutConfig.zkPort = 2181;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("bytes", kafkaSpout);
builder.setBolt("byteSize", new KafkaByteProcessingBolt()).shuffleGrouping("bytes");
StormTopology topology = builder.createTopology();
Config config = new Config();
StormSubmitter.submitTopology("topology", config, topology);
However, the error message I keep getting when executing the bin/storm monitor <topology_name> -m bytes is the following:
Exception in thread "main" java.lang.IllegalArgumentException: stream: default not found
at org.apache.storm.utils.Monitor.metrics(Monitor.java:223)
at org.apache.storm.utils.Monitor.metrics(Monitor.java:159)
at org.apache.storm.command.monitor$_main.doInvoke(monitor.clj:36)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at org.apache.storm.command.monitor.main(Unknown Source)
Whereas by inspecting the logs of the workers (the worker.log file), I’ve concluded that
the KafkaSpout fails on the open() method.
java.lang.NoClassDefFoundError: org/apache/curator/RetryPolicy
at org.apache.storm.kafka.KafkaSpout.open(KafkaSpout.java:75) ~[storm-kafka-1.0.2.jar:1.0.2]
at org.apache.storm.daemon.executor$fn__7990$fn__8005.invoke(executor.clj:604) ~[storm-core-1.0.2.jar:1.0.2]
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.0.2.jar:1.0.2]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]
Caused by: java.lang.ClassNotFoundException: org.apache.curator.RetryPolicy
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_101]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_101]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) ~[?:1.8.0_101]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_101]
... 5 more
Could someone explain what might be the reason for the KafkaSpout to fail on the
open() method?
I would really appreciate for your help!
From the error "java.lang.NoClassDefFoundError: org/apache/curator/RetryPolicy" it appears that curator-client.jar is missing.
Please check if below link helps you ?
https://github.com/abhinavg6/Kafka-Storm-Conscomp/issues/1