Flink with kafka issue: Timeout expired while fetching topic metadata - apache-kafka

I tried submitting the simple flink job to accept messages from kafka, but after submitting the job, within less than a minute, the job fails with the following kafka exception. I have kafka 2.12 running on my local machine and I have configured the topic which this job consumes from.
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "127.0.0.1:9092");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> kafkaData = env
.addSource(new FlinkKafkaConsumer<String>("test-topic",
new SimpleStringSchema(), properties));
kafkaData.print();
env.execute("Aggregation Job");
}
Here's the exception:
Job has been submitted with JobID 5cc30fe72f685406126e2f5a26f10341
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 5cc30fe72f685406126e2f5a26f10341)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
...
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
I saw another question in stackoverflow, but that does not resolve the problem. I have not configured any SSL on the kafka broker. Any suggestions would be appreciated.

I had this same issue today. In my case, the problem was that I failed to put my flink application in a VPC (my MSK cluster lives in a VPC). After editing the flink application and moving it into the appropriate VPC, the problem went away.
I realize this question is a few months old, but I figured I'd post my findings in case anyone else happens to come across this from a Google search like I did.

Related

kafka client getting TOPIC_AUTHORIZATION_FAILED on cluster restart

I am using an SSL enabled kafka cluster to connect to consumer and publish messages. Below is the tech stack.
spring-kafka : 2.6.6
spring-boot : 2.4.3
Kafka properties
kafka:
bootstrap-servers: ${BOOTSTRAP-SERVERS-HOST}
subscription-topic: TEST
properties:
security.protocol: SSL
ssl.truststore.location: ${SUBSCRIPTION_TRUSTSTORE_PATH}
ssl.truststore.password: ${SUBSCRIPTION_TRUSTSTORE_PWD}
ssl.keystore.location: ${SUBSCRIPTION_KEYSTORE_PATH}
ssl.keystore.password: ${SUBSCRIPTION_KEYSTORE_PWD}
Issue:
Kafka Client application is up connected to the kafka cluster consumer and publishing messages as expected.
Now we stop the kafka broker/cluster below error is logged.
could not be established. Broker may not be available.
This is fine and expected as broker/cluster is down.
Now We start the broker/cluster and below error start appearing and kafka consumer stops consuming messages from topic however kafka publisher is able to send message to the topic. [application restart resolves this issue]
Trying to understand the root cause any help is much appreciated.
2022-01-13 13:34:52.078 [TEST.CONSUMER-GROUP-0-C-1] ERROR--SUBSCRIPTION - -org.apache.kafka.clients.Metadata.checkUnauthorizedTopics - [Consumer clientId=consumer-TEST.CONSUMER-GROUP-1, groupId=TEST.CONSUMER-GROUP] Topic authorization failed for topics [TEST]
2022-01-13 13:34:52.078 [TEST.CONSUMER-GROUP-0-C-1] ERROR- -SUBSCRIPTION - -org.springframework.core.log.LogAccessor.error - Authorization Exception and no authorizationExceptionRetryInterval set
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [TEST]
2022-01-13 13:34:52.081 [TEST.CONSUMER-GROUP-0-C-1] ERROR- IRVS-SUBSCRIPTION - -org.springframework.core.log.LogAccessor.error - Fatal consumer exception; stopping container
2022-01-13 13:34:52.083 [TEST.CONSUMER-GROUP-0-C-1] INFO - IRVS-SUBSCRIPTION - -org.springframework.scheduling.concurrent.ExecutorConfigurationSupport.shutdown - Shutting down ExecutorService
Above issue was resolved after adding
AuthorizationException RetryInterval
Below is an example illustrating this
#Bean
ConcurrentKafkaListenerContainerFactory<Object, Object> kafkaListenerContainerFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
ConsumerFactory<Object, Object> kafkaConsumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConcurrency(2);
configurer.configure(factory, kafkaConsumerFactory);
// other setters like error handler , retry handler
// setting Authorization Exception Retry Interval
factory.getContainerProperties()
.setAuthorizationExceptionRetryInterval(Duration.ofSeconds(5l));
return factory;
}

Flink SQL does not honor "table.exec.source.idle-timeout" setting

I have a Flink job running FlinkSQL with the following setup:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
final EnvironmentSettings settings =
EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
final StreamTableEnvironment tEnv = StreamTableEnvironment.create(env, settings);
env.setMaxParallelism(env.getParallelism() * 8);
env.getConfig().setAutoWatermarkInterval(config.autowatermarkInterval());
final TableConfig tConfig = tEnv.getConfig();
tConfig.setIdleStateRetention(Duration.ofMinutes(60));
tConfig.getConfiguration().setString("table.exec.source.idle-timeout", "180000 ms");
To test this locally with a Kafka source, I fired a few events to the Flink job. The Flink UI shows it produced one watermark. I waited 3 minutes to see if watermarks advance without sending in new events (i.e idle partition). However, no watermark advancement occurred.
Note: I use a Kafka broker locally with three partitions. and my test data is keyed and hence gets sent to the same partition. However, I am not seeing watermarks advance even if other partitions are idle and if I wait 3 minutes.
Any place in the JOB UI I could see if the value i set for 3 minutes is actually picked up? Am I using the right units(seconds vs ms)
Anything else I could check to test this setting?
We are running Flink 1.12.1.
Update: I see this exception in my Flink SQL job under exceptions: Wonder if there is a mismatch of versions.
2021-10-26 16:38:14
java.lang.NoClassDefFoundError: org/apache/kafka/common/requests/OffsetsForLeaderEpochRequest$PartitionData
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.lambda$null$0(OffsetsForLeaderEpochClient.java:52)
at java.base/java.util.Optional.ifPresent(Unknown Source)
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.lambda$prepareRequest$1(OffsetsForLeaderEpochClient.java:51)
at java.base/java.util.HashMap.forEach(Unknown Source)
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.prepareRequest(OffsetsForLeaderEpochClient.java:51)
at org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochClient.prepareRequest(OffsetsForLeaderEpochClient.java:37)
at org.apache.kafka.clients.consumer.internals.AsyncClient.sendAsyncRequest(AsyncClient.java:37)
at org.apache.kafka.clients.consumer.internals.Fetcher.lambda$validateOffsetsAsync$5(Fetcher.java:798)
at java.base/java.util.HashMap.forEach(Unknown Source)
at org.apache.kafka.clients.consumer.internals.Fetcher.validateOffsetsAsync(Fetcher.java:774)
at org.apache.kafka.clients.consumer.internals.Fetcher.validateOffsetsIfNeeded(Fetcher.java:498)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateFetchPositions(KafkaConsumer.java:2328)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1271)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1235)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1168)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaConsumerThread.run(KafkaConsumerThread.java:249)
The issue was that this setting does not work in Flink 1.12.0 or 1.12.1. I had to upgrade to Flink 1.13.2 and the setting was honored and worked as expected.
The exception was a red herring and not consistently reproducible.

Member stream consumer failed removing it from group

I have a Streams application that throws the following exception:
Exception in thread "<StreamsApp>-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=topic1, partition=0, offset=1
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:240)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:411)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:918)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:798)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort sending since an error caught with a previous record (key null value {...} timestamp 1530812658459) to topic topic2 due to Failed to update metadata after 60000 ms.
You can increase producer parameter `retries` and `retry.backoff.ms` to avoid this error.
In Streams app I have the following configs:
props.put(StreamsConfig.producerPrefix(ProducerConfig.RETRIES_CONFIG), 5);
props.put(StreamsConfig.producerPrefix(ProducerConfig.RETRY_BACKOFF_MS_CONFIG), 300000);
props.put(StreamsConfig.producerPrefix(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG), 300000);
I checked kafka broker logs (I have a single kafka broker) and see the following logs related to this:
INFO [GroupCoordinator 1001]: Member <StreamsApp>-StreamThread-1-consumer-49d0a5b3-be2a-4b5c-a4ab-ced7a2484a02 in group <StreamsApp> has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2018-07-03 14:39:23,893] INFO [GroupCoordinator 1001]: Preparing to rebalance group <StreamsApp> with old generation 1 (__consumer_offsets-46) (kafka.coordinator.group.GroupCoordinator)
[2018-07-03 14:39:23,893] INFO [GroupCoordinator 1001]: Group <StreamsApp> with generation 2 is now empty (__consumer_offsets-46) (kafka.coordinator.group.GroupCoordinator)
I read somewhere that its related to consumer not calling poll() for quite some time and hence it is kicked out by the consumer coordinator and now the new consumer uses heartbeat as the failure detection protocol. I am not sure if this can be the reason since I am using Kafka version 1.1.0 and streams version 1.1.0 as well.
How can I avoid this failure scenario? For now I have to restart streams application every time this happens.
UPDATE-1:
I am trying to handle this StreamsException by enclosing the main in try-catch block but I cant catch the exception. What can be the reason for it and how can I catch it and exit the application? Currently streams-app is in a halt state and not doing anything after this exception.
UPDATE-2:
Following this and this I have updated my code to this:
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
streams.setUncaughtExceptionHandler((Thread thread, Throwable throwable) -> {
log.error("EXITTING");
log.error(throwable.getMessage());
streams.close(5 ,TimeUnit.SECONDS);
latch.countDown();
System.exit(-1);
});
Now the exception is handled and logged. However, Streams app is not exited (its still running in terminal in halt state). Ctrl+C doesn't kill it. I have to kill it by getting pid of the process and calling kill on it.
This is how I have stopped streams app and exited it.
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
streams.setUncaughtExceptionHandler((Thread thread, Throwable throwable) -> {
log.error("EXITING");
log.error(throwable.getMessage());
latch.countDown();
});
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
try {
streams.start();
latch.await();
} catch (Throwable e) {
log.info("Exitting apllication with status code 1");
System.exit(1);
}
log.info("Exitting apllication with status code 0");
streams.close(5 ,TimeUnit.SECONDS);
System.exit(0);
For the exception and related broker logs:
Reduce max.poll.records so poll() is called more often. (Confluent help forum).
I haven't tried this yet. Will update once I test this out.
I had the same problem.Even restarting the process couldn't get consumers to join in.I have to restart this Kafka every time for all the consumers.I adjusted the relevant configuration, but the problem continued to occur with consumers when a large number of messages were written.
props.put("max.poll.interval.ms",30601000); props.put("session.timeout.ms",100000); props.put("max.poll.records",100);
I use manual submission, and when I have consumed the message and processed it, I submit the offset. However, I usually use only a few seconds for the whole transaction, but I still set it to 30 minutes. Expect someone to tell me the question.

Kafka Producer error Expiring 10 record(s) for TOPIC:XXXXXX: 6686 ms has passed since batch creation plus linger time

Kafka Version : 0.10.2.1,
Kafka Producer error Expiring 10 record(s) for TOPIC:XXXXXX: 6686 ms has passed since batch creation plus linger time
org.apache.kafka.common.errors.TimeoutException: Expiring 10 record(s) for TOPIC:XXXXXX: 6686 ms has passed since batch creation plus linger time
This exception is occuring because you are queueing records at a much faster rate than they can be sent.
When you call the send method, the ProducerRecord will be stored in an internal buffer for sending to the broker. The method returns immediately once the ProducerRecord has been buffered, regardless of whether it has been sent.
Records are grouped into batches for sending to the broker, to reduce the transport overheard per message and increase throughput.
Once a record is added into a batch, there is a time limit for sending that batch to ensure that it has been sent within a specified duration. This is controlled by the Producer configuration parameter, request.timeout.ms, which defaults to 30 seconds. See related answer
If the batch has been queued longer than the timeout limit, the exception will be thrown. Records in that batch will be removed from the send queue.
Producer configs block.on.buffer.full, metadata.fetch.timeout.ms and timeout.ms have been removed. They were initially deprecated in Kafka 0.9.0.0.
Therefore give a try for increasing request.timeout.ms
Still, if you have any problem related to throughput, you can also refer following blog
This issue originates when wither brokers/topics/partitions are not able to contact with producer or producer times out before the queue.
I found that even for a live brokers you can encounter this issue. In my case, the topic partitions leaders were pointing to inactive broker ids. To fix this issue, you have to migrate those leaders to active brokers.
Use topic-reassignment tool for impacted topics.
Topic Migration: https://kafka.apache.org/21/documentation.html#basic_ops_automigrate
I had same message and I fixed it cleaning the kafka data from zookeeper. After that it's working.
i had faced same issue in aks cluster, just restarting of kafka and zookeeper servers resolved the issue.
FOR KAFKA DOCKER CASE
For a lot of time find out what happened, including changes server.properties , producer.properties and my code (Eclipse). That does not work for me (I send message from my laptop to Kafka Docker on a Linux server)
I cleaned Kafka and Zookeeper and reinstall them by docker-compose.yml(I'm newbie). Please look at my docker-compose.yml file and follow how I changes these IP to my Linux server's IP
bitnami/kafka
bitnami/kafka
to...
bitnami-changed
while 10.5.1.30 is my Linux server's IP address
wurstmeister kafka
wurstmeister
after that, I ran my code and here's result:
result
full code:
import java.util.Properties;
import java.util.concurrent.Future;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SimpleProducer {
public static void main(String[] args) throws Exception {
try {
String topicName = "demo";
Properties props = new Properties();
props.put("bootstrap.servers", "10.5.1.30:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String, String>(props);
Future<RecordMetadata> f = producer.send(new ProducerRecord<String, String>(topicName, "Eclipse3"));
System.out.println("Message sent successfully, total of message is: " + f.get().toString());
producer.close();
} catch (Exception e) {
System.out.println(e.getMessage());
}
System.out.println("Successful");
}
}
Hope that helps. Peace !!!
Say a topic has 100 partitions (0-99). Kafka lets you produce records to a topic by specifying a particular partition. Faced the issue where I'm trying to produce to partition > 99, because brokers reject these records.
We tried everything, but no luck.
Decreased producer batch size and increased request.timeout.ms.
Restarted target kafka cluster, still no luck.
Checked replication on target kafka cluster, that as well was working fine.
Added retries, retries.backout.ms in prodcuer properties.
Added linger.time as well in kafka prodcuer properties.
Finally our case there was issue with kafka cluster itself, from 2 servers we were unable to fetch metadata in between.
When we changed target kafka cluster to our dev box, it worked fine.

Storm - KafkaSpout fails on open()

For the past two days, I’ve been tying to implement a KafkaSpout within our topology. Here
are some important information.
All three services are running on the same instance. Kafka’s brokers use as by default the 9092
port, with advertised.listeners set to PLAINTEXT://localhost:9092. Zookeeper, uses the default
client port 2181. Whereas the Storm Nimbus host name has been set to localhost as well.
A custom Kafka Producer creates log messages successfully, whereas by using the zkCli
Zookeeper script I’ve seen that when using the /brokers path, the partitions and other relevant
information are stored correctly.
However, I keep getting the error when activating, and afterwards monitoring the topology.
Here is the source code of the Storm topology I’ve implemented:
BrokerHosts hosts = new ZkHosts("127.0.0.1:2181");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "bytes", "/kafkastorm/", "bytes" + UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.zkServers = Arrays.asList("127.0.0.1");
spoutConfig.zkPort = 2181;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("bytes", kafkaSpout);
builder.setBolt("byteSize", new KafkaByteProcessingBolt()).shuffleGrouping("bytes");
StormTopology topology = builder.createTopology();
Config config = new Config();
StormSubmitter.submitTopology("topology", config, topology);
However, the error message I keep getting when executing the bin/storm monitor <topology_name> -m bytes is the following:
Exception in thread "main" java.lang.IllegalArgumentException: stream: default not found
at org.apache.storm.utils.Monitor.metrics(Monitor.java:223)
at org.apache.storm.utils.Monitor.metrics(Monitor.java:159)
at org.apache.storm.command.monitor$_main.doInvoke(monitor.clj:36)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at org.apache.storm.command.monitor.main(Unknown Source)
Whereas by inspecting the logs of the workers (the worker.log file), I’ve concluded that
the KafkaSpout fails on the open() method.
java.lang.NoClassDefFoundError: org/apache/curator/RetryPolicy
at org.apache.storm.kafka.KafkaSpout.open(KafkaSpout.java:75) ~[storm-kafka-1.0.2.jar:1.0.2]
at org.apache.storm.daemon.executor$fn__7990$fn__8005.invoke(executor.clj:604) ~[storm-core-1.0.2.jar:1.0.2]
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.0.2.jar:1.0.2]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]
Caused by: java.lang.ClassNotFoundException: org.apache.curator.RetryPolicy
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_101]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_101]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) ~[?:1.8.0_101]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_101]
... 5 more
Could someone explain what might be the reason for the KafkaSpout to fail on the
open() method?
I would really appreciate for your help!
From the error "java.lang.NoClassDefFoundError: org/apache/curator/RetryPolicy" it appears that curator-client.jar is missing.
Please check if below link helps you ?
https://github.com/abhinavg6/Kafka-Storm-Conscomp/issues/1