Apache Flume: kafka.consumer.ConsumerTimeoutException - apache-kafka

I'm trying to build pipeline with Apache Flume:
spooldir -> kafka channel -> hdfs sink
Events go to kafka topic without problems and I can see them with kafkacat request. But kafka channel can't write files to hdfs via sink. The error is:
Timed out while waiting for data to come from Kafka
Full log:
2016-02-26 18:25:17,125
(SinkRunner-PollingRunner-DefaultSinkProcessor-SendThread(zoo02:2181))
[DEBUG -
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:717)]
Got ping response for sessionid: 0x2524a81676d02aa after 0ms
2016-02-26 18:25:19,127
(SinkRunner-PollingRunner-DefaultSinkProcessor-SendThread(zoo02:2181))
[DEBUG -
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:717)]
Got ping response for sessionid: 0x2524a81676d02aa after 1ms
2016-02-26 18:25:21,129
(SinkRunner-PollingRunner-DefaultSinkProcessor-SendThread(zoo02:2181))
[DEBUG -
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:717)]
Got ping response for sessionid: 0x2524a81676d02aa after 0ms
2016-02-26 18:25:21,775
(SinkRunner-PollingRunner-DefaultSinkProcessor) [DEBUG -
org.apache.flume.channel.kafka.KafkaChannel$KafkaTransaction.doTake(KafkaChannel.java:327)]
Timed out while waiting for data to come from Kafka
kafka.consumer.ConsumerTimeoutException at
kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:69)
at
kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
at
kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at
org.apache.flume.channel.kafka.KafkaChannel$KafkaTransaction.doTake(KafkaChannel.java:306)
at
org.apache.flume.channel.BasicTransactionSemantics.take(BasicTransactionSemantics.java:113)
at
org.apache.flume.channel.BasicChannelSemantics.take(BasicChannelSemantics.java:95)
at
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:374)
at
org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
My FlUME's config is:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c2
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/alex/spoolFlume
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://10.12.0.1:54310/logs/flumetest/
a1.sinks.k1.hdfs.filePrefix = flume-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 1000
a1.channels.c2.brokerList=kafka10:9092,kafka11:9092,kafka12:9092
a1.channels.c2.topic=flume_test_001
a1.channels.c2.zookeeperConnect=zoo00:2181,zoo01:2181,zoo02:2181
# Bind the source and sink to the channel
a1.sources.r1.channels = c2
a1.sinks.k1.channel = c2
With memory channel instead of kafka channel all works good.
Thanks for any ideas in advance!

ConsumerTimeoutException means there is no new message for a long time, doesn't mean connect time out for Kafka.
http://kafka.apache.org/documentation.html
consumer.timeout.ms -1 Throw a timeout exception to the consumer if no message is available for consumption after the specified interval

Kafka's ConsumerConfig class has the "consumer.timeout.ms" configuration property, which Kafka sets by default to -1. Any new Kafka Consumer is expected to override the property with a suitable value.
Below is a reference from Kafka documentation :
consumer.timeout.ms -1
By default, this value is -1 and a consumer blocks indefinitely if no new message is available for consumption. By setting the value to a positive integer, a timeout exception is thrown to the consumer if no message is available for consumption after the specified timeout value.
When Flume creates a Kafka channel, it is setting the timeout.ms value to 100, as seen on the Flume logs at the INFO level. That explains why we see a ton of these ConsumerTimeoutExceptions.
level: INFO Post-validation flume configuration contains configuration for agents: [agent]
level: INFO Creating channels
level: DEBUG Channel type org.apache.flume.channel.kafka.KafkaChannel is a custom type
level: INFO Creating instance of channel c1 type org.apache.flume.channel.kafka.KafkaChannel
level: DEBUG Channel type org.apache.flume.channel.kafka.KafkaChannel is a custom type
level: INFO Group ID was not specified. Using flume as the group id.
level: INFO {metadata.broker.list=kafka:9092, request.required.acks=-1, group.id=flume,
zookeeper.connect=zookeeper:2181, **consumer.timeout.ms=100**, auto.commit.enable=false}
level: INFO Created channel c1
Going by the Flume user guide on Kafka channel settings, I tried to override this value by specifying the below, but that doesn't seem to work though:
agent.channels.c1.kafka.consumer.timeout.ms=5000
Also, we did a load test with pounding data through the channels constantly, and this exception didn't occur during the tests.

I read flume's source code, and found that flume reads value of the key "timeout" for "consumer.timeout.ms".
So you can config the value for "consumer.timeout.ms" like this:
agent1.channels.kafka_channel.timeout=-1

Related

How to set consumer config values for Kafka Mirrormaker-2 2.6.1?

I am attempting to use mirrormaker 2 to replicate data between AWS Managed Kafkas (MSK) in 2 different AWS regions - one in eu-west-1 (CLOUD_EU) and the other in us-west-2 (CLOUD_NA), both running Kafka 2.6.1. For testing I am currently trying just to replicate topics 1 way, from EU -> NA.
I am starting a mirrormaker connect cluster using ./bin/connect-mirror-maker.sh and a properties file (included)
This works fine for topics with small messages on them, but one of my topic has binary messages up to 20MB in size. When I try to replicate that topic I get an error every 30 seconds
[2022-04-21 13:47:05,268] INFO [Consumer clientId=consumer-29, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: {}. (org.apache.kafka.clients.FetchSessionHandler:481)
org.apache.kafka.common.errors.DisconnectException
When logging in DEBUG to get more information we get
[2022-04-21 13:47:05,267] DEBUG [Consumer clientId=consumer-29, groupId=null] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient:784)
[2022-04-21 13:47:05,268] DEBUG [Consumer clientId=consumer-29, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=consumer-29, correlationId=35) due to node 2 being disconnected (org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:593)
It gets stuck in a loop constantly disconnecting with request timeout every 30s and then trying again.
Looking at this, I suspect that the problem is the request.timeout.ms is on the default (30s) and it times out trying to read the topic with many large messages.
I followed the guide at https://github.com/apache/kafka/tree/trunk/connect/mirror to attempt to configure the consumer properties, however, no matter what I set, the timeout for the consumer remains fixed at the default, confirmed both by kafka outputting its config in the log and by timing how long between the disconnect messages. e.g. I set:
CLOUD_EU.consumer.request.timeout.ms=120000
In the properties that I start MM-2 with.
based on various guides I have found while looking at this, I have also tried
CLOUD_EU.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.override.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.override.request.timeout.ms=120000
None of which have worked.
How can I change the consumer request.timeout setting? The log is approx 10,000 lines long, but everywhere where the ConsumerConfig is logged out it logs request.timeout.ms = 30000
Properties file I am using:
# specify any number of cluster aliases
clusters = CLOUD_EU, CLOUD_NA
# connection information for each cluster
CLOUD_EU.bootstrap.servers = kafka.eu-west-1.amazonaws.com:9092
CLOUD_NA.bootstrap.servers = kafka.us-west-2.amazonaws.com:9092
# enable and configure individual replication flows
CLOUD_EU->CLOUD_NA.enabled = true
CLOUD_EU->CLOUD_NA.topics = METRICS_ATTACHMENTS_OVERSIZE_EU
CLOUD_NA->CLOUD_EU.enabled = false
replication.factor=3
tasks.max = 1
############################# Internal Topic Settings #############################
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
############################ Kafka Settings ###################################
# CLOUD_EU cluster over writes
CLOUD_EU.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.session.timeout.ms=150000

Flink Kafka Sink org.apache.kafka.common.errors.UnsupportedVersionException ERROR

version flink(1.11.3), kafka(2.1.1)
My flink datapipeline is kafka(source) -> flink -> kafka(sink).
When I submit job first, it works well.
but after jobmanager or taskmanagers fail, if they restarted, they occur exception
2020-12-31 10:35:23.831 [objectOperator -> Sink: objectSink (1/1)] WARN o.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer - Encountered error org.apache.kafka.common.errors.InvalidTxnStateException: The producer attempted a transactional operation in an invalid state. while recovering transaction KafkaTransactionState [transactionalId=objectOperator -> Sink: objectSink-bcabd9b643c47ab46ace22db2e1285b6-3, producerId=14698, epoch=7]. Presumably this transaction has been already committed before
2020-12-31 10:35:23.919 [userOperator -> Sink: userSink (1/1)] WARN org.apache.flink.runtime.taskmanager.Task - userOperator -> Sink: userSink (1/1) (2a5a171aa335f444740b4acfc7688d7c) switched from RUNNING to FAILED.
org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
2020-12-31 10:35:24.131 [objectOperator -> Sink: objectSink (1/1)] WARN org.apache.flink.runtime.taskmanager.Task - objectOperator -> Sink: objectSink (1/1) (07fe747a81b31e016e88ea6331b31433) switched from RUNNING to FAILED.
org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default producerId at version 1
I don't know why this error occurs.
my kafka producer code
Properties props = new Properties();
props.setProperty("bootstrap.servers", servers);
props.setProperty("transaction.timeout.ms", "30000");
FlinkKafkaProducer<CountModel> producer = new FlinkKafkaProducer<CountModel>(
topic,((record, timestamp) -> new ProducerRecord<>(
topic
, Longs.toByteArray(record.getUserInKey())
, JsonUtils.toJsonBytes(record))), props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
I don't think it's a version issue.
It seems that no one has experienced the same error as me
Each Producer is assigned a unique PID when it is initialized. This PID is transparent to the application and is not exposed to the user at all. For a given PID, the sequence number will increase from 0, and each Topic-Partition will have an independent sequence number. When the Producer sends data, it will identify a sequence number for each msg, and the Server will use this to verify whether the data is duplicated. The PID here is globally unique, and a new PID will be assigned after the Producer is restarted after a failure. This is also one of the reasons why idempotence cannot be achieved across sessions.
If you resume from savepoint, the previous producerId will be used, and a new session will generate 1000 new producerIds (these id runs through the entire session, equivalent to the default value), so it will be non-default

Kafka Connect AWS S3 sink connector doesn't read from topic

I have a simple standalone S3 sink connector. Here is the relevant part of worker configuration properties:
plugin.path = <plugins directory>
bootstrap.servers = <List of servers on Amazon MKS>
security.protocol = SSL
...
It works fine when I connect it to a locally running Kafka. However when I connect it to a Kafka broker on AWS (with SSL), it doesn't consume anything. No errors, nothing. As if the topic was empty:
[2020-01-30 10:50:03,597] INFO Started S3 connector task with assigned partitions: [] (io.confluent.connect.s3.S3SinkTask:116)
[2020-01-30 10:50:03,598] INFO WorkerSinkTask{id=xxx} Sink task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:302)
When I enabled DEBUG mode in connect-log4j.properties, I started seeing lots of error messages:
Completed connection to node -2. Fetching API versions. (org.apache.kafka.clients.NetworkClient:914)
Initiating API versions fetch from node -2. (org.apache.kafka.clients.NetworkClient:928)
Connection with YYY disconnected (org.apache.kafka.common.network.Selector:607)
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
...
Node -2 disconnected. (org.apache.kafka.clients.NetworkClient:894)
Initialize connection to node XXX (id: -3 rack: null) for sending metadata request (org.apache.kafka.clients.NetworkClient:1125)
Initiating connection to node XXX (id: -3 rack: null) using address XXX (org.apache.kafka.clients.NetworkClient:956)
Am I missing something with SSL configuration? Note that manually created org.apache.kafka.clients.consumer.KafkaConsumers can successfully read from this topic having only set "security.protocol = SSL".
EDIT:
Here are the connector properties:
name = my-connector
connector.class = io.confluent.connect.s3.S3SinkConnector
topics = some_topic
timestamp.extractor = Record
locale = de_DE
timezone = UTC
storage.class = io.confluent.connect.s3.storage.S3Storage
partitioner.class = io.confluent.connect.storage.partitioner.HourlyPartitioner
format.class = io.confluent.connect.s3.format.bytearray.ByteArrayFormat
s3.bucket.name = some-s3-bucket
s3.compression.type = gzip
flush.size = 3
s3.region = eu-central-1
I had a similar problem, which got solved after I have specified security protocol for consumer additionally (besides the global one): So just add
consumer.security.protocol = SSL
To the configuration properties

Flume Kafka sink not able to write complete messages to Kafka Broker

I have written a process where I'm generating messages thru custom flume source and Flume Kafka sink provided by Hortonworks to write into Kafka brokers.
During this process i have noticed that if KAFKA broker is already running and then i start my Flume agent it delivers each and every message to the Kafka broker properly but when i starts Kafka broker when Flume agent is already running, KAFKA broker is not able to receive all the messages.
When i run Kafka Console consumer to check the counts of messages received i noticed it is dropping few records from beginning and few records from the end.
I have tried multiple mix and match in Flume.conf but still it is working as expected.
Below are the configuration parameter which i have provided to
Flume.conf -
agent.channels = firehose-channel
agent.sources = stress-source
agent.sinks = kafkasink
#################################
# Benchmark Souce Configuration #
#################################
agent.sources.stress-source.type=com.kohls.flume.source.stress.BenchMarkTestScenriao
agent.sources.stress-source.size=5000
agent.sources.stress-source.maxTotalEvents=30000
agent.sources.stress-source.batchSize=200
agent.sources.stress-source.throughputThreshold=4000
agent.sources.stress-source.throughputControlSeconds=1
agent.sources.stress-source.channels=firehose-channel
#################################
# Firehose Channel Configuration #
#################################
agent.channels.firehose-channel.type = file
agent.channels.firehose-channel.checkpointDir = /data/flume/checkpoint
agent.channels.firehose-channel.dataDirs = /data/flume/data
agent.channels.firehose-channel.capacity = 10000
agent.channels.firehose-channel.transactionCapacity = 10000
agent.channels.firehose-channel.useDualCheckpoints=1
agent.channels.firehose-channel.backupCheckpointDir=/data/flume/backup
############################################
# Firehose Sink Configuration - Kafka Sink #
############################################
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.topic = backoff_test_17
agent.sinks.kafkasink.channel=firehose-channel
agent.sinks.kafkasink.brokerList = sandbox.hortonworks.com:6667
agent.sinks.kafkasink.batchsize = 200
agent.sinks.kafkasink.requiredAcks = 1
agent.sinks.kafkasink.kafka.producer.type = async
agent.sinks.kafkasink.kafka.batch.num.messages = 200
I have also tried to analyses the flume log and noticed that the flume metrics are properly showing the PUT and TAKE count.
Please let me know if anyone has any pointer to solve this issue. Appreciating your help in advance.

Kafka unrecoverable if broker dies

We have a kafka cluster with three brokers (node ids 0,1,2) and a zookeeper setup with three nodes.
We created a topic "test" on this cluster with 20 partitions and replication factor 2. We are using Java producer API to send messages to this topic. One of the kafka broker intermittently goes down after which it is unrecoverable. To simulate the case, we killed one of the broker manually. As per the kafka arch, it is supposed to self recover, but which is not happening. When I describe the topic on the console, I see the number of ISR's reduced to one for few of the partitions as one of the broker killed. Now, whenever we are trying to push messages via the producer API (either Java client or console producer), we are encountering SocketTimeoutException.. One quick look into the logs says, "Unable to fetch the metadata"
WARN [2015-07-01 22:55:07,590] [ReplicaFetcherThread-0-3][] kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-3],
Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 23711; ClientId: ReplicaFetcherThread-0-3;
ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [zuluDelta,2] -> PartitionFetchInfo(11409,1048576),[zuluDelta,14] -> PartitionFetchInfo(11483,1048576).
Possible cause: java.nio.channels.ClosedChannelException
[2015-07-01 23:37:40,426] WARN Fetching topic metadata with correlation id 0 for topics [Set(test)] from broker [id:1,host:abc-0042.yy.xxx.com,port:9092] failed (kafka.client.ClientUtils$)
java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Any leads will be appreciated...
From your error Unable to fetch metadata it could mostly be because you could have set the bootstrap.servers in the producer to the broker that has died.
Ideally, you must have more than one broker in the bootstrap.servers list because if one of the broker fails (or is unreachable) then the other could give you the metadata.
FYI: Metadata is the information about a particular topic that tells how many number of partitions it has, their leader brokers, follower brokers etc.
So, when a key is produced to a partition, its corresponding leader broker will be the one to whom the messages will be sent to.
From your question, your ISR set has only one broker. You could try setting the bootstrap.server to this broker.