Flume Kafka sink not able to write complete messages to Kafka Broker - apache-kafka

I have written a process where I'm generating messages thru custom flume source and Flume Kafka sink provided by Hortonworks to write into Kafka brokers.
During this process i have noticed that if KAFKA broker is already running and then i start my Flume agent it delivers each and every message to the Kafka broker properly but when i starts Kafka broker when Flume agent is already running, KAFKA broker is not able to receive all the messages.
When i run Kafka Console consumer to check the counts of messages received i noticed it is dropping few records from beginning and few records from the end.
I have tried multiple mix and match in Flume.conf but still it is working as expected.
Below are the configuration parameter which i have provided to
Flume.conf -
agent.channels = firehose-channel
agent.sources = stress-source
agent.sinks = kafkasink
#################################
# Benchmark Souce Configuration #
#################################
agent.sources.stress-source.type=com.kohls.flume.source.stress.BenchMarkTestScenriao
agent.sources.stress-source.size=5000
agent.sources.stress-source.maxTotalEvents=30000
agent.sources.stress-source.batchSize=200
agent.sources.stress-source.throughputThreshold=4000
agent.sources.stress-source.throughputControlSeconds=1
agent.sources.stress-source.channels=firehose-channel
#################################
# Firehose Channel Configuration #
#################################
agent.channels.firehose-channel.type = file
agent.channels.firehose-channel.checkpointDir = /data/flume/checkpoint
agent.channels.firehose-channel.dataDirs = /data/flume/data
agent.channels.firehose-channel.capacity = 10000
agent.channels.firehose-channel.transactionCapacity = 10000
agent.channels.firehose-channel.useDualCheckpoints=1
agent.channels.firehose-channel.backupCheckpointDir=/data/flume/backup
############################################
# Firehose Sink Configuration - Kafka Sink #
############################################
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.topic = backoff_test_17
agent.sinks.kafkasink.channel=firehose-channel
agent.sinks.kafkasink.brokerList = sandbox.hortonworks.com:6667
agent.sinks.kafkasink.batchsize = 200
agent.sinks.kafkasink.requiredAcks = 1
agent.sinks.kafkasink.kafka.producer.type = async
agent.sinks.kafkasink.kafka.batch.num.messages = 200
I have also tried to analyses the flume log and noticed that the flume metrics are properly showing the PUT and TAKE count.
Please let me know if anyone has any pointer to solve this issue. Appreciating your help in advance.

Related

How Apache Flume and Kafka works together?

Regarding this configuration my understanding is flume is reading message to kafka topic source-topic , push this message/event to kafka channel/topic test-topic and then sink consume it and write it to ElasticSearch.
To test this flow, I explicitly pushed 1 message/event to kafka topic source-topic and was expecting this event on sink side. But it did not work for me.
Then I did some debugging on it and thought message / event must be in kafka channel. But when I tried to run the bin/kafka-topics.sh --list --zookeeper localhost:2181 command then it did not return test-topic on console.
Now my question is , is this channel name is not kafka topic ?
if not then how can I query the event from kafka channel or may be if someone can help me to understand this flow.
test.sources = ks
test.sinks = es
test.channels = kc
# SOURCES
test.sources.ks.type = org.apache.flume.source.kafka.KafkaSource
test.sources.ks.zookeeperConnect = 127.0.0.1:2181
test.sources.ks.topic = source-topic
test.sources.ks.groupId = cst
test.sources.ks.batchSize = 1000
test.sources.ks.batchDurationMillis = 1000
test.sources.ks.kafka.consumer.timeout.ms = 100
test.sources.ks.kafka.auto.offset.reset = smallest
# sink
test.sinks.es.type = org.es.TestElasticSearchSink
test.sinks.es.hostNames = 127.0.0.1:9200
test.sinks.es.indexName = test-idx
test.sinks.es.batchSize = 1000
test.sinks.es.iaCacheLifetime = 20
# Normal channel
test.channels.kc.type = org.kc.TestKafkaChannel
test.channels.kc.capacity = 10000
test.channels.kc.transactionCapacity = 1000
test.channels.kc.brokerList = 127.0.0.1:9092
test.channels.kc.topic = test-topic
test.channels.kc.zookeeperConnect = 127.0.0.1:2181
test.channels.kc.parseAsFlumeEvent = false
test.channels.kc.readSmallestOffset = true
test.channels.kc.groupId = test-flume
You will probably want to pre-create all necessary Kafka topics before starting Flume. However, it's not clear what is org.kc.TestKafkaChannel, or org.es.TestElasticSearchSink. Flume has provided classes for both of these (Kafka channel +Elasticsearch sink), I believe, so anything "not working" would begin in either of your "custom" classes here...
Alternatively, Kafka Connect already has an Elasticsearch sink connector, so you don't need an intermediate Kafka topic just to send data between Kafka and Elasticsearch. Logstash would work as well.

How to set consumer config values for Kafka Mirrormaker-2 2.6.1?

I am attempting to use mirrormaker 2 to replicate data between AWS Managed Kafkas (MSK) in 2 different AWS regions - one in eu-west-1 (CLOUD_EU) and the other in us-west-2 (CLOUD_NA), both running Kafka 2.6.1. For testing I am currently trying just to replicate topics 1 way, from EU -> NA.
I am starting a mirrormaker connect cluster using ./bin/connect-mirror-maker.sh and a properties file (included)
This works fine for topics with small messages on them, but one of my topic has binary messages up to 20MB in size. When I try to replicate that topic I get an error every 30 seconds
[2022-04-21 13:47:05,268] INFO [Consumer clientId=consumer-29, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: {}. (org.apache.kafka.clients.FetchSessionHandler:481)
org.apache.kafka.common.errors.DisconnectException
When logging in DEBUG to get more information we get
[2022-04-21 13:47:05,267] DEBUG [Consumer clientId=consumer-29, groupId=null] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient:784)
[2022-04-21 13:47:05,268] DEBUG [Consumer clientId=consumer-29, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=consumer-29, correlationId=35) due to node 2 being disconnected (org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:593)
It gets stuck in a loop constantly disconnecting with request timeout every 30s and then trying again.
Looking at this, I suspect that the problem is the request.timeout.ms is on the default (30s) and it times out trying to read the topic with many large messages.
I followed the guide at https://github.com/apache/kafka/tree/trunk/connect/mirror to attempt to configure the consumer properties, however, no matter what I set, the timeout for the consumer remains fixed at the default, confirmed both by kafka outputting its config in the log and by timing how long between the disconnect messages. e.g. I set:
CLOUD_EU.consumer.request.timeout.ms=120000
In the properties that I start MM-2 with.
based on various guides I have found while looking at this, I have also tried
CLOUD_EU.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.override.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.override.request.timeout.ms=120000
None of which have worked.
How can I change the consumer request.timeout setting? The log is approx 10,000 lines long, but everywhere where the ConsumerConfig is logged out it logs request.timeout.ms = 30000
Properties file I am using:
# specify any number of cluster aliases
clusters = CLOUD_EU, CLOUD_NA
# connection information for each cluster
CLOUD_EU.bootstrap.servers = kafka.eu-west-1.amazonaws.com:9092
CLOUD_NA.bootstrap.servers = kafka.us-west-2.amazonaws.com:9092
# enable and configure individual replication flows
CLOUD_EU->CLOUD_NA.enabled = true
CLOUD_EU->CLOUD_NA.topics = METRICS_ATTACHMENTS_OVERSIZE_EU
CLOUD_NA->CLOUD_EU.enabled = false
replication.factor=3
tasks.max = 1
############################# Internal Topic Settings #############################
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
############################ Kafka Settings ###################################
# CLOUD_EU cluster over writes
CLOUD_EU.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.session.timeout.ms=150000

How sink topic, kafka to kafka, using Flume?

I am trying transfer log from topic to another topic. I need connect Kafka to Kafka using Flume. Take a look below:
#
# Flume Conf
#
a1.sources = s1
a1.sinks = k1
a1.channels = c1
# Kafka Source
a1.sources.s1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.s1.kafka.bootstrap.servers = kafka:9092
a1.sources.s1.kafka.topics = apache
# Kafka Sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = kafka:9092
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000000
a1.channels.c1.transactionCapacity = 1000000
# Bind the source and sink to the channel
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
The sink are not creating.
If you want to replicate data from one Kafka cluster to another, there are better ways than Flume, including:
MirrorMaker, as #cricket_007 mentioned (open source, part of Apache Kafka)
Confluent's Replicator (commercial tool, 30 day free trial)
uReplicator (open sourced from Uber)
Mirus (open sourced from Salesforce)
Brucke (open source)
If you want a really dirty hack, you can also do something with kafkacat and nc.
Disclaimer: I work for Confluent.

Streaming Kafka Messages to MySQL Database

I want to write Kafka messages to MySQL database. There is an example in this link. In that example, apache flume used for consume messages and writing it to MySQL. I'm using same code and when i run the flume-ng agent and event always becomes null
And my flume.conf.properties file is:
agent.sources=kafkaSrc
agent.channels=channel1
agent.sinks=jdbcSink
agent.channels.channel1.type=org.apache.flume.channel.kafka.KafkaChannel
agent.channels.channel1.brokerList=localhost:9092
agent.channels.channel1.topic=kafkachannel
agent.channels.channel1.zookeeperConnect=localhost:2181
agent.channels.channel1.capacity=10000
agent.channels.channel1.transactionCapacity=1000
agent.channels.channel1.parseAsFlumeEvent=false
agent.sources.kafkaSrc.type = org.apache.flume.source.kafka.KafkaSource
agent.sources.kafkaSrc.channels = channel1
agent.sources.kafkaSrc.zookeeperConnect = localhost:2181
agent.sources.kafkaSrc.topic = kafka-mysql
agent.sinks.jdbcSink.type = com.stratio.ingestion.sink.jdbc.JDBCSink
agent.sinks.jdbcSink.connectionString = jdbc:mysql://127.0.0.1:3306/test?useSSL=false
agent.sinks.jdbcSink.username=root
agent.sinks.jdbcSink.password=pass
agent.sinks.jdbcSink.batchSize = 10
agent.sinks.jdbcSink.channel =channel1
agent.sinks.jdbcSink.sqlDialect=MYSQL
agent.sinks.jdbcSink.driver=com.mysql.jdbc.Driver
agent.sinks.jdbcSink.sql=INSERT INTO kafkamsg(msg) VALUES(${body:varchar})
Where I'm wrong?
Thanks.
In my referance example, flume listens kafka for kafka-mysql topic. But this code works for kafkachannel topic. So we need to produce messages to kafkachannel topic, i don't know why.

Apache Flume: kafka.consumer.ConsumerTimeoutException

I'm trying to build pipeline with Apache Flume:
spooldir -> kafka channel -> hdfs sink
Events go to kafka topic without problems and I can see them with kafkacat request. But kafka channel can't write files to hdfs via sink. The error is:
Timed out while waiting for data to come from Kafka
Full log:
2016-02-26 18:25:17,125
(SinkRunner-PollingRunner-DefaultSinkProcessor-SendThread(zoo02:2181))
[DEBUG -
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:717)]
Got ping response for sessionid: 0x2524a81676d02aa after 0ms
2016-02-26 18:25:19,127
(SinkRunner-PollingRunner-DefaultSinkProcessor-SendThread(zoo02:2181))
[DEBUG -
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:717)]
Got ping response for sessionid: 0x2524a81676d02aa after 1ms
2016-02-26 18:25:21,129
(SinkRunner-PollingRunner-DefaultSinkProcessor-SendThread(zoo02:2181))
[DEBUG -
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:717)]
Got ping response for sessionid: 0x2524a81676d02aa after 0ms
2016-02-26 18:25:21,775
(SinkRunner-PollingRunner-DefaultSinkProcessor) [DEBUG -
org.apache.flume.channel.kafka.KafkaChannel$KafkaTransaction.doTake(KafkaChannel.java:327)]
Timed out while waiting for data to come from Kafka
kafka.consumer.ConsumerTimeoutException at
kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:69)
at
kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
at
kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at
org.apache.flume.channel.kafka.KafkaChannel$KafkaTransaction.doTake(KafkaChannel.java:306)
at
org.apache.flume.channel.BasicTransactionSemantics.take(BasicTransactionSemantics.java:113)
at
org.apache.flume.channel.BasicChannelSemantics.take(BasicChannelSemantics.java:95)
at
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:374)
at
org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
My FlUME's config is:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c2
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/alex/spoolFlume
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://10.12.0.1:54310/logs/flumetest/
a1.sinks.k1.hdfs.filePrefix = flume-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 1000
a1.channels.c2.brokerList=kafka10:9092,kafka11:9092,kafka12:9092
a1.channels.c2.topic=flume_test_001
a1.channels.c2.zookeeperConnect=zoo00:2181,zoo01:2181,zoo02:2181
# Bind the source and sink to the channel
a1.sources.r1.channels = c2
a1.sinks.k1.channel = c2
With memory channel instead of kafka channel all works good.
Thanks for any ideas in advance!
ConsumerTimeoutException means there is no new message for a long time, doesn't mean connect time out for Kafka.
http://kafka.apache.org/documentation.html
consumer.timeout.ms -1 Throw a timeout exception to the consumer if no message is available for consumption after the specified interval
Kafka's ConsumerConfig class has the "consumer.timeout.ms" configuration property, which Kafka sets by default to -1. Any new Kafka Consumer is expected to override the property with a suitable value.
Below is a reference from Kafka documentation :
consumer.timeout.ms -1
By default, this value is -1 and a consumer blocks indefinitely if no new message is available for consumption. By setting the value to a positive integer, a timeout exception is thrown to the consumer if no message is available for consumption after the specified timeout value.
When Flume creates a Kafka channel, it is setting the timeout.ms value to 100, as seen on the Flume logs at the INFO level. That explains why we see a ton of these ConsumerTimeoutExceptions.
level: INFO Post-validation flume configuration contains configuration for agents: [agent]
level: INFO Creating channels
level: DEBUG Channel type org.apache.flume.channel.kafka.KafkaChannel is a custom type
level: INFO Creating instance of channel c1 type org.apache.flume.channel.kafka.KafkaChannel
level: DEBUG Channel type org.apache.flume.channel.kafka.KafkaChannel is a custom type
level: INFO Group ID was not specified. Using flume as the group id.
level: INFO {metadata.broker.list=kafka:9092, request.required.acks=-1, group.id=flume,
zookeeper.connect=zookeeper:2181, **consumer.timeout.ms=100**, auto.commit.enable=false}
level: INFO Created channel c1
Going by the Flume user guide on Kafka channel settings, I tried to override this value by specifying the below, but that doesn't seem to work though:
agent.channels.c1.kafka.consumer.timeout.ms=5000
Also, we did a load test with pounding data through the channels constantly, and this exception didn't occur during the tests.
I read flume's source code, and found that flume reads value of the key "timeout" for "consumer.timeout.ms".
So you can config the value for "consumer.timeout.ms" like this:
agent1.channels.kafka_channel.timeout=-1