Http source configuration doesn't work for flume - apache-kafka

Iam a beginner to Apache flume. Iam trying to pull data from a REST API and take it through flume and send to a kafka topic. But it is not working so far. The configuration I tried to use is shown below.There is a test GET API at localhost:8080/kafka/publish/ in the system. Iam trying to get data from this. The below configuration I tried pulling from flume documentation.
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 8080
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
a1.sources.r1.handler.nickname = random props
a1.sources.r1.HttpConfiguration.sendServerVersion = false
a1.sources.r1.ServerConnector.idleTimeout = 300
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = simple
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
a1.sinks.k1.kafka.producer.compression.type = snappy
Can anyone help me solve this. What is the problem here?
The logs is added below
2020-12-03 11:16:17,696 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateConfigFilterSet(FlumeConfiguration.java:623)] Agent configuration for 'a1' has no configfilters.
2020-12-03 11:16:17,713 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.isValid(FlumeConfiguration.java:373)] Agent configuration for 'a1' does not contain any valid channels. Marking it as invalid.
2020-12-03 11:16:17,714 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:154)] Agent configuration invalid for agent 'a1'. It will be removed.
2020-12-03 11:16:17,715 (conf-file-poller-0) [INFO - org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:163)] Post-validation flume configuration contains configuration for agents: []
2020-12-03 11:16:17,718 (conf-file-poller-0) [WARN - org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:139)] No configuration found for this host:a1
2020-12-03 11:16:17,730 (conf-file-poller-0) [INFO - org.apache.flume.node.Application.startAllComponents(Application.java:162)] Starting new configuration:{ sourceRunners:{} sinkRunners:{} channels:{} }

Related

Apache flume with kafka source, kafka sink and memory channel - throwing UNKNOWN_TOPIC_OR_PARTITION

I am new to Apache flume https://flume.apache.org/. For one of the use-case, I need to move data from the Kafka topic on one cluster (bootstrap: bootstrap1, topic: topic1) to topic with different name in a different cluster (bootstrap: bootstrap2, topic: topic2). There are another use-cases in same project which fits best for flume and I need to use same flume pipeline for this use-case though there are other options to copy from Kafka to Kafka.
I tried below configs and the results are as mentioned in each options.
#1: telnet to kafka sink (bootstrap2, topic2) --> works perfect.
configs:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = topic2
a1.sinks.k1.kafka.bootstrap.servers = bootstrap2
a1.sinks.k1.kafka.flumeBatchSize = 100
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#2: kafka as source(bootstrap1, topic1) and logger as sink --> works perfect.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 10
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = bootstrap1
a1.sources.r1.kafka.topics = topic1
a1.sources.r1.kafka.consumer.group.id = flume-gis-consumer
a1.sources.r1.backoffSleepIncrement = 1000
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#3: kafka as source (bootstrap1, topic1) and kafka as sink(bootstrap2, topic2) --> gives error as mentioned below the config.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 10
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = bootstrap1
a1.sources.r1.kafka.topics = topic1
a1.sources.r1.kafka.consumer.group.id = flume-gis-consumer1
a1.sources.r1.backoffSleepIncrement = 1000
# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = topic2
a1.sinks.k1.kafka.bootstrap.servers = bootstrap2
a1.sinks.k1.kafka.flumeBatchSize = 100
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.linger.ms = 1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Error:
(kafka-producer-network-thread | producer-1) [WARN - org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleCompletedMetadataResponse(NetworkClient.java:968)] [Producer clientId=producer-1] Error while fetching metadata with correlation id 85 : {topic1=UNKNOWN_TOPIC_OR_PARTITION}
Continuously shows above error.
ERROR upon terminating flume-ng command
(SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to publish events
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:268)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.EventDeliveryException: Could not send event
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:234)
... 3 more
Seeking help from the stackoverflow community on:
What config is going wrong here. Kafka topics exists in respective clusters. (Option 1 and Option 2 works fine and I can see messages flowing from source to sink)
Why producer thread is trying to produce in source kafka topic?
I encountered the same issue today. My case is even worse because I host two topics on a single Kafka cluster.
It is really misleading that the producer thread in Kafka sink is producing back to the Kafka source topic.
I fixed the issue by setting allowTopicOverride to false for Kafka sink.
Quote from Kafka sink part in Flume document:
allowTopicOverride: Default is true. When set, the sink will allow a message to be produced into a topic specified by the topicHeader property (if provided).
topicHeader: When set in conjunction with allowTopicOverride will produce a message into the value of the header named using the value of this property. Care should be taken when using in conjunction with the Kafka Source topicHeader property to avoid creating a loopback.
And in Kafka source part:
setTopicHeader: Default is true. When set to true, stores the topic of the retrieved message into a header, defined by the topicHeader property.
So by default, Apache Flume store the Kafka source topic in topicHeader for each event. Then, Kafka sink by default write to the topic specify in topicHeader.

File reader configuration doesn't work for Fume

Iam new to flume and was trying my first experiment with it.Iam trying to read data from a file using fume and send it to a kafka topic.
The configuration is pulled from a tutorial website.The configuration is shown below.
a1.sources = r1
a1.sinks = sample
a1.channels = sample-channel
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f \data.txt
a1.sources.r1.logStdErr = true
a1.channels.sample-channel.type = memory
a1.channels.sample-channel.capacity = 1000
a1.channels.sample-channel.transactionCapacity = 100
a1.sources.r1.channels = sample-channel
a1.sinks.sample.topic = sample
a1.sinks.sample.brokerList = 127.0.0.1:9092
a1.sinks.sample.requiredAcks = 1
a1.sinks.sample.batchSize = 20
a1.sinks.sample.channel = sample-channel
But this doesnot do anything.It isn't throwing any errors,but a few warnings. The log is shown below.
2020-12-03 12:01:17,265 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateConfigFilterSet(FlumeConfiguration.java:623)] Agent configuration for 'a1' has no configfilters.
2020-12-03 12:01:17,291 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSinks(FlumeConfiguration.java:884)] Could not configure sink sample due to: Component has no type. Cannot configure. sample
org.apache.flume.conf.ConfigurationException: Component has no type. Cannot configure. sample
at org.apache.flume.conf.ComponentConfiguration.configure(ComponentConfiguration.java:76)
at org.apache.flume.conf.sink.SinkConfiguration.configure(SinkConfiguration.java:44)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSinks(FlumeConfiguration.java:867)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.isValid(FlumeConfiguration.java:383)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.access$000(FlumeConfiguration.java:228)
at org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfiguration.java:153)
at org.apache.flume.conf.FlumeConfiguration.<init>(FlumeConfiguration.java:133)
at org.apache.flume.node.PropertiesFileConfigurationProvider.getFlumeConfiguration(PropertiesFileConfigurationProvider.java:194)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:97)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
How can I solve this?
As the error says, r1 has no type, so it doesn't know what to do with your source
You're missing
a1.sources.r1.type = exec
Personally, I'd suggest Filebeat or Telegraf over Flume for taking files to Kafka

how to use flume to get the data from kafka

Now, i have to use the flume to get the data from the kafka. what i want to achieve is to consume the data from kafka every 30 minute and write the data into the file, which i can use to do the online learning. The config is as follows:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.zookeeperConnect = host
a1.sources.r1.kafka.topics = topic
# Use a channel which buffers events in memory
a1.channels.c1.type = file
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /home/flume_log
a1.sinks.k1.sink.rollInterval=30
a1.sinks.k1.channel = c1
if the config has something wrong. when i run the test.conf, although i have produce some data, i can't get it in the file. All the files are empty

org.apache.kafka.common.errors.RecordTooLargeException in Flume Kafka Sink

I am trying to read data from JMS source and pushing them into KAFKA topic, while doing that after few hours i observed that pushing frequency to the KAFKA topic became almost zero and after some initial analysis i found following exception in FLUME logs .
28 Feb 2017 16:35:44,758 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.SinkRunner$PollingRunner.run:158) - Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to publish events
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:252)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1399305 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
at org.apache.kafka.clients.producer.KafkaProducer$FutureFailure.<init>(KafkaProducer.java:686)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:449)
at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:212)
... 3 more
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1399305 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
my flume shows the current set value (in logs ) for max.request.size as 1048576 , which is clearly very less than 1399305 , increasing this max.request.size may eliminate these exception but am unable to find correct place for updating that value .
My flume.config ,
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.channels.c1.type = file
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.capacity = 100000000
a1.channels.c1.checkpointDir = /data/flume/apache-flume-1.7.0-bin/checkpoint
a1.channels.c1.dataDirs = /data/flume/apache-flume-1.7.0-bin/data
a1.sources.r1.type = jms
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = some context urls
a1.sources.r1.connectionFactory = some_queue
a1.sources.r1.providerURL = some_url
#a1.sources.r1.providerURL = some_url
a1.sources.r1.destinationType = QUEUE
a1.sources.r1.destinationName = some_queue_name
a1.sources.r1.userName = some_user
a1.sources.r1.passwordFile= passwd
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = some_kafka_topic
a1.sinks.k1.kafka.bootstrap.servers = some_URL
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.flumeBatchSize = 1
a1.sinks.k1.channel = c1
Any help will be really appreciated !!
This change has to be done at Kafka.
Update the Kafka producer configuration file producer.properties with a larger value like
max.request.size=10000000
It seems like i have resolved my issue ; As suspected increasing the max.request.size eliminated the exception , for updating such kafka sink(producer) properties FLUME provides the constant prefix as kafka.producer. and we can append this constant prefix with any kafka properties ;
so mine goes as, a1.sinks.k1.kafka.producer.max.request.size = 5271988 .

Flume. Line deserializer adds unicode symbols to loglines in Kafka channel

I am using flume with following config for parsing nginx logs and put them into kafka.
#define sources, channels and sink
a1.sources = r1
a1.channels = c2
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /spool/upload_flume
a1.sources.r1.fileSuffix = .DONE
a1.sources.r1.basenameHeader = false
a1.sources.r1.fileHeader = false
a1.sources.r1.batchSize = 1000
a1.sources.r1.deserializer.maxLineLength = 11000
a1.sources.r1.decodeErrorPolicy = IGNORE
a1.sources.r1.deserializer.outputCharset = UTF-8
#define channels
a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.brokerList=kafka10:9092,kafka11:9092,kafka12:9092
a1.channels.c2.topic = test001_logs
a1.channels.c2.zookeeperConnect = kafka10:2181,kafka11:2181,kafka12:2181
a1.channels.c2.parseAsFlumeEvent = true
# Bind the source and sink to the channel
a1.sources.r1.channels = c2
For some reason In resulting entries in kafka topic there are unicode symbols appended to loglines. For example:
\00\F4176.124.146.227 1469439200.715 ...
\00\DE185.18.5.6 1469439200.715 3146510 ...
\00\B0176.15.87.26 1469439200.717 80674 ...
Why this happens and how to avoid such problem?
Thanks in advance!
Update.
If I use kafka as a sink for memory channel with the same 'spoolDir' settings - there are no any unicode additions in resulted entries in kafka topic. But this way doesn't looks like right solution, because I have to use additional resources for memory channel.
Try
a1.channels.c2.parseAsFlumeEvent = false