It complains if a source is not specified in the configuration. According to doc :
The Kafka channel can be used for multiple scenarios:
With Flume source and sink - it provides a reliable and highly
available channel for events
With Flume source and interceptor but no sink - it allows writing Flume events into a Kafka topic, for use by other apps
With Flume sink, but no source - it is a low-latency,fault tolerant way to send events from Kafka to Flume sinks such asHDFS, HBase or Solr
https://flume.apache.org/FlumeUserGuide.html
I'm interested in the scenario 3 however there is no example for that in official flume doc.
Regards
flume agent source can be omitted in flume config on newer versions of CDH (5.14 in my case). Only a warning is issued.
you can provide some dummy name for the source like:
agent.sources = dummySource
agent.sinks = hdfsSink
agent.channels = kafkaChnl
and just provide the configurations for hdfsSink and kafkaChnl
Related
We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)
Messages are not getting posted to output topic.
Is there any cluster level configuration is required to use Kafka stream APIs. for information Kafka version is > 0.10.0.0
I'm trying to write from Flume source to Kafka topic. There is kafka channel in flume and in this Cloudera post author says that kafka channel may be used
To write to Kafka directly from Flume sources without additional buffering.
But when I'm trying to exclude sink from my configuration flume says
An error occurred while validating this configuration: Component tier1.sinks: Property value missing.
Do I really need to write to kafka channel and read back to write again to the kafka sink? That seems strange for me...
No, you don't need to do that, please show me your config file.
A sample in Flume 1.7 goes like this:
source config etc...
agent1.channels.channel_sample.type = org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.channel_sample.kafka.bootstrap.servers = hostnameorip:9092,hostnameorip:9092
agent1.channels.channel_sample.kafka.topic = topic_sample
agent1.channels.channel_sample.kafka.consumer.group.id = consumer_group_sample
If you don't need any sink binding to this channel_sample, kafka.consumer.group.id is not essential.
See https://flume.apache.org/FlumeUserGuide.html#kafka-channel for more
Take care there is a mistake in the document that default value of kafka.consumer.auto.offset.reset is earliest not latest
I have a task of configuring a simple pipeline for app log ingestion.
A prerequisite for this pipeline is to use kafka as the transport protocol.
As I understand, flume has a built-in capability of ingesting log files.
Is there a way to use flume as a producer, and have it pass its output onto a kafka topic?
Yes, you can use Flume as a producer for Kafka.
Have a look at this API provided by Flume: https://flume.apache.org/releases/content/1.6.0/apidocs/org/apache/flume/sink/kafka/KafkaSink.html
yes, you can use. As specified in the previous response.
Just want to add that you need to make configurations similar to:
# Sources, channels, and sinks are defined per
# agent name, in this case flume1.
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
# For each source, channel, and sink, set
# standard properties.
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = ...
In flume, I have Kafka-channel from where I can read and write data.
What is the difference between the performance of reading and writing data into Kafka channel if I replace Kafka source and Kafka sink with Avro source and Avro sink?
In my opinion, by replacing Kafka-source with Avro-source, I will be unable to read data in parallel from multiple partitions of Kafka broker, as there is no consumer group specified in case of Avro-source. Please correct me if I am wrong.
In Flume, the Avro RPC source binds to a specified TCP port of a network interface, so only one Avro source of one of the Flume agents running on a single machine can ever receive events sent to this port.
Avro source is meant to connect two or more Flume agents together: one or more Avro sinks connect to a single Avro source.
As you point out, using Kafka as a source allows for events to be received by several consumer groups. However, my experience with Flume 1.6.0 is that it is faster to push events from one Flume agent to another on a remote host through Avro RPC rather than through Kafka.
So I ended up with the following setup for log data collection:
[Flume agent on remote collected node] =Avro RPC=> [Flume agent in central cluster] =Kafka=> [multiple consumer groups in central cluster]
This way, I got better log ingestion and processing throughput and I also could encrypt and compress log data between remote sites and central cluster. This may however change when Flume adds support for the new protocol introduced by Kafka 0.9.0 in a future version, possibly making Kafka more usable as the front interface of the central cluster with remote data collection nodes (see here).