How to use Flume's Kafka Channel without specifying a source - apache-kafka

I have an existing Kafka topic and a flume agent that reads from there and writes to HDFS. I want to reconfigure my flume agent so it will move away from the existing setup; a Kafka Source, file Channel to HDFS Sink, to use a Kafka Channel.
I read in the cloudera documentation that it possible to achieve this by only using a Kafka Channel and HDFS sink (without a flume source).. (unless I have got the wrong end of the stick.) So I tried to create this configuration but it isn't working. It's not even starting the flume process on the box.
# Test
test.channels = kafka-channel
test.sinks = hdfs-sink
test.channels.kafka-channel.type =
org.apache.flume.channel.kafka.KafkaChannel
test.channels.kafka-channel.kafka.bootstrap.servers = localhost:9092
test.channels.kafka-channel.kafka.topic = test
test.channels.kafka-channel.parseAsFlumeEvent = false
test.sinks.hdfs-sink.channel = kafka-channel
test.sinks.hdfs-sink.type = hdfs
test.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8082/data/test/
I'm using:
HDP Quickstart VM 2.6.3
Flume version 1.5.2
The HDFS directory does exist
ps -ef | grep flume only returns a process once I added a kafka-source, but this can't be right because doing this creates an infinite loop for any messages published onto the topic.
Is it possible to only use a Kafka Channel and HDFS Sink or do I need to use a kafka-source but change some other configurations that will prevent the infinite loop of messages?
Kafka-source -> kafka-channel -> HDFS Sink - This doesn't seem right to me.

After digging around a bit I noticed that Ambari didn't create any flume conf files for the specified agent. Ambari seems to only create/update the flume config if I specify test.sources = kafka-source. Once I added this into the flume config (via ambari) the config was created on the box and the flume agent started successfully.
The final flume config looked like this:
test.sources=kafka-source
test.channels = kafka-channel
test.sinks = hdfs-sink
test.channels.kafka-channel.type = org.apache.flume.channel.kafka.KafkaChannel
test.channels.kafka-channel.kafka.bootstrap.servers = localhost:9092
test.channels.kafka-channel.kafka.topic = test
test.channels.kafka-channel.parseAsFlumeEvent = false
test.sinks.hdfs-sink.channel = kafka-channel
test.sinks.hdfs-sink.type = hdfs
test.sinks.hdfs-sink.hdfs.path = hdfs:///data/test
Notice I didn't set any of the properties on the source (this would cause the infinite loop issue i mentioned in my question), it just needs to be mentioned so Ambari creates the flume config and starts the agent.

This doesn't directly answer your question about Flume, but in general since you're already using Apache Kafka this pattern is best solved using Kafka Connect (which is part of Apache Kafka).
There is a Kafka Connect HDFS connector which is simple to use, per this guide here.

Related

MSK, IAM, and Kafka Java Api

So for some reason I can't get my connections just right with MSK via the Kafka Java API. I can get producers/consumers to work with MSK using conduktor and Kafka CLI tools. However when I try to hook up my Scala code I can't get it to work. So I am using the config as follows to connect via conduktor and Kafka CLI tools.
security.protocol=SASL_SSL
sasl.mechanism=AWS_MSK_IAM
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;
sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandle
and for my Scala application I am setting up producers/consumers using a similar pattern
def props: Properties = {
val p = new Properties()
....
p.setProperty("security.protocol", "SASL_SSL")
p.setProperty("sasl.mechanism", "AWS_MSK_IAM")
p.setProperty("sasl.jaas.config", "software.amazon.msk.auth.iam.IAMLoginModule required;")
p.setProperty("sasl.client.callback.handler.class", "software.amazon.msk.auth.iam.IAMClientCallbackHandler")
p
}
val PRODUCER = new KafkaConsumer[AnyRef, AnyRef](props)
So the code works when I omit the security config lines and run against a local instance of Kafka but when I try to hit the MSK it seems like it isn't constructing a consumer and I get the following error.
java.lang.IllegalStateException: You can only check the position for partitions assigned to this consumer.
However, the locally running instance works. So this makes me think I'm not setting up something correctly in the config to connect to the MSK.
I am trying to follow the following tutorial and I am using Scala 2.11 and Kafka versions 2.41. I also added the aws-msk-iam-auth to my build.sbt (1.1.0). Any thoughts or solutions?
This turned out to not be a problem with my AWS connection as I implemented some logging as explained here. My problem lies in the difference between my local running version of Kafka and MSK. I am still trying to understand the differences.

How to configure Flume with Kafka channel without source?

It complains if a source is not specified in the configuration. According to doc :
The Kafka channel can be used for multiple scenarios:
With Flume source and sink - it provides a reliable and highly
available channel for events
With Flume source and interceptor but no sink - it allows writing Flume events into a Kafka topic, for use by other apps
With Flume sink, but no source - it is a low-latency,fault tolerant way to send events from Kafka to Flume sinks such asHDFS, HBase or Solr
https://flume.apache.org/FlumeUserGuide.html
I'm interested in the scenario 3 however there is no example for that in official flume doc.
Regards
flume agent source can be omitted in flume config on newer versions of CDH (5.14 in my case). Only a warning is issued.
you can provide some dummy name for the source like:
agent.sources = dummySource
agent.sinks = hdfsSink
agent.channels = kafkaChnl
and just provide the configurations for hdfsSink and kafkaChnl

Flume 1.7 & Kafka - How to restart at the beginning of the topic?

I'm using a Flume 1.7 Kafka source to pull data out of Apache Kafka into my AbstractSink. In the past I could re-start the offsets at the beginning of the topic by deleting the topic offsets using ./kafka-consumer-groups.sh --delete but since Flume 1.7 (apparently) uses a "new" consumer, attempting ./kafka-consumer-groups.sh --delete now gives the following error message:
Option [delete] is not valid with [new-consumer]. Note that there's no
need to delete group metadata for the new consumer as it is
automatically deleted when the last member leaves
So, what is the recommended method of achieving the desired behavior (which is that we will re-process the data from the beginning of the topic) ?
Here is part of my flume config:
myagent.sources.my-kafka-source.type = org.apache.flume.source.kafka.KafkaSource
myagent.sources.my-kafka-source.kafka.bootstrap.servers = kafka.example.net:9092
myagent.sources.my-kafka-source.kafka.consumer.group.id = my-gid
myagent.sources.my-kafka-source.kafka.topics = my.topic
myagent.sources.my-kafka-source.kafka.auto.offset.reset = earliest
myagent.sources.my-kafka-source.channels = my_channel
Flume does not offer direct support to the rewind feature although kafka does ships with KafkaConsumer#seek allowing you to re-consume the messages. Seems you have to use a new group id to do this which needs to restart the Flume agent.

Flume KafkaChannel without KafkaSink?

I'm trying to write from Flume source to Kafka topic. There is kafka channel in flume and in this Cloudera post author says that kafka channel may be used
To write to Kafka directly from Flume sources without additional buffering.
But when I'm trying to exclude sink from my configuration flume says
An error occurred while validating this configuration: Component tier1.sinks: Property value missing.
Do I really need to write to kafka channel and read back to write again to the kafka sink? That seems strange for me...
No, you don't need to do that, please show me your config file.
A sample in Flume 1.7 goes like this:
source config etc...
agent1.channels.channel_sample.type = org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.channel_sample.kafka.bootstrap.servers = hostnameorip:9092,hostnameorip:9092
agent1.channels.channel_sample.kafka.topic = topic_sample
agent1.channels.channel_sample.kafka.consumer.group.id = consumer_group_sample
If you don't need any sink binding to this channel_sample, kafka.consumer.group.id is not essential.
See https://flume.apache.org/FlumeUserGuide.html#kafka-channel for more
Take care there is a mistake in the document that default value of kafka.consumer.auto.offset.reset is earliest not latest

Is it possible to use flume as kafka producer for log ingestion?

I have a task of configuring a simple pipeline for app log ingestion.
A prerequisite for this pipeline is to use kafka as the transport protocol.
As I understand, flume has a built-in capability of ingesting log files.
Is there a way to use flume as a producer, and have it pass its output onto a kafka topic?
Yes, you can use Flume as a producer for Kafka.
Have a look at this API provided by Flume: https://flume.apache.org/releases/content/1.6.0/apidocs/org/apache/flume/sink/kafka/KafkaSink.html
yes, you can use. As specified in the previous response.
Just want to add that you need to make configurations similar to:
# Sources, channels, and sinks are defined per
# agent name, in this case flume1.
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
# For each source, channel, and sink, set
# standard properties.
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = ...