Flume KafkaChannel without KafkaSink? - apache-kafka

I'm trying to write from Flume source to Kafka topic. There is kafka channel in flume and in this Cloudera post author says that kafka channel may be used
To write to Kafka directly from Flume sources without additional buffering.
But when I'm trying to exclude sink from my configuration flume says
An error occurred while validating this configuration: Component tier1.sinks: Property value missing.
Do I really need to write to kafka channel and read back to write again to the kafka sink? That seems strange for me...

No, you don't need to do that, please show me your config file.
A sample in Flume 1.7 goes like this:
source config etc...
agent1.channels.channel_sample.type = org.apache.flume.channel.kafka.KafkaChannel
agent1.channels.channel_sample.kafka.bootstrap.servers = hostnameorip:9092,hostnameorip:9092
agent1.channels.channel_sample.kafka.topic = topic_sample
agent1.channels.channel_sample.kafka.consumer.group.id = consumer_group_sample
If you don't need any sink binding to this channel_sample, kafka.consumer.group.id is not essential.
See https://flume.apache.org/FlumeUserGuide.html#kafka-channel for more
Take care there is a mistake in the document that default value of kafka.consumer.auto.offset.reset is earliest not latest

Related

How to configure Flume with Kafka channel without source?

It complains if a source is not specified in the configuration. According to doc :
The Kafka channel can be used for multiple scenarios:
With Flume source and sink - it provides a reliable and highly
available channel for events
With Flume source and interceptor but no sink - it allows writing Flume events into a Kafka topic, for use by other apps
With Flume sink, but no source - it is a low-latency,fault tolerant way to send events from Kafka to Flume sinks such asHDFS, HBase or Solr
https://flume.apache.org/FlumeUserGuide.html
I'm interested in the scenario 3 however there is no example for that in official flume doc.
Regards
flume agent source can be omitted in flume config on newer versions of CDH (5.14 in my case). Only a warning is issued.
you can provide some dummy name for the source like:
agent.sources = dummySource
agent.sinks = hdfsSink
agent.channels = kafkaChnl
and just provide the configurations for hdfsSink and kafkaChnl

In Kafka Connector, how do I get the bootstrap-server address My Kafka Connect is currently using?

I'm developing a Kafka Sink connector on my own. My deserializer is JSONConverter. However, when someone send a wrong JSON data into my connector's topic, I want to omit this record and send this record to a specific topic of my company.
My confuse is: I can't find any API for me to get my Connect's bootstrap.servers.(I know it's in the confluent's etc directory but it's not a good idea to write hard code of the directory of "connect-distributed.properties" to get the bootstrap.servers)
So question, is there another way for me to get the value of bootstrap.servers conveniently in my connector program?
Instead of trying to send the "bad" records from a SinkTask to Kafka, you should instead try to use the dead letter queue feature that was added in Kafka Connect 2.0.
You can configure the Connect runtime to automatically dump records that failed to be processed to a configured topic acting as a DLQ.
For more details, see the KIP that added this feature.

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.

How to use Flume's Kafka Channel without specifying a source

I have an existing Kafka topic and a flume agent that reads from there and writes to HDFS. I want to reconfigure my flume agent so it will move away from the existing setup; a Kafka Source, file Channel to HDFS Sink, to use a Kafka Channel.
I read in the cloudera documentation that it possible to achieve this by only using a Kafka Channel and HDFS sink (without a flume source).. (unless I have got the wrong end of the stick.) So I tried to create this configuration but it isn't working. It's not even starting the flume process on the box.
# Test
test.channels = kafka-channel
test.sinks = hdfs-sink
test.channels.kafka-channel.type =
org.apache.flume.channel.kafka.KafkaChannel
test.channels.kafka-channel.kafka.bootstrap.servers = localhost:9092
test.channels.kafka-channel.kafka.topic = test
test.channels.kafka-channel.parseAsFlumeEvent = false
test.sinks.hdfs-sink.channel = kafka-channel
test.sinks.hdfs-sink.type = hdfs
test.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8082/data/test/
I'm using:
HDP Quickstart VM 2.6.3
Flume version 1.5.2
The HDFS directory does exist
ps -ef | grep flume only returns a process once I added a kafka-source, but this can't be right because doing this creates an infinite loop for any messages published onto the topic.
Is it possible to only use a Kafka Channel and HDFS Sink or do I need to use a kafka-source but change some other configurations that will prevent the infinite loop of messages?
Kafka-source -> kafka-channel -> HDFS Sink - This doesn't seem right to me.
After digging around a bit I noticed that Ambari didn't create any flume conf files for the specified agent. Ambari seems to only create/update the flume config if I specify test.sources = kafka-source. Once I added this into the flume config (via ambari) the config was created on the box and the flume agent started successfully.
The final flume config looked like this:
test.sources=kafka-source
test.channels = kafka-channel
test.sinks = hdfs-sink
test.channels.kafka-channel.type = org.apache.flume.channel.kafka.KafkaChannel
test.channels.kafka-channel.kafka.bootstrap.servers = localhost:9092
test.channels.kafka-channel.kafka.topic = test
test.channels.kafka-channel.parseAsFlumeEvent = false
test.sinks.hdfs-sink.channel = kafka-channel
test.sinks.hdfs-sink.type = hdfs
test.sinks.hdfs-sink.hdfs.path = hdfs:///data/test
Notice I didn't set any of the properties on the source (this would cause the infinite loop issue i mentioned in my question), it just needs to be mentioned so Ambari creates the flume config and starts the agent.
This doesn't directly answer your question about Flume, but in general since you're already using Apache Kafka this pattern is best solved using Kafka Connect (which is part of Apache Kafka).
There is a Kafka Connect HDFS connector which is simple to use, per this guide here.

Is it possible to use flume as kafka producer for log ingestion?

I have a task of configuring a simple pipeline for app log ingestion.
A prerequisite for this pipeline is to use kafka as the transport protocol.
As I understand, flume has a built-in capability of ingesting log files.
Is there a way to use flume as a producer, and have it pass its output onto a kafka topic?
Yes, you can use Flume as a producer for Kafka.
Have a look at this API provided by Flume: https://flume.apache.org/releases/content/1.6.0/apidocs/org/apache/flume/sink/kafka/KafkaSink.html
yes, you can use. As specified in the previous response.
Just want to add that you need to make configurations similar to:
# Sources, channels, and sinks are defined per
# agent name, in this case flume1.
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
# For each source, channel, and sink, set
# standard properties.
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = ...