Insert filename to kafka through Flume spool directory source - apache-kafka

I'm using spooldir as Flume source and sink to kafka, is there anyway that i can transfer both the content and filename to kafka.
For example, filename is test.txt and content is hello world, need to display
hello world
test.txt

Some sources allow adding the name of the file as header of the Flume event created with the input data; that's the case of the spooldir source.
And some sinks allow configuring the serializer to be used for writting the data, such as the HDFS one; in that case, I've read there exists a header_and_text serializer (never tested it). Nevertheless, the Kafka source does not expose parameters for doing that.
So, IMHO your options are two:
Configure the spooldir for adding the above mentioned header about the file name, and develop a custom interceptor in charge of modifying the data with such a header value. Interceptors are pieces of code running at the output of the sources that "intercept" the events and modify them before they are effectively put into the Flume channel.
Modify the data you send to the spooldir source by adding a first data line about the file name.

Related

FilePulse SourceConnector

I would like to continuously reading csv file in ksqldb with FilePulse Source Connector but it not work correctly
a) the connector read the file only once or
b) the connector read all data from file, but in that case there are duplicities in kafka topic (every time when connector read the appended file then insert all data from file into topic - not only the changed data
Is there any options how to solve this? (to continuously read only appended data from file or remove duplicities in kafka topic)
Thank you
To my knowledge, the file source connector doesn't track the file content. The connector only sees a modified file, so reads the whole thing on any update. Otherwise, reading the file once is expected behavior and you should reset your consumer offsets to handle this in your processing logic; for example make a table in ksql
If you want to tail a file for appends, other options like the spooldir connector, or Filebeat/Fluentd would be preferred (and are actually documented as being production-grade solutions for reading files into Kafka)
Disclaimer: I'm the author of Connect FilePulse
Connect FilePulse is probably not the best solution for continuously reading files. And as already mentioned in other answers: it might be a good idea to use solutions like: Filebeat, Fluentd or Logstash.
But, FilePulse actually supports continous reading using the LocalRowFileInputReader with the reader's property read.max.wait.ms. Here is an older answer for a question similar to yours: Stackoverflow: How can be configured kafka-connect-file-pulse for continuous reading of a text file?

MQTT converter trouble to perform tansformation on values

I have been struggling for some days to convert values from ByteArray coming from a Mqtt Source Connector to a String. Our standalone configuration has the following parameters:
Our standalone properties file looks like this:
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true
When we try to perform some transformation on the values, it looks like Kafka has data format as ByteArrays, therefore it is not possible to perform any operation on the value. Is there a way to convert the ByteArray to a StringConverter?
What we tried to do is to change the parameters as above:
converter.encoding=UTF-8
value.converter=org.apache.kafka.connect.storage.StringConverter
We had no luck. Our Input is a string in python similar to:
{'id': 42, 'cost': 4000}
Any suggestion how to configure the property file?
EDIT: As required, I am providing more information, we moved to distributed mode, with a cluster of 3 brokers, we start the connector as follows:
{
"name":"MqttSourceConnector",
"config":{
"connector.class":"io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max":"2",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"mqtt.server.uri":"ssl://XXXXXX.XXXX.amazonaws.com:1883",
"mqtt.topics":"/mqtt/topic",
"mqtt.ssl.trust.store.path":"XXXXXXXXXX.jks",
"mqtt.ssl.trust.store.password":"XXXXX",
"mqtt.ssl.key.store.path":"XXXXXXXXX.jks",
"mqtt.ssl.key.store.password":"XXXXXX",
"mqtt.ssl.key.password":"XXXXXXXX",
"max.retry.time.ms":86400000,
"mqtt.connect.timeout.seconds":2,
"mqtt.keepalive.interval.seconds":4
}
What I receive as kwy value is:
/mqtt/topic:"stringified json with kafka topic and key"
What I would like to see:
Topic: kafka.topic.from.the.mqtt.paylod.string
Key: key_from_mqtt_string
Ofc the mqtt payload should be a json for us, but I can't manage to convert it
Single Message Transform usually require a schema in the data.
Connectors like this, and others connecting from message queues, generally require the ByteArrayConverter. After that you need to apply a schema to it after which you can start to manipulate the fields.
I wrote about one way of doing this here in which you ingest the raw bytes to a Kafka topic and then apply a schema using a stream processor (ksqlDB in my example, but you can use other option if you want).

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

Apache NiFi : Validate the FlowFile data created by ConsumeKafka

I am pretty new to NiFi. We have the setup done already where we are able to consume the Kafka messages.
In the NiFi UI, I created the Processor with ConsumeKafka_0_10. When the messages are published (different process), My processor is able to pick up the required data/messages properly.
I go to "Data provenance" and can see that the correct data is received.
However, I want to have the next process as some validator. That will read the flowfile from consumekafka and do basic validation (user-supplied script should be good)
How do we that or which processor works here?
Also any way to convert the flowfile input format into csv or json format?
You have a few options. Depending on the flowfile content format, you can use ValidateRecord with a *Reader record reader controller service configured to validate it. If you already have a script to do this in Groovy/Javascript/Ruby/Python, ExecuteScript is also a solution.
Similarly, to convert the flowfile content into CSV or JSON, use a ConvertRecord processor, with a ScriptedReader and a CSVRecordSetWriter or JsonRecordSetWriter to output into the correct format. These processes use the Apache NiFi record structure internally to convert from arbitrary input/output formats with high performance. Further reading is available at blogs.apache.org/nifi and bryanbende.com.

display the contents of a folder using kafka

I have a folder that contains two files, I want to display the names of two file and when I added another file I display their name too. My question is, can I do that with Kafka?
Its strange case of use kafka. I think that this should by done adding an apache flume. Flume can pool directory and when discover new files, send this to kafka and then process this messages to recover file name.
it solves your problem?