Apache NiFi : Validate the FlowFile data created by ConsumeKafka - apache-kafka

I am pretty new to NiFi. We have the setup done already where we are able to consume the Kafka messages.
In the NiFi UI, I created the Processor with ConsumeKafka_0_10. When the messages are published (different process), My processor is able to pick up the required data/messages properly.
I go to "Data provenance" and can see that the correct data is received.
However, I want to have the next process as some validator. That will read the flowfile from consumekafka and do basic validation (user-supplied script should be good)
How do we that or which processor works here?
Also any way to convert the flowfile input format into csv or json format?

You have a few options. Depending on the flowfile content format, you can use ValidateRecord with a *Reader record reader controller service configured to validate it. If you already have a script to do this in Groovy/Javascript/Ruby/Python, ExecuteScript is also a solution.
Similarly, to convert the flowfile content into CSV or JSON, use a ConvertRecord processor, with a ScriptedReader and a CSVRecordSetWriter or JsonRecordSetWriter to output into the correct format. These processes use the Apache NiFi record structure internally to convert from arbitrary input/output formats with high performance. Further reading is available at blogs.apache.org/nifi and bryanbende.com.

Related

MQTT converter trouble to perform tansformation on values

I have been struggling for some days to convert values from ByteArray coming from a Mqtt Source Connector to a String. Our standalone configuration has the following parameters:
Our standalone properties file looks like this:
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true
When we try to perform some transformation on the values, it looks like Kafka has data format as ByteArrays, therefore it is not possible to perform any operation on the value. Is there a way to convert the ByteArray to a StringConverter?
What we tried to do is to change the parameters as above:
converter.encoding=UTF-8
value.converter=org.apache.kafka.connect.storage.StringConverter
We had no luck. Our Input is a string in python similar to:
{'id': 42, 'cost': 4000}
Any suggestion how to configure the property file?
EDIT: As required, I am providing more information, we moved to distributed mode, with a cluster of 3 brokers, we start the connector as follows:
{
"name":"MqttSourceConnector",
"config":{
"connector.class":"io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max":"2",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"mqtt.server.uri":"ssl://XXXXXX.XXXX.amazonaws.com:1883",
"mqtt.topics":"/mqtt/topic",
"mqtt.ssl.trust.store.path":"XXXXXXXXXX.jks",
"mqtt.ssl.trust.store.password":"XXXXX",
"mqtt.ssl.key.store.path":"XXXXXXXXX.jks",
"mqtt.ssl.key.store.password":"XXXXXX",
"mqtt.ssl.key.password":"XXXXXXXX",
"max.retry.time.ms":86400000,
"mqtt.connect.timeout.seconds":2,
"mqtt.keepalive.interval.seconds":4
}
What I receive as kwy value is:
/mqtt/topic:"stringified json with kafka topic and key"
What I would like to see:
Topic: kafka.topic.from.the.mqtt.paylod.string
Key: key_from_mqtt_string
Ofc the mqtt payload should be a json for us, but I can't manage to convert it
Single Message Transform usually require a schema in the data.
Connectors like this, and others connecting from message queues, generally require the ByteArrayConverter. After that you need to apply a schema to it after which you can start to manipulate the fields.
I wrote about one way of doing this here in which you ingest the raw bytes to a Kafka topic and then apply a schema using a stream processor (ksqlDB in my example, but you can use other option if you want).

Write Data in JSON to KAFKA Using Avro and Confluent schema registry using NiFi

I would like to write data in JSON format to a Kafka Cluster using Avro and a Confluent schema registry. I already created a schema in the confluent schema registry, which looks like this:
The JSON looks like this:
In NiFi I'M currently using the PublishKafkaRecord_2_6 processor which is configured like this:
To process the JSON I'm using a JSONTreeReader which is configured like this:
To Write to Kafka we are Using AvroRecordSetWriter which is configured like this:
When I have a look what is written into Kafka I get something cryptic like this:
Can somebody maybe point out my mistake?
Thanks in advance.
Avro is a binary format that happens to have a JSON representation. The AvroRecordSetWriter uses the binary format only. To write JSON out, you'd need to use the JSON record set writer.
So all you're seeing, as far as I can tell, is the binary representation of your messages in the Kafka queues. That would be the expected behavior using the Avro writer with the processor.
FWIW, this is precisely the pattern you should be using with Kafka in most costs. The binary format is smaller and more concise, so Kafka will process it on both ends faster. You can also set it up to not write the schema into the binary blob, so it'll be really concise and then just have downstream systems use a convention based on topic name or something to pick a schema.

Integrating a large XML file size with Kafka

The XML file (~100 Mb) is a batch export by an external system of its entire database (The Batch export is every 6 hours).
I can not change the integration to use Debezium connector for example.
I have access only to the XML file.
What would be the best solution to consume the file with Apache Kafka?
Or, an architecture to send single messages of the XML file with an XSD schema?
Is not receiving its content on a large single message size a bad thing for the architecture?
The default max.message.bytes configuration on broker and topic level in Kafka is set to c. 1MB and it is not advisable to significantly increase that configuration as Kafka is not optimizes to handle large messages.
Is see two options to solve this:
Before loading the XML into Kafka, split it into chunks that represent an individual row of the database. In addition, us a typesafe format (such as AVRO) in combination with a Schema Registry to tell potential consumers how to read the data.
Dependent on what needs to be done with the large XML file, you could also store the XML in a resilient location (such as HDFS) and only provide the location path in a Kafka message. That way, a consumer can consume the paths from the Kafka topic and make some processing on them.
Writing a Kafka producer that unamarshalls XML files to Java Objects, Sends serialized objects in Avro format to the cluster was the solution for me.

Read Avro Buffer encoded messages with ConsumeKafka (NIFI)

I am new to NIFI(and not much experience with Kafka), and I am trying to consume the messages that the producer is generating. To do this job, I am using the the ConsumeKafka processor on NIFI.
The messages are arriving (I can see them on the queue), but when I check the queue, and try to view the messages, I can only see the content with hex format (f.e: in original format I can see a message that says: No viewer is registered for this content type).
The messages that the producer is sending are encoded avro buffer (this is the reference I have taken: https://blog.mimacom.com/apache-kafka-with-node-js/) And when I check the consumer from the console, each message has this format:
02018-09-21T08:37:44.587Z #02018-09-21T08:37:44.587Z #
I have read that the processor UpdateRecord can help to change the hex code to plain text, but I can't make it happen.
How can I configure this UpdateRecord processor?
Chears
Instead of ConsumeKafka, it is better to use ConsumeKafkaRecord processor appropriate to the Kafka version you're using and configure Record Reader with an AvroReader and set Record Writer to the writer of your choice.
Once that's done, you have to configure the AvroReader controller service with a Schema registry. You can use AvroSchemaRegistry where you would specify the schema for the Avro messages that you're receiving in Kafka.
A quick look at this tutorial would help you achieve what you want: https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries

StreamSets Design Of Ingestion

Dears,
I am considering options how to use Streamsets properly in a given generic Data Hub Architecture:
I have several data types (csv, tsv, json, binary from IOT) that needs to be captured by CDC and saved into a Kafka topic with as-is format and then sinked to HDFS Data Lake as -is.
Then, an other Streamsets Pipeline will consume from this Kafka topic and convert to a common format (depending on data type) into JSON and perform validations, masking, meta-data, etc and save to another Kafka topic.
The same JSON message will be saved into HDFS Data Lake in Avro format for batch processing.
I will then use Spark Streaming to consume the same JSON messages for real-time processing assuming the JSON data is all ready and can further be enriched with other data for scalable complex transformation.
I have not used Streamsets for further processing and relying on Spark Streaming for scalable complex transformations which is not part of the SLA management (as Spark Jobs are not triggered from within Streamsets) Also, I could not use Kafka Registry with Avro in this design to validate JSON schema and JSON schema is validated based on custom logic embedded into StreamSets as Javascript.
What can be done better in the above design?
Thanks in advance...
Your pipeline design looks good.
However I would recommend consolidating several of those steps using Striim.
Striim has built in CDC (change data capture) from all the sources you listed plus databases
It has native kafka integration so you can write to and read from kafka in the same pipeline
Striim also has built in caches and processing operators for enrichment. That way you don't need to write Spark code to do enrichment. Everything is done through our simple UI.
You can try it out here:
https://striim.com/instant-download
Full disclosure: I'm a PM at Striim.