I have been struggling for some days to convert values from ByteArray coming from a Mqtt Source Connector to a String. Our standalone configuration has the following parameters:
Our standalone properties file looks like this:
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true
When we try to perform some transformation on the values, it looks like Kafka has data format as ByteArrays, therefore it is not possible to perform any operation on the value. Is there a way to convert the ByteArray to a StringConverter?
What we tried to do is to change the parameters as above:
converter.encoding=UTF-8
value.converter=org.apache.kafka.connect.storage.StringConverter
We had no luck. Our Input is a string in python similar to:
{'id': 42, 'cost': 4000}
Any suggestion how to configure the property file?
EDIT: As required, I am providing more information, we moved to distributed mode, with a cluster of 3 brokers, we start the connector as follows:
{
"name":"MqttSourceConnector",
"config":{
"connector.class":"io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max":"2",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"mqtt.server.uri":"ssl://XXXXXX.XXXX.amazonaws.com:1883",
"mqtt.topics":"/mqtt/topic",
"mqtt.ssl.trust.store.path":"XXXXXXXXXX.jks",
"mqtt.ssl.trust.store.password":"XXXXX",
"mqtt.ssl.key.store.path":"XXXXXXXXX.jks",
"mqtt.ssl.key.store.password":"XXXXXX",
"mqtt.ssl.key.password":"XXXXXXXX",
"max.retry.time.ms":86400000,
"mqtt.connect.timeout.seconds":2,
"mqtt.keepalive.interval.seconds":4
}
What I receive as kwy value is:
/mqtt/topic:"stringified json with kafka topic and key"
What I would like to see:
Topic: kafka.topic.from.the.mqtt.paylod.string
Key: key_from_mqtt_string
Ofc the mqtt payload should be a json for us, but I can't manage to convert it
Single Message Transform usually require a schema in the data.
Connectors like this, and others connecting from message queues, generally require the ByteArrayConverter. After that you need to apply a schema to it after which you can start to manipulate the fields.
I wrote about one way of doing this here in which you ingest the raw bytes to a Kafka topic and then apply a schema using a stream processor (ksqlDB in my example, but you can use other option if you want).
Related
I would like to write data in JSON format to a Kafka Cluster using Avro and a Confluent schema registry. I already created a schema in the confluent schema registry, which looks like this:
The JSON looks like this:
In NiFi I'M currently using the PublishKafkaRecord_2_6 processor which is configured like this:
To process the JSON I'm using a JSONTreeReader which is configured like this:
To Write to Kafka we are Using AvroRecordSetWriter which is configured like this:
When I have a look what is written into Kafka I get something cryptic like this:
Can somebody maybe point out my mistake?
Thanks in advance.
Avro is a binary format that happens to have a JSON representation. The AvroRecordSetWriter uses the binary format only. To write JSON out, you'd need to use the JSON record set writer.
So all you're seeing, as far as I can tell, is the binary representation of your messages in the Kafka queues. That would be the expected behavior using the Avro writer with the processor.
FWIW, this is precisely the pattern you should be using with Kafka in most costs. The binary format is smaller and more concise, so Kafka will process it on both ends faster. You can also set it up to not write the schema into the binary blob, so it'll be really concise and then just have downstream systems use a convention based on topic name or something to pick a schema.
I need to mirror records from a topic on a cluster A to a topic on cluster B while adding a field onto the record as they are proxied (eg. InsertField).
I am not controlling cluster A (but could require changes) and have full control of cluster B.
I know that cluster A is sending serialised JSON.
I am using the MirrorMaker API with Kafka connect to do the mirroring and I am trying to use InsertField transformation to add data on the record as they are proxied.
My configuration looks like that:
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
topics=.*
source.cluster.alias=upstream
source.cluster.bootstrap.servers=source:9092
target.cluster.bootstrap.servers=target:9092
# ByteArrayConverter to avoid MirrorMaker to re-encode messages
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
transforms=InsertSource1
transforms.InsertSource1.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource1.static.field=test_inser
transforms.InsertSource1.static.value=test_value
name=somerandomname
This code will fail with an error stating:
org.apache.kafka.connect.errors.DataException: Only Struct objects
supported for [field insertion]
Is there a way to achieve this without writing a custom transform (I am using Python and I am not familiar with Java)
Thanks a lot
In the current version of Apache Kafka (2.6.0), you cannot apply InsertField single message transformation (SMT) to MirrorMaker 2.0 records.
Explanation
The MirrorMaker 2.0 is based on Kafka Connect framework and, internally, the MirrorMaker 2.0 driver sets up MirrorSourceConnector.
Source connectors apply SMT immediately after polling records (there are no converters (e.g. ByteArrayConverter or JsonConverter) at this steps: they are used after SMT has been applied).
The SourceRecord value are represented as a byte array with BYTES_SCHEMA schema. At the same time InsertField transformation requires Type.STRUCT for records with schema.
So, since record can not be determine as Struct, transformation is not applied.
References
KIP-382: MirrorMaker 2.0
How to Use Single Message Transforms in Kafka Connect
Additional resources
Docker-compose playground for MirrorMaker 2.0
As commented, the Byte Array converter has no Struct/Schema information, so therefore the transform you're using (adding a field) cannot be used.
This does not mean that no transforms can be used, however
If you're sending JSON messages, you must send schema and payload information.
I have configured Kafka connect workers to run in cluster and able to get DB data. Also I have stored DB data in Kafka topics in JSON format. Here I used JSON converter for serializing the data
On viewing the DB data in Kafka consumer console I can see that UserCreatedon column value is displayed as integer. The data type of the UserCreatedon column value in DB is int64 (unix epoch time), that’s why timestamp value is displayed as int by Kafka consumer
Is there any way to send schema during connector creation. Because i want UserCreatedon should be displayed in timestamp format instead of int
Sample output
{"schema":{"type":"struct","fields":[{"type":"string","optional":false,"field":"NAME"},{"type":"int64","optional":true,"name":"org.apache.kafka.connect.data.Timestamp","version":1,"field":"UserCreatedON"}],"optional":false},"payload":{"NAME":"UserProvision","UserCreatedon ":1567688965261}}
Kindly requesting your support a lot here.
You have not mentioned what type of connector you are using to bring data to Kafka from DB. Kafka connect supports transformers.
Single Message Transformations (SMTs) are applied to messages as they
flow through Connect. SMTs transform inbound messages after a source
connector has produced them, but before they are written to Kafka
See here
Specifically for your case you can use TimestampConvertor
I am trying to set up a kafka consumer to process data from Kafka streams. I was able to set up the connection to the stream and the data is visible but it's a mixture of special characters and ASCII.
I am using the inbuilt kafka console, but have also tried the python version of confluent-kafka. The only parameters that need to be followed is to use SASL_PLAINTEXT security protocol with SCRAM-SHA-256. I am open to using other methods to also parse the output (not Java if possible).
Kafka Console
bin/kafka-console-consumer.sh --bootstrap-server server:9092 \
--topic TOPIC --from-beginning --consumer.config=consumer.properties
Confluent Kafka Python
topics = "TOPIC"
conf = {
"bootstrap.servers": "server:9092",
"group.id": "group",
"security.protocol": "SASL_PLAINTEXT",
"sasl.mechanisms" : "SCRAM-SHA-256",
}
c = Consumer(conf)
c.subscribe([topics])
running = True
while running:
message = c.poll()
print(message.value())
c.close()
Output
PLE9K1PKH3S0MAY38ChangeRequest : llZYMEgVmq2CHG:Infra RequestKSUSMAINCHANGEKC-10200-FL01DATA_MISSINGCHGUSD
DATA_MISSINGDATA_MISSINGUSD
CANCEL
▒▒12SLM:Measurement"Schedule(1 = 0)USDUSD▒▒▒
l▒l▒V?▒▒▒
llZYMEgVmq
company_team team_nameTEAM###SGP000000140381PPL000002020234
Latha M▒>▒>▒ChangeRequest
hello:1234543534 cloud abcdef▒▒▒
▒Ի▒
▒▒▒
John Smithjs12345SGP000000140381▒NPPL000002020234
▒Ի▒
I am trying to parse the data on the standard output initially, but the expectation at the end is to get the parsed data in a database. Any advice would be appreciated.
It seems like you have your messages encoded in binary format. To print those you will need to set up a binary decoder and pass them through that. In case you have produced them using a specific schema you might also need to deserialize the objects using the Schema Registry which contains the schema for the given topic. You are looking at something in the lines of:
message_bytes = io.BytesIO(message.value())
decoder = BinaryDecoder(message_bytes)
As jaivalis has mentioned there appears to be a mismatch between your producers and the consumer you are using to ingest the data. Kafka Streams exposes two properties for controlling the serialization and deserialization of data that passes through the topology; default.value.serde, default.key.serde. I recommend reviewing your streams application's configuration to find a suitable deserializer for the consumer to use.
https://kafka.apache.org/documentation/#streamsconfigs
Do note however that these serdes may be overwritten by the your streams application implementation. Be sure to review your implementation as well to ensure you have found the correct serialization format.
https://kafka.apache.org/21/documentation/streams/developer-guide/datatypes.html#overriding-default-serdes
I am pretty new to NiFi. We have the setup done already where we are able to consume the Kafka messages.
In the NiFi UI, I created the Processor with ConsumeKafka_0_10. When the messages are published (different process), My processor is able to pick up the required data/messages properly.
I go to "Data provenance" and can see that the correct data is received.
However, I want to have the next process as some validator. That will read the flowfile from consumekafka and do basic validation (user-supplied script should be good)
How do we that or which processor works here?
Also any way to convert the flowfile input format into csv or json format?
You have a few options. Depending on the flowfile content format, you can use ValidateRecord with a *Reader record reader controller service configured to validate it. If you already have a script to do this in Groovy/Javascript/Ruby/Python, ExecuteScript is also a solution.
Similarly, to convert the flowfile content into CSV or JSON, use a ConvertRecord processor, with a ScriptedReader and a CSVRecordSetWriter or JsonRecordSetWriter to output into the correct format. These processes use the Apache NiFi record structure internally to convert from arbitrary input/output formats with high performance. Further reading is available at blogs.apache.org/nifi and bryanbende.com.