Error With RowKey Definition on Confluent BigTable Sink Connector - apache-kafka

I'm trying to use the BigTable Sink Connector from Confluent to read data from kafka and write it into my BigTable Instance, but I'm receiving the following message error:
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:614)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:329)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.connect.errors.ConnectException: Error with RowKey definition: Row key definition was defined, but received, deserialized kafka key is not a struct. Unable to construct a row key.
at io.confluent.connect.bigtable.client.RowKeyExtractor.getRowKey(RowKeyExtractor.java:69)
at io.confluent.connect.bigtable.client.BufferedWriter.addWriteToBatch(BufferedWriter.java:84)
at io.confluent.connect.bigtable.client.InsertWriter.write(InsertWriter.java:47)
at io.confluent.connect.bigtable.BaseBigtableSinkTask.put(BaseBigtableSinkTask.java:99)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)
... 10 more
The message producer, due to some technical limitations, will not be able to produce the messages with the key property and, because of that, I'm using some Transforms to get information from payload and setting it as the key message.
Here's my connector payload:
{
"name" : "DATALAKE.BIGTABLE.SINK.QUEUEING.ZTXXD",
"config" : {
"connector.class" : "io.confluent.connect.gcp.bigtable.BigtableSinkConnector",
"key.converter" : "org.apache.kafka.connect.storage.StringConverter",
"value.converter" : "org.apache.kafka.connect.json.JsonConverter",
"topics" : "APP-DATALAKE-QUEUEING-ZTXXD_DATALAKE-V1",
"transforms" : "HoistField,AddKeys,ExtractKey",
"gcp.bigtable.project.id" : "bigtable-project-id",
"gcp.bigtable.instance.id" : "bigtable-instance-id",
"gcp.bigtable.credentials.json" : "XXXXX",
"transforms.ExtractKey.type" : "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.HoistField.field" : "raw_data_cf",
"transforms.ExtractKey.field" : "KEY1,ATT1",
"transforms.HoistField.type" : "org.apache.kafka.connect.transforms.HoistField$Value",
"transforms.AddKeys.type" : "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.AddKeys.fields" : "KEY1,ATT1",
"row.key.definition" : "KEY1,ATT1",
"table.name.format" : "raw_ZTXXD_DATALAKE",
"consumer.override.group.id" : "svc-datalake-KAFKA_2_BIGTABLE",
"confluent.topic.bootstrap.servers" : "xxxxxx:9092",
"input.data.format" : "JSON",
"confluent.topic" : "_dsp-confluent-license",
"input.key.format" : "STRING",
"key.converter.schemas.enable" : "false",
"confluent.topic.security.protocol" : "SASL_SSL",
"row.key.delimiter" : "/",
"confluent.topic.sasl.jaas.config" : "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"XXXXX\" password=\"XXXXXX\";",
"value.converter.schemas.enable" : "false",
"auto.create.tables" : "true",
"auto.create.column.families" : "true",
"confluent.topic.sasl.mechanism" : "PLAIN"
}
}
And here's my message produced to Kafka:
{
"MANDT": "110",
"KEY1": "1",
"KEY2": null,
"ATT1": "1M",
"ATT2": "0000000000",
"TABLE_NAME": "ZTXXD_DATALAKE",
"IUUC_OPERATION": "I",
"CREATETIMESTAMP": "2022-01-24T20:26:45.247Z"
}
In my transforms I'm doing three operations:
HoistField is putting my payload inside a two-level structure (the connect docs for BigTable says that connect expects a two-level structure in order to be able to infer the family columns
addKey is adding the columns that I consider key to the message key
ExtractKey is removing the key from the fields added in the header, leaving only the values ​​themselves.
I've been reading the documentation for this connector for Bigtable and it's not clear to me if the connector works well with the JSON format. Could you let me know?

JSON should work, but...
deserialized kafka key is not a struct
This is because you have set the schemas.enable=false property on the value converter, such that when you do ValueToKey, it's not a Connect Struct type; the HoistField makes a Java Map instead.
If you're not able to use the Schema Registry and switch the serialization format, then you'll need to try and find a way to get the REST Proxy to infer the schema of the JSON message before it produces the data (I don't think it can). Otherwise, your records need to include schema and payload fields, and you need to enable schemas on the converters. Explained here
Another option - There may be a transform project around that sets the schema of the record, but it's not builtin.. (it's not part of SetSchemaMetadata)

Related

Kafka connect RabbitMQ unable to use insert field transform: Only Struct objects supported for [field insertion], found: [B

I'm trying to use the InsertField kafka connect transformation with rabbitmq connector.
my configuration:
"config": {
"connector.class": "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"confluent.topic.bootstrap.servers": "kafka:29092",
"topic.creation.default.replication.factor": 1,
"topic.creation.default.partitions": 1,
"tasks.max": "2",
"kafka.topic": "test",
"rabbitmq.queue": "events",
"rabbitmq.host": "rabbitmq",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"transforms": "InsertField",
"transforms.InsertField.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.InsertField.static.field": "MessageSource",
"transforms.InsertField.static.value": "Kafka Connect framework"
}
I have also tried using BytesArrayConverter as the value. Using python, I send a message as follows:
msg = json.dumps(body)
self.channel.basic_publish(exchange="", routing_key="events", body=msg)
where using encode() to transform it into a byte array does not work as well.
The exception I'm receiving is:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [field insertion], found: [B
at org.apache.kafka.connect.transforms.util.Requirements.requireStruct(Requirements.java:52)
at org.apache.kafka.connect.transforms.InsertField.applyWithSchema(InsertField.java:162)
at org.apache.kafka.connect.transforms.InsertField.apply(InsertField.java:133)
at org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:50)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 11 more
I understand the error and thought that using JsonConverter will solve it, but I was wrong. I've also used "value.converter.schemas.enable" : "false" to no avail.
Would appreciate any help. I don't mind sending the data in json form or bytes form, I just want a key:value pair to be added to the event.
Thanks
As the error indicates, you can only insert fields into structs. To get a Struct from RabbitMQ String/Bytes schemas, you must chain a HoistField transform before InsertField one.
To get any Struct from JSONConverter, your JSON needs two top level fields named schema and payload, then connector needs
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true"
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/
Alternatively, use Kafka headers for "source" information, rather than trying to inject into the value

Handling empty/invalid Mqtt Messages with Kafka Connect

I am trying to ingest data from Mqtt into Kafka. Unfortunately, some of those Mqtt-Messages are either empty or invalid JSON. I assume that is what leads to the following exception:
{
"name": "source_mqtt_alarms",
"connector": {
"state": "RUNNING",
"worker_id": "-redacted-:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "-redacted-:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException:
Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:196)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:122)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.convertTransformedRecord(WorkerSourceTask.java:314)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:340)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:264)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:235)\n\t
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\t
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\t
at java.base/java.lang.Thread.run(Thread.java:834)\n
Caused by: org.apache.kafka.connect.errors.DataException: Conversion error: null value for field that is required and has no default value\n\t
at org.apache.kafka.connect.json.JsonConverter.convertToJson(JsonConverter.java:611)\n\t
at org.apache.kafka.connect.json.JsonConverter.convertToJsonWithEnvelope(JsonConverter.java:592)\n\t
at org.apache.kafka.connect.json.JsonConverter.fromConnectData(JsonConverter.java:346)\n\t
at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.lambda$convertTransformedRecord$2(WorkerSourceTask.java:314)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:146)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:180)\n\t
... 11 more\n"
}
],
"type": "source"
}
From what I've learned so far, it looks like the incoming (empty/invalid) messages do not contain values that are declared as non-optional, which leads to the exception above.
My question would be, where is the connector taking that expectation from? It says "null value for field that is required and has no default value", but how is that field required if the schema is (I assume) created per message?
Additional information:
I am using the Lenses.io Stream Reactor Mqtt Source Connector. The configuration is as follows:
{
"name": "source_mqtt_alarms",
"config": {
"topics": "alarms",
"connect.mqtt.kcql": "INSERT INTO alarms SELECT * FROM `-redacted-/+/alarms` WITHCONVERTER=`com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter`",
"connect.mqtt.client.id": "kafka_connect_alarms",
"tasks.max": 1,
"connector.class": "com.datamountaineer.streamreactor.connect.mqtt.source.MqttSourceConnector",
"connect.mqtt.service.quality": 2,
"connect.mqtt.hosts": "ssl://-redacted-:8883",
"connect.mqtt.ssl.ca.cert": "/usr/share/certs/cumu.crt",
"connect.mqtt.ssl.cert": "/usr/share/certs/mqtt.crt",
"connect.mqtt.ssl.key": "/usr/share/certs/mqtt.pem",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true,
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": true,
}
}
Edit: I just went through the logs of the Kafka Connect worker and it's giving a bit more information. Prior to the exception above, I get a lost of these:
[2021-05-26 08:27:19,552] ERROR Error handling message with id:0 on topic:-redacted-/alarms (com.datamountaineer.streamreactor.connect.mqtt.source.MqttManager)
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:430)
at scala.collection.immutable.Nil$.head(List.scala:427)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter$.convert(JsonSimpleConverter.scala:76)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter$.convert(JsonSimpleConverter.scala:70)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter.convert(JsonSimpleConverter.scala:37)
at com.datamountaineer.streamreactor.connect.mqtt.source.MqttManager.messageArrived(MqttManager.scala:110)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.deliverMessage(CommsCallback.java:514)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.handleMessage(CommsCallback.java:417)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.run(CommsCallback.java:214)
at java.base/java.lang.Thread.run(Thread.java:834)

Cassandra Sink Connector for Confluent Platform

I am trying to run Cassandra sink connector for confluent platform.The cassandra-sink.json file is as below :
{
"name" : "cassandra-sink",
"config" : {
"connector.class" : "io.confluent.connect.cassandra.CassandraSinkConnector",
"tasks.max" : "1",
"topics" : "topic1",
"cassandra.contact.points" : "127.0.0.1",
"cassandra.keyspace" : "test",
"confluent.topic.bootstrap.servers": "127.0.0.1:9092",
"cassandra.write.mode" : "Update",
"connect.cassandra.port":"127.0.0.1:9042"
}
}
I downloaded confluent-hub install confluentinc/kafka-connect-cassandra:latest as per the link.
I am able to load the file but when i check the status i get the below error. I am unable to figure out what the issue is.
FAILED worker_id:127.0.0.1:8083,trace:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed
com.datastax.driver.core.exceptions.TransportException: [/127.0.0.1:9042] Cannot connect
com.datastax.driver.core.ControlConnection.reconnectInternal
com.datastax.driver.core.ControlConnection.connect
com.datastax.driver.core.Cluster$Manager.negotiateProtocolVersionAndConnect
com.datastax.driver.core.Cluster$Manager.init
com.datastax.driver.core.Cluster.init
com.datastax.driver.core.SessionManager.initAsync
com.datastax.driver.core.SessionManager.executeAsync
com.datastax.driver.core.AbstractSession.execute
io.confluent.connect.cassandra.CassandraSessionImpl.executeStatement
io.confluent.connect.cassandra.CassandraSinkConnector.doStart
io.confluent.connect.cassandra.CassandraSinkConnector.start
org.apache.kafka.connect.runtime.WorkerConnector.doStart
org.apache.kafka.connect.runtime.WorkerConnector.start
org.apache.kafka.connect.runtime.WorkerConnector.transitionTo
org.apache.kafka.connect.runtime.Worker.startConnector
org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector
org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1300
org.apache.kafka.connect.runtime.distributed.DistributedHerder$14
org.apache.kafka.connect.runtime.distributed.DistributedHerder$14
java.util.concurrent.FutureTask.run java.util.concurrent.ThreadPoolExecutor.runWorker
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run
Please guide.

Kafka Connect issue when reading from a RabbitMQ queue

I'm trying to read data into my topic from a RabbitMQ queue using the Kafka connector with the configuration below:
{
"name" : "RabbitMQSourceConnector1",
"config" : {
"connector.class" : "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max" : "1",
"kafka.topic" : "rabbitmqtest3",
"rabbitmq.queue" : "taskqueue",
"rabbitmq.host" : "localhost",
"rabbitmq.username" : "guest",
"rabbitmq.password" : "guest",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true"
}
}
But I´m having troubles when converting the source stream to JSON format as I´m losing the original message
Original:
{'id': 0, 'body': '010101010101010101010101010101010101010101010101010101010101010101010'}
Received:
{"schema":{"type":"bytes","optional":false},"payload":"eyJpZCI6IDEsICJib2R5IjogIjAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMDEwMTAxMCJ9"}
Does anyone have an idea why this is happening?
EDIT: I tried to convert the message to String using the "value.converter": "org.apache.kafka.connect.storage.StringConverter", but the result is the same:
11/27/19 4:07:37 PM CET , 0 , [B#1583a488
EDIT2:
I´m now receiving the JSON file but the content is still encoded in BASE64
Any idea on how to convert it back to UTF8 directly?
{
"name": "adls-gen2-sink",
"config": {
"connector.class":"io.confluent.connect.azure.datalake.gen2.AzureDataLakeGen2SinkConnector",
"tasks.max":"1",
"topics":"rabbitmqtest3",
"flush.size":"3",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"internal.value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"topics.dir":"sw66jsoningest",
"confluent.topic.bootstrap.servers":"localhost:9092",
"confluent.topic.replication.factor":"1",
"partitioner.class" : "io.confluent.connect.storage.partitioner.DefaultPartitioner"
}
}
UPDATE:
I got the solution, considering this flow:
Message (JSON) --> RabbitMq (ByteArray) --> Kafka (ByteArray) -->ADLS (JSON)
I used this converter on the RabbitMQ to Kafka connector to decode the message from Base64 to UTF8.
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter"
Afterwards I treated the message as a String and saved it as a JSON.
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"format.class":"io.confluent.connect.azure.storage.format.json.JsonFormat",
Many thanks!
If you set schemas.enable": "false", you shouldn't be getting the schema and payload fields
If you want no translation to happen at all, use ByteArrayConverter
If your data is just a plain string (which includes JSON), use StringConverter
It's not clear how you're printing the resulting message, but looks like you're printing the byte array and not decoding it to a String

org.apache.kafka.connect.errors.DataException: Invalid JSON for array default value: "null"

I am trying to use the confluent Kafka s3 connector using confluent-4.1.1.
s3-sink
"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter"
When I run Kafka connectors for the s3 sink, I get this error message:
ERROR WorkerSinkTask{id=singular-s3-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
org.apache.kafka.connect.errors.DataException: Invalid JSON for array default value: "null"
at io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1649)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1562)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1443)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1443)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1323)
at io.confluent.connect.avro.AvroData.toConnectData(AvroData.java:1047)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:87)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:468)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My Schema contains only 1 array type field and its schema is like this
{"name":"item_id","type":{"type":"array","items":["null","string"]},"default":[]}
I am able to see the deserialized message using the kafka-avro-console-consumer command. I have seen a similar question but in his case, he was using Avro serializer for key also.
./confluent-4.1.1/bin/kafka-avro-console-consumer --topic singular_custom_postback --bootstrap-server localhost:9092 -max-messages 2
"item_id":[{"string":"15552"},{"string":"37810"},{"string":"38061"}]
"item_id":[]
I cannot put the entire output I get from the console consumer as it contains sensitive user information, so I have added the only array type field in my schema.
Thanks in advance.
The io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1649) is called for the conversion of avro schema of the message you read to the connect sink's internal schema. I believe it is not related to the data of your message. That is why the AbstractKafkaAvroDeserializer can successfully deserialise your message (e.g. via kafka-avro-console-consumer), as your message is a valid avro message. The above exception may occur if your default value is null, while null is not a valid value of your field. E.g.
{
"name":"item_id",
"type":{
"type":"array",
"items":[
"string"
]
},
"default": null
}
I would propose you to remotely debug connect and see what exactly is failing.
Same problem as the question that you have linked to.
In the source code, you can see this condition.
case ARRAY: {
if (!jsonValue.isArray()) {
throw new DataException("Invalid JSON for array default value: " + jsonValue.toString());
}
And the exception can be thrown when the schema type is defined in your case as type:"array", but the payload itself has a null value (or any other value type) rather than actually an array, despite what you have defined as your schema default value. The default is only applied when the items element isn't there at all, not when "items":null
Other than that, I would suggest a schema like so, i.e. a record object, not just a named array, with a default of an empty array, not null.
{
"type" : "record",
"name" : "Items",
"namespace" : "com.example.avro",
"fields" : [ {
"name" : "item_id",
"type" : {
"type" : "array",
"items" : [ "null", "string" ]
},
"default": []
} ]
}