Handling empty/invalid Mqtt Messages with Kafka Connect - apache-kafka

I am trying to ingest data from Mqtt into Kafka. Unfortunately, some of those Mqtt-Messages are either empty or invalid JSON. I assume that is what leads to the following exception:
{
"name": "source_mqtt_alarms",
"connector": {
"state": "RUNNING",
"worker_id": "-redacted-:8083"
},
"tasks": [
{
"id": 0,
"state": "FAILED",
"worker_id": "-redacted-:8083",
"trace": "org.apache.kafka.connect.errors.ConnectException:
Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:196)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:122)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.convertTransformedRecord(WorkerSourceTask.java:314)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:340)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:264)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:235)\n\t
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\t
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\t
at java.base/java.lang.Thread.run(Thread.java:834)\n
Caused by: org.apache.kafka.connect.errors.DataException: Conversion error: null value for field that is required and has no default value\n\t
at org.apache.kafka.connect.json.JsonConverter.convertToJson(JsonConverter.java:611)\n\t
at org.apache.kafka.connect.json.JsonConverter.convertToJsonWithEnvelope(JsonConverter.java:592)\n\t
at org.apache.kafka.connect.json.JsonConverter.fromConnectData(JsonConverter.java:346)\n\t
at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.lambda$convertTransformedRecord$2(WorkerSourceTask.java:314)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:146)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:180)\n\t
... 11 more\n"
}
],
"type": "source"
}
From what I've learned so far, it looks like the incoming (empty/invalid) messages do not contain values that are declared as non-optional, which leads to the exception above.
My question would be, where is the connector taking that expectation from? It says "null value for field that is required and has no default value", but how is that field required if the schema is (I assume) created per message?
Additional information:
I am using the Lenses.io Stream Reactor Mqtt Source Connector. The configuration is as follows:
{
"name": "source_mqtt_alarms",
"config": {
"topics": "alarms",
"connect.mqtt.kcql": "INSERT INTO alarms SELECT * FROM `-redacted-/+/alarms` WITHCONVERTER=`com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter`",
"connect.mqtt.client.id": "kafka_connect_alarms",
"tasks.max": 1,
"connector.class": "com.datamountaineer.streamreactor.connect.mqtt.source.MqttSourceConnector",
"connect.mqtt.service.quality": 2,
"connect.mqtt.hosts": "ssl://-redacted-:8883",
"connect.mqtt.ssl.ca.cert": "/usr/share/certs/cumu.crt",
"connect.mqtt.ssl.cert": "/usr/share/certs/mqtt.crt",
"connect.mqtt.ssl.key": "/usr/share/certs/mqtt.pem",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true,
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": true,
}
}
Edit: I just went through the logs of the Kafka Connect worker and it's giving a bit more information. Prior to the exception above, I get a lost of these:
[2021-05-26 08:27:19,552] ERROR Error handling message with id:0 on topic:-redacted-/alarms (com.datamountaineer.streamreactor.connect.mqtt.source.MqttManager)
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:430)
at scala.collection.immutable.Nil$.head(List.scala:427)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter$.convert(JsonSimpleConverter.scala:76)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter$.convert(JsonSimpleConverter.scala:70)
at com.datamountaineer.streamreactor.connect.converters.source.JsonSimpleConverter.convert(JsonSimpleConverter.scala:37)
at com.datamountaineer.streamreactor.connect.mqtt.source.MqttManager.messageArrived(MqttManager.scala:110)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.deliverMessage(CommsCallback.java:514)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.handleMessage(CommsCallback.java:417)
at org.eclipse.paho.client.mqttv3.internal.CommsCallback.run(CommsCallback.java:214)
at java.base/java.lang.Thread.run(Thread.java:834)

Related

Kafka connect RabbitMQ unable to use insert field transform: Only Struct objects supported for [field insertion], found: [B

I'm trying to use the InsertField kafka connect transformation with rabbitmq connector.
my configuration:
"config": {
"connector.class": "io.confluent.connect.rabbitmq.RabbitMQSourceConnector",
"confluent.topic.bootstrap.servers": "kafka:29092",
"topic.creation.default.replication.factor": 1,
"topic.creation.default.partitions": 1,
"tasks.max": "2",
"kafka.topic": "test",
"rabbitmq.queue": "events",
"rabbitmq.host": "rabbitmq",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"transforms": "InsertField",
"transforms.InsertField.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.InsertField.static.field": "MessageSource",
"transforms.InsertField.static.value": "Kafka Connect framework"
}
I have also tried using BytesArrayConverter as the value. Using python, I send a message as follows:
msg = json.dumps(body)
self.channel.basic_publish(exchange="", routing_key="events", body=msg)
where using encode() to transform it into a byte array does not work as well.
The exception I'm receiving is:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [field insertion], found: [B
at org.apache.kafka.connect.transforms.util.Requirements.requireStruct(Requirements.java:52)
at org.apache.kafka.connect.transforms.InsertField.applyWithSchema(InsertField.java:162)
at org.apache.kafka.connect.transforms.InsertField.apply(InsertField.java:133)
at org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:50)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 11 more
I understand the error and thought that using JsonConverter will solve it, but I was wrong. I've also used "value.converter.schemas.enable" : "false" to no avail.
Would appreciate any help. I don't mind sending the data in json form or bytes form, I just want a key:value pair to be added to the event.
Thanks
As the error indicates, you can only insert fields into structs. To get a Struct from RabbitMQ String/Bytes schemas, you must chain a HoistField transform before InsertField one.
To get any Struct from JSONConverter, your JSON needs two top level fields named schema and payload, then connector needs
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true"
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/
Alternatively, use Kafka headers for "source" information, rather than trying to inject into the value

MongoDB Kafka connect ChangeStreamHandler do not support truncatedArrays

I am using ChangeStreamHandler in mongo Kafka sink connector to stream changes from mongo source to sink collection
"change.data.capture.handler": "com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler"
On updates events from the source MongoDB collection the change stream handler is failing with exception
ERROR Unable to process record SinkRecord{kafkaOffset=3, timestampType=CreateTime} ConnectRecord{topic='quickstart.sampleData', kafkaPartition=0, key={"_id": {"_data": "8262A5CD4B000000012B022C0100296E5A1004B80560BF7F114B04962A5F523CEAB5D046645F6964006462A5CC9B84956FD488691BF10004"}}, keySchema=Schema{STRING}, value={"_id": {"_data": "8262A5CD4B000000012B022C0100296E5A1004B80560BF7F114B04962A5F523CEAB5D046645F6964006462A5CC9B84956FD488691BF10004"}, "operationType": "update", "clusterTime": {"$timestamp": {"t": 1655033163, "i": 1}}, "ns": {"db": "quickstart", "coll": "sampleData"}, "documentKey": {"_id": {"$oid": "62a5cc9b84956fd488691bf1"}}, "updateDescription": {"updatedFields": {"hello": "moto"}, "removedFields": [], "truncatedArrays": []}}, valueSchema=Schema{STRING}, timestamp=1655033166742, headers=ConnectHeaders(headers=)} (com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData)
org.apache.kafka.connect.errors.DataException: Warning unexpected field(s) in updateDescription [truncatedArrays]. {"updatedFields": {"hello": "moto"}, "removedFields": [], "truncatedArrays": []}. Cannot process due to risk of data loss.
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.OperationHelper.getUpdateDocument(OperationHelper.java:99)
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.Update.perform(Update.java:57)
at com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler.handle(ChangeStreamHandler.java:84)
at com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData.lambda$buildWriteModelCDC$3(MongoProcessedSinkRecordData.java:99)
at java.base/java.util.Optional.flatMap(Optional.java:294)
Below is the Change stream event received on the sink side
{"schema":{"type":"string","optional":false},"payload":"{\"_id\": {\"_data\": \"8262A5CD4B000000012B022C0100296E5A1004B80560BF7F114B04962A5F523CEAB5D046645F6964006462A5CC9B84956FD488691BF10004\"}, \"operationType\": \"update\", \"clusterTime\": {\"$timestamp\": {\"t\": 1655033163, \"i\": 1}}, \"ns\": {\"db\": \"quickstart\", \"coll\": \"sampleData\"}, \"documentKey\": {\"_id\": {\"$oid\": \"62a5cc9b84956fd488691bf1\"}}, \"updateDescription\": {\"updatedFields\": {\"hello\": \"moto\"}, \"removedFields\": [], \"truncatedArrays\": []}}"}
On looking at the code in class
com.mongodb.kafka.connect.sink.cdc.mongodb.operations.OperationHelper.getUpdateDocument(OperationHelper.java:99)
It shows that the updateDescription.updatedfields only handles updatedFields & removedFields.. support for truncatedArrays is not present.
Is this a bug? or I need to tune my source connector to somehow stop sending truncatedArrays in changeEvents.
I had the same issue here and i could solve that setting up the following configuration at the Source Connector:
"change.stream.full.document": "updateLookup"
A Full Exemple:
{
"name": "mongo-simple-source",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"connection.uri": "yourMongodbUri",
"database": "yourDataBase",
"collection": "yourCollection",
"change.stream.full.document": "updateLookup"
}
}

Debezium Connector filter "partly" working

We have a debezium connector that works without any errors. Two filtering conditions are applied and one of them works as intended but the other one seems to have no effect. These are the important parts of the config:
"connector.class": "io.debezium.connector.oracle.OracleConnector",
"transforms.filter.topic.regex": "topicname",
"database.connection.adapter": "logminer",
"transforms": "filter",
"schema.include.list": "xxxx",
"transforms.filter.type": "io.debezium.transforms.Filter",
"transforms.filter.language": "jsr223.groovy",
"tombstones.on.delete": "false",
"transforms.filter.condition": "value.op == \"c\" && value.after.QUEUELOCATIONTYPE == 5",
"table.include.list": "xxxxxx",
"skipped.operations": "u,d,r",
"snapshot.mode": "initial",
"topics": "xxxxxxx"
As you see, we want to get records which have op as "c" and "QUEUELOCATIONTYPE" as 5. In kafka topic all the records have the op field as "c". But the second condition does not work. There are records with QUEUELOCATIONTYPE as 2,3,4 etc.
A sample record is given below.
"payload": {
"before": null,
"after": {
"EVENTOBJECTID": "749dc9ea-a7aa-44c2-9af7-10574769c7db",
"QUEUECODE": "STDQSTDBKP",
"STATE": 6,
"RECORDDATE": 1638964344000,
"RECORDREQUESTOBJECTID": "32b7f617-60e8-4020-98b0-66f288433031",
"QUEUELOCATIONTYPE": 4,
"RETRYCOUNT": 0,
"RECORDCHANNELCODE": null,
"MESSAGEBROKERSERVERID": 1
},
"op": "c",
"ts_ms": 1638953572392,
"transaction": null
}
}
What may be the problem? Even though I wasn't thinking it was going to work, I've tried switching the placement of conditions. There are no error codes, connector is running.
Ok solved it. I was using a pre-created config. While reading documentations, I've seen that "skipped.operations": "u,d,r" is not an Oracle configuration. It was in the MySQL documentation. So, I deleted it and changed the connector name (cached data can cause problems so often). Looks like it's working now.

How to migrate consumer offsets using MirrorMaker 2.0?

With Kafka 2.7.0, I am using MirroMaker 2.0 as a Kafka-connect connector to replicate all the topics from the primary Kafka cluster to the backup cluster.
All the topics are being replicated perfectly except __consumer_offsets. Below are the connect configurations:
{
"name": "test-connector",
"config": {
"connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
"topics.blacklist": "some-random-topic",
"replication.policy.separator": "",
"source.cluster.alias": "",
"target.cluster.alias": "",
"exclude.internal.topics":"false",
"tasks.max": "10",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"source.cluster.bootstrap.servers": "xx.xx.xxx.xx:9094",
"target.cluster.bootstrap.servers": "yy.yy.yyy.yy:9094",
"topics": "test-topic-from-primary,primary-kafka-connect-offset,primary-kafka-connect-config,primary-kafka-connect-status,__consumer_offsets"
}
}
In a similar question here, the accepted answer says the following:
Add this in your consumer.config:
exclude.internal.topics=false
And add this in your producer.config:
client.id=__admin_client
Where do I add these in my configuration?
Here the Connector Configuration Properties does not have such property named client.id, I have set the value of exclude.internal.topics to false though.
Is there something I am missing here?
UPDATE
I learned that Kafka 2.7 and above supports automated consumer offset sync using MirrorCheckpointTask as mentioned here.
I have created a connector for this having the below configurations:
{
"name": "mirror-checkpoint-connector",
"config": {
"connector.class": "org.apache.kafka.connect.mirror.MirrorCheckpointConnector",
"sync.group.offsets.enabled": "true",
"source.cluster.alias": "",
"target.cluster.alias": "",
"exclude.internal.topics":"false",
"tasks.max": "10",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"source.cluster.bootstrap.servers": "xx.xx.xxx.xx:9094",
"target.cluster.bootstrap.servers": "yy.yy.yyy.yy:9094",
"topics": "__consumer_offsets"
}
}
Still no help.
Is this the correct approach? Is there something needed?
you do not want to replicate connsumer_offsets. The offsets from the src to the destination cluster will not be the same for various reasons.
MirrorMaker2 provides the ability to do offset translation. It will populate the destination cluster with a translated offset generated from the src cluster. https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0
__consumer_offsets is ignored by default
topics.exclude = [.*[\-\.]internal, .*\.replica, __.*]
you'll need to override this config

Kafka Connect: Topic shows 3x the number of events than expected

We are using Kafka Connect JDBC to sync tables between to databases (Debezium would be perfect for this but is out of the question).
The Sync in general works fine but it seems there are 3x the number of events / messages stored in the topic than expected.
What could be the reason for this?
Some additional information
The target database contains the exact number of messages (count of messages in the topics / 3).
Most of the topics are split into 3 partitions (Key is set via SMT, DefaultPartitioner is used).
JDBC Source Connector
{
"name": "oracle_source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#dbdis01.allesklar.de:1521:stg_cdb",
"connection.user": "****",
"connection.password": "****",
"schema.pattern": "BBUCH",
"topic.prefix": "oracle_",
"table.whitelist": "cdc_companies, cdc_partners, cdc_categories, cdc_additional_details, cdc_claiming_history, cdc_company_categories, cdc_company_custom_fields, cdc_premium_custom_field_types, cdc_premium_custom_fields, cdc_premiums, cdc, cdc_premium_redirects, intermediate_oz_data, intermediate_oz_mapping",
"table.types": "VIEW",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "ts",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"validate.non.null": false,
"numeric.mapping": "best_fit",
"db.timezone": "Europe/Berlin",
"transforms":"createKey, extractId, dropTimestamp, deleteTransform",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractId.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractId.field": "id",
"transforms.dropTimestamp.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.dropTimestamp.blacklist": "ts",
"transforms.deleteTransform.type": "de.meinestadt.kafka.DeleteTransformation"
}
}
JDBC Sink Connector
{
"name": "postgres_sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:postgresql://writer.branchenbuch.psql.integration.meinestadt.de:5432/branchenbuch",
"connection.user": "****",
"connection.password": "****",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.schemas.enable": true,
"insert.mode": "upsert",
"pk.mode": "record_key",
"pk.fields": "id",
"delete.enabled": true,
"auto.create": true,
"auto.evolve": true,
"topics.regex": "oracle_cdc_.*",
"transforms": "dropPrefix",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "oracle_cdc_(.*)",
"transforms.dropPrefix.replacement": "$1"
}
}
Strange Topic Count
This isn't an answer per-se but it's easier to format here than in the comments box.
It's not clear why you'd be getting duplicates. Some possibilities would be:
You have more than one instance of the connector running
You have on instance of the connector running but have previously run other instances which loaded the same data to the topic
Data's coming from multiple tables and being merged into one topic (not possible here based on your config, but if you were using Single Message Transform to modify target-topic name could be a possibility)
In terms of investigation I would suggest:
Isolate the problem by splitting the connector into one connector per table.
Examine each topic and locate examples of the duplicate messages. See if there is a pattern to which topics have duplicates. KSQL will be useful here:
SELECT ROWKEY, COUNT(*) FROM source GROUP BY ROWKEY HAVING COUNT(*) > 1
I'm guessing at ROWKEY (the key of the Kafka message) - you'll know your data and which columns should be unique and can be used to detect duplicates.
Once you've found a duplicate message, use kafkacat to examine the duplicate instances. Do they have the exact same Kafka message timestamp?
For more back and forth, StackOverflow isn't such an appropriate platform - I'd recommend heading to http://cnfl.io/slack and the #connect channel.