MongoSource from Kafka connect create weird _data key

MongoSource from Kafka connect create weird _data key - mongodb

I'm using KafkaConnect - MongoSource with the following configuration:
curl -X PUT http://localhost:8083/connectors/mongo-source2/config -H "Content-Type: application/json" -d '{
"name":"mongo-source2",
"tasks.max":1,
"connector.class":"com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"connection.uri":"mongodb://xxx:xxx#localhost:27017/mydb",
"database":"mydb",
"collection":"claimmappingrules.66667777-8888-9999-0000-666677770000",
"pipeline":"[{\"$addFields\": {\"something\":\"xxxx\"} }]",
"transforms":"dropTopicPrefix",
"transforms.dropTopicPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropTopicPrefix.regex":".*",
"transforms.dropTopicPrefix.replacement":"my-topic"
}'
For some reason, when I consume messages, I'm getting a weird key:
"_id": {
"_data": "825DFD2A53000000012B022C0100296E5A1004060C0FB7484A4990A7363EF5F662CF8D465A5F6964005A1003F9974744D06AFB498EF8D78370B0CD440004"
}
I have no Idea where did it come from, My mongo document's _id is UUID, When I'm consuming messages, I was expected to see the documentKey field at my consumer key.
Here is a message example of what the connector published into kafka:
{
"_id": {
"_data": "825DFD2A53000000012B022C0100296E5A1004060C0FB7484A4990A7363EF5F662CF8D465A5F6964005A1003F9974744D06AFB498EF8D78370B0CD440004"
},
"operationType": "replace",
"clusterTime": {
"$timestamp": {
"t": 1576872531,
"i": 1
}
},
"fullDocument": {
"_id": {
"$binary": "+ZdHRNBq+0mO+NeDcLDNRA==",
"$type": "03"
},
...
},
"ns": {
"db": "security",
"coll": "users"
},
"documentKey": {
"_id": {
"$binary": "+ZdHRNBq+0mO+NeDcLDNRA==",
"$type": "03"
}
}
}

The documentation related to schema for Kafka connect configuration is really limited out there. I know it is too late to reply but lately I have also been facing the same issue and found out the solution by trial and error.
I added these two configuration to my mongodb-kafka-connect configuration -
"output.format.key": "schema",
"output.schema.key": "{\"name\":\"sampleId\",\"type\":\"record\",\"namespace\":\"com.mongoexchange.avro\",\"fields\":[{\"name\":\"documentKey._id\",\"type\":\"string\"}]}",
But still even after this I don't know whether resume_token of change stream as a key for kafka partition allotment had any significance in term of performance or even for the case where resume_token gets expired due long time of inactivity.
P.S. - The final version of my kafka connect configuration for mongodb as a source is this -
{
"tasks.max": 1,
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connection.uri": "mongodb://example-mongodb-0:27017,example-mongodb-1:27017,example-mongodb-2:27017/?replicaSet=replicaSet",
"database": "exampleDB",
"collection": "exampleCollection",
"output.format.key": "schema",
"output.schema.key": "{\"name\":\"ClassroomId\",\"type\":\"record\",\"namespace\":\"com.mongoexchange.avro\",\"fields\":[{\"name\":\"documentKey._id\",\"type\":\"string\"}]}",
"change.stream.full.document": "updateLookup",
"copy.existing": "true",
"topic.prefix": "mongodb"
}

Related

sink json like string to a databae with kafka connect jdbc

i am producing simple plaintext json like data to kafka with simple kafka-console-producer command and i want to sink this data to database table. i have tried many ways to do this. but always i get deserializer error or unknown magic bytes error.
there is no serialization and schema validation on that. but the data is always same type.
we cant change the producer configs to add serializer also.
schema :
{
"type": "record",
"name": "people",
"namespace": "com.cena",
"doc": "This is a sample Avro schema to get you started. Please edit",
"fields": [
{
"name": "first_name",
"type": "string",
"default":null
},
{
"name": "last_name",
"type": "string",
"default":null
},
{
"name": "town",
"type": "string",
"default":null
},
{
"name": "country_code",
"type": "string",
"default":null
},
{
"name": "mobile_number",
"type": "string",
"default":null
}
]
}
Connector :
{
"name": "JdbcSinkConnecto",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"table.name.format": "people",
"topics": "people",
"tasks.max": "1",
"transforms": "RenameField",
"transforms.RenameField.renames": "\"town:city,mobile_number:msisdn\"",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"connection.url": "jdbc:postgresql://localhost:5432/postgres",
"connection.password": "postgres",
"connection.user": "postgres",
"insert.mode": "insert",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schema.registry.url": "http://http://localhost:8081"
}
data sample :
{"first_name": "some_name","last_name": "Family","town": "some_city","country_code": "+01","mobile_number": "some_number"}
Is there a way to use kafka connect for this ?

with simple kafka-console-producer
That doesn't use Avro, so I'm not sure why you added an Avro schema to the question.
You also don't show value.converter value, so it's unclear if that is truly JSON or Avro...
You are required to add a schema to the data for JDBC sink. If you use plain JSON and kafka-console-producer, then you need data that looks like {"schema": ... , "payload": { your data here } }, then you need value.converter.schemas.enabled=true for class of JsonConverter
ref. Converters and Serializers Deep Dive
If you want to use Avro, then use kafka-avro-console-producer ... This still accepts JSON inputs, but serializes to Avro (and will fix your magic byte error)
Another option would be to use ksqlDB to first parse the JSON into a defined STREAM with typed and named fields, then you can run the Connector from it in embedded mode
By the way, StringConverter does not use schema registry, so remove schema.registry.url property for it... And if you want to use a registry, don't put http:// twice

MongoDB Kafka Connect - Sink connector failing on updates

I am new to Kafka connect.
I am trying sync the change stream from 1 mongo collection to another using Kafka connectors, both Inserts and updates operations
Source config-
{
"name": "mongo-sourceV2",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"connection.uri": "mongodb://mongo1:27017/?replicaSet=rs0",
"database": "quickstart",
"collection": "transactionV2",
"pipeline": "[{\"$match\":{\"operationType\": { \"$in\": [ \"update\",\"insert\" ]}}}]"
}}
Sink Config -
{
"name": "mongo-sinkV2",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"connection.uri": "mongodb://mongo1:27017/?replicaSet=rs0",
"database": "quickstart",
"collection": "transactionV1",
"topics": "quickstart.transactionV2",
"errors.tolerance": "all",
"errors.log.enable": true,
"mongo.errors.tolerance": "all",
"mongo.errors.log.enable": true,
"change.data.capture.handler": "com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler"
}}
Kafka Topic Event for update -
{"schema":{"type":"string","optional":false},"payload":"{\"_id\": {\"_data\": \"8262A512F7000000012B022C0100296E5A1004195DB8CC822F4A4FAE4ECCE5917B98A946645F6964006462A512A5F74E67E722B3B6760004\"}, \"operationType\": \"update\", \"clusterTime\": {\"$timestamp\": {\"t\": 1654985463, \"i\": 1}}, \"ns\": {\"db\": \"quickstart\", \"coll\": \"transactionV2\"}, \"documentKey\": {\"_id\": {\"$oid\": \"62a512a5f74e67e722b3b676\"}}, \"updateDescription\": {\"updatedFields\": {\"amount\": 10001}, \"removedFields\": [], \"truncatedArrays\": []}}"}
My Inserts are streaming fine but the updates are failing on the sink connector side with Exception
[2022-06-11 22:11:07,195] ERROR Unable to process record SinkRecord{kafkaOffset=9, timestampType=CreateTime} ConnectRecord{topic='quickstart.transactionV2', kafkaPartition=0, key={"_id": {"_data": "8262A512F7000000012B022C0100296E5A1004195DB8CC822F4A4FAE4ECCE5917B98A946645F6964006462A512A5F74E67E722B3B6760004"}}, keySchema=Schema{STRING}, value={"_id": {"_data": "8262A512F7000000012B022C0100296E5A1004195DB8CC822F4A4FAE4ECCE5917B98A946645F6964006462A512A5F74E67E722B3B6760004"}, "operationType": "update", "clusterTime": {"$timestamp": {"t": 1654985463, "i": 1}}, "ns": {"db": "quickstart", "coll": "transactionV2"}, "documentKey": {"_id": {"$oid": "62a512a5f74e67e722b3b676"}}, "updateDescription": {"updatedFields": {"amount": 10001}, "removedFields": [], "truncatedArrays": []}}, valueSchema=Schema{STRING}, timestamp=1654985467191, headers=ConnectHeaders(headers=)} (com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData)
org.apache.kafka.connect.errors.DataException: Warning unexpected field(s) in updateDescription [truncatedArrays]. {"updatedFields": {"amount": 10001}, "removedFields": [], "truncatedArrays": []}. Cannot process due to risk of data loss.
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.OperationHelper.getUpdateDocument(OperationHelper.java:99)
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.Update.perform(Update.java:57)
at com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler.handle(ChangeStreamHandler.java:84)
at com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData.lambda$buildWriteModelCDC$3(MongoProcessedSinkRecordData.java:99)
at java.base/java.util.Optional.flatMap(Optional.java:294)
at com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData.lambda$buildWriteModelCDC$4(MongoProcessedSinkRecordData.java:99)

Got the update from mongodb community.
The issue is in pipeline
https://www.mongodb.com/community/forums/t/mongodb-kafka-connect-changestreamhandler-do-not-support-truncatedarrays/169214/3

Debezium Postgres and ElasticSearch - Store complex Object in ElasticSearch

I have in Postgres a database with a table "product" which is connected 1 to n with "sales_Channel". So 1 Product can have multiple SalesChannel. Now I want to transfer it to ES and keep it up to date, so I am using debezium and kafka. It is no problem to transfer the single tables to ES. I can query for SalesChannels and Products. But I need Products with all SalesChannels attached as a Result. How get I debezium to transfer this?
mapping for Product
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"_doc": {
"properties": {
"id": {
"type": "integer"
}
}
}
}
}
sink for Product
{
"name": "es-sink-product",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "product",
"connection.url": "http://elasticsearch:9200",
"transforms": "unwrap,key",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.drop.deletes": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "id",
"key.ignore": "false",
"type.name": "_doc",
"behavior.on.null.values": "delete"
}
}

you either need to use Outbox pattern, see https://debezium.io/documentation/reference/1.2/configuration/outbox-event-router.html
or you can use aggregate objects, see
https://github.com/debezium/debezium-examples/tree/master/jpa-aggregations
https://github.com/debezium/debezium-examples/tree/master/kstreams-fk-join

Kafka Connect - JDBC Source Connector - Setting Avro Schema

How can I make Kafka Connect JDBC connector to predefined Avro schema ? It creates a new version when the connecter is created. I am reading from DB2 and putting into Kafka topic.
I am setting schema name and version during creation but it does not work!!! Here is my connector settings :
{
"name": "kafka-connect-jdbc-db2-tst-2",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "jdbc:db2://mydb2:50000/testdb",
"connection.user": "DB2INST1",
"connection.password": "12345678",
"query":"SELECT CORRELATION_ID FROM TEST.MYVIEW4 ",
"mode": "incrementing",
"incrementing.column.name": "CORRELATION_ID",
"validate.non.null": "false",
"topic.prefix": "tst-4" ,
"auto.register.schemas": "false",
"use.latest.version": "true",
"transforms": "RenameField,SetSchemaMetadata",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.RenameField.renames": "CORRELATION_ID:id",
"transforms.SetSchemaMetadata.type": "org.apache.kafka.connect.transforms.SetSchemaMetadata$Value",
"transforms.SetSchemaMetadata.schema.name": "foo.bar.MyMessage",
"transforms.SetSchemaMetadata.schema.version": "1"
}
}
And here are the schemas : V.1 is mine, and V.2 is created by JDBC source connector:
$ curl localhost:8081/subjects/tst-4-value/versions/1 | jq .
{
"subject": "tst-4-value",
"version": 1,
"id": 387,
"schema": "{"type":"record","name":"MyMessage",
"namespace":"foo.bar","fields":[{"name":"id","type":"int"}]}"
}
$ curl localhost:8081/subjects/tst-4-value/versions/2 | jq .
{
"subject": "tst-4-value",
"version": 2,
"id": 386,
"schema": "{"type":"record","name":"MyMessage","namespace":"foo.bar",
"fields":[{"name":"id","type":"int"}],
"connect.version":1,
"connect.name":"foo.bar.MyMessage"
}"
}
Any idea how can I force Kafka connector to use my schema ?
Thanks in advance,

Kafka Connect JDBC failed on JsonConverter

I am working on a design MySQL -> Debezium -> Kafka -> Flink -> Kafka -> Kafka Connect JDBC -> MySQL. Following is sample message i write from Flink (I also tried using Kafka console producer as well)
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int64",
"optional": false,
"field": "id"
},
{
"type": "string",
"optional": true,
"field": "name"
}
],
"optional": true,
"name": "user"
},
"payload": {
"id": 1,
"name": "Smith"
}
}
but connect failed on JsonConverter
DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:338)
I have debugged and in method public SchemaAndValue toConnectData(String topic, byte[] value) value is null. My sink configurations are:
{
"name": "user-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "user",
"connection.url": "jdbc:mysql://localhost:3306/my_db?verifyServerCertificate=false",
"connection.user": "root",
"connection.password": "root",
"auto.create": "true",
"insert.mode": "upsert",
"pk.fields": "id",
"pk.mode": "record_value"
}
}
Can someone please help me on this issue?

I think an issue is not related with the value serialization (of Kafka message). It is rather problem with the key of the message.
What is your key.converter? I think it is the same like value.converter (org.apache.kafka.connect.json.JsonConverter). Your key might be simple String, that doesn't contain schema, payload
Try to change key.converter to org.apache.kafka.connect.storage.StringConverter
For Kafka Connect you set default Converters, but you can also set specific one for your particular Connector configuration (that will overwrite default one). For that you have to modify your config request:
{
"name": "user-sink",
"config": {
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "user",
"connection.url": "jdbc:mysql://localhost:3306/my_db?verifyServerCertificate=false",
"connection.user": "root",
"connection.password": "root",
"auto.create": "true",
"insert.mode": "upsert",
"pk.fields": "id",
"pk.mode": "record_value"
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MongoSource from Kafka connect create weird _data key - mongodb

Related

sink json like string to a databae with kafka connect jdbc

MongoDB Kafka Connect - Sink connector failing on updates

Debezium Postgres and ElasticSearch - Store complex Object in ElasticSearch

Kafka Connect - JDBC Source Connector - Setting Avro Schema

Kafka Connect JDBC failed on JsonConverter

Categories

Resources