KafkaConnect produces CDC event with null value when reading from mongoDB with debezium - mongodb

When reading the kafka topic which contains lots of CDC events produced by Kafka-Connect using debezium and the data source is in a mongodb collection with TTL, I saw some of the CDC events are null, those are in between the deletion events. what does it really mean?
As I understand all the CDC events should have the CDC event structure, even the deletion events as well, why there are events with null value?
null,
{
"after": null,
"patch": null,
"source": {
"version": "0.9.3.Final",
"connector": "mongodb",
"name": "test",
"rs": "rs1",
"ns": "testestest",
"sec": 1555060472,
"ord": 297,
"h": 1196279425766381600,
"initsync": false
},
"op": "d",
"ts_ms": 1555060472177
},
null,
{
"after": null,
"patch": null,
"source": {
"version": "0.9.3.Final",
"connector": "mongodb",
"name": "test",
"rs": "rs1",
"ns": "testestest",
"sec": 1555060472,
"ord": 298,
"h": -2199232943406075600,
"initsync": false
},
"op": "d",
"ts_ms": 1555060472177
}
I use https://debezium.io/docs/connectors/mongodb/ without flattening any event, and use the config as follows:
{
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "live.xxx.xxx:27019",
"mongodb.name": "testmongodb",
"collection.whitelist": "testest",
"tasks.max": 4,
"snapshot.mode": "never",
"poll.interval.ms": 15000
}

These are so-called tombstone events used for correct compaction of deleted events - see https://kafka.apache.org/documentation/#compaction
Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. This delete marker will cause any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that they will themselves be cleaned out of the log after a period of time to free up space. The point in time at which deletes are no longer retained is marked as the "delete retention point" in the above diagram.

Related

MongoDB Kafka Connect - Sink connector failing on updates

I am new to Kafka connect.
I am trying sync the change stream from 1 mongo collection to another using Kafka connectors, both Inserts and updates operations
Source config-
{
"name": "mongo-sourceV2",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"connection.uri": "mongodb://mongo1:27017/?replicaSet=rs0",
"database": "quickstart",
"collection": "transactionV2",
"pipeline": "[{\"$match\":{\"operationType\": { \"$in\": [ \"update\",\"insert\" ]}}}]"
}}
Sink Config -
{
"name": "mongo-sinkV2",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"connection.uri": "mongodb://mongo1:27017/?replicaSet=rs0",
"database": "quickstart",
"collection": "transactionV1",
"topics": "quickstart.transactionV2",
"errors.tolerance": "all",
"errors.log.enable": true,
"mongo.errors.tolerance": "all",
"mongo.errors.log.enable": true,
"change.data.capture.handler": "com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler"
}}
Kafka Topic Event for update -
{"schema":{"type":"string","optional":false},"payload":"{\"_id\": {\"_data\": \"8262A512F7000000012B022C0100296E5A1004195DB8CC822F4A4FAE4ECCE5917B98A946645F6964006462A512A5F74E67E722B3B6760004\"}, \"operationType\": \"update\", \"clusterTime\": {\"$timestamp\": {\"t\": 1654985463, \"i\": 1}}, \"ns\": {\"db\": \"quickstart\", \"coll\": \"transactionV2\"}, \"documentKey\": {\"_id\": {\"$oid\": \"62a512a5f74e67e722b3b676\"}}, \"updateDescription\": {\"updatedFields\": {\"amount\": 10001}, \"removedFields\": [], \"truncatedArrays\": []}}"}
My Inserts are streaming fine but the updates are failing on the sink connector side with Exception
[2022-06-11 22:11:07,195] ERROR Unable to process record SinkRecord{kafkaOffset=9, timestampType=CreateTime} ConnectRecord{topic='quickstart.transactionV2', kafkaPartition=0, key={"_id": {"_data": "8262A512F7000000012B022C0100296E5A1004195DB8CC822F4A4FAE4ECCE5917B98A946645F6964006462A512A5F74E67E722B3B6760004"}}, keySchema=Schema{STRING}, value={"_id": {"_data": "8262A512F7000000012B022C0100296E5A1004195DB8CC822F4A4FAE4ECCE5917B98A946645F6964006462A512A5F74E67E722B3B6760004"}, "operationType": "update", "clusterTime": {"$timestamp": {"t": 1654985463, "i": 1}}, "ns": {"db": "quickstart", "coll": "transactionV2"}, "documentKey": {"_id": {"$oid": "62a512a5f74e67e722b3b676"}}, "updateDescription": {"updatedFields": {"amount": 10001}, "removedFields": [], "truncatedArrays": []}}, valueSchema=Schema{STRING}, timestamp=1654985467191, headers=ConnectHeaders(headers=)} (com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData)
org.apache.kafka.connect.errors.DataException: Warning unexpected field(s) in updateDescription [truncatedArrays]. {"updatedFields": {"amount": 10001}, "removedFields": [], "truncatedArrays": []}. Cannot process due to risk of data loss.
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.OperationHelper.getUpdateDocument(OperationHelper.java:99)
at com.mongodb.kafka.connect.sink.cdc.mongodb.operations.Update.perform(Update.java:57)
at com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler.handle(ChangeStreamHandler.java:84)
at com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData.lambda$buildWriteModelCDC$3(MongoProcessedSinkRecordData.java:99)
at java.base/java.util.Optional.flatMap(Optional.java:294)
at com.mongodb.kafka.connect.sink.MongoProcessedSinkRecordData.lambda$buildWriteModelCDC$4(MongoProcessedSinkRecordData.java:99)
Got the update from mongodb community.
The issue is in pipeline
https://www.mongodb.com/community/forums/t/mongodb-kafka-connect-changestreamhandler-do-not-support-truncatedarrays/169214/3

How can I select internal fields of CDC JSON to be Key of record in Kafka using SMT?

I have tried using SMT configuration ValueToKey and ExtractField$Key for my following CDC JSON data. But as id field is internal it is giving me an error as field is not recognized. How can I make it accessible to internal fields ?
"before": null,
"after": {
"id": 4,
"salary": 5000
},
"source": {
"version": "1.5.0.Final",
"connector": "mysql",
"name": "Try-",
"ts_ms": 1623834752000,
"snapshot": "false",
"db": "mysql_db",
"sequence": null,
"table": "EmpSalary",
"server_id": 1,
"gtid": null,
"file": "binlog.000004",
"pos": 374,
"row": 0,
"thread": null,
"query": null
},
"op": "c",
"ts_ms": 1623834752982,
"transaction": null
}
Configuration Used:
transforms=createKey,extractInt
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
key.converter.schemas.enable=false
value.converter.schemas.enable=false
With these transformations and changes in properties file. I could make it possible.
Unfortunately accessing nested fields is not possible without using a different transform.
If you want to use the built-in ones, you'd need to extract the after state before you can access its fields
transforms=extractAfterState,createKey,extractInt
# Add these
transforms.extractAfterState.type=io.debezium.transforms.ExtractNewRecordState
# since you cannot get the ID from null events
transforms.extractAfterState.drop.tombstones=true
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id

Does Kafka Connect provide data provenance?

Iam new to kafka connect. I have used tools like nifi for sometime now. Those tools provide data provenance for auditing and other purpose for understanding what happened to a piece of data. But I couldn't find any similar feature with kafka connect. Does that feature exist for kafka connect? Or is there some way of handling data provenance in kafka connect so as to understand what happened to the data?
A CDC tool may help with your auditing needs, otherwise you will have to build your custom logic using a single message transformation (SMT). For example, using Debezium connector, this is what you will get as message payload for every change event:
{
"payload": {
"before": null,
"after": {
"id": 1,
"first_name": "7b789a503dc96805dc9f3dabbc97073b",
"last_name": "8428d131d60d785175954712742994fa",
"email": "68d0a7ccbd412aa4c1304f335b0edee8#example.com"
},
"source": {
"version": "1.1.0.Final",
"connector": "postgresql",
"name": "localhost",
"ts_ms": 1587303655422,
"snapshot": "true",
"db": "cdcdb",
"schema": "cdc",
"table": "customers",
"txId": 2476,
"lsn": 40512632,
"xmin": null
},
"op": "c",
"ts_ms": 1587303655424,
"transaction": null
}
}

Kafka JDBC Source connector Insert or Update

I configured a Kafka JDBC Source connector in order to push on a Kafka topic the record changed (insert or update) from a PostgreSQL database.
I use "timestamp+incrementing" mode. Seems to work fine.
I didnt't configure the JDBC Sink connector because I'm using a Kafka Consumer that listen on the topic.
The message on the topic is a JSON. This is an example:
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int64",
"optional": false,
"field": "id"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_create_date"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_modify_date"
},
{
"type": "int32",
"optional": true,
"field": "entity_version"
},
{
"type": "string",
"optional": true,
"field": "firstname"
},
{
"type": "string",
"optional": true,
"field": "lastname"
}
],
"optional": false,
"name": "author"
},
"payload": {
"id": 1,
"entity_create_date": 1600287236682,
"entity_modify_date": 1600287236682,
"entity_version": 1,
"firstname": "George",
"lastname": "Orwell"
}
}
As you can see there is no info about if this change is captured by Source connector because of an insert or an update.
I need this information. How can solve?
You can't get that information using the JDBC Source connector, unless you do something bespoke in the source schema and triggers.
This is one of the reasons why log-based CDC is generally a better way to get events from the source database, and for other reasons including:
capturing deletes
capturing the type of operation
capturing all events, not just what's there at the time when the connector polls.
For more details on the nuances of this see this blog or a talk based on the same.
Using a CDC based approach as suggested by #Robin Moffatt may be the proper way to handle your requirement. Checkout https://debezium.io/
However, looking at your table data you could use "entity_create_date" and "entity_modify_date" in your consumer to determine if the message in an insert or update. If "entity_create_date" = "entity_modify_date" then it's an insert else it's an update.

MongoSource from Kafka connect create weird _data key

I'm using KafkaConnect - MongoSource with the following configuration:
curl -X PUT http://localhost:8083/connectors/mongo-source2/config -H "Content-Type: application/json" -d '{
"name":"mongo-source2",
"tasks.max":1,
"connector.class":"com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"connection.uri":"mongodb://xxx:xxx#localhost:27017/mydb",
"database":"mydb",
"collection":"claimmappingrules.66667777-8888-9999-0000-666677770000",
"pipeline":"[{\"$addFields\": {\"something\":\"xxxx\"} }]",
"transforms":"dropTopicPrefix",
"transforms.dropTopicPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropTopicPrefix.regex":".*",
"transforms.dropTopicPrefix.replacement":"my-topic"
}'
For some reason, when I consume messages, I'm getting a weird key:
"_id": {
"_data": "825DFD2A53000000012B022C0100296E5A1004060C0FB7484A4990A7363EF5F662CF8D465A5F6964005A1003F9974744D06AFB498EF8D78370B0CD440004"
}
I have no Idea where did it come from, My mongo document's _id is UUID, When I'm consuming messages, I was expected to see the documentKey field at my consumer key.
Here is a message example of what the connector published into kafka:
{
"_id": {
"_data": "825DFD2A53000000012B022C0100296E5A1004060C0FB7484A4990A7363EF5F662CF8D465A5F6964005A1003F9974744D06AFB498EF8D78370B0CD440004"
},
"operationType": "replace",
"clusterTime": {
"$timestamp": {
"t": 1576872531,
"i": 1
}
},
"fullDocument": {
"_id": {
"$binary": "+ZdHRNBq+0mO+NeDcLDNRA==",
"$type": "03"
},
...
},
"ns": {
"db": "security",
"coll": "users"
},
"documentKey": {
"_id": {
"$binary": "+ZdHRNBq+0mO+NeDcLDNRA==",
"$type": "03"
}
}
}
The documentation related to schema for Kafka connect configuration is really limited out there. I know it is too late to reply but lately I have also been facing the same issue and found out the solution by trial and error.
I added these two configuration to my mongodb-kafka-connect configuration -
"output.format.key": "schema",
"output.schema.key": "{\"name\":\"sampleId\",\"type\":\"record\",\"namespace\":\"com.mongoexchange.avro\",\"fields\":[{\"name\":\"documentKey._id\",\"type\":\"string\"}]}",
But still even after this I don't know whether resume_token of change stream as a key for kafka partition allotment had any significance in term of performance or even for the case where resume_token gets expired due long time of inactivity.
P.S. - The final version of my kafka connect configuration for mongodb as a source is this -
{
"tasks.max": 1,
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connection.uri": "mongodb://example-mongodb-0:27017,example-mongodb-1:27017,example-mongodb-2:27017/?replicaSet=replicaSet",
"database": "exampleDB",
"collection": "exampleCollection",
"output.format.key": "schema",
"output.schema.key": "{\"name\":\"ClassroomId\",\"type\":\"record\",\"namespace\":\"com.mongoexchange.avro\",\"fields\":[{\"name\":\"documentKey._id\",\"type\":\"string\"}]}",
"change.stream.full.document": "updateLookup",
"copy.existing": "true",
"topic.prefix": "mongodb"
}