Iam new to kafka connect. I have used tools like nifi for sometime now. Those tools provide data provenance for auditing and other purpose for understanding what happened to a piece of data. But I couldn't find any similar feature with kafka connect. Does that feature exist for kafka connect? Or is there some way of handling data provenance in kafka connect so as to understand what happened to the data?
A CDC tool may help with your auditing needs, otherwise you will have to build your custom logic using a single message transformation (SMT). For example, using Debezium connector, this is what you will get as message payload for every change event:
{
"payload": {
"before": null,
"after": {
"id": 1,
"first_name": "7b789a503dc96805dc9f3dabbc97073b",
"last_name": "8428d131d60d785175954712742994fa",
"email": "68d0a7ccbd412aa4c1304f335b0edee8#example.com"
},
"source": {
"version": "1.1.0.Final",
"connector": "postgresql",
"name": "localhost",
"ts_ms": 1587303655422,
"snapshot": "true",
"db": "cdcdb",
"schema": "cdc",
"table": "customers",
"txId": 2476,
"lsn": 40512632,
"xmin": null
},
"op": "c",
"ts_ms": 1587303655424,
"transaction": null
}
}
Related
i am producing simple plaintext json like data to kafka with simple kafka-console-producer command and i want to sink this data to database table. i have tried many ways to do this. but always i get deserializer error or unknown magic bytes error.
there is no serialization and schema validation on that. but the data is always same type.
we cant change the producer configs to add serializer also.
schema :
{
"type": "record",
"name": "people",
"namespace": "com.cena",
"doc": "This is a sample Avro schema to get you started. Please edit",
"fields": [
{
"name": "first_name",
"type": "string",
"default":null
},
{
"name": "last_name",
"type": "string",
"default":null
},
{
"name": "town",
"type": "string",
"default":null
},
{
"name": "country_code",
"type": "string",
"default":null
},
{
"name": "mobile_number",
"type": "string",
"default":null
}
]
}
Connector :
{
"name": "JdbcSinkConnecto",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"table.name.format": "people",
"topics": "people",
"tasks.max": "1",
"transforms": "RenameField",
"transforms.RenameField.renames": "\"town:city,mobile_number:msisdn\"",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"connection.url": "jdbc:postgresql://localhost:5432/postgres",
"connection.password": "postgres",
"connection.user": "postgres",
"insert.mode": "insert",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schema.registry.url": "http://http://localhost:8081"
}
data sample :
{"first_name": "some_name","last_name": "Family","town": "some_city","country_code": "+01","mobile_number": "some_number"}
Is there a way to use kafka connect for this ?
with simple kafka-console-producer
That doesn't use Avro, so I'm not sure why you added an Avro schema to the question.
You also don't show value.converter value, so it's unclear if that is truly JSON or Avro...
You are required to add a schema to the data for JDBC sink. If you use plain JSON and kafka-console-producer, then you need data that looks like {"schema": ... , "payload": { your data here } }, then you need value.converter.schemas.enabled=true for class of JsonConverter
ref. Converters and Serializers Deep Dive
If you want to use Avro, then use kafka-avro-console-producer ... This still accepts JSON inputs, but serializes to Avro (and will fix your magic byte error)
Another option would be to use ksqlDB to first parse the JSON into a defined STREAM with typed and named fields, then you can run the Connector from it in embedded mode
By the way, StringConverter does not use schema registry, so remove schema.registry.url property for it... And if you want to use a registry, don't put http:// twice
I want to rename the value of key name "id" I've use case where producer publish messages to this topic (product-topic) either of these messages "id": "test.product.mobile" or "id": "test.product.computer"
My HTTP sink connector consume the message from this topic and want to do the transformation (rename the field's value)
For example,
if producer sends "id": "test.product.mobile" I want to replace like this "id": "test.product.iPhone"
if producer sends "id": "test.product.computer" I want to replace like this "id": "test.product.mac"
I'm using HTTP sink connector and transform package to replace field value, but it's not working as expected. Please find the connector configuration below:
{
"connector.class": "io.confluent.connect.http.HttpSinkConnector",
"confluent.topic.bootstrap.servers": "localhost:9092",
"topics": "product-topic",
"tasks.max": "1",
"http.api.url": "http://localhost:8080/product/create",
"reporter.bootstrap.servers": "localhost:9092",
"reporter.error.topic.name": "error-responses",
"reporter.result.topic.name": "success-responses",
"reporter.error.topic.replication.factor": "1",
"confluent.topic.replication.factor": "1",
"errors.tolerance": "all",
"value.converter.schemas.enable": "false",
"batch.json.as.array": "true",
"name": "Product-Connector",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"reporter.result.topic.replication.factor": "1",
"transforms": "RenameField",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Key",
"transforms.RenameField.renames": "id:test.product.iPhone"
}
Producer send messages like below
{
"id": "test.product.mobile",
"price": "1232"
}
{
"id": "test.product.computer",
"price": "2032"
}
Expected Output:
{
"id": "test.product.iPhone",
"price": "1232"
}
{
"id": "test.product.mac",
"price": "2032"
}
I referred the Kafka Confluent Docs to rename a field but that example works well if we want to replace the key name but not value. Can someone please help me with use case - what needs to be change to rename the field value?
Appreciated your help in advance. Thanks!
It's not possible to replace field value text with any (included) SMTs, outside of masking, no.
You could write (or find) your own SMT, but otherwise, the recommended pattern for this is a KStreams/ksqlDB process.
Or, simply have your initial Kafka producer send the values that you want to sink to the HTTP server.
I have tried using SMT configuration ValueToKey and ExtractField$Key for my following CDC JSON data. But as id field is internal it is giving me an error as field is not recognized. How can I make it accessible to internal fields ?
"before": null,
"after": {
"id": 4,
"salary": 5000
},
"source": {
"version": "1.5.0.Final",
"connector": "mysql",
"name": "Try-",
"ts_ms": 1623834752000,
"snapshot": "false",
"db": "mysql_db",
"sequence": null,
"table": "EmpSalary",
"server_id": 1,
"gtid": null,
"file": "binlog.000004",
"pos": 374,
"row": 0,
"thread": null,
"query": null
},
"op": "c",
"ts_ms": 1623834752982,
"transaction": null
}
Configuration Used:
transforms=createKey,extractInt
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
key.converter.schemas.enable=false
value.converter.schemas.enable=false
With these transformations and changes in properties file. I could make it possible.
Unfortunately accessing nested fields is not possible without using a different transform.
If you want to use the built-in ones, you'd need to extract the after state before you can access its fields
transforms=extractAfterState,createKey,extractInt
# Add these
transforms.extractAfterState.type=io.debezium.transforms.ExtractNewRecordState
# since you cannot get the ID from null events
transforms.extractAfterState.drop.tombstones=true
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields=id
transforms.extractInt.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.extractInt.field=id
I configured a Kafka JDBC Source connector in order to push on a Kafka topic the record changed (insert or update) from a PostgreSQL database.
I use "timestamp+incrementing" mode. Seems to work fine.
I didnt't configure the JDBC Sink connector because I'm using a Kafka Consumer that listen on the topic.
The message on the topic is a JSON. This is an example:
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int64",
"optional": false,
"field": "id"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_create_date"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_modify_date"
},
{
"type": "int32",
"optional": true,
"field": "entity_version"
},
{
"type": "string",
"optional": true,
"field": "firstname"
},
{
"type": "string",
"optional": true,
"field": "lastname"
}
],
"optional": false,
"name": "author"
},
"payload": {
"id": 1,
"entity_create_date": 1600287236682,
"entity_modify_date": 1600287236682,
"entity_version": 1,
"firstname": "George",
"lastname": "Orwell"
}
}
As you can see there is no info about if this change is captured by Source connector because of an insert or an update.
I need this information. How can solve?
You can't get that information using the JDBC Source connector, unless you do something bespoke in the source schema and triggers.
This is one of the reasons why log-based CDC is generally a better way to get events from the source database, and for other reasons including:
capturing deletes
capturing the type of operation
capturing all events, not just what's there at the time when the connector polls.
For more details on the nuances of this see this blog or a talk based on the same.
Using a CDC based approach as suggested by #Robin Moffatt may be the proper way to handle your requirement. Checkout https://debezium.io/
However, looking at your table data you could use "entity_create_date" and "entity_modify_date" in your consumer to determine if the message in an insert or update. If "entity_create_date" = "entity_modify_date" then it's an insert else it's an update.
When reading the kafka topic which contains lots of CDC events produced by Kafka-Connect using debezium and the data source is in a mongodb collection with TTL, I saw some of the CDC events are null, those are in between the deletion events. what does it really mean?
As I understand all the CDC events should have the CDC event structure, even the deletion events as well, why there are events with null value?
null,
{
"after": null,
"patch": null,
"source": {
"version": "0.9.3.Final",
"connector": "mongodb",
"name": "test",
"rs": "rs1",
"ns": "testestest",
"sec": 1555060472,
"ord": 297,
"h": 1196279425766381600,
"initsync": false
},
"op": "d",
"ts_ms": 1555060472177
},
null,
{
"after": null,
"patch": null,
"source": {
"version": "0.9.3.Final",
"connector": "mongodb",
"name": "test",
"rs": "rs1",
"ns": "testestest",
"sec": 1555060472,
"ord": 298,
"h": -2199232943406075600,
"initsync": false
},
"op": "d",
"ts_ms": 1555060472177
}
I use https://debezium.io/docs/connectors/mongodb/ without flattening any event, and use the config as follows:
{
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "live.xxx.xxx:27019",
"mongodb.name": "testmongodb",
"collection.whitelist": "testest",
"tasks.max": 4,
"snapshot.mode": "never",
"poll.interval.ms": 15000
}
These are so-called tombstone events used for correct compaction of deleted events - see https://kafka.apache.org/documentation/#compaction
Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. This delete marker will cause any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that they will themselves be cleaned out of the log after a period of time to free up space. The point in time at which deletes are no longer retained is marked as the "delete retention point" in the above diagram.