Kafka Connect - JDBC Avro connect how define custom schema registry - apache-kafka

I was following tutorial on kafka connect, and I am wondering if there is a possibility to define a custom schema registry for a topic which data came from a MySql table.
I can't find where define it in my json/connect config and I don't want to create a new version of that schema after creating it.
My MySql table called stations has this schema
Field | Type
---------------+-------------
code | varchar(4)
date_measuring | timestamp
attributes | varchar(256)
where the attributes contains a Json data and not a String (I have to use that type because the Json field of the attributes are variable.
My connector is
{
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"key.converter.schema.registry.url": "http://localhost:8081",
"name": "jdbc_source_mysql_stations",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"transforms": [
"ValueToKey"
],
"transforms.ValueToKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.ValueToKey.fields": [
"code",
"date_measuring"
],
"connection.url": "jdbc:mysql://localhost:3306/db_name?useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC",
"connection.user": "confluent",
"connection.password": "**************",
"table.whitelist": [
"stations"
],
"mode": "timestamp",
"timestamp.column.name": [
"date_measuring"
],
"validate.non.null": "false",
"topic.prefix": "mysql-"
}
and creates this schema
{
"subject": "mysql-stations-value",
"version": 1,
"id": 23,
"schema": "{\"type\":\"record\",\"name\":\"stations\",\"fields\":[{\"name\":\"code\",\"type\":\"string\"},{\"name\":\"date_measuring\",\"type\":{\"type\":\"long\",\"connect.version\":1,\"connect.name\":\"org.apache.kafka.connect.data.Timestamp\",\"logicalType\":\"timestamp-millis\"}},{\"name\":\"attributes\",\"type\":\"string\"}],\"connect.name\":\"stations\"}"
}
Where "attributes" field is of course a String.
Unlike I would apply it this other schema.
{
"fields": [
{
"name": "code",
"type": "string"
},
{
"name": "date_measuring",
"type": {
"connect.name": "org.apache.kafka.connect.data.Timestamp",
"connect.version": 1,
"logicalType": "timestamp-millis",
"type": "long"
}
},
{
"name": "attributes",
"type": {
"type": "record",
"name": "AttributesRecord",
"fields": [
{
"name": "H1",
"type": "long",
"default": 0
},
{
"name": "H2",
"type": "long",
"default": 0
},
{
"name": "H3",
"type": "long",
"default": 0
},
{
"name": "H",
"type": "long",
"default": 0
},
{
"name": "Q",
"type": "long",
"default": 0
},
{
"name": "P1",
"type": "long",
"default": 0
},
{
"name": "P2",
"type": "long",
"default": 0
},
{
"name": "P3",
"type": "long",
"default": 0
},
{
"name": "P",
"type": "long",
"default": 0
},
{
"name": "T",
"type": "long",
"default": 0
},
{
"name": "Hr",
"type": "long",
"default": 0
},
{
"name": "pH",
"type": "long",
"default": 0
},
{
"name": "RX",
"type": "long",
"default": 0
},
{
"name": "Ta",
"type": "long",
"default": 0
},
{
"name": "C",
"type": "long",
"default": 0
},
{
"name": "OD",
"type": "long",
"default": 0
},
{
"name": "TU",
"type": "long",
"default": 0
},
{
"name": "MO",
"type": "long",
"default": 0
},
{
"name": "AM",
"type": "long",
"default": 0
},
{
"name": "N03",
"type": "long",
"default": 0
},
{
"name": "P04",
"type": "long",
"default": 0
},
{
"name": "SS",
"type": "long",
"default": 0
},
{
"name": "PT",
"type": "long",
"default": 0
}
]
}
}
],
"name": "stations",
"namespace": "com.mycorp.mynamespace",
"type": "record"
}
Any suggestion please?
In case it's not possible, I suppose I'll have to create a KafkaStream to create another topic, even if I would avoid it.
Thanks in advance!

I don't think you're asking anything about using a "custom" registry (which you'd do with the two lines that say which registry you're using), but rather how you can parse the data / apply a schema after the record is pulled from the database
You can write your own Transform, or you can use Kstreams, which are really the main options here. There is a SetSchemaMetadata transform, but I'm not sure that'll do what you want (parse a string into an Avro record)
Or if you must shove JSON data into a single database attribute, maybe you shouldn't use Mysql and rather a document database which has more flexible data constraints.
Otherwise, you can use BLOB rather than varchar and put binary Avro data into that column, but then you'd still need a custom deserializer to read the data

Related

Add MySQL column comment as a metadata to Avro schema through the Debezium connector

Kafka Connect is used through the Confluent platform and io.debezium.connector.mysql.MySqlConnector is used as the Debezium connector.
In my case, MySQL table includes columns with sensitive data and these columns must be tagged as sensitive for further use.
SHOW FULL COLUMNS FROM astronauts;
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
| orderid | int | NULL | YES | | NULL | | select,insert,update,references | |
| name | varchar(100) | utf8mb4_0900_ai_ci | NO | | NULL | | select,insert,update,references | sensitive column |
+---------+--------------+--------------------+------+-----+---------+-------+---------------------------------+------------------+
Notice MySQL comment for the name column.
Based on this table, I would like to have this Avro schema in the Schema registry:
{
"connect.name": "dbserver1.inventory.astronauts.Envelope",
"connect.version": 1,
"fields": [
{
"default": null,
"name": "before",
"type": [
"null",
{
"connect.name": "dbserver1.inventory.astronauts.Value",
"fields": [
{
"default": null,
"name": "orderid",
"type": [
"null",
"int"
]
},
{
"name": "name",
"type": {
"MY_CUSTOM_ATTRIBUTE": "sensitive column",
"type": "string"
}
}
],
"name": "Value",
"type": "record"
}
]
},
{
"default": null,
"name": "after",
"type": [
"null",
"Value"
]
},
{
"name": "source",
"type": {
"connect.name": "io.debezium.connector.mysql.Source",
"fields": [
{
"name": "version",
"type": "string"
},
{
"name": "connector",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "ts_ms",
"type": "long"
},
{
"default": "false",
"name": "snapshot",
"type": [
{
"connect.default": "false",
"connect.name": "io.debezium.data.Enum",
"connect.parameters": {
"allowed": "true,last,false,incremental"
},
"connect.version": 1,
"type": "string"
},
"null"
]
},
{
"name": "db",
"type": "string"
},
{
"default": null,
"name": "sequence",
"type": [
"null",
"string"
]
},
{
"default": null,
"name": "table",
"type": [
"null",
"string"
]
},
{
"name": "server_id",
"type": "long"
},
{
"default": null,
"name": "gtid",
"type": [
"null",
"string"
]
},
{
"name": "file",
"type": "string"
},
{
"name": "pos",
"type": "long"
},
{
"name": "row",
"type": "int"
},
{
"default": null,
"name": "thread",
"type": [
"null",
"long"
]
},
{
"default": null,
"name": "query",
"type": [
"null",
"string"
]
}
],
"name": "Source",
"namespace": "io.debezium.connector.mysql",
"type": "record"
}
},
{
"name": "op",
"type": "string"
},
{
"default": null,
"name": "ts_ms",
"type": [
"null",
"long"
]
},
{
"default": null,
"name": "transaction",
"type": [
"null",
{
"connect.name": "event.block",
"connect.version": 1,
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "total_order",
"type": "long"
},
{
"name": "data_collection_order",
"type": "long"
}
],
"name": "block",
"namespace": "event",
"type": "record"
}
]
}
],
"name": "Envelope",
"namespace": "dbserver1.inventory.astronauts",
"type": "record"
}
Notice the custom schema field named MY_CUSTOM_ATTRIBUTE.
Debezium 2.0 supports schema doc from column comments [DBZ-5489], however, personally I think the doc field attribute is not appropriate since:
any implementation of a schema registry or system that processes the schemas is free to drop those fields when encoding/decoding and its fully spec compliant
Additionally, the doc field is solely intended to provide information to a user of the schema and is not intended as a form of metadata that downstream programs can rely on
source: https://avro.apache.org/docs/1.10.2/spec.html#Schema+Resolution
Based on the Avro schema docs, custom attributes for Avro schemas are allowed and these attributes are known as metadata:
A JSON object, of the form:
{"type": "typeName" ...attributes...}
where typeName is either a primitive or derived type name, as defined below. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data.
source: https://avro.apache.org/docs/1.10.2/spec.html#schemas
I think Debezium transformations might be a solution, however, I have the following problems:
No idea how to get MySQL column comments in my custom transformation
org.apache.kafka.connect.data.SchemaBuilder does not allow to add custom attributes, afaik just doc and the specific field
Here are several native transformations for reference: https://github.com/apache/kafka/tree/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/

How can I save Kafka message Key in document for MongoDB Sink?

Right now I have a MongoDB Sink and it saves the value of incoming AVRO messages correctly.
I need it to save the Kafka Message Key in the document.
I have tried org.apache.kafka.connect.transforms.HoistField$Key in order to add the key to the value that is being saved, but this did nothing. It did work when using ProvidedInKeyStrategy, but I don't want my _id to be the Kafka message Key.
My configuration:
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"connection.uri": "mongodb://mongo1",
"database": "mongodb",
"collection": "sink",
"topics": "topics.foo",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "hoistKey",
"transforms.hoistKey.type":"org.apache.kafka.connect.transforms.HoistField$Key",
"transforms.hoistKey.field":"kafkaKey"
}
Kafka message schema:
{
"type": "record",
"name": "Smoketest",
"namespace": "some_namespace",
"fields": [
{
"name": "timestamp",
"type": "int",
"logicalType": "timestamp-millis"
}
]
}
Kafka key schema:
[
{
"type": "enum",
"name": "EnvironmentType",
"namespace": "some_namespace",
"doc": "DEV",
"symbols": [
"Dev",
"Test",
"Accept",
"Sandbox",
"Prod"
]
},
{
"type": "record",
"name": "Key",
"namespace": "some_namespace",
"doc": "The standard Key type that is used as key",
"fields": [
{
"name": "conversation_id",
"doc": "The first system producing an event sets this field",
"type": "string"
},
{
"name": "broker_key",
"doc": "The key of the broker",
"type": "string"
},
{
"name": "user_id",
"doc": "User identification",
"type": [
"null",
"string"
]
},
{
"name": "application",
"doc": "The name of the application",
"type": [
"null",
"string"
]
},
{
"name": "environment",
"doc": "The type of environment",
"type": "type.EnvironmentType"
}
]
}
]
Using https://github.com/f0xdx/kafka-connect-wrap-smt I can now wrap all the data from the kafka message into a single document to save in my mongodb sink.

Can I use single s3-sink connector to point same field name for the timeStamp field by using different type of Avro Schema's for different topics?

schema for topic t1
{
"type": "record",
"name": "Envelope",
"namespace": "t1",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
],
"connect.name": "t1.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
}
],
"connect.name": "t1.Envelope"
}
schema for topic t2
{
"type": "record",
"name": "Value",
"namespace": "t2",
"fields": [
{
"name": "id",
"type": {
"type": "long",
"connect.default": 0
},
"default": 0
},
{
"name": "createdAt",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
}
],
"connect.name": "t2.Value"
}
s3-sink Connector configuration
connector.class=io.confluent.connect.s3.S3SinkConnector
behavior.on.null.values=ignore
s3.region=us-west-2
partition.duration.ms=1000
flush.size=1
tasks.max=3
timezone=UTC
topics.regex=t1,t2
aws.secret.access.key=******
locale=US
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.TimeBasedPartitioner
value.converter.schemas.enable=false
name=s3-sink-connector
aws.access.key.id=******
errors.tolerance=all
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
key.converter=org.apache.kafka.connect.storage.StringConverter
s3.bucket.name=s3-sink-connector-bucket
path.format=YYYY/MM/dd
timestamp.extractor=RecordField
timestamp.field=after.createdAt
By using this connector configuration I got error for t2 topic that is "createdAt field does not exist".
If I set timestamp.field = createdAt then error is thrown for t1 topic "createdAt field does not exist".
How can I point "createdAt" field in both schemas at the same time by using same connector for both?
Is it possible to achieve this by using a single s3-sink connector configuration ?
If this scenario is possible then how can I do this, which properties I have to use for achieve this?
If anybody has idea about this, please suggest on this.
If there is any other way to do this then please suggest that way also.
All topics will need the same timestamp field; there's no way to configure topic-to-field mappings.
Your t2 schema doesn't have an after field, so you need to run two separate connectors
The field is also required to be present in all records, otherwise the partitioner won't work.

debezium - schema registry issue

Im using AWS schema registry for debezium.
In the debezium I mentioned the server name as mysql-db01. So debezium will create a topic with this server name to add some metadata about the server and schema changes.
When I deployed the connector, in the schema registry I got the schema like this.
{
"type": "record",
"name": "SchemaChangeKey",
"namespace": "io.debezium.connector.mysql",
"fields": [
{
"name": "databaseName",
"type": "string"
}
],
"connect.name": "io.debezium.connector.mysql.SchemaChangeKey"
}
Then immediately another version got created like this.
{
"type": "record",
"name": "SchemaChangeValue",
"namespace": "io.debezium.connector.mysql",
"fields": [
{
"name": "source",
"type": {
"type": "record",
"name": "Source",
"fields": [
{
"name": "version",
"type": "string"
},
{
"name": "connector",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "ts_ms",
"type": "long"
},
{
"name": "snapshot",
"type": [
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"allowed": "true,last,false"
},
"connect.default": "false",
"connect.name": "io.debezium.data.Enum"
},
"null"
],
"default": "false"
},
{
"name": "db",
"type": "string"
},
{
"name": "sequence",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "table",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "server_id",
"type": "long"
},
{
"name": "gtid",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "file",
"type": "string"
},
{
"name": "pos",
"type": "long"
},
{
"name": "row",
"type": "int"
},
{
"name": "thread",
"type": [
"null",
"long"
],
"default": null
},
{
"name": "query",
"type": [
"null",
"string"
],
"default": null
}
],
"connect.name": "io.debezium.connector.mysql.Source"
}
},
{
"name": "databaseName",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "schemaName",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "ddl",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "tableChanges",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Change",
"namespace": "io.debezium.connector.schema",
"fields": [
{
"name": "type",
"type": "string"
},
{
"name": "id",
"type": "string"
},
{
"name": "table",
"type": {
"type": "record",
"name": "Table",
"fields": [
{
"name": "defaultCharsetName",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "primaryKeyColumnNames",
"type": [
"null",
{
"type": "array",
"items": "string"
}
],
"default": null
},
{
"name": "columns",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Column",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "jdbcType",
"type": "int"
},
{
"name": "nativeType",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "typeName",
"type": "string"
},
{
"name": "typeExpression",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "charsetName",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "length",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "scale",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "position",
"type": "int"
},
{
"name": "optional",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "autoIncremented",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "generated",
"type": [
"null",
"boolean"
],
"default": null
}
],
"connect.name": "io.debezium.connector.schema.Column"
}
}
}
],
"connect.name": "io.debezium.connector.schema.Table"
}
}
],
"connect.name": "io.debezium.connector.schema.Change"
}
}
}
],
"connect.name": "io.debezium.connector.mysql.SchemaChangeValue"
These 2 schemas are not matching, so the AWS schema registry is not allowing the connector to register the 2nd version. But the 2nd version is the actual schema for the connector.
To solve this issue, I deleted the schema(in the schema registry). Then deleted the connector, re-deployed the connector, then It worked.
But I'm trying to understand why the very first time the schema has different versions.
I have used the following key/value convertors on the source and sink connectors to make it work.
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"internal.key.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.key.converter.schemas.enable": "false",
"internal.value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.value.converter.schemas.enable": "false",
"value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.region": "ap-south-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"key.converter.registry.name": "bhuvi-debezium",
"value.converter.registry.name": "bhuvi-debezium",

org.apache.avro.AvroTypeException: Unknown union branch EventId

I am trying to convert a json to avro using 'kafka-avro-console-producer' and publish it to kafka topic.
I am able to do that flat json/schema's but for below given schema and json I am getting
"org.apache.avro.AvroTypeException: Unknown union branch EventId" error.
Any help would be appreciated.
Schema :
{
"type": "record",
"name": "Envelope",
"namespace": "CoreOLTPEvents.dbo.Event",
"fields": [{
"name": "before",
"type": ["null", {
"type": "record",
"name": "Value",
"fields": [{
"name": "EventId",
"type": "long"
}, {
"name": "CameraId",
"type": ["null", "long"],
"default": null
}, {
"name": "SiteId",
"type": ["null", "long"],
"default": null
}],
"connect.name": "CoreOLTPEvents.dbo.Event.Value"
}],
"default": null
}, {
"name": "after",
"type": ["null", "Value"],
"default": null
}, {
"name": "op",
"type": "string"
}, {
"name": "ts_ms",
"type": ["null", "long"],
"default": null
}],
"connect.name": "CoreOLTPEvents.dbo.Event.Envelope"
}
And Json input is like below :
{
"before": null,
"after": {
"EventId": 12,
"CameraId": 10,
"SiteId": 11974
},
"op": "C",
"ts_ms": null
}
And in my case I cant alter schema, I can alter only json such a way that it works
If you are using the Avro JSON format, the input you have is slightly off. For unions, non-null values need to be specified such that the type information is listed: https://avro.apache.org/docs/current/spec.html#json_encoding
See below for an example which I think should work.
{
"before": null,
"after": {
"CoreOLTPEvents.dbo.Event.Value": {
"EventId": 12,
"CameraId": {
"long": 10
},
"SiteId": {
"long": 11974
}
}
},
"op": "C",
"ts_ms": null
}
Removing "connect.name": "CoreOLTPEvents.dbo.Event.Value" and "connect.name": "CoreOLTPEvents.dbo.Event.Envelope" as The RecordType can only contains {'namespace', 'aliases', 'fields', 'name', 'type', 'doc'} keys.
Could you try with below schema and see if you are able to produce the msg?
{
"type": "record",
"name": "Envelope",
"namespace": "CoreOLTPEvents.dbo.Event",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "EventId",
"type": "long"
},
{
"name": "CameraId",
"type": [
"null",
"long"
],
"default": "null"
},
{
"name": "SiteId",
"type": [
"null",
"long"
],
"default": "null"
}
]
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
},
{
"name": "op",
"type": "string"
},
{
"name": "ts_ms",
"type": [
"null",
"long"
],
"default": null
}
]
}