Kafka connect JDBC sink connector - tables without Primary Key - apache-kafka

I want to replicate 100+ tables from old MySQL server to PostgreSQL. I have setup debezium which is running fine. Now I want to setup a JDBC sink connector to PostgreSQL. Also I need to enable the DELETEs.
In this case, how can I configure my sink connector for the tables without a primary key?
It should replicate Insert, Update, and delete.

I have used Debezium and Kafka Connect 2 Postgres. I have specified the key column pk.mode=record_key and pk.fields=name_of_pk_column, because it's needed by Kafka Connect, so it can delete (enable.delete=true) and update. Also you can set auto.evolve=true.
If you set enable.delete to true, you have to specify the PK. The column key must be the same in source and in destination table.
But this is not handy if you need to transfer more than few tables.
I did not tried it, but I think a single message transformation script might obratin the key from kafka message, transform it (maybe rename it) and us it in kafka_connect_postgres settings.
I actually have 1:1 databases on boths sides, so getting pk column name I have to only worry about.
If you desire, I can provide mine settings for Kafka connect, Kafka pg connect...
connector-pg.properties
name=sink-postgres
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
#tasks.max=2
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true
topics=pokladna.public.books #my debezium topic with CDC from source DB
connection.url=jdbc:postgresql://localhost:5434/postgres
connection.user=postgres
connection.password=postgres
dialect.name=PostgreSqlDatabaseDialect
table.name.format=books_kafka #targe table in same db, in thus case given by url
transforms=unwrap
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=false
auto.create=true
auto.evolve=true
insert.mode=upsert #insert does not work for updates
delete.enabled=true
pk.fields=id #pk column name same in target and resource db
pk.mode=record_key
try to use SMT extract field, but for me it didn't worked yet. I still have no clue how to extract the primary key from kafka message automatically. Maybe I have to write my own SMT...
Actually I'm gonna start a new SO topic about it. Haha

Related

How to get database.server.name for Kafka Debezium MySQL connector?

EDITING the question:
Trying to configure a debezium MySQL Kafka connector, taking as example
https://debezium.io/documentation/reference/stable/connectors/mysql.html#mysql-example-configuration
I have:
hostname: "ec2-xxx.compute.amazonaws.com"
database: mycooldb (with all my tables inside)
Then I set the following properties like:
"database.hostname": "ec2-xxx.compute.amazonaws.com"
"database.include.list": "mycooldb"
And debezium has another property called "database.server.name". How can I find the server name value in MySql server?
A server can have multiple database, then in database.include.list I can include a list of databases.
database.hostname is the for the hostname or the ip.
I'm not sure about what's database.server.name and how to get the value from MySQL server?. In the scenario if I want to include multiple datbases in database.include.list, then what's the value for atabase.server.name?
What is the difference between database.server.name and database.hostname
Per the docs:
database.hostname: IP address or host name of the MySQL database server
database.server.name: Logical name that identifies and provides a namespace for the particular MySQL database server/cluster in which Debezium is capturing changes. The logical name should be unique across all other connectors, since it is used as a prefix for all Kafka topic names that receive events emitted by this connector. Only alphanumeric characters, hyphens, dots and underscores must be used in the database server logical name.
So database.hostname must be the host/IP of where the database can be found. database.server.name could be fred or foobar or sales or anythingelse. It's just a logical name for that database, and is used (as described above) in the Kafka topic.
Without database.server.name you'd have the potential problem ingesting a table called foo from two different databases using two different Debezium connectors and both trying to store it in a Kafka topic called foo. Hence the comment in the docs that database.server.name "…provides a namespace"
Edit: In regards to your comments, my answer is still accurate. The docs detail topic naming, specifically the fact that the MySQL database name is used in part of the topic as is the database.server.name. If you connect to the same MySQL host (let's say we call it database.server.name=fred), and pull data from two databases on it called sales and warehouse, and each has a table called audit, you'd have two resulting Kafka topics:
fred.sales.audit
fred.warehouse.audit

Is there a way of telling a sink connector in Kafka Connect how to look for schema entries

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?
The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

Is it available if Kafka Sink consume multiple sink into multiple table with standalone configuration?

I have read that an Kafka connect source can receive a multiple topics from a database (a topic represent one table). I have PostgreSQL database with many table, and one Kafka source are satisfied enough for this time. But is it available if I declare only single JDBC Kafka sink to consume all the topics into topic-based destination table, for example all tables from PostgreSQL into single MS SQL Server Database? It is time-cost, for example if I have 200 Tables from one database and must make 200 sinks connection for each tables, although I only need to declare the source once.
You can use Debezium to snapshot one database and all tables to send them over Kafka and dump to any other sink connector (including MSSQL), yes
How many connectors you need to run or how many tables on the destination you'll create are ultimately up to your own configurations
And standalone doesn't matter, but distributed mode is preferred anyway, even if you are only using one machine

Kafka Connect Schema evolution when columns are removed

Lets say we have a setup as follows.
Schema evolution compatibility is set to BACKWARD.
JDBC Source Connector polls data from DB writing to Kafka topic.HDFS Sink Connector read message from Kafka topic and write to HDFS in Avro format.
Following the the flow as I understood.
JDBC Source Connector query DB and generate the Schema V1 from JDBC Metadata from ResultSet.V1 has col1,col2,col3.Schema V1 is registered in Schema Registry.
Source connector polls data from DB and write messages to the Kafka topic in V1 schema.
(Question 1) When HDFS Sink connector read messages from the topic ,does it validate the messages against the V1 schema from the Schema Registry?
Next DB schema is changed. Column "col3" is removed from the table.
Next time JDBC Source polls DB it sees that the schema has changed, generate new Schema V2 (col1,col2) and register V2 is Schema Registry.
Source Connect continue polling data and write to Kafka topic in V2 schema.
Now the Kafka Topic can have messages in both V1 and V2 schema.
(Question 2) When HDFS Sink connector read message does it now validate messages against Schema V2 ?
This this the case addressed in the Confluent documentation under the Backward Compatibility ? :
[https://docs.confluent.io/current/schema-registry/avro.html#schema-evolution-and-compatibility]
An example of a backward compatible change is a removal of a field. A
consumer that was developed to process events without this field will
be able to process events written with the old schema and contain the
field – the consumer will just ignore that field.
The registry only validates when a new schema is registered.
Therefore, it's if/when the source connector detects a change, then validation occurs at the registry side
As for HDFS connector, there is a separate schema.compatibility property that applies a projection over records held in memory and any new records. When you get a record with a new schema, and have a backwards compatible update, then all messages not yet flushed will be updated to hold the new schema when an Avro container file is written.
Aside: just because the registry thinks it's backwards, doesn't guarantee the sink connector does... The validation within the source code is different, and we've had multiple issues with it :/

Kafka Connect HDFS (Azure) Persist Avro Values AND String Keys

I have configured Kafka Connect HDFS to work on Azure Datalake, however I just noticed that the keys (Strings) are not being persisted in anyway, only the Avro values.
When I think about this I suppose this makes sense as the partitioning I want to apply in the data lake is not related to the key and I have not specified some new Avro Schema which incorporates the key String into the existing Avro value Schema.
Now within the configurations I supply when running the connect-distributed.sh script, I have (among other configurations)
...
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://<ip>:<port>
...
But within the actual sink connector that I set up using curl I simply specify the output format as
...
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat"
...
so the connector just assumes that the Avro value is to be written.
So I have two questions. How do I tell the connector that it should save the key along with the value as part of a new Avro schema, and where do I define this schema?
Note that this is an Azure HDInsight cluster and so is not a Confluent Kafka solution (though I would have access to open source Confluent code such as Kafka Connect HDFS)