Connecting Kafka connect source directly to the kafka connect sink - apache-kafka

I want to sync two tables from two distinct database services with Kafka Connect Source and Kafka Connect Sink.
Kafka Connect Source reads data from source database and publishes changes into a topic named TOP1, and Kafka Connect Sink subscribed into the TOP1 and should write the change into the destination database.
The source and destination database are MSSQL and I use Debezium connector for SQL Server.
I created Kafka Connect Source with following configuration:
{
"name": "sql-source",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"database.server.name": "TEST",
"database.dbname": "test_source",
"database.hostname": "172.1x.xx.xx",
"database.port": "1433",
"database.user": "sa",
"database.password": "xxxxx",
"database.instance": "MSSQLSERVER",
"database.history.kafka.bootstrap.servers": "kafka01.xxxx.dev:9092",
"database.history.kafka.topic": "schema-changes.inventory",
"value.converter.schema.registry.url":"http://kafka01.xxxx.dev:8081"
}
}
This works great and publish any changes(insert, update, delete) with following schema:
{
"before": {...},
"after": {...},
"source": {...}
}
But how should I create Kafka Connect Sink configuration that in destination database I have exactly the same data as source database.
When a record inserted in source the same insert in destination, a record deleted in source the same record delete from destination and also update in source results update in the destination.

Related

Configure Debezium CDC -> Kafka -> JDBC Sink (Multiple tables) Question

We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}
The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.

No CDC generated by Kafka Debezium connector for Postgres

I succeed generating CDC in a Postgres DB.
Today, when I use same step to try to set up Kafka Debezium connector for another Postgres DB.
First I ran
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
which succeed without error.
Then I ran
GET http://localhost:8083/connectors/postgres-kafkaconnector/status
to check status. It returns this result without any error:
{
"name": "postgres-kafkaconnector",
"connector": {
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
}
],
"type": "source"
}
However, this time, when I updated anything in the products table. No CDC got generated.
Any idea? Any suggestion for helping further debug would be appreciate. Thanks!
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

Is there any way to use MongoSourceConnector for multiple database with single kafka topic?

I am using MongoSourceConnector to connect kafka topic with mongo database collection. For single database with single kafka topic it's working fine, but is there any way that i could do a connection for multiple mongo database with single kafka topic.
If you are running kafka-connect in distributed mode then you can create a another connector config file with the above mentioned config
I am not really sure about multiple databases and a single Kafka topic but you can surely listen to multiple databases change streams and push data to topics. Since topic creation depends on the database_name.collection_name, so you will have more topics.
You can provide the Regex to listen to multiple databases in the pipeline.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-names_.*/}},{\"ns.coll\":{\"$regex\":/^collection_name$/}}]}}]"
Here is the complete Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-names_.*/}},{\"ns.coll\":{\"$regex\":/^collection_name$/}}]}}]"
}
}
You can get more details from official docs.
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
https://docs.confluent.io/platform/current/connect/index.html

Kafka Connect: streaming changes from Postgres to topics using debezium

I'm pretty new to Kafka and Kafka Connect world. I am trying to implement CDC using Kafka (on MSK), Kafka Connect (using the Debezium connector for PostgreSQL) and an RDS Postgres instance. Kafka Connect runs in a K8 pod in our cluster deployed in AWS.
Before diving into the details of the configuration used, I'll try to summarise the problem:
Once the connector starts, it sends messages to the topic as expected (snahpshot)
Once we make any change to a table (Create, Update, Delete), no messages are sent to the topic. We would expect to see messages about the changes made to the table.
My connector config looks like:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.user": "root",
"database.dbname": "insights",
"slot.name": "cdc_organization",
"tasks.max": "1",
"column.blacklist": "password, access_key, reset_token",
"database.server.name": "insights",
"database.port": "5432",
"plugin.name": "wal2json_rds_streaming",
"schema.whitelist": "public",
"table.whitelist": "public.kafka_connect_cdc_test",
"key.converter.schemas.enable": "false",
"database.hostname": "de-test-sre-12373.cbplqnioxomr.eu-west-1.rds.amazonaws.com",
"database.password": "MYSECRETPWD",
"value.converter.schemas.enable": "false",
"name": "source-postgres",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"snapshot.mode": "initial"
}
We have tried different configurations for the plugin.name property: wal2josn, wal2json_streaming and wal2json_rds_streaming.
There's no problem of connection between the connector and the DB as we already saw messages flowing through as soon as the connector starts.
Is there a configuration issue with the connector described above that prevent us to see messages related to new changes appearing in the topic?
Thanks
Your connector config looks a bit confusing. I'm pretty new to Kafka as well so I don't really know the issue but this is my connector config that works for me.
{
"name":"<connector_name>",
"config": {
"connector.class":"io.debezium.connector.postgresql.PostgresConnector",
"database.server.name":"<server>",
"database.port":"5432",
"database.hostname":"<host>",
"database.user":"<user>",
"database.dbname":"<password>",
"tasks.max":"1",
"database.history.kafka.boostrap.servers":"localhost:9092",
"database.history.kafka.topic":"<kafka_topic_name>",
"plugin.name":"pgoutput",
"include.schema.changes":"true"
}
}
If this configuration didn't work aswell, try look up the log console; sometimes the error isn't the last write of the console

Is it possible to sink kafka message generated by debezium to snowflake

I use debezium-ui repo to testing debezium mysql cdc feature, the message can normally stream
into kafka, the request body of to create mysql connect are as follows:
{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "dbzui-db-mysql",
"database.port": "3306",
"database.user": "mysqluser",
"database.password": "mysql",
"database.server.id": "184054",
"database.server.name": "inventory-connector-mysql",
"database.include.list": "inventory",
"database.history.kafka.bootstrap.servers": "dbzui-kafka:9092",
"database.history.kafka.topic": "dbhistory.inventory"
}
}
And then I need to sink the kafka message into snowflake, the data warehouse my team use. I create a snowflake sink connector to sink it, the request body are as follows:
{
"name": "kafka2-04",
"config": {
"connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
"tasks.max": 1,
"topics": "inventory-connector-mysql.inventory.orders",
"snowflake.topic2table.map": "inventory-connector-mysql.inventory.orders:tbl_orders",
"snowflake.url.name": "**.snowflakecomputing.com",
"snowflake.user.name": "kafka_connector_user_1",
"snowflake.private.key": "*******",
"snowflake.private.key.passphrase": "",
"snowflake.database.name": "kafka_db",
"snowflake.schema.name": "kafka_schema",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "com.snowflake.kafka.connector.records.SnowflakeJsonConverter",
"header.converter": "org.apache.kafka.connect.storage.SimpleHeaderConverter",
"value.converter.schemas.enable":"true"
}
}
But after it runs,the data sink into my snowflake is like: data in snowflake, the schema in snowflake table is different from mysql table. Is my sink connector config is incorrect or it is impossible to sink kafka data generated by debezium with SnowflakeSinkConnector.
This is default behavior in Snowflake and it is documented here:
Every Snowflake table loaded by the Kafka connector has a schema consisting of two VARIANT columns:
RECORD_CONTENT. This contains the Kafka message.
RECORD_METADATA. This contains metadata about the message, for example, the topic from which the message was read.
If Snowflake creates the table, then the table contains only these two columns. If the user creates the table for the Kafka Connector to add rows to, then the table can contain more than these two columns (any additional columns must allow NULL values because data from the connector does not include values for those columns).