I succeed generating CDC in a Postgres DB.
Today, when I use same step to try to set up Kafka Debezium connector for another Postgres DB.
First I ran
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
which succeed without error.
Then I ran
GET http://localhost:8083/connectors/postgres-kafkaconnector/status
to check status. It returns this result without any error:
{
"name": "postgres-kafkaconnector",
"connector": {
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
}
],
"type": "source"
}
However, this time, when I updated anything in the products table. No CDC got generated.
Any idea? Any suggestion for helping further debug would be appreciate. Thanks!
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.
Related
We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}
The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.
I have a Postgres DB with CDC setup.
I deployed the Kafka Debezium connector 1.8.0.Final for a Postgres DB by
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
I noticed some strange things.
In same table, when I update rows, some rows can generate CDC, but other rows cannot generate CDC.
And those rows are very similar except for id and identifier are different.
-- Updating this row can generate CDC
UPDATE public.products
SET identifier = 'GET /api/accounts2'
WHERE id = '90c21719-ce41-4523-8ad1-ed6b21ecfaf1';
-- Updating this row cannot generate CDC
UPDATE public.products
SET identifier = 'GET /api/notworking/accounts2'
WHERE id = '22f5ebf3-9594-493d-8aa6-649d9fbcefd2';
I checked my Kafka Connect container log, there is no error neither.
Any idea?
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.
I want to sync two tables from two distinct database services with Kafka Connect Source and Kafka Connect Sink.
Kafka Connect Source reads data from source database and publishes changes into a topic named TOP1, and Kafka Connect Sink subscribed into the TOP1 and should write the change into the destination database.
The source and destination database are MSSQL and I use Debezium connector for SQL Server.
I created Kafka Connect Source with following configuration:
{
"name": "sql-source",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"database.server.name": "TEST",
"database.dbname": "test_source",
"database.hostname": "172.1x.xx.xx",
"database.port": "1433",
"database.user": "sa",
"database.password": "xxxxx",
"database.instance": "MSSQLSERVER",
"database.history.kafka.bootstrap.servers": "kafka01.xxxx.dev:9092",
"database.history.kafka.topic": "schema-changes.inventory",
"value.converter.schema.registry.url":"http://kafka01.xxxx.dev:8081"
}
}
This works great and publish any changes(insert, update, delete) with following schema:
{
"before": {...},
"after": {...},
"source": {...}
}
But how should I create Kafka Connect Sink configuration that in destination database I have exactly the same data as source database.
When a record inserted in source the same insert in destination, a record deleted in source the same record delete from destination and also update in source results update in the destination.
I'm experiencing performance issues when replicating tables using Debezium and Kafka Connect.
The slow replication is only experienced during the initial snapshot of the database. One table that I tested with, contains 3.4m rows and the replication took 2 hours to complete.
At this stage, the entire database was locked and I was unable to commit data to other tables that were not being synced at the time.
My configuration (Debezium config deployed via curl request):
{
"name": "debezium-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "redacted",
"database.port": "3306",
"database.user": "redacted",
"database.password": "redacted",
"database.server.id": "54005",
"database.server.name": "redacted",
"database.include.list": "redacted",
"table.include.list": "redacted",
"database.history.consumer.security.protocol":"SSL",
"database.history.producer.security.protocol":"SSL",
"database.history.kafka.bootstrap.servers": "redacted",
"database.history.kafka.topic": "schema-changes.debezium-test",
"snapshot.mode": "when_needed",
"max.queue.size": 81290,
"max.batch.size": 20480
}
}
Kafka Connect configuration that was changed:
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_FLUSH_TIMEOUT_MS: 60000
Questions:
1 - How can I improve performance during the initial snapshot of the database?
2 - How can I replicate a limited number of tables from a database, without locking the entire database?
if you can make sure that database schema will not change during snapshot process then you can avoid locking the database via https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-snapshot-locking-mode
Also check https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-min-row-count-to-stream-results option, there might be also some performance change using it properly.
You can also max.batch.size together with max.queue.size even more than you have it right now.
I'm streaming my PostgreSQL using confluent (./kafka-avro-console-consumer). Successfully streaming, but Kafka only showed INSERT activity. Other than that like DELETE, UPDATE, CREATE TABLE, didn't stream on my Kafka consume.
I did:
Install PostgreSQL including making role until making tables
Install debezium CDC for PostgreSQL (I didn't use json/jdbc at all)
Install wal2json
Making connector to confluent
Automatically: topics are made after successfully deploy the connector
Stream topics ./kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic debezium.public.emp_bio --from-beginning
This is my connector
{
"name": "postgres-connector-1",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "5432",
"database.user": "dbuser1",
"database.password": "password",
"database.dbname":"testdb",
"database.server.name": "debezium",
"database.whitelist": "testdb",
"plugin.name": "wal2json",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "postgres-hist-test",
"include.schema.changes": "true"
}
}
Expected: I can stream other activities like DELETE, DROP, UPDATE, CREATE from my PostgreSQL
Error: Sorry, there is no error message anywhere!