Debezium streams slowly after stoping and restarting connector - postgresql

I observe that row updates are streamed slowly when I stop and restart the Debezium connector. It sometimes takes almost nine minutes for any row updates to stream instantaneously. During this period, row updates start off by taking over a minute. Gradually the time decreases to being instantaneous. This "waiting period" is sometimes less.
I'm using the official Debezium Docker image. My database is PostgreSQL 15.
Here's how I create the connector:
curl -H 'Content-Type: application/json' 127.0.0.1:9083/connectors --data '
{
"name": "my-name",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"topic.prefix": "prefix",
"database.hostname": "myhost.net",
"database.port": "5432",
"database.user": "root",
"database.password": "root",
"database.server.name": "my_db",
"database.dbname" : "my_db",
"table.include.list": "my_schema.my_table",
"plugin.name": "pgoutput",
"heartbeat.interval.ms": 100
}
}'
The table in table.include.list is updated infrequently, but I have other tables that are updated frequently, so I had to use heartbeat.interval.ms as per this.
Why are row updates streamed so slowly in this scenario? How can I recover from this scenario?

Related

Some rows in the Postgres table can generate CDC while others cannot

I have a Postgres DB with CDC setup.
I deployed the Kafka Debezium connector 1.8.0.Final for a Postgres DB by
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
I noticed some strange things.
In same table, when I update rows, some rows can generate CDC, but other rows cannot generate CDC.
And those rows are very similar except for id and identifier are different.
-- Updating this row can generate CDC
UPDATE public.products
SET identifier = 'GET /api/accounts2'
WHERE id = '90c21719-ce41-4523-8ad1-ed6b21ecfaf1';
-- Updating this row cannot generate CDC
UPDATE public.products
SET identifier = 'GET /api/notworking/accounts2'
WHERE id = '22f5ebf3-9594-493d-8aa6-649d9fbcefd2';
I checked my Kafka Connect container log, there is no error neither.
Any idea?
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

No CDC generated by Kafka Debezium connector for Postgres

I succeed generating CDC in a Postgres DB.
Today, when I use same step to try to set up Kafka Debezium connector for another Postgres DB.
First I ran
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
which succeed without error.
Then I ran
GET http://localhost:8083/connectors/postgres-kafkaconnector/status
to check status. It returns this result without any error:
{
"name": "postgres-kafkaconnector",
"connector": {
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
}
],
"type": "source"
}
However, this time, when I updated anything in the products table. No CDC got generated.
Any idea? Any suggestion for helping further debug would be appreciate. Thanks!
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

SMT's to create kafka connector string partition key through connector config

I've been implementing a kafka connector for PostgreSQL (I'm using the debezium kafka connector and running all the pieces through docker). I need a custom partition key, and so I've been using the SMT to achieve this. However, the approach that I'm using creates a Struct, and I need it to be a string. This article runs through how to set up the partition key as an int, but I can't access the config file to set up the appropriate transforms. Currently my kafka connector looks like this
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{
"name": "connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "password",
"database.dbname" : "postgres",
"database.server.name": "postgres",
"table.include.list": "public.table",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.table",
"transforms": "routeRecords,unwrap,createkey",
"transforms.routeRecords.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.routeRecords.regex": "(.*)",
"transforms.routeRecords.replacement": "table",
"transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
"transforms.createkey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createkey.fields": "id"
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter"
}
}'
I know that I have to extract the value of the column but I'm just not sure how.
ValueToKey creates a Struct from a list of fields, as it is documented.
You need one more transform to extract a specific field from a Struct, as shown in the linked post.
org.apache.kafka.connect.transforms.ExtractField$Key
Note: This does not "set" the partition of the actual Kafka record, only the key, which is then hashed by the Producer to get the partition

Performance issues when replicating tables with Kafka Connect & Debezium to Kafka

I'm experiencing performance issues when replicating tables using Debezium and Kafka Connect.
The slow replication is only experienced during the initial snapshot of the database. One table that I tested with, contains 3.4m rows and the replication took 2 hours to complete.
At this stage, the entire database was locked and I was unable to commit data to other tables that were not being synced at the time.
My configuration (Debezium config deployed via curl request):
{
"name": "debezium-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "redacted",
"database.port": "3306",
"database.user": "redacted",
"database.password": "redacted",
"database.server.id": "54005",
"database.server.name": "redacted",
"database.include.list": "redacted",
"table.include.list": "redacted",
"database.history.consumer.security.protocol":"SSL",
"database.history.producer.security.protocol":"SSL",
"database.history.kafka.bootstrap.servers": "redacted",
"database.history.kafka.topic": "schema-changes.debezium-test",
"snapshot.mode": "when_needed",
"max.queue.size": 81290,
"max.batch.size": 20480
}
}
Kafka Connect configuration that was changed:
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_FLUSH_TIMEOUT_MS: 60000
Questions:
1 - How can I improve performance during the initial snapshot of the database?
2 - How can I replicate a limited number of tables from a database, without locking the entire database?
if you can make sure that database schema will not change during snapshot process then you can avoid locking the database via https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-snapshot-locking-mode
Also check https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-min-row-count-to-stream-results option, there might be also some performance change using it properly.
You can also max.batch.size together with max.queue.size even more than you have it right now.

Kafka can not stream database activity

I'm streaming my PostgreSQL using confluent (./kafka-avro-console-consumer). Successfully streaming, but Kafka only showed INSERT activity. Other than that like DELETE, UPDATE, CREATE TABLE, didn't stream on my Kafka consume.
I did:
Install PostgreSQL including making role until making tables
Install debezium CDC for PostgreSQL (I didn't use json/jdbc at all)
Install wal2json
Making connector to confluent
Automatically: topics are made after successfully deploy the connector
Stream topics ./kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic debezium.public.emp_bio --from-beginning
This is my connector
{
"name": "postgres-connector-1",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "5432",
"database.user": "dbuser1",
"database.password": "password",
"database.dbname":"testdb",
"database.server.name": "debezium",
"database.whitelist": "testdb",
"plugin.name": "wal2json",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "postgres-hist-test",
"include.schema.changes": "true"
}
}
Expected: I can stream other activities like DELETE, DROP, UPDATE, CREATE from my PostgreSQL
Error: Sorry, there is no error message anywhere!