Performance issues when replicating tables with Kafka Connect & Debezium to Kafka - apache-kafka

I'm experiencing performance issues when replicating tables using Debezium and Kafka Connect.
The slow replication is only experienced during the initial snapshot of the database. One table that I tested with, contains 3.4m rows and the replication took 2 hours to complete.
At this stage, the entire database was locked and I was unable to commit data to other tables that were not being synced at the time.
My configuration (Debezium config deployed via curl request):
{
"name": "debezium-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "redacted",
"database.port": "3306",
"database.user": "redacted",
"database.password": "redacted",
"database.server.id": "54005",
"database.server.name": "redacted",
"database.include.list": "redacted",
"table.include.list": "redacted",
"database.history.consumer.security.protocol":"SSL",
"database.history.producer.security.protocol":"SSL",
"database.history.kafka.bootstrap.servers": "redacted",
"database.history.kafka.topic": "schema-changes.debezium-test",
"snapshot.mode": "when_needed",
"max.queue.size": 81290,
"max.batch.size": 20480
}
}
Kafka Connect configuration that was changed:
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_FLUSH_TIMEOUT_MS: 60000
Questions:
1 - How can I improve performance during the initial snapshot of the database?
2 - How can I replicate a limited number of tables from a database, without locking the entire database?

if you can make sure that database schema will not change during snapshot process then you can avoid locking the database via https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-snapshot-locking-mode
Also check https://debezium.io/documentation/reference/1.3/connectors/mysql.html#mysql-property-min-row-count-to-stream-results option, there might be also some performance change using it properly.
You can also max.batch.size together with max.queue.size even more than you have it right now.

Related

Debezium Postgres Connector "After applying the include/exclude list filters, no changes will be captured"

I am using Debezium (a Kafka Connector) to capture Postgres database changes and I am getting an error from Debezium. Does anyone know what the error below means and perhaps offer a suggestion to fix it.
A bit more debugging info:
I tried both "schema.include.list": "banking" and "database.include.list": "banking"... neither works
I tried debezium/connect:1.4 and it works... but not with debezium/connect:1.5+ (1.9 is as high a version as is available and it does not work (same error as below)
Postgres|dbserver1|snapshot After applying the include/exclude list filters, no changes will be captured. Please check your configuration! [io.debezium.relational.RelationalDatabaseSchema]
I have verified (in the logs) that Kafka (and schema registry etc) is running properly, and the Debezium connector seems to have started, and Postgres iw working properly and the database and tables are created.
Below is the Debezium connector configuration:
{
"name": "banking-postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "banking",
"database.server.name": "dbserver1",
"database.include.list": "banking",
"tasks.max": "1",
"table.include.list": "public.x_account,public.x_party,public.x_product,public.x_transaction"
}
}
After many hours debugging (plus some advice from #OneCricketeer which pointed me in the right direction), I managed to get a configuration that works with debezium/connect:1.9. The solution was to use defaults by eliminating configuration items:
database.include.list
schema.include.list
This final working Debezium configuration is as follows:
{
"name": "banking-postgres-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "banking",
"database.server.name": "dbserver1",
"tasks.max": "1"
}
}
This does point to a minor gap in the Debezium documentation and code:
The documentation should provide valid values (examples) for either "schema.include.list" or "database.include.list" since adding the database name alone for either value does not seem to work for postgres...
It would be great to get more information from the logs... the warning message was both hard to find/understand and left one with little recourse. Since this is a situation where no data is captured, it may warrant a higher importance (log an error?)
NOTE: I offer the above as humble suggestions because I find Debezium to be an OUTSTANDING product!
Had a similar issue myself. Structure of the database server when viewed in pgadmin is
Servers
Server Name (Local)
Databases (9)
database_1
...
Schemas(1)
public
...
Tables (56)
table_1
...
Have configs values set
debezium.source.database.dbname = database_1
debezium.source.schema.include.list=public
Then tried the below seperately with
debezium.source.table.include.list=table_1
debezium.source.table.include.list=public.table_1
But with no success, same warning log.

Some rows in the Postgres table can generate CDC while others cannot

I have a Postgres DB with CDC setup.
I deployed the Kafka Debezium connector 1.8.0.Final for a Postgres DB by
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
I noticed some strange things.
In same table, when I update rows, some rows can generate CDC, but other rows cannot generate CDC.
And those rows are very similar except for id and identifier are different.
-- Updating this row can generate CDC
UPDATE public.products
SET identifier = 'GET /api/accounts2'
WHERE id = '90c21719-ce41-4523-8ad1-ed6b21ecfaf1';
-- Updating this row cannot generate CDC
UPDATE public.products
SET identifier = 'GET /api/notworking/accounts2'
WHERE id = '22f5ebf3-9594-493d-8aa6-649d9fbcefd2';
I checked my Kafka Connect container log, there is no error neither.
Any idea?
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

No CDC generated by Kafka Debezium connector for Postgres

I succeed generating CDC in a Postgres DB.
Today, when I use same step to try to set up Kafka Debezium connector for another Postgres DB.
First I ran
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
which succeed without error.
Then I ran
GET http://localhost:8083/connectors/postgres-kafkaconnector/status
to check status. It returns this result without any error:
{
"name": "postgres-kafkaconnector",
"connector": {
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
}
],
"type": "source"
}
However, this time, when I updated anything in the products table. No CDC got generated.
Any idea? Any suggestion for helping further debug would be appreciate. Thanks!
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

Kafka Connect: streaming changes from Postgres to topics using debezium

I'm pretty new to Kafka and Kafka Connect world. I am trying to implement CDC using Kafka (on MSK), Kafka Connect (using the Debezium connector for PostgreSQL) and an RDS Postgres instance. Kafka Connect runs in a K8 pod in our cluster deployed in AWS.
Before diving into the details of the configuration used, I'll try to summarise the problem:
Once the connector starts, it sends messages to the topic as expected (snahpshot)
Once we make any change to a table (Create, Update, Delete), no messages are sent to the topic. We would expect to see messages about the changes made to the table.
My connector config looks like:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.user": "root",
"database.dbname": "insights",
"slot.name": "cdc_organization",
"tasks.max": "1",
"column.blacklist": "password, access_key, reset_token",
"database.server.name": "insights",
"database.port": "5432",
"plugin.name": "wal2json_rds_streaming",
"schema.whitelist": "public",
"table.whitelist": "public.kafka_connect_cdc_test",
"key.converter.schemas.enable": "false",
"database.hostname": "de-test-sre-12373.cbplqnioxomr.eu-west-1.rds.amazonaws.com",
"database.password": "MYSECRETPWD",
"value.converter.schemas.enable": "false",
"name": "source-postgres",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"snapshot.mode": "initial"
}
We have tried different configurations for the plugin.name property: wal2josn, wal2json_streaming and wal2json_rds_streaming.
There's no problem of connection between the connector and the DB as we already saw messages flowing through as soon as the connector starts.
Is there a configuration issue with the connector described above that prevent us to see messages related to new changes appearing in the topic?
Thanks
Your connector config looks a bit confusing. I'm pretty new to Kafka as well so I don't really know the issue but this is my connector config that works for me.
{
"name":"<connector_name>",
"config": {
"connector.class":"io.debezium.connector.postgresql.PostgresConnector",
"database.server.name":"<server>",
"database.port":"5432",
"database.hostname":"<host>",
"database.user":"<user>",
"database.dbname":"<password>",
"tasks.max":"1",
"database.history.kafka.boostrap.servers":"localhost:9092",
"database.history.kafka.topic":"<kafka_topic_name>",
"plugin.name":"pgoutput",
"include.schema.changes":"true"
}
}
If this configuration didn't work aswell, try look up the log console; sometimes the error isn't the last write of the console

table.include.list configuration parameter not working in Debezium Postgres Connector

I am using Debezium Postgres connector to capture changes in postgres tables.
The documentation for this connector
https://debezium.io/documentation/reference/connectors/postgresql.html
mentions a configuration parameter
table.include.list
However when I set the value of this parameter to 'config.abc'. Even after that changes from both tables in config schema (namely abc and def) are getting streamed.
The reason I want to do this is that I want to create 2 separate connectors for each of the 2 tables to split the load and faster change data streaming.
Is this a known issue ? Anyway to overcome this ?
The same problem here (Debezium Version: 1.1.0.Final and postgresql:11.13.0). Below is my configuration:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"transforms.unwrap.delete.handling.mode": "rewrite",
"slot.name": "debezium_planning",
"transforms": "unwrap,extractInt",
"include.schema.changes": "false",
"decimal.handling.mode": "string",
"database.schema": "partsmaster",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.converters.LongConverter",
"database.user": "************",
"database.dbname": "smartng-db",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"database.server.name": "ASLog.m3d.smartng-db",
"database.port": "***********",
"plugin.name": "pgoutput",
"column.exclude.list": "create_date, create_user, change_date, change_user",
"transforms.extractInt.field": "planning_id",
"database.hostname": "************",
"database.password": "*************",
"name": "ASLog.m3d.source-postgres-smartng-planning",
"transforms.unwrap.add.fields": "op,table,lsn,source.ts_ms",
"table.include.list": "partsmaster.planning",
"snapshot.mode": "never"
}
A change in another table partsmaster.basic is causing this connector to fail because the attribute planning_id is not available in table partsmaster.basic.
I need separate connectors for each table. table.include.list is not working. Neither table.exclude.list does.
My last resort would be to use a separate Postgres Publication for each connector. But I still hope I can find an immediate configuration solution here or be pointed out at what I am missing. In a previous version I have used table.whitelist without problems.
Solution:
create seperate publications to each table:
CREATE PUBLICATION dbz_planning FOR TABLE partsmaster.planning;
drop previous replication slot:
select pg_drop_replication_slot('debezium_planning');
Change your connector configurations:
{
"slot.name": "dbz_planning",
"publication.name": "planning_publication"
}
et voilĂ !