Configure Debezium CDC -> Kafka -> JDBC Sink (Multiple tables) Question - apache-kafka

We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}

The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.

Related

Slow down kafka connector JDBC source - control throughput

I want to control the throughtput of a JDBC source kafka connector.
I have a lot of data stored in a PostgreSQL table and I wand to ingest it into a Kafka topic. However I would like to avoid a huge "peak" in the ingestion.
My config looks like:
{
"name": "my-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"topic.prefix": "my-topic",
"connection.url": "jdbc:postgresql://localhost:5432/my-db",
"connection.user": "user",
"connection.password": "password",
"mode": "timestamp",
"timestamp.column.name": "time",
"poll.interval.ms": "10000",
"batch.max.rows": "100",
"query": "SELECT * FROM my-table",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false"
}
}
I guess I would need to play with these parameters:
poll.interval.ms
batch.max.rows
I don't understand how they impact the throughput. With these values it goes really fast.
How can I configure it properly to slow it down?
Edit: the idea looks like KIP-731 and the propose to limit record rate.
You currently have batch.max.rows=100 which is the default
https://docs.confluent.io/kafka-connectors/jdbc/current/source-connector/source_config_options.html#connector
Once 100 rows are included in the batch, the connector will send the batch to the Kafka topic. If you want to increase throughput you should try increasing this value.

No CDC generated by Kafka Debezium connector for Postgres

I succeed generating CDC in a Postgres DB.
Today, when I use same step to try to set up Kafka Debezium connector for another Postgres DB.
First I ran
POST http://localhost:8083/connectors
with body:
{
"name": "postgres-kafkaconnector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
"database.server.name": "postgres_server",
"table.include.list": "public.products",
"plugin.name": "pgoutput"
}
}
which succeed without error.
Then I ran
GET http://localhost:8083/connectors/postgres-kafkaconnector/status
to check status. It returns this result without any error:
{
"name": "postgres-kafkaconnector",
"connector": {
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "10.xx.xx.xx:8083"
}
],
"type": "source"
}
However, this time, when I updated anything in the products table. No CDC got generated.
Any idea? Any suggestion for helping further debug would be appreciate. Thanks!
Found the issue! It is because my Kafka Connector postgres-kafkaconnector was initially pointing to a DB (stage1), then I switched to another DB (stage2) by updating
"database.hostname": "example.com",
"database.port": "5432",
"database.dbname": "my_db",
"database.user": "xxx",
"database.password": "xxx",
However, they are using same configuration properties in the Kafka Connect I deployed in the very beginning:
config.storage.topic
offset.storage.topic
status.storage.topic
Since this connector with different DB config shared same above Kafka configuration properties, nd the database table schemas are same,
it became mess due to sharing same Kafka offset.
One simple way to fix is when deploying Kafka connector to test on different DBs, using different names such as postgres-kafkaconnector-stage1 and postgres-kafkaconnector-stage2 to avoid Kafka topic offset mess.

Is there any way to use MongoSourceConnector for multiple database with single kafka topic?

I am using MongoSourceConnector to connect kafka topic with mongo database collection. For single database with single kafka topic it's working fine, but is there any way that i could do a connection for multiple mongo database with single kafka topic.
If you are running kafka-connect in distributed mode then you can create a another connector config file with the above mentioned config
I am not really sure about multiple databases and a single Kafka topic but you can surely listen to multiple databases change streams and push data to topics. Since topic creation depends on the database_name.collection_name, so you will have more topics.
You can provide the Regex to listen to multiple databases in the pipeline.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-names_.*/}},{\"ns.coll\":{\"$regex\":/^collection_name$/}}]}}]"
Here is the complete Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-names_.*/}},{\"ns.coll\":{\"$regex\":/^collection_name$/}}]}}]"
}
}
You can get more details from official docs.
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
https://docs.confluent.io/platform/current/connect/index.html

Kafka MongoDB Sink only one record from table

I am using ksqlDB, Where I have created a table from the stream. when I fire select query in that table it gives me all the record properly. Now I want to sink that table in MongoDB. I am also able to create a sink between the Kafka table to MongoDB. But somehow it sinks only one record into it(MongoDB). Whereas in the table I have 100 records. Below is my MongoDB sink connector.
{
"name": "MongoSinkConnectorConnector_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "FEEDS",
"connection.uri": "mongodb://xxx:xxx#x.x.x.x:27017/",
"database": "xxx",
"max.num.retries": "1000000",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"value.projection.type": "allowlist",
"value.projection.list": "id",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"buffer.capacity": "20000",
"value.converter.schema.registry.url": "http://x.x.x.x:8081",
"key.converter.schemas.enable": "false",
"insert.mode": "upsert"
}
}
I could not able to understand that, what's the reason behind that. Any help appreciated. Thank you
you can set the "batch.size" property in the Sink Connector property and also you can have a better write model strategy that you can read through Mongo DB Source Connector Official documentation https://docs.confluent.io/cloud/current/connectors/cc-mongo-db-source.html.

table.include.list configuration parameter not working in Debezium Postgres Connector

I am using Debezium Postgres connector to capture changes in postgres tables.
The documentation for this connector
https://debezium.io/documentation/reference/connectors/postgresql.html
mentions a configuration parameter
table.include.list
However when I set the value of this parameter to 'config.abc'. Even after that changes from both tables in config schema (namely abc and def) are getting streamed.
The reason I want to do this is that I want to create 2 separate connectors for each of the 2 tables to split the load and faster change data streaming.
Is this a known issue ? Anyway to overcome this ?
The same problem here (Debezium Version: 1.1.0.Final and postgresql:11.13.0). Below is my configuration:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"transforms.unwrap.delete.handling.mode": "rewrite",
"slot.name": "debezium_planning",
"transforms": "unwrap,extractInt",
"include.schema.changes": "false",
"decimal.handling.mode": "string",
"database.schema": "partsmaster",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.converters.LongConverter",
"database.user": "************",
"database.dbname": "smartng-db",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"database.server.name": "ASLog.m3d.smartng-db",
"database.port": "***********",
"plugin.name": "pgoutput",
"column.exclude.list": "create_date, create_user, change_date, change_user",
"transforms.extractInt.field": "planning_id",
"database.hostname": "************",
"database.password": "*************",
"name": "ASLog.m3d.source-postgres-smartng-planning",
"transforms.unwrap.add.fields": "op,table,lsn,source.ts_ms",
"table.include.list": "partsmaster.planning",
"snapshot.mode": "never"
}
A change in another table partsmaster.basic is causing this connector to fail because the attribute planning_id is not available in table partsmaster.basic.
I need separate connectors for each table. table.include.list is not working. Neither table.exclude.list does.
My last resort would be to use a separate Postgres Publication for each connector. But I still hope I can find an immediate configuration solution here or be pointed out at what I am missing. In a previous version I have used table.whitelist without problems.
Solution:
create seperate publications to each table:
CREATE PUBLICATION dbz_planning FOR TABLE partsmaster.planning;
drop previous replication slot:
select pg_drop_replication_slot('debezium_planning');
Change your connector configurations:
{
"slot.name": "dbz_planning",
"publication.name": "planning_publication"
}
et voilĂ !