How can I increase the tasks.max for debezium sql connnector? - apache-kafka

I tried setting the configuration for debezium MySQL connector for the property
'tasks.max=50'
But the connector in logs shows error as below:
'java.lang.IllegalArgumentException: Only a single connector task may be started'
I am using MSK Connector with debezium custom plugin and Debezium version 1.8.

It's not possible.
The database bin log must be read sequentially by only one task.
Run multiple connectors for different tables if you want to distribute workload

Related

Custom Connector for Apache Kafka

I am looking to write a custom connector for Apache Kafka to connect to SQL database to get CDC data. I would like to write a custom connector so I can connect to multiple databases using one connector because all the marketplace connectors only offer one database per connector.
First question: Is it possible to connect to multiple databases using one custom connector? Also, in that custom connector, can I define which topics the data should go to?
Second question: Can I write a custom connector in .NET or it has to be Java? Is there an example that I can look at for custom connector for CDC for a database in .net?
There are no .NET examples. The Kafka Connect API is Java only, and not specific to Confluent.
Source is here - https://github.com/apache/kafka/tree/trunk/connect
Dependency here - https://search.maven.org/artifact/org.apache.kafka/connect-api
looking to write a custom connector ... to connect to SQL database to get CDC data
You could extend or contribute to Debezium, if you really wanted this feature.
connect to multiple databases using one custom connector
If you mean database servers, then not really, no. Your URL would have to be unique per connector task, and there isn't an API to map a task number to a config value. If you mean one server, and multiple database schemas, then I also don't think that is really possible to properly "distribute" within a single connector with multiple tasks (thus why database.names config in Debezium only currently supports one name).
explored debezium but it won't work for us because we have microservices architecture and we have more than 1000 databases for many clients and debezium creates one topic for each table which means it is going to be a massive architecture
Kafka can handle thousands of topics fine. If you run the connector processes in Kubernetes, as an example, then they're centrally deployable, scalable, and configurable from there.
However, I still have concerns over you needing all databases to capture CDC events.
Was also previously suggested to use Maxwell

Kafka connect confluent jdbc does not control session pool in MSSQL database

I am working with Kafka connect and confluent jdbc. Integrate a source connector with Mssql and a few days ago the operating area warned us that there is a high number of sessions in the "sleeping" state in the database. I need to control those sessions but apparently the connector (confluent jdbc) doesn't have those properties in its configuration.
Do you have any ideas to correct this problem?
Kafka Connect will run a minimum of one task per connector. Each connector is isolated from the other and other than sharing a runtime environment is isolated from the others.
Therefore if you have 27 connectors sourcing from the same database, you will have a minimum of 27 connections to the database.
If you can't reduce the number of connectors (e.g. by have one connector pull from multiple tables), then the only option I think you have is to speak to your DBA about enforcing some kind of resource management on the RDBMS side. For example, on Oracle the Resource Manager option can be used for this.

Configuring Kafka connect Postgress Debezium CDC plugin

I am trying to use kafka connect to read changes in postgress DB.
I have Kafka running on my local system and i want to use the Kafka connect API in standalone mode to read the postgress server DB changes.
connect-standalone.sh connect-standalone.properties dbezium.properties
i would appreciate if someone can help me with setting up configuration properties for CDC postgress debezium connector
https://www.confluent.io/connector/debezium-postgresql-cdc-connector/
I am following the below to construct the properties
https://debezium.io/docs/connectors/postgresql/#how-the-postgresql-connector-works
The name of the Kafka topics takes by default the form
serverName.schemaName.tableName, where serverName is the logical name
of the connector as specified with the database.server.name
configuration property
and here is what i have come up with for dbezium.properties
name=cdc_demo
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
plugin.name=wal2json
slot.name=debezium
slot.drop_on_stop=false
database.hostname=localhost
database.port=5432
database.user=postgress
database.password=postgress
database.dbname=test
time.precision.mode=adaptive
database.sslmode=disable
Lets say i create a PG schema name as demo and table name as suppliers
So i need to create a topic with name as test.demo.suppliers so that this plugin can push the data to?
Also can someone suggest a docker image which has the postgress server + with suitable replication plugin such as wal2json etc? i am having hard time configuring postgress and the CDC plugin myself.
Check out the tutorial with associated Docker Compose and sample config.
The topic you've come up with sounds correct, but if you have your Kafka broker configured to auto-create topics (which is the default behaviour IIRC) then it will get created for you and you don't need to pre-create it.

kafka-connect jdbc distributed mode

We are working on building the Kafka-connect application using JDBC source connector in increment+timestamp mode. We tried the Standalone mode and It is working as expected. Now, we would like to switch to distributed mode.
When we have a single Hive table as a source, How the tasks will be distributed among the workers?
The problem we faced was when we run the application in multiple instances, It is querying the table for every instance and fetching the same rows again. Does parallelism will work in this case? If so,
How does the tasks will co-ordinate with each other on the current status of table ?
The parameter tasks.max doesn't have any difference for the kafka-connect-jdbc source/sink connector. There is no occurrence of this property in the source code of the jdbc connector project.
Consult JDBC source config options for the available properties for this connector.

Postgres streaming using JDBC Kafka Connect

I am trying to stream changes in my Postgres database using the Kafka Connect JDBC Connector. I am running into issues upon startup as the database is quite big and the query dies every time as rows change in between.
What is the best practice for starting off the JDBC Connector on really huge tables?
Assuming you can't pause the workload on the database that you're streaming the contents in from to allow the initialisation to complete, I would look at Debezium.
In fact, depending on your use case, I would look at Debezium regardless :) It lets you do true CDC against Postgres (and MySQL and MongoDB), and is a Kafka Connect plugin just like the JDBC Connector is so you retain all the benefits of that.