Custom Connector for Apache Kafka - apache-kafka

I am looking to write a custom connector for Apache Kafka to connect to SQL database to get CDC data. I would like to write a custom connector so I can connect to multiple databases using one connector because all the marketplace connectors only offer one database per connector.
First question: Is it possible to connect to multiple databases using one custom connector? Also, in that custom connector, can I define which topics the data should go to?
Second question: Can I write a custom connector in .NET or it has to be Java? Is there an example that I can look at for custom connector for CDC for a database in .net?

There are no .NET examples. The Kafka Connect API is Java only, and not specific to Confluent.
Source is here - https://github.com/apache/kafka/tree/trunk/connect
Dependency here - https://search.maven.org/artifact/org.apache.kafka/connect-api
looking to write a custom connector ... to connect to SQL database to get CDC data
You could extend or contribute to Debezium, if you really wanted this feature.
connect to multiple databases using one custom connector
If you mean database servers, then not really, no. Your URL would have to be unique per connector task, and there isn't an API to map a task number to a config value. If you mean one server, and multiple database schemas, then I also don't think that is really possible to properly "distribute" within a single connector with multiple tasks (thus why database.names config in Debezium only currently supports one name).
explored debezium but it won't work for us because we have microservices architecture and we have more than 1000 databases for many clients and debezium creates one topic for each table which means it is going to be a massive architecture
Kafka can handle thousands of topics fine. If you run the connector processes in Kubernetes, as an example, then they're centrally deployable, scalable, and configurable from there.
However, I still have concerns over you needing all databases to capture CDC events.
Was also previously suggested to use Maxwell

Related

AWS: What is the right way of PostgreSQL integration with Kinesis?

The aim that I want to achieve:
is to be notified about DB data updates, for this reason, I want to build the following chain: PostgreSQL -> Kinesis -> Lambda.
But I am now sure how to notify Kisesis properly about DB changes?
I saw a few examples where peoples try to use Postgresql triggers to send data to Kinesis.
some people use wal2json concept.
So I have some doubts about which option to choose, that why I am looking for advice.
You can leverage Debezium to do the same.
Debezium connectors can also be intergrated within the code, using Debezium Engine and you can add transformation or filtering logic(if you need) before pushing the changes out to Kinesis.
Here's a link that explains about Debezium Postgres Connector.
Debezium Server( Internally I believe it makes use of Debezium Engine).
It supports Kinesis, Google PubSub, Apache Pulsar as of now for CDC from Databases that Debezium Supports.
Here is an article that you can refer to for step by step configuration of Debezium Server
[https://xyzcoder.github.io/2021/02/19/cdc-using-debezium-server-mysql-kinesis.html][1]

Kafka connect confluent jdbc does not control session pool in MSSQL database

I am working with Kafka connect and confluent jdbc. Integrate a source connector with Mssql and a few days ago the operating area warned us that there is a high number of sessions in the "sleeping" state in the database. I need to control those sessions but apparently the connector (confluent jdbc) doesn't have those properties in its configuration.
Do you have any ideas to correct this problem?
Kafka Connect will run a minimum of one task per connector. Each connector is isolated from the other and other than sharing a runtime environment is isolated from the others.
Therefore if you have 27 connectors sourcing from the same database, you will have a minimum of 27 connections to the database.
If you can't reduce the number of connectors (e.g. by have one connector pull from multiple tables), then the only option I think you have is to speak to your DBA about enforcing some kind of resource management on the RDBMS side. For example, on Oracle the Resource Manager option can be used for this.

Is there a way to connect to multiple databases in multiple hosts using Kafka Connect?

I have a need to get data from Informix database using Kafka Connect. The scenario is this - I have 50 Informix Databases residing in 50 hosts. What I have understood by reading from Kafka connect is that we need to install the Kafka connect in each hosts to get the data from the database residing in that host. My question is this - Is there a way in which I can create the connectors centrally for these 50 hosts instead of installing into each of them and pull data from the databases?
Kafka Connect JDBC does not have to run on the database, just as other JDBC clients don't, so you can a have a Kafka Connect cluster be larger or smaller than your database pool.
Informix seems to have a thing called "CDC Replication Engine for Kafka", however, which might be something worth looking into, as CDC overall causes less load on the database
You don’t need any additional software installation on the system where Informix server is running.I am not fully clear about the question or the type of operation you are plan to do. If you are planning to setup a real time replication type of scenario, then you may have to invoke CDC API. Then one-time setup of CDC API at server is needed, then this APIs can be invoked using any Informix database driver API. If you are plan to read existing data from table(s) and pump into Kafka topic, then no need of any additional setup at server side. You could connect to all 50 database server(s) from a single program (remotely) and then pump those records to the Kafka topic(s). Base on the program language you are using you may choose Informix database driver.

Kafka and IIDR CDC

I am trying to build a CDC pipeline using : DB2--IBM CDC --Kafka
and I am trying to figure out the right way to setup this .
I tried below things -
1.Setup a 3 node kafka cluster on linux on prem
2.Installed IIDR CDC software on linux on prem using - setup-iidr-11.4.0.1-5085-linux-x86.bin file . The CDC instance is up and running .
The various online documentation suggest to install 'IIDR management console ' to configure the source datastore and CDC server configuration and also Kafka subscription configuration to build the pipeline .
Currently I do not have the management console installed .
Few questions on this -
1.Is there any alternative to IBM CDC management console for setting up the kafka-CDC pipeline ?
2.How can I get the IIDR management console ? and if we install it on our local windows dekstop and try to connect to CDC/Kafka which are on remote linux servers, will it work ?
3.Any other method to setup the data ingestion IIDR CDC to Kafka ?
I am fairly new to CDC/ IIDR , please help !
I own the development of the IIDR Kafka target for our CDC Replication product.
Management Console is the best way to setup the subscription initially. You can install it on a windows box.
Technically I believe you can use our scripting language called CHCCLP to setup a subscription as well. But I recommend using the GUI.
Here are links to our resources on our IIDR (CDC) Kafka Target. Search for the "Kafka" section.
"https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/IIDR%20Wiki"
An example of setting up a subscription and replicating is this video
https://ibm.box.com/s/ur8jokg6tclsx5fcav5g86a3n57mqtd5
Management console and access server can be obtained from IBM fix central.
I have installed MC/Access server on my VM and on my personal windows box to use it against my linux VMs. You will need connectivity of course.
You can definitely follow up with our Support and they'll be able to sort you out. Plus we have docs in our knowledge centre on MC starting here.... https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/overview_of_cdc.html
You'll find our Kafka target is very flexible it comes with five different formats to write data into Kafka, and you can choose to capture data in an audit format, or the Kafka compaction compatible key, null for a delete method.
Additionally you can even use the product to write several records to several different topics in several formats from a single insert operation. This is useful if some of your consumer apps want JSON and others Avro binary. Additionally you can use this to put all the data to more secure topics, and write out just some of the data to topics that more people have access to.
We even have customers who encrypt columns in flight when replicating.
Finally the product's transformations can be parallelized even if you choose to only use one producer to write out data.
Actually one more finally, we additionally provide the option to use a special consumer which produces database ACID semantics for data written into Kafka and shred across topics and partitions. It re-orders it. we call it the transactionally consistent consumer. It provides operation order, bookmarks for restarting applications, and allows parallelism in performance but ordered, exactly once, deduplicated consumption of data.
From my talk at the Kafka Summit...
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions

Couchdb changes to Apache Kafka

I want to have all of the changes of a couchdb database in kafka at application run time as they arrive. Is there any reliable existing tool for that?
You may try to use Kafka Connect tool. Also, Confluent Platform provides long list of different connectors for Kafka Connect.
I'm not a CouchDB user, but you may choose one of applicable source connectors here or create your own Kafka CouchDB source connector.