Streaming Database(mariadb) Changes into another database table using apache-kafka and debezium connector - apache-kafka

What I am aiming to do is to Stream data changes into new database table using apache-kafka along with debezium-connectors. But I don't have the slightest idea to how to achieve it. Although I know how to start kafka-zookeeper ,creating topics , and subscribe to that topic . And I am unfamiliar with all the next steps .How to achieve data streaming and capture that data into new database table using Change Data Capture(CDC)?

Debezium only sources data into Kafka. Won't read from Kafka or write to a new database.
You can refer an old blog post of theirs using the JDBC Sink Kafka Connector to write to a new server
https://debezium.io/blog/2017/09/25/streaming-to-another-database/

Related

Is it a good practice to use the exiting topic for multiple connectors?

I am using the Debezium PostgreSQL connector to get the users table into a Kafka Topic.
I have a JDBC Sink Connector connector that then reads the data from the topic and pushes it into it's own Database.
Now, I need a subset of the data for another Microservice Database. So I am planning to write another JDBC Sink Connector.
The Question: is it a good practice to use the existing users table topic? If yes, then how I can make sure that new JDBC connector get's a snapshot of entire users table
 
If Debezium snapshotted the table and data hasn't been lost in the topic due to retention, then that's what any sink or other consumer will read.
Any unique sink connector name will read unique offsets from its topic. Nothing bad will happen with multiple consumers reading the same topic; this is how Kafka is intended to be used.
You may need to ensure consumer.auto.offset.reset=earliest for connect to read from the start of the topic
To get a subset of fields, you'll need to "replace" them - https://docs.confluent.io/platform/current/connect/transforms/replacefield.html#replacefield

Kafka connect topic message modification before writing to sink database

I have setup Kafka connect between my source and destination, for example
I have a table in mysql which I want to send to mongodb, I have setup mysql as source where as mongodb as sink and its working fine.
In my mysql table has a column called 'download_link', where I have a pdf s3 download link. Now when I setup Kafka this link will go mongodb but what I need is, after I receive message from mysql source, I want to execute a python code which downloads the pdf file and extract text from it, so when my data goes into mongodb. It shouldnt be link rather the text extracted. Is it possible to do something like this?
Can someone provide some resources how I can achieve this?
I want to execute a python code ...
Kafka Connect cannot do this.
Since you have apache-kafka-streams, refer post - Does Kafka python API support stream processing?
You would run your Python stream processor after the source connector, send data to new topic(s), then use a Connect sink on those
Keep in mind that Kafka messages have a maximum size, so extracting large PDF text blobs and persisting the data in the topic(s) might not be the best idea. Instead, you could have the MongoDB writer application download the PDF before writing to the database, but as stated, you'd need to write Java to use Kafka Connect for that. Otherwise, you're left with other Python processes that consume from Kafka and write to Mongo

What should be the kafka serde configuration when we use kafka streams

We are using a JDBC source connector to sync data from a table to a topic (call this Topic 1) in Kafka. As we know this captures only inserts and updates, we have added a trigger to capture deletes. This trigger captures the deleted record and writes to a new table which gets synced to another Kafka topic (call this Topic 2).
We have configured the JDBC source connector to use AvroConverter.
Now we have written a Kafka streams logic that consumes data from this Topic 2 and publishes to Topic 1. My question is what should be the serializer and deserializer configuration for the Kafka streams logic? Is it ok to use KafkaAvroSerializer and KafkaAvroDeserializer?
I was going through the AvroConverter code (https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java) to see if I can get some ideas. I was navigating the Github code for quite some time. I was not able to conclude whether using KafkaAvoSerializer and KafkaAvroDeserializer is the right side in Kafka streams logic. Can someone please help me?
Why does your JDBC connector only capture inserts and updates?
EDITED: We use Confluent JDBC source connector SQL Server Debezium Connector and it performs well even on deletes. Pay attention to query modes specifically.
Maybe try switching to this connector and you might end up with one problem solved, having only one stream containing all the relevant events.

Streaming database data to Kafka topic without using a connector

I have a use case where I have to push all my MySQL database data to a Kafka topic. Now, I know I can get this up and running using a Kafka connector, but I want to understand how it all works internally without using a connector. In my spring boot project I already have created a Kafka Producer file where I set all my configuration, create a Producer record and so on.
Has anyone tried this approach before? Can anyone throw some light on this?
Create entity using spring jpa for tables and send data to topic using find all. Use scheduler for fetching data and sending it to topic. You can add your own logic for fetching from DB and also a different logic for sending it to Kafka topic. Like fetch using auto increment, fetch using last updated timestamp or a bulk fetch. Same logic of JDBC connectors can be implemented.
Kakfa Connect will do it in an optimized way.

Is it possible to use Kafka Connect to mirror an RDBMS table to a Kafka Stream?

I know it's possible to push updates from a database to a Kafka stream using Kafka Connect. My question is, can I create a consumer to write changes from that same stream back into the table without creating an infinite loop?
I'm assuming if I create a consumer that writes updates into the database table, it would trigger Connect to push that update to the stream, etc. Is there a way around this so I can mirror a database table to a stream?
You can stream from a Kafka topic to a database using the JDBC Sink connector for Kafka Connect.
You'd need to code in your business logic for avoiding an infinite replication loop into either the connectors or your consumer. For example:
JDBC Source connector uses a WHERE clause to only pull records with a flag set to indicate they are the original record
Custom Single Message Transform in the source connector to drop records with a flag set to indicate they are not the original record
Stream application (e.g. KSQL / Kafka Streams) processes the inbound stream of all database changes to filter out only those with a flag set to indicate they are the original record
Inefficient because then you're still streaming everything from the database
Yes. It is possible to configure synchronisation/replication.