Is it possible to use Kafka Connect to mirror an RDBMS table to a Kafka Stream? - apache-kafka

I know it's possible to push updates from a database to a Kafka stream using Kafka Connect. My question is, can I create a consumer to write changes from that same stream back into the table without creating an infinite loop?
I'm assuming if I create a consumer that writes updates into the database table, it would trigger Connect to push that update to the stream, etc. Is there a way around this so I can mirror a database table to a stream?

You can stream from a Kafka topic to a database using the JDBC Sink connector for Kafka Connect.
You'd need to code in your business logic for avoiding an infinite replication loop into either the connectors or your consumer. For example:
JDBC Source connector uses a WHERE clause to only pull records with a flag set to indicate they are the original record
Custom Single Message Transform in the source connector to drop records with a flag set to indicate they are not the original record
Stream application (e.g. KSQL / Kafka Streams) processes the inbound stream of all database changes to filter out only those with a flag set to indicate they are the original record
Inefficient because then you're still streaming everything from the database

Yes. It is possible to configure synchronisation/replication.

Related

Is it a good practice to use the exiting topic for multiple connectors?

I am using the Debezium PostgreSQL connector to get the users table into a Kafka Topic.
I have a JDBC Sink Connector connector that then reads the data from the topic and pushes it into it's own Database.
Now, I need a subset of the data for another Microservice Database. So I am planning to write another JDBC Sink Connector.
The Question: is it a good practice to use the existing users table topic? If yes, then how I can make sure that new JDBC connector get's a snapshot of entire users table
 
If Debezium snapshotted the table and data hasn't been lost in the topic due to retention, then that's what any sink or other consumer will read.
Any unique sink connector name will read unique offsets from its topic. Nothing bad will happen with multiple consumers reading the same topic; this is how Kafka is intended to be used.
You may need to ensure consumer.auto.offset.reset=earliest for connect to read from the start of the topic
To get a subset of fields, you'll need to "replace" them - https://docs.confluent.io/platform/current/connect/transforms/replacefield.html#replacefield

Kafka Connect: a sink connector for writing multiple tables from one topic

I'd like to use Kafka Connect to detect changes on a Postgres DB via CDC for a group of tables, and push them as messages in one single topic, with the key as the logic key of the main table.
This will give the consumer to consume the data changes in the right order to apply them to a destination DB.
Are there Source and Sink connectors allowing me to achieve this goal?
I'm using Debezium CDC Source Connector for Postgres... which I can configure to route all the messages for all the tables into one single topic.
But then I'm not able to find a Sink connector capable to consume the messages, and write to the right table depending on the schema of the message.
There'd be no property for a specific sink connector, if that's what you're looking for
You can use Single Message Transforms to Extract parts of the record to set the outgoing topic name from a single sink connector topic
Example : https://docs.confluent.io/platform/current/connect/transforms/extracttopic.html

Dealing with data loss in kafka connect

I understood that Kafka connect can be deployed in cluster mode. And workers move data between data source and kafka topic. What I want to know is if a worker fails when moving data between data source to kafka topic would there be a dataloss? If there would be a dataloss how can we get the data back from the connector or will kafka connect automatically deal with it?
This depends on the source and if it supports offset tracking.
For example, lines in a file, rows in a database with a primary ID / timestamp, or some idenpotent API call can be repeatedly called and get the same starting position. (although, in each case, the underlying data also needs to be immutable for it to work consistently)
Kafka Connect SourceTask API has a call to commit tracked "offsets" (different from Kafka topic offsets)

What should be the kafka serde configuration when we use kafka streams

We are using a JDBC source connector to sync data from a table to a topic (call this Topic 1) in Kafka. As we know this captures only inserts and updates, we have added a trigger to capture deletes. This trigger captures the deleted record and writes to a new table which gets synced to another Kafka topic (call this Topic 2).
We have configured the JDBC source connector to use AvroConverter.
Now we have written a Kafka streams logic that consumes data from this Topic 2 and publishes to Topic 1. My question is what should be the serializer and deserializer configuration for the Kafka streams logic? Is it ok to use KafkaAvroSerializer and KafkaAvroDeserializer?
I was going through the AvroConverter code (https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java) to see if I can get some ideas. I was navigating the Github code for quite some time. I was not able to conclude whether using KafkaAvoSerializer and KafkaAvroDeserializer is the right side in Kafka streams logic. Can someone please help me?
Why does your JDBC connector only capture inserts and updates?
EDITED: We use Confluent JDBC source connector SQL Server Debezium Connector and it performs well even on deletes. Pay attention to query modes specifically.
Maybe try switching to this connector and you might end up with one problem solved, having only one stream containing all the relevant events.

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.