debezium connector failover mechanism - postgresql

i'm learning about debezium connectors and im using debezium for postgresql. I have a small question to clarify.
Imagine a situation like this. I have a debezium connector for a table called tableA and changes happening on that table publish to a topic called topicA. Connector works without any issue and changes are publishing to the topic without any issue. Now think that for some reason i need to delete my connector and start a new connector with the same configurations for the same table that publish to the same topic. So there is a time gap between i stop my connector and start a new one with same configs. What happen to the data that get change during that time on my tableA.
Will that gonna start from where it stopped or what will happen ?

Dushan , The answer is depends on how the connector stops. The various scenarios are articulated here
https://debezium.io/documentation/reference/stable/connectors/postgresql.html#postgresql-kafka-connect-process-stops-gracefully
In an ideal case scenario , the Log Sequence Number is recorded in the database history topic. Unless the history topic is re-created or messages expired the LSN offsets are stored and on restart will resume from that location

Related

Kafka Connect Reread Entire File for KSQLDB Debugging or KSQLDB possiblity to insert all events after query creation?

I start to develop with KSQLDB and alongside with Kafka-connect. Kafka-connect is awesome and everything is well and has the behaviour to not reread the records if it detects that it was already read in the past (extrem useful for production). But, for development and debugging of KSQLDB queries it is necessary to replay the data, as ksqldb will create table entries on the fly on emitted changes. If nothing is replayed the to 'test' query stays empty. Any advice how to replay a csv file with kafka connect after the file is inserted for the first time? Maybe, ksqldb has the possiblity to reread the whole topic after the table is created. Has someone the answer for a beginner?
Create the source connector with a different name, or give the CSV file a new name. Both should cause it to be re-read.
I had also a problem in my configuration file and ksqldb did not recognize the
SET 'auto.offset.reset'='earliest';
option. Use the above command, notice the ', to force ksqldb to reread the whole topic from the beginning after a table/stream creation command. You have to set it manually every time you connect via ksql-cli or client.

Dealing with data loss in kafka connect

I understood that Kafka connect can be deployed in cluster mode. And workers move data between data source and kafka topic. What I want to know is if a worker fails when moving data between data source to kafka topic would there be a dataloss? If there would be a dataloss how can we get the data back from the connector or will kafka connect automatically deal with it?
This depends on the source and if it supports offset tracking.
For example, lines in a file, rows in a database with a primary ID / timestamp, or some idenpotent API call can be repeatedly called and get the same starting position. (although, in each case, the underlying data also needs to be immutable for it to work consistently)
Kafka Connect SourceTask API has a call to commit tracked "offsets" (different from Kafka topic offsets)

What should be the kafka serde configuration when we use kafka streams

We are using a JDBC source connector to sync data from a table to a topic (call this Topic 1) in Kafka. As we know this captures only inserts and updates, we have added a trigger to capture deletes. This trigger captures the deleted record and writes to a new table which gets synced to another Kafka topic (call this Topic 2).
We have configured the JDBC source connector to use AvroConverter.
Now we have written a Kafka streams logic that consumes data from this Topic 2 and publishes to Topic 1. My question is what should be the serializer and deserializer configuration for the Kafka streams logic? Is it ok to use KafkaAvroSerializer and KafkaAvroDeserializer?
I was going through the AvroConverter code (https://github.com/confluentinc/schema-registry/blob/master/avro-converter/src/main/java/io/confluent/connect/avro/AvroConverter.java) to see if I can get some ideas. I was navigating the Github code for quite some time. I was not able to conclude whether using KafkaAvoSerializer and KafkaAvroDeserializer is the right side in Kafka streams logic. Can someone please help me?
Why does your JDBC connector only capture inserts and updates?
EDITED: We use Confluent JDBC source connector SQL Server Debezium Connector and it performs well even on deletes. Pay attention to query modes specifically.
Maybe try switching to this connector and you might end up with one problem solved, having only one stream containing all the relevant events.

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.

How I can use Kafka like relational database

good time of day. I am sorry my poor English. I have some issue, can you help me to understand how i can use kafka and kafka streams like database.
My problem is i have some microservices and each service have their data in own database. I need for report purposes collect data in one point, for this i chose the kafka. I use debezuim maybe you know it (change data capture debezium), each table in relational database it is a topic in kafka. And i wrote the application with kafka stream (i joined streams each other) so far good. Example: I have the topic for ORDER and ORDER_DETAILS, after a while will come some event for join this topic, problem is i dont know when come this event maybe after minutes or after monthes or after years. How i can get data in topics ORDER and ORDER_DETAIL after month or year ? It is right way save data in topic infinitely? can you give me some advice maybe have some solutions.
The event will come as soon as there is a change in the database.
Typically, the changes to the database tables are pushed as messages to the topic.
Each and every update to the database will be a kafka message. Since there is a message for every update, you might be interested in only the latest value (update) for any given key which mostly will be the primary key
In this case, you can maintain the topic infinitely (retention.ms=-1) but compact (cleanup.policy=compact) it in order to save space.
You may also be interested in configuring segment.ms and/or segment.bytes for further tuning the topic retention.