Kafka connect: How to handle database schema/table changes - postgresql

Wondering if there is a documented process on how to handle database schema changes. I am using Debezium source connector for postgres and confluent JDBC Sink connector to replicate database changes. I need to do some changes in database as below
Add new columns to existing table
Modify database column type and update name.
I am not sure what is the best way to do this. Solution that I can think if is
Stop source connector
Wait for sinks to consume all messages
Upgrade the databases
Start source and sink connector

Debezium will automatically add new fields in the record schema for new columns. So you would update your consumer and downstream systems first to prepare for those events. No need to stop the source...
If you change types and names, then you might run into backwards incompatible schema changes, and these operations are generally not recommended. Instead, always add new columns but "deprecate" and dont use the old ones. After you are done reading events from those old columns in all other systems, then drop those columns.

Related

schema evolution along with message transformations using confluent_kafka and schema-registry

I need to transform messages along with schema evolution to target MySQL DB. I can't use Sink connector here because it supports schema evolution but not complex transformations.
I have a table in MySQL Source like below :
id, name, created_at
1, shoaib, 2022-01-01
2, ahmed, 2022-02-01
In target MySQL I want to replicate that table with some transformations. Target table would look like
id, name, created_at, isDeleted
1, shoaib, 2022-01-01, 0
2, ahmed, 2022-02-01, 0
Whenever a row gets deleted in Source, here isDeleted column should be updated as 1.
This is a very simple transformation I just put in.
So, I decided not to use Sink's SMT because it offers very basic transformations. I went with
confluent_kafka libray using Python.
I am able to transform data as needed and load into target MySQL but I am not able to make the relevant schema changes automatically to target using schema-registry with confluent_kafka libray. Schema versions are getting updated in the registry but how to propagate that changes to target DB if I'm not using the Sink connector.
propagate that changes to target DB if I'm not using the Sink connector
You would need to run an ALTER TABLE statement on your own.
I suggest looking at the Debezium docs on how it can insert a __deleted attribute into the data from the source connector, then you should be able to use an SMT to convert the boolean into an INT (or just keep the database having a boolean column rather than a number)
Or you can use Python to write a new schema to a new Kafka topic (creating a new subject in the registry), which then can be consumed by the Sink connector and write to (and evolve) the table.

Audit data changes with Debezium

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

How to update table schema when there is new Avro schema for Kafka data in Flink?

We are consuming a Kafka topic in the Flink application using Flink Table API.
When we first submit the application, we first read the latest schema from our custom registry. Then create a Kafka Datastream and Table using Avro schema. My data serializers' implementation works similarly to the Confluent schema registry by checking the schema ID and then using the registry. So we can apply the correct schema in runtime.
However, I do not know how to update the table schema and re-execute SQL without re deploying the job. Is there a way to have a background thread for checking the schema changes, and if there are any, pauses the current execution, updates the table schema and execute the SQL.
This will be particularly useful for the continuous delivery of schema changes to the applications. We already have a compatibility check in place.
TL;DR you don't need to change anything to get it working in most cases.
In Avro, there is the concept of reader and writer schema. Writer schema is the schema that was used to generate the Avro record and it's encoded into the payload (in most cases as an id).
Reader schema is used by your application to make sense of your data. If you do a particular calculation you are using a specific set of fields of an Avro record.
Now the good part: Avro transparently translates the writer schema to a read schema if they are schema-compatible. So as long as your schemas are fully compatible, there is a way to always transform the writer schema to your read schema.
So if your schema of the records change in the background while the application is running, the DeserializationSchema fetches the new write schema and infers a new mapping to the read schema. Your query will not notice any change.
This approach falls short if you actually want to enrich the schema in your application; for example, you always want to add a field calculated and return all other fields. Then a newly added field will not be picked up, since effectively your reader schema changes. In this case, you either need to restart or use generic record schema.

use already existing table for Confluent JDBC Sink Connector

according to the examples from confluent docs (https://docs.confluent.io/3.1.1/connect/connect-jdbc/docs/sink_connector.html) I am trying to get a solution working, where I can reuse my already existing table from a previous legacy system and all messages from a certain topic shall be written via "upset" into it.
So in general based on the example from confluent, how could I write all messages from topic orders into a table called "myOrders" (which is already existing) instead of auto create a new table in my database with the same name as the topic name?
The table can be auto created if it does not exist by setting auto.create to true in the configuration.
However, out of the box, the JDBC Sink Connector can reuse your "myOrders" table.
Please see https://docs.confluent.io/current/connect/kafka-connect-jdbc/sink-connector/index.html#auto-creation-and-auto-evoluton
This applies to the current documentation and the Confluent Platform 3.1.1 version you linked.

Kafka JDBC connector not picking up new commits

I am currently using a Kafka JDBC connector to poll records from an Oracle db. The connector properties are set to use timestamp mode and we have provided a simple select query in the properties (not using a where clause) - based on my understanding this should work.
However currently when instantiating the connector I can see the initial query does pull out all of the records it should and does publish them to the Kafka consumer - but any new commits to the oracle db are not picked up and the connector just sits polling without finding any new info, and maintaining its offset.
No exceptions are being thrown in the connector, and no indication of a problem other than it is not picking up the new commits in the db.
One thing of note, which i have been unable to prove makes a difference, is that the fields in the oracle db are all nullable. But i have tested changing that for the timestamp field, and it had no effect and the same behaviour continued. I have also tested in bulk mode and it works fine and does pick up new commits, though I cannot use bulk mode as we cannot duplicate the records for the system.
Does anyone have any idea why the connector is unable to pick up new commits for timestamp mode?
What does your properties file look like? You need to make sure to use an incrementing column or a time stamp column.
If you you are using a time stamp column, is it getting updated on the commit?
Regarding nulls, You can tweak your query to coalesce the null column to a value. Alternatively, I think there is a setting to allow nullable columns.