Kafka-connect topic.prefix without the table name - apache-kafka

I'm using jdbc source connector, my table names have special chars (ie.$) that are acceptable to the DB engine but when I run kafka-connect with below configuration, it attempts to create the kafka topic with this prefix, plus the table name but special chars on the table name are not necessarily acceptable to kafka. Is it possible to go with static target topic name instead of relying on the source table name ?
"topic.prefix":"blah-"

I ended up using kafka connect transformations like below to make it work. Still not sure if there would be any performance drag because of SMT but it works for now
"transforms":"dropSomething",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"(.*)\\$",
"transforms.dropPrefix.replacement":"$1"

Related

Kafka connect: How to handle database schema/table changes

Wondering if there is a documented process on how to handle database schema changes. I am using Debezium source connector for postgres and confluent JDBC Sink connector to replicate database changes. I need to do some changes in database as below
Add new columns to existing table
Modify database column type and update name.
I am not sure what is the best way to do this. Solution that I can think if is
Stop source connector
Wait for sinks to consume all messages
Upgrade the databases
Start source and sink connector
Debezium will automatically add new fields in the record schema for new columns. So you would update your consumer and downstream systems first to prepare for those events. No need to stop the source...
If you change types and names, then you might run into backwards incompatible schema changes, and these operations are generally not recommended. Instead, always add new columns but "deprecate" and dont use the old ones. After you are done reading events from those old columns in all other systems, then drop those columns.

Audit data changes with Debezium

I have a use case where I want to audit the DB table data changes into another table for compliance purposes. Primarily, any changes to the data like Inserts/Updates/Deletes should be audited. I found different options like JaVers, Hibernate Envers, Database triggers, and Debezium.
I am avoiding using JaVers, and Hibernate Envers as this will not capture any data change that happens through direct SQL queries and any data change that happens through other applications. The other issue I see is we need to add the audit-related code to the main application code in the same transaction boundary.
I am also avoiding the usage of database triggers as we are not using triggers at all for any of the deployments.
Then I left with Debezium which is promising. But, the only concern that I have is that we need to use Kafka to leverage Debezium. Is Kafka's usage is necessary to use Debezium if both the primary table and the audit table sit in the same DB instance?
Debezium is perfect for auditing, but given it is a source Connector, it represents just one part of the data pipeline in your use case. You will capture every table change event (c=create, r=read, u=update, d=delete), store it on a Kafka topic or local disk and then you need a Sink Connector (i.e. Camel Kafka SQL or JDBC, kafka-connect-jdbc) to insert into the target table.
For the same transaction boundary requirement you can use the Outbox Pattern if the eventual consistency is fine. There is also an Outbox Event Router SMT component that is part of the project.
Note that Debezium can also run embedded in a standalone Java application, storing the offset on local disk, but you lose the HA capability given by KafkaConnect running in distributed mode. With the embedded mode, you are also swtiching from a configuration-driven approach to a code-driven one.
I found Debezium to be a very comprehensive solution, and it is open source backed by Redhat. That gives it not only the credibility, but the security that it is going to be supported.
It provides a rich configuration to whitelist, blacklist databases/tables/columns (with wild card patterns), along with controls to limit data in really large columns.
Since it is driven from BinLogs, you not only get the current state, you also get the previous state. This is ideal for audit trails, and you can customize building a proper Sync to elastic topics (one for table).
Use of Kafka is necessary to account for HA and latency when bulk updates are made on DB, even though Primary and Audit tables are in same DB.

use already existing table for Confluent JDBC Sink Connector

according to the examples from confluent docs (https://docs.confluent.io/3.1.1/connect/connect-jdbc/docs/sink_connector.html) I am trying to get a solution working, where I can reuse my already existing table from a previous legacy system and all messages from a certain topic shall be written via "upset" into it.
So in general based on the example from confluent, how could I write all messages from topic orders into a table called "myOrders" (which is already existing) instead of auto create a new table in my database with the same name as the topic name?
The table can be auto created if it does not exist by setting auto.create to true in the configuration.
However, out of the box, the JDBC Sink Connector can reuse your "myOrders" table.
Please see https://docs.confluent.io/current/connect/kafka-connect-jdbc/sink-connector/index.html#auto-creation-and-auto-evoluton
This applies to the current documentation and the Confluent Platform 3.1.1 version you linked.

Does debezium support capture postgres schema change event?

Does debezium support capture postgres schema change like 'alter table xxx add/drop/alter column xxx'?
Seems like an old question but in any way the short answer is yes. checkout the documentation here https://debezium.io/documentation/reference/connectors/postgresql.html .
With some exceptions:
The PostgreSQL connector retrieves schema information as part of the events sent by the logical decoding plug-in. However, the connector does not retrieve information about which columns compose the primary key. The connector obtains this information from the JDBC metadata (side channel). If the primary key definition of a table changes (by adding, removing or renaming primary key columns), there is a tiny period of time when the primary key information from JDBC is not synchronized with the change event that the logical decoding plug-in generates. During this tiny period, a message could be created with an inconsistent key structure. To prevent this inconsistency, update primary key structures as follows:
Put the database or an application into a read-only mode.
Let Debezium process all remaining events.
Stop Debezium.
Update the primary key definition in the relevant table.
Put the database or the application into read/write mode.
Restart Debezium.

How do I get data for more than one table in Talend using Oracle CDC?

We are trying to connect Talend to our Oracle 12c database using CDC. The tOracleCDC component uses Oracle XStream to do the actual change data capture work. The issue is that when creating the CDC endpoint in Oracle one creates an "Outbound Server" which listens for changes on a number of tables, or even a number of whole schemas.
In Talend when configuring the tOracleCDC component one of the required fields is "Table Using CDC" which in the generated Java code is used to filter the incoming change records using something like "TableName".equalsIgnoringCase(... )
This means that we can only get changes for a single table for a given XStream connection (and each connection will require a unique outbound server object in the database).
We must be missing something, how can we pull changes for multiple tables in Talend?
Thanks!
The solution is to use an empty string as the table name in the Table Using CDC field. This will cause the templating engine to not emit the table name check that was causing this problem.
I could not find this documented anywhere, so it might be unsupported, but examining the templates shows that it is the intended behavior.