SMT's to create kafka connector string partition key through connector config - apache-kafka

I've been implementing a kafka connector for PostgreSQL (I'm using the debezium kafka connector and running all the pieces through docker). I need a custom partition key, and so I've been using the SMT to achieve this. However, the approach that I'm using creates a Struct, and I need it to be a string. This article runs through how to set up the partition key as an int, but I can't access the config file to set up the appropriate transforms. Currently my kafka connector looks like this
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost:8083/connectors/ -d '{
"name": "connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgres",
"database.password": "password",
"database.dbname" : "postgres",
"database.server.name": "postgres",
"table.include.list": "public.table",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.table",
"transforms": "routeRecords,unwrap,createkey",
"transforms.routeRecords.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.routeRecords.regex": "(.*)",
"transforms.routeRecords.replacement": "table",
"transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
"transforms.createkey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createkey.fields": "id"
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter"
}
}'
I know that I have to extract the value of the column but I'm just not sure how.

ValueToKey creates a Struct from a list of fields, as it is documented.
You need one more transform to extract a specific field from a Struct, as shown in the linked post.
org.apache.kafka.connect.transforms.ExtractField$Key
Note: This does not "set" the partition of the actual Kafka record, only the key, which is then hashed by the Producer to get the partition

Related

Configure Debezium CDC -> Kafka -> JDBC Sink (Multiple tables) Question

We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}
The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.

Sink data from DLQ topic to another table as CLOB

I'm using a connector to sink records from a topic to a DB. There are also some values redirected to DLQ (Dead Letter Queue). The records in DLQ may contain wrong types, sizes, non-avro values etc. What I want to do is, sinking all the records to an Oracle DB table. This table will only have 2 columns; a CLOB for the entire message and record date.
To sink from Kafka, we need a schema. Since this topic will contain many types of records, I can't create a proper schema (or can I?). I just want to sink the messages as a whole, how can I achieve this?
I've tried it with this schema and connector:
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema": "{\"type\":\"record\",\"name\":\"DLQ_TEST\",\"namespace\":\"DLQ_TEST\",\"fields\":[
{\"name\":\"VALUE\",\"type\":[\"null\",\"string\"]},
{\"name\":\"RECORDDATE\",\"type\":[\"null\",\"long\"]}]}"}' http://server:8071/subjects/DLQ_INSERT-value/versions
curl -i -X PUT -H "Content-Type:application/json" http://server:8083/connectors/sink_DLQ_INSERT/config -d
'{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:oracle:thin:#oracleserver:1357/VITORCT",
"table.name.format": "GLOBAL.DLQ_TEST_DLQ",
"connection.password": "${file:/kafka/vty/pass.properties:vitweb_pwd}",
"connection.user": "${file:/kafka/vty/pass.properties:vitweb_user}",
"tasks.max": "1",
"topics": "DLQ_TEST_dlq",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "true",
"auto.create": "false",
"insert.mode": "insert",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter"
}'
I don't exactly understand how to make connector use this schema.

table.include.list configuration parameter not working in Debezium Postgres Connector

I am using Debezium Postgres connector to capture changes in postgres tables.
The documentation for this connector
https://debezium.io/documentation/reference/connectors/postgresql.html
mentions a configuration parameter
table.include.list
However when I set the value of this parameter to 'config.abc'. Even after that changes from both tables in config schema (namely abc and def) are getting streamed.
The reason I want to do this is that I want to create 2 separate connectors for each of the 2 tables to split the load and faster change data streaming.
Is this a known issue ? Anyway to overcome this ?
The same problem here (Debezium Version: 1.1.0.Final and postgresql:11.13.0). Below is my configuration:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"transforms.unwrap.delete.handling.mode": "rewrite",
"slot.name": "debezium_planning",
"transforms": "unwrap,extractInt",
"include.schema.changes": "false",
"decimal.handling.mode": "string",
"database.schema": "partsmaster",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.converters.LongConverter",
"database.user": "************",
"database.dbname": "smartng-db",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"database.server.name": "ASLog.m3d.smartng-db",
"database.port": "***********",
"plugin.name": "pgoutput",
"column.exclude.list": "create_date, create_user, change_date, change_user",
"transforms.extractInt.field": "planning_id",
"database.hostname": "************",
"database.password": "*************",
"name": "ASLog.m3d.source-postgres-smartng-planning",
"transforms.unwrap.add.fields": "op,table,lsn,source.ts_ms",
"table.include.list": "partsmaster.planning",
"snapshot.mode": "never"
}
A change in another table partsmaster.basic is causing this connector to fail because the attribute planning_id is not available in table partsmaster.basic.
I need separate connectors for each table. table.include.list is not working. Neither table.exclude.list does.
My last resort would be to use a separate Postgres Publication for each connector. But I still hope I can find an immediate configuration solution here or be pointed out at what I am missing. In a previous version I have used table.whitelist without problems.
Solution:
create seperate publications to each table:
CREATE PUBLICATION dbz_planning FOR TABLE partsmaster.planning;
drop previous replication slot:
select pg_drop_replication_slot('debezium_planning');
Change your connector configurations:
{
"slot.name": "dbz_planning",
"publication.name": "planning_publication"
}
et voilĂ !

How to make Instaclustr Kafka Sink Connector work with Avro serialized value to postgres?

I have a Kafka topic of Avro-serialized value.
I am trying to set up a JDBC(postgres) sink connector to dump these messages in the postgres table.
But, I am getting below error.
"org.apache.kafka.common.config.ConfigException: Invalid value io.confluent.connect.avro.AvroConverter for configuration value.converter: Class io.confluent.connect.avro.AvroConverter could not be found."
My Sink.json is
{"name": "postgres-sink",
"config": {
"connector.class":"io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max":"1",
"topics": "<topic_name>",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "instaclustr_schema_registry_host:8085",
"connection.url": "jdbc:postgresql://postgres:5432/postgres?currentSchema=local",
"connection.user": "postgres",
"connection.password": "postgres",
"auto.create": "true",
"auto.evolve":"true",
"pk.mode":"none",
"table.name.format": "<table_name>"
}
}
Also, I have made changes in the connect-distributed.properties(bootstrap servers).
The command I am running is -
curl -X POST -H "Content-Type: application/json" --data #postgres-sink.json https://<instaclustr_schema_registry_host>:8083/connectors
io.confluent.connect.avro.AvroConverter is not part of the Apache Kafka distribution. You can either just run Apache Kafka as part of Confluent Platform (which ships with the converter and is easier) or you can download it separately and install it yourself.

Kafka can not stream database activity

I'm streaming my PostgreSQL using confluent (./kafka-avro-console-consumer). Successfully streaming, but Kafka only showed INSERT activity. Other than that like DELETE, UPDATE, CREATE TABLE, didn't stream on my Kafka consume.
I did:
Install PostgreSQL including making role until making tables
Install debezium CDC for PostgreSQL (I didn't use json/jdbc at all)
Install wal2json
Making connector to confluent
Automatically: topics are made after successfully deploy the connector
Stream topics ./kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic debezium.public.emp_bio --from-beginning
This is my connector
{
"name": "postgres-connector-1",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "5432",
"database.user": "dbuser1",
"database.password": "password",
"database.dbname":"testdb",
"database.server.name": "debezium",
"database.whitelist": "testdb",
"plugin.name": "wal2json",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "postgres-hist-test",
"include.schema.changes": "true"
}
}
Expected: I can stream other activities like DELETE, DROP, UPDATE, CREATE from my PostgreSQL
Error: Sorry, there is no error message anywhere!