Kafka topic seems to function first time only. Why? - apache-kafka

I am working with Kafka Connect (using the Confluent implementation) and am seeing a strange behavior. I configure a source connection to pull data from a DB table, and populate a topic. This works.
But, if I delete the topic, remove the Source config, and then reset the config (perhaps adding another column to the query) the topic does not get populated. If I change the topic name to something I haven't used before, it works. I am using Postman to set the configuration, though I don't believe that matters here.
My Connect config:
{
"name": "my-jdbc-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:db2://db2server.mycompany.com:4461/myDB",
"connection.user: "dbUser",
"connection.password": "dbPass",
"dialect.name": "Db2DatabaseDialect",
"mode": "timestamp",
"query": "select fname, lname, custId, custRegion, lastUpdate from CustomerMaster",
"timestamp.column.name": "lastUpdate",
"table.types": "TABLE",
"topic.prefix": "master.customer"
}
}

KAFKA JDBC connector uses HighWatermark on the timestamp column i.e. last update in your case. It doesn't depend on the topic or even you can delete the JDBC connector and recreate it with the same name it still will be using the same HighWatermark because HighWatermark depends on the connector name. So even you recreate the topic it will not load data again.
So there is a way to reprocess the whole data again you can follow any of the ways:
Drop topic and delete JDBC Connector, recreate topic, and create JDBC Connector with a different name. or
Delete JDBC connector and recreate again with the same name with mode "mode": "bulk" . It will dump all DB tables again in the topic. once it loads you can again update mode to timestamp.
Please refer JDBC connector configuration details
https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/source_config_options.html
update lastUpdate for all records to the current timestamp.

Related

PLC4X OPCUA -Kafka Connnector

I want to use the PLC4X Connector (https://www.confluent.io/hub/apache/kafka-connect-plc4x-plc4j) to connect OPC UA (Prosys Simulation Server) with Kafka.
However I really do not find any website that describe the kafka connect configuration options?
I tried to connect to the prosys opc ua simulation server and than stream the data to a kafka topic.
I managed it to simply send the data and consume it, however i want to use a schema and the avro connverter.
My output from my sink python connector looks like this. That seems a bit strange to me too?
b'Struct{fields=Struct{ff=-5.4470555688606E8,hhh=Sean Ray MD},timestamp=1651838599206}'
How can I use the PLC4X connector with the Avro converter and a Schema?
Thanks!
{
"connector.class": "org.apache.plc4x.kafka.Plc4xSourceConnector",
"default.topic":"plcTestTopic",
"connectionString":"opcua.tcp://127.0.0.1:12345",
"tasks.max": "2",
"sources": "machineA",
"sources.machineA.connectionString": "opcua:tcp://127.0.0.1:12345",
"sources.machineA.jobReferences": "jobA",
"jobs": "jobA",
"jobs.jobA.interval": "5000",
"jobs.jobA.fields": "job1,job2",
"jobs.jobA.fields.job1": "ns=2;i=2",
"jobs.jobA.fields.job2": "ns=2;i=3"
}
When using a schema with Avro and the Confluent schema registry, the following settings should be used. You can also choose to use different settings for both the keys and values.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url:http://127.0.0.1:8081
value.converter.schema.registry.url:http://127.0.0.1:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
Sample configuration files are also available in the PLC4X Github repository.
https://github.com/apache/plc4x/tree/develop/plc4j/integrations/apache-kafka/config

Is there a way of telling a sink connector in Kafka Connect how to look for schema entries

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?
The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

Debezium topics and schema registry subject descriptions

When I create Debezium connector, it creates many kafka topics and schema registry subjects.
I am not sure about what these topics and subjects are what is its purpose
My connector configuration:
{
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"snapshot.locking.mode": "minimal",
"database.user": "XXXXX",
"tasks.max": "3",
"database.history.kafka.bootstrap.servers": "XX:9092",
"database.history.kafka.topic": "history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"database.server.name": "cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"heartbeat.interval.ms": "5000",
"database.port": "3306",
"table.whitelist": "fk_sp_generic_checklist.entity_checklist",
"database.hostname": "abc.kcloud.in",
"database.password": "XXXXXX",
"database.history.kafka.recovery.poll.interval.ms": "5000",
"name": "cdc.fkw.supply.marketplace1.fk_sp_generic_checklist.connector",
"database.history.skip.unparseable.ddl": "true",
"errors.tolerance": "all",
"database.whitelist": "fk_sp_generic_checklist",
"snapshot.mode": "when_needed"
}
Subjects got created in schema registry:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
2) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
4) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
5) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-key
6) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
7) tr.cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
The Kafka topics which got created are:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
2) cdc.fkw.supply.marketplace.fk_sp_generic_checklist
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist
4) history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
Questions:
What is the purpose of the subjects and topics based on my above connector configuration?
What if I deleted my connector and again created a new one with the same name and same database.tables? Will the data ingest from the beginning?
Is there a way to delete the entire connector and create a new one with the same name but as a fresh connector? (This is in case I messed up with some settings and then want to delete the existing data and create a fresh one)
What is the purpose of the [...] topics based on my above connector configuration
Obviously, Debezium reads each database-table into one topic.
Otherwise, you seem to have asked this - What are the extra topics created when creating and debezium source connector
[purpose of the] subjects
The subjects are all made because of your key.converter and value.converter configs (which are not shown). They are optional, for example, if you configured JSONConverter instead of using the Schema Registry.
You have a -key and a -value schema for each topic that that Connector is using that map to the Kafka record key-value pairs. This is not unique to Debezium. The tr.cdc... one seems to be extra, and doesn't refer to anything in the config shown, nor has an associated topic name.
Sidenote: Avro keys are usually discouraged unless you have a specific purpose for it; keys are often ID's or simple values that are used for comparison, partitioning, and compaction. If you modify a complex Avro object in any way. E.g. add/remove/rename fields, then that results in problems for consumers that expected that record to be in-order with previous records of some other field-value will have issues.
Delete and re-create ... Will it start from the beginning
With the same name, no. Source Connectors use the internal Kafka Connect offsets topic. Debezium History Topic also comes into effect, I assume. You would need to manually change these events to reset the database records to read.
Delete and start fresh.
Yes, deletes are possible. Refer the Connect REST API DELETE HTTP method. Then read above about (2).

Kafka connect JDBC sink connector - tables without Primary Key

I want to replicate 100+ tables from old MySQL server to PostgreSQL. I have setup debezium which is running fine. Now I want to setup a JDBC sink connector to PostgreSQL. Also I need to enable the DELETEs.
In this case, how can I configure my sink connector for the tables without a primary key?
It should replicate Insert, Update, and delete.
I have used Debezium and Kafka Connect 2 Postgres. I have specified the key column pk.mode=record_key and pk.fields=name_of_pk_column, because it's needed by Kafka Connect, so it can delete (enable.delete=true) and update. Also you can set auto.evolve=true.
If you set enable.delete to true, you have to specify the PK. The column key must be the same in source and in destination table.
But this is not handy if you need to transfer more than few tables.
I did not tried it, but I think a single message transformation script might obratin the key from kafka message, transform it (maybe rename it) and us it in kafka_connect_postgres settings.
I actually have 1:1 databases on boths sides, so getting pk column name I have to only worry about.
If you desire, I can provide mine settings for Kafka connect, Kafka pg connect...
connector-pg.properties
name=sink-postgres
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
#tasks.max=2
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true
topics=pokladna.public.books #my debezium topic with CDC from source DB
connection.url=jdbc:postgresql://localhost:5434/postgres
connection.user=postgres
connection.password=postgres
dialect.name=PostgreSqlDatabaseDialect
table.name.format=books_kafka #targe table in same db, in thus case given by url
transforms=unwrap
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=false
auto.create=true
auto.evolve=true
insert.mode=upsert #insert does not work for updates
delete.enabled=true
pk.fields=id #pk column name same in target and resource db
pk.mode=record_key
try to use SMT extract field, but for me it didn't worked yet. I still have no clue how to extract the primary key from kafka message automatically. Maybe I have to write my own SMT...
Actually I'm gonna start a new SO topic about it. Haha

Include the key from a Kafka message with a connect sink HDFS connector

I'm using the Kafka connect HDFS sink connector to write to HDFS from kafka, it is working fine. My messages look like this:
key: my-key
value: {
"name": "helen"
}
My use case is that in need to append the keys of my message to the events i send to HDFS.
The problem is that the key doesn't appear in the value payload so that i cannot use:
"partitioner.class":
"io.confluent.connect.hdfs.partitioner.FieldPartitioner",
"partition.field.name": "key",
My question is how can I add the key to the message I send to the HDFS or how can i partioned based on the key?
Out of the box, you can't (same goes for S3 Connect), just based on the way the code is written, not a limitation of the Connect framework
Edit - For the S3 sink, at least, I think there is now a property to include the keys
At the very least, you would need to build and add this SMT to your Connect workers, which will "move" the key, topic, and partition all over into the "value" of the Connect record before writing to storage
https://github.com/jcustenborder/kafka-connect-transform-archive