Debezium topics and schema registry subject descriptions - apache-kafka

When I create Debezium connector, it creates many kafka topics and schema registry subjects.
I am not sure about what these topics and subjects are what is its purpose
My connector configuration:
{
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"snapshot.locking.mode": "minimal",
"database.user": "XXXXX",
"tasks.max": "3",
"database.history.kafka.bootstrap.servers": "XX:9092",
"database.history.kafka.topic": "history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"database.server.name": "cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"heartbeat.interval.ms": "5000",
"database.port": "3306",
"table.whitelist": "fk_sp_generic_checklist.entity_checklist",
"database.hostname": "abc.kcloud.in",
"database.password": "XXXXXX",
"database.history.kafka.recovery.poll.interval.ms": "5000",
"name": "cdc.fkw.supply.marketplace1.fk_sp_generic_checklist.connector",
"database.history.skip.unparseable.ddl": "true",
"errors.tolerance": "all",
"database.whitelist": "fk_sp_generic_checklist",
"snapshot.mode": "when_needed"
}
Subjects got created in schema registry:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
2) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
4) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
5) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-key
6) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
7) tr.cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
The Kafka topics which got created are:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
2) cdc.fkw.supply.marketplace.fk_sp_generic_checklist
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist
4) history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
Questions:
What is the purpose of the subjects and topics based on my above connector configuration?
What if I deleted my connector and again created a new one with the same name and same database.tables? Will the data ingest from the beginning?
Is there a way to delete the entire connector and create a new one with the same name but as a fresh connector? (This is in case I messed up with some settings and then want to delete the existing data and create a fresh one)

What is the purpose of the [...] topics based on my above connector configuration
Obviously, Debezium reads each database-table into one topic.
Otherwise, you seem to have asked this - What are the extra topics created when creating and debezium source connector
[purpose of the] subjects
The subjects are all made because of your key.converter and value.converter configs (which are not shown). They are optional, for example, if you configured JSONConverter instead of using the Schema Registry.
You have a -key and a -value schema for each topic that that Connector is using that map to the Kafka record key-value pairs. This is not unique to Debezium. The tr.cdc... one seems to be extra, and doesn't refer to anything in the config shown, nor has an associated topic name.
Sidenote: Avro keys are usually discouraged unless you have a specific purpose for it; keys are often ID's or simple values that are used for comparison, partitioning, and compaction. If you modify a complex Avro object in any way. E.g. add/remove/rename fields, then that results in problems for consumers that expected that record to be in-order with previous records of some other field-value will have issues.
Delete and re-create ... Will it start from the beginning
With the same name, no. Source Connectors use the internal Kafka Connect offsets topic. Debezium History Topic also comes into effect, I assume. You would need to manually change these events to reset the database records to read.
Delete and start fresh.
Yes, deletes are possible. Refer the Connect REST API DELETE HTTP method. Then read above about (2).

Related

Mongo Kafka Connector Collection Listen Limitations

We have several collections in Mongo based on n tenants and want the kafka connector to only watch for specific collections.
Below is my mongosource.properties file where I have added the pipeline filter to listen only to specific collections.It works
pipeline=[{$match:{“ns.coll”:{"$in":[“ecom-tesla-cms-instance”,“ca-tesla-cms-instance”,“ecom-tesla-cms-page”,“ca-tesla-cms-page”]}}}]
the collections will grow in the future to maybe 200 collections which have to be watched, wanted to know the below three things
is there some performance impact with one connector listening to huge number of collections ?
is there any limit on the collections one connector can watch ?
what would be the best practice, to run one connector listening to 100 collections or 10 different connectors listening to 10 collections each ?
Best practice would say to run many connectors, where "many" depends on your ability to maintain the overhead of them all.
Reason being - one connector creates a single point of failure (per task, but only one task should be assigned to any collection at a time, to prevent duplicates). If the Connect task fails with a non-retryable error, then that will halt the connector's tasks completely, and stop reading from all collections assigned to that connector.
You could also try Debezium, which might have less resource usage than the Mongo Source Connector since it acts as a replica rather than querying the collection at an interval.
You can listen to multiple change streams from multiple mongo collections, you just need to provide the suitable Regex for the collection names in pipeline. You can even exclude collection/collections by providing the Regex from where you don't want to listen to any change streams.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
You can even exclude any given database using $nin, which you don't want to listen for any change-stream.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/,\"$nin\":[/^any_database_name$/]}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
Coming to your questions:
Is there some performance impact with one connector listening to huge number of collections?
To my knowledge I don't think so, since it is not mentioned anywhere in the docs. You can listen to multiple mongo collections using a single connector.
Is there any limit on the collections one connector can watch?
Again to my knowledge there is no limit mentioned in docs.
What would be the best practice, to run one connector listening to 100 collections or 10 different connectors listening to 10 collections each?
From my point of view it will be an overhead to create an N number of Kafka connectors for each collection, make sure you provide fault tolerance using recommended configurations, just don't rely on a default configuration of connector.
Here is the basic Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
}
}
You can get more details from official docs.
Mongo docs: https://www.mongodb.com/docs/kafka-connector/current/source-connector/
Confluent docs: https://docs.confluent.io/platform/current/connect/index.html
Regex: https://www.mongodb.com/docs/manual/reference/operator/query/regex/#mongodb-query-op.-regex

PLC4X OPCUA -Kafka Connnector

I want to use the PLC4X Connector (https://www.confluent.io/hub/apache/kafka-connect-plc4x-plc4j) to connect OPC UA (Prosys Simulation Server) with Kafka.
However I really do not find any website that describe the kafka connect configuration options?
I tried to connect to the prosys opc ua simulation server and than stream the data to a kafka topic.
I managed it to simply send the data and consume it, however i want to use a schema and the avro connverter.
My output from my sink python connector looks like this. That seems a bit strange to me too?
b'Struct{fields=Struct{ff=-5.4470555688606E8,hhh=Sean Ray MD},timestamp=1651838599206}'
How can I use the PLC4X connector with the Avro converter and a Schema?
Thanks!
{
"connector.class": "org.apache.plc4x.kafka.Plc4xSourceConnector",
"default.topic":"plcTestTopic",
"connectionString":"opcua.tcp://127.0.0.1:12345",
"tasks.max": "2",
"sources": "machineA",
"sources.machineA.connectionString": "opcua:tcp://127.0.0.1:12345",
"sources.machineA.jobReferences": "jobA",
"jobs": "jobA",
"jobs.jobA.interval": "5000",
"jobs.jobA.fields": "job1,job2",
"jobs.jobA.fields.job1": "ns=2;i=2",
"jobs.jobA.fields.job2": "ns=2;i=3"
}
When using a schema with Avro and the Confluent schema registry, the following settings should be used. You can also choose to use different settings for both the keys and values.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url:http://127.0.0.1:8081
value.converter.schema.registry.url:http://127.0.0.1:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
Sample configuration files are also available in the PLC4X Github repository.
https://github.com/apache/plc4x/tree/develop/plc4j/integrations/apache-kafka/config

Is there a way to use Kafka Connect with REST Proxy?

Kafka Connect source and sink connectors provide practically ideal feature set for configuring a data pipeline without writing any code. In my case I wanted to use it for integrating data from several DB servers (producers) located on the public Internet.
However some producers don't have direct access to Kafka brokers as their network/firewall configuration allows traffic to a specific host only (port 443). And unfortunately I cannot really change these settings.
My thought was to use Confluent REST Proxy but I learned that Kafka Connect uses KafkaProducer API so it needs direct access to brokers.
I found a couple possible workarounds but none is perfect:
SSH Tunnel - as described in: Consume from a Kafka Cluster through SSH Tunnel
Use REST Proxy but replace Kafka Connect with custom producers, mentioned in How can we configure kafka producer behind a firewall/proxy?
Use SSHL demultiplexer to route the trafic to broker (but just one broker)
Has anyone faced similar challenge? How did you solve it?
Sink Connectors (ones that write to external systems) do not use the Producer API.
That being said, you could use some HTTP Sink Connector that issues POST requests to the REST Proxy endpoint. It's not ideal, but it would address the problem. Note: This means you have two clusters - one that you are consuming from in order to issue HTTP requests via Connect, and the other behind the proxy.
Overall, I don't see how the question is unique to Connect, since you'd have similar issues with any other attempt to write the data to Kafka through the only open HTTPS port.
As #OneCricketeer recommended, I tried a HTTP Sink Connector with REST Proxy approach.
I managed to configure Confluent HTTP Sink connector as well as alternative one (github.com/llofberg/kafka-connect-rest) to work with Confluent REST Proxy.
I'm adding connector configuration in case it saves some time to anyone trying this approach.
Confluent HTTP Sink connector
{
"name": "connector-sink-rest",
"config": {
"topics": "test",
"tasks.max": "1",
"connector.class": "io.confluent.connect.http.HttpSinkConnector",
"headers": "Content-Type:application/vnd.kafka.json.v2+json",
"http.api.url": "http://rest:8082/topics/test",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"batch.prefix": "{\"records\":[",
"batch.suffix": "]}",
"batch.max.size": "1",
"regex.patterns":"^~$",
"regex.replacements":"{\"value\":~}",
"regex.separator":"~",
"confluent.topic.bootstrap.servers": "localhost:9092",
"confluent.topic.replication.factor": "1"
}
}
Kafka Connect REST connector
{
"name": "connector-sink-rest-v2",
"config": {
"connector.class": "com.tm.kafka.connect.rest.RestSinkConnector",
"tasks.max": "1",
"topics": "test",
"rest.sink.url": "http://rest:8082/topics/test",
"rest.sink.method": "POST",
"rest.sink.headers": "Content-Type:application/vnd.kafka.json.v2+json",
"transforms": "velocityEval",
"transforms.velocityEval.type": "org.apache.kafka.connect.transforms.VelocityEval$Value",
"transforms.velocityEval.template": "{\"records\":[{\"value\":$value}]}",
"transforms.velocityEval.context": "{}"
}
}

Kafka topic seems to function first time only. Why?

I am working with Kafka Connect (using the Confluent implementation) and am seeing a strange behavior. I configure a source connection to pull data from a DB table, and populate a topic. This works.
But, if I delete the topic, remove the Source config, and then reset the config (perhaps adding another column to the query) the topic does not get populated. If I change the topic name to something I haven't used before, it works. I am using Postman to set the configuration, though I don't believe that matters here.
My Connect config:
{
"name": "my-jdbc-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:db2://db2server.mycompany.com:4461/myDB",
"connection.user: "dbUser",
"connection.password": "dbPass",
"dialect.name": "Db2DatabaseDialect",
"mode": "timestamp",
"query": "select fname, lname, custId, custRegion, lastUpdate from CustomerMaster",
"timestamp.column.name": "lastUpdate",
"table.types": "TABLE",
"topic.prefix": "master.customer"
}
}
KAFKA JDBC connector uses HighWatermark on the timestamp column i.e. last update in your case. It doesn't depend on the topic or even you can delete the JDBC connector and recreate it with the same name it still will be using the same HighWatermark because HighWatermark depends on the connector name. So even you recreate the topic it will not load data again.
So there is a way to reprocess the whole data again you can follow any of the ways:
Drop topic and delete JDBC Connector, recreate topic, and create JDBC Connector with a different name. or
Delete JDBC connector and recreate again with the same name with mode "mode": "bulk" . It will dump all DB tables again in the topic. once it loads you can again update mode to timestamp.
Please refer JDBC connector configuration details
https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/source_config_options.html
update lastUpdate for all records to the current timestamp.

kafka connector to read from csv and convert in to avro

is there any kafka connector which read from csv and converts into Avro before pushing it to the topic?
I have gone through the well know https://github.com/jcustenborder/kafka-connect-spooldir, but it only reads and pushes to the topic.
I am planning to modify the code base for my custom use but before i make the changes, i just wanted to check if there already such connector available.
kafka-connect-spooldir does do exactly what you describe. When you run it, you just need to set Kafka Connect to use Avro converter. For example:
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
See https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained for more information about how converters and connectors relate.
Edit in response to your comment:
When i am using kafka-console-consumer i am seeing data as
103693(2018-03-11T09:19:17Z Sugar - assa8.7
when i am using kafka-avro-console-consumer format is
{"order_id":{"string":"1035"},"customer_id":{"string":"93"},"order_ts":{"string":"2018-03-11T09:19:17Z"},"product":{"string":"Sugar - assa"},"order_total_usd":{"string":"8.7"}}.
This shows that it is Avro data on your topic. The whole point of kafka-avro-console-consumer is that it decodes the binary Avro data and renders it in plain format. The output from kafka-console-consumer shows the raw Avro, parts of which may look human-readable (Sugar - assa) but others clearly not (103693)