Is there a way to use Kafka Connect with REST Proxy? - apache-kafka

Kafka Connect source and sink connectors provide practically ideal feature set for configuring a data pipeline without writing any code. In my case I wanted to use it for integrating data from several DB servers (producers) located on the public Internet.
However some producers don't have direct access to Kafka brokers as their network/firewall configuration allows traffic to a specific host only (port 443). And unfortunately I cannot really change these settings.
My thought was to use Confluent REST Proxy but I learned that Kafka Connect uses KafkaProducer API so it needs direct access to brokers.
I found a couple possible workarounds but none is perfect:
SSH Tunnel - as described in: Consume from a Kafka Cluster through SSH Tunnel
Use REST Proxy but replace Kafka Connect with custom producers, mentioned in How can we configure kafka producer behind a firewall/proxy?
Use SSHL demultiplexer to route the trafic to broker (but just one broker)
Has anyone faced similar challenge? How did you solve it?

Sink Connectors (ones that write to external systems) do not use the Producer API.
That being said, you could use some HTTP Sink Connector that issues POST requests to the REST Proxy endpoint. It's not ideal, but it would address the problem. Note: This means you have two clusters - one that you are consuming from in order to issue HTTP requests via Connect, and the other behind the proxy.
Overall, I don't see how the question is unique to Connect, since you'd have similar issues with any other attempt to write the data to Kafka through the only open HTTPS port.

As #OneCricketeer recommended, I tried a HTTP Sink Connector with REST Proxy approach.
I managed to configure Confluent HTTP Sink connector as well as alternative one (github.com/llofberg/kafka-connect-rest) to work with Confluent REST Proxy.
I'm adding connector configuration in case it saves some time to anyone trying this approach.
Confluent HTTP Sink connector
{
"name": "connector-sink-rest",
"config": {
"topics": "test",
"tasks.max": "1",
"connector.class": "io.confluent.connect.http.HttpSinkConnector",
"headers": "Content-Type:application/vnd.kafka.json.v2+json",
"http.api.url": "http://rest:8082/topics/test",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"batch.prefix": "{\"records\":[",
"batch.suffix": "]}",
"batch.max.size": "1",
"regex.patterns":"^~$",
"regex.replacements":"{\"value\":~}",
"regex.separator":"~",
"confluent.topic.bootstrap.servers": "localhost:9092",
"confluent.topic.replication.factor": "1"
}
}
Kafka Connect REST connector
{
"name": "connector-sink-rest-v2",
"config": {
"connector.class": "com.tm.kafka.connect.rest.RestSinkConnector",
"tasks.max": "1",
"topics": "test",
"rest.sink.url": "http://rest:8082/topics/test",
"rest.sink.method": "POST",
"rest.sink.headers": "Content-Type:application/vnd.kafka.json.v2+json",
"transforms": "velocityEval",
"transforms.velocityEval.type": "org.apache.kafka.connect.transforms.VelocityEval$Value",
"transforms.velocityEval.template": "{\"records\":[{\"value\":$value}]}",
"transforms.velocityEval.context": "{}"
}
}

Related

Mongo Kafka Connector Collection Listen Limitations

We have several collections in Mongo based on n tenants and want the kafka connector to only watch for specific collections.
Below is my mongosource.properties file where I have added the pipeline filter to listen only to specific collections.It works
pipeline=[{$match:{“ns.coll”:{"$in":[“ecom-tesla-cms-instance”,“ca-tesla-cms-instance”,“ecom-tesla-cms-page”,“ca-tesla-cms-page”]}}}]
the collections will grow in the future to maybe 200 collections which have to be watched, wanted to know the below three things
is there some performance impact with one connector listening to huge number of collections ?
is there any limit on the collections one connector can watch ?
what would be the best practice, to run one connector listening to 100 collections or 10 different connectors listening to 10 collections each ?
Best practice would say to run many connectors, where "many" depends on your ability to maintain the overhead of them all.
Reason being - one connector creates a single point of failure (per task, but only one task should be assigned to any collection at a time, to prevent duplicates). If the Connect task fails with a non-retryable error, then that will halt the connector's tasks completely, and stop reading from all collections assigned to that connector.
You could also try Debezium, which might have less resource usage than the Mongo Source Connector since it acts as a replica rather than querying the collection at an interval.
You can listen to multiple change streams from multiple mongo collections, you just need to provide the suitable Regex for the collection names in pipeline. You can even exclude collection/collections by providing the Regex from where you don't want to listen to any change streams.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
You can even exclude any given database using $nin, which you don't want to listen for any change-stream.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/,\"$nin\":[/^any_database_name$/]}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
Coming to your questions:
Is there some performance impact with one connector listening to huge number of collections?
To my knowledge I don't think so, since it is not mentioned anywhere in the docs. You can listen to multiple mongo collections using a single connector.
Is there any limit on the collections one connector can watch?
Again to my knowledge there is no limit mentioned in docs.
What would be the best practice, to run one connector listening to 100 collections or 10 different connectors listening to 10 collections each?
From my point of view it will be an overhead to create an N number of Kafka connectors for each collection, make sure you provide fault tolerance using recommended configurations, just don't rely on a default configuration of connector.
Here is the basic Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
}
}
You can get more details from official docs.
Mongo docs: https://www.mongodb.com/docs/kafka-connector/current/source-connector/
Confluent docs: https://docs.confluent.io/platform/current/connect/index.html
Regex: https://www.mongodb.com/docs/manual/reference/operator/query/regex/#mongodb-query-op.-regex

PLC4X OPCUA -Kafka Connnector

I want to use the PLC4X Connector (https://www.confluent.io/hub/apache/kafka-connect-plc4x-plc4j) to connect OPC UA (Prosys Simulation Server) with Kafka.
However I really do not find any website that describe the kafka connect configuration options?
I tried to connect to the prosys opc ua simulation server and than stream the data to a kafka topic.
I managed it to simply send the data and consume it, however i want to use a schema and the avro connverter.
My output from my sink python connector looks like this. That seems a bit strange to me too?
b'Struct{fields=Struct{ff=-5.4470555688606E8,hhh=Sean Ray MD},timestamp=1651838599206}'
How can I use the PLC4X connector with the Avro converter and a Schema?
Thanks!
{
"connector.class": "org.apache.plc4x.kafka.Plc4xSourceConnector",
"default.topic":"plcTestTopic",
"connectionString":"opcua.tcp://127.0.0.1:12345",
"tasks.max": "2",
"sources": "machineA",
"sources.machineA.connectionString": "opcua:tcp://127.0.0.1:12345",
"sources.machineA.jobReferences": "jobA",
"jobs": "jobA",
"jobs.jobA.interval": "5000",
"jobs.jobA.fields": "job1,job2",
"jobs.jobA.fields.job1": "ns=2;i=2",
"jobs.jobA.fields.job2": "ns=2;i=3"
}
When using a schema with Avro and the Confluent schema registry, the following settings should be used. You can also choose to use different settings for both the keys and values.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url:http://127.0.0.1:8081
value.converter.schema.registry.url:http://127.0.0.1:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
Sample configuration files are also available in the PLC4X Github repository.
https://github.com/apache/plc4x/tree/develop/plc4j/integrations/apache-kafka/config

Debezium topics and schema registry subject descriptions

When I create Debezium connector, it creates many kafka topics and schema registry subjects.
I am not sure about what these topics and subjects are what is its purpose
My connector configuration:
{
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"snapshot.locking.mode": "minimal",
"database.user": "XXXXX",
"tasks.max": "3",
"database.history.kafka.bootstrap.servers": "XX:9092",
"database.history.kafka.topic": "history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"database.server.name": "cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"heartbeat.interval.ms": "5000",
"database.port": "3306",
"table.whitelist": "fk_sp_generic_checklist.entity_checklist",
"database.hostname": "abc.kcloud.in",
"database.password": "XXXXXX",
"database.history.kafka.recovery.poll.interval.ms": "5000",
"name": "cdc.fkw.supply.marketplace1.fk_sp_generic_checklist.connector",
"database.history.skip.unparseable.ddl": "true",
"errors.tolerance": "all",
"database.whitelist": "fk_sp_generic_checklist",
"snapshot.mode": "when_needed"
}
Subjects got created in schema registry:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
2) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
4) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
5) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-key
6) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
7) tr.cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
The Kafka topics which got created are:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
2) cdc.fkw.supply.marketplace.fk_sp_generic_checklist
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist
4) history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
Questions:
What is the purpose of the subjects and topics based on my above connector configuration?
What if I deleted my connector and again created a new one with the same name and same database.tables? Will the data ingest from the beginning?
Is there a way to delete the entire connector and create a new one with the same name but as a fresh connector? (This is in case I messed up with some settings and then want to delete the existing data and create a fresh one)
What is the purpose of the [...] topics based on my above connector configuration
Obviously, Debezium reads each database-table into one topic.
Otherwise, you seem to have asked this - What are the extra topics created when creating and debezium source connector
[purpose of the] subjects
The subjects are all made because of your key.converter and value.converter configs (which are not shown). They are optional, for example, if you configured JSONConverter instead of using the Schema Registry.
You have a -key and a -value schema for each topic that that Connector is using that map to the Kafka record key-value pairs. This is not unique to Debezium. The tr.cdc... one seems to be extra, and doesn't refer to anything in the config shown, nor has an associated topic name.
Sidenote: Avro keys are usually discouraged unless you have a specific purpose for it; keys are often ID's or simple values that are used for comparison, partitioning, and compaction. If you modify a complex Avro object in any way. E.g. add/remove/rename fields, then that results in problems for consumers that expected that record to be in-order with previous records of some other field-value will have issues.
Delete and re-create ... Will it start from the beginning
With the same name, no. Source Connectors use the internal Kafka Connect offsets topic. Debezium History Topic also comes into effect, I assume. You would need to manually change these events to reset the database records to read.
Delete and start fresh.
Yes, deletes are possible. Refer the Connect REST API DELETE HTTP method. Then read above about (2).

pubSubSource: Receving the same message twice

Description
I have a pubSubSource connector in Kafka Connect Distributed mode that is simply reading from a PubSub subscription and writing into a Kafka topic. The issue is, even if I am publishing one message to GCP PubSub, I am getting this message written twice in my Kafka topic.
How to reproduce
Deploy Kafka and Kafka connect
Create a connector with below pubSubSource configurations:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "pubSubSource",
"config": {
"connector.class":"com.google.pubsub.kafka.source.CloudPubSubSourceConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"tasks.max":"1",
"cps.subscription":"pubsub-test-sub",
"kafka.topic":"kafka-sub-topic",
"cps.project":"test-project123",
"gcp.credentials.file.path":"/tmp/gcp-creds/account-key.json"
}
}'
Below are the Kafka-connect configurations:
"plugin.path": "/usr/share/java,/usr/share/confluent-hub-components"
"key.converter": "org.apache.kafka.connect.json.JsonConverter"
"value.converter": "org.apache.kafka.connect.json.JsonConverter"
"key.converter.schemas.enable": "false"
"value.converter.schemas.enable": "false"
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter"
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter"
"config.storage.replication.factor": "1"
"offset.storage.replication.factor": "1"
"status.storage.replication.factor": "1"
Publish a message to the PubSub topic using the below command:
gcloud pubsub topics publish test-topic --message='{"someKey":"someValue"}'
Read messages from the destination Kafka topics:
/usr/bin/kafka-console-consumer --bootstrap-server xx.xxx.xxx.xx:9092 --topic kafka-topic --from-beginning
# Output
{"someKey":"someValue"}
{"someKey":"someValue"}
Why is this happening, is there something that I am doing wrong?
I found below info at https://cloud.google.com/pubsub/docs/faq and seems you are facing the same issue. Could you try producing large message and see if the result is the same?
Details from the link:
Why are there too many duplicate messages?
Pub/Sub guarantees at-least-once message delivery, which means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Pub/Sub is retrying the message delivery. This can be observed in the monitoring metrics pubsub.googleapis.com/subscription/pull_ack_message_operation_count for pull subscriptions, and pubsub.googleapis.com/subscription/push_request_count for push subscriptions. Look for elevated expired or webhook_timeout values in the /response_code. This is particularly likely if there are many small messages, since Pub/Sub may batch messages internally and a partially acknowledged batch will be fully redelivered.
Another possibility is that the subscriber is not acknowledging some messages because the code path processing those specific messages fails, and the Acknowledge call is never made; or the push endpoint never responds or responds with an error.
How do I detect duplicate messages?
Pub/Sub assigns a unique message_id to each message, which can be used to detect duplicate messages received by the subscriber. This will not, however, allow you to detect duplicates resulting from multiple publish requests on the same data. Detecting those will require a unique message identifier to be provided by the publisher. See Pub/Sub I/O for further discussion.
Sometimes due to acknowledgement delay Google pub sub keeps on retrying the messages to kafka broker. This issue can usually be avoided by configuring Acknowledgement deadline in Google Pub sub subscription. This will ensure your messages get enough time for acknowledgement and duplicates issue will get resolved.
For more details you can check out this link from confluent documentation
https://docs.confluent.io/kafka-connect-gcp-pubsub/current/overview.html#too-many-duplicates

How to Connect Kafka to Postgres in Heroku

I have some Kafka consumers and producers running through my Kafka instance on my Heroku Cluster. I'm looking to create a data sink connector to connect Kafka to PosytgreSQL to put data FROM Kafka TO my heroku PostgreSQL instance. Pretty much like the HeroKu docs, but one way.
I can't figure out the steps I need to take to achieve this.
The docs say to look at the Gitlab or Confluence Ecosystem page but i can't find any mention of Postgres in these.
Looking in the Confluent Kafka Connectors library there seems to something from Debezium but i'm not running Confluent.
The diagram in the Heroku docs mentions a JDBC connector? I found this Postgres JDBC driver, should I be using this?
I'm happy to create a consumer and update postgres manually as the data comes if that's what's needed, but I feel that Kafka to Postgres must be a common enough interface that there should be something out there to manage this?
I'm just looking for some high level help or examples to set me on the right path.
Thanks
You're almost there :)
Bear in mind that Kafka Connect is part of Apache Kafka, and you get a variety of connectors. Some (e.g. Debezium) are community projects from Red Hat, others (e.g. JDBC Sink) are community projects from Confluent.
The JDBC Sink connector will let you stream data from Kafka to a database with a JDBC driver - such as Postgres.
Here's an example configuration:
{
"connector.class" : "io.confluent.connect.jdbc.JdbcSinkConnector",
"key.converter" : "org.apache.kafka.connect.storage.StringConverter",
"connection.url" : "jdbc:postgresql://postgres:5432/",
"connection.user" : "postgres",
"connection.password": "postgres",
"auto.create" : true,
"auto.evolve" : true,
"insert.mode" : "upsert",
"pk.mode" : "record_key",
"pk.fields" : "MESSAGE_KEY"
}
Here's a walkthrough and couple of videos that you might find useful:
Kafka Connect in Action: JDBC Sink
ksqlDB and the Kafka Connect JDBC Sink
Do i actually need to install anything
Kafka Connect comes with Apache Kafka. You need to install the JDBC connector.
Do i actually need to write any code
No, just the configuration, similar to what I quoted above
can i just call the Connect endpoint , which comes with Kafka,
Once you've installed the connector you run Kafka Connect (a binary that ships with Apache Kafka) and then use the REST endpoint to create the connector using the configuration