List All the Kafka connect workers running in a group in distributed mode - apache-kafka

I want to list all the Kafka connect workers running in a single group in distributed mode.
My use case is that if a workers closes/killed due to any reason, I could recreate the worker to join again to that group. Also, all the workers will be running on different machines on a network. Therefore, I need to have a central logging of all the available workers in a group.
But, I couldn't find a way to list all the workers ( The leader node and the follower nodes). could somebody tell how I could do that?

When a worker leaves, or joins, a Kafka Connect cluster the tasks get rebalanced automatically. Therefore you should only need to monitor the Kafka Connect worker processes in isolation on each host machine and restart it if you see a failure.
You can find out which workers are currently executing a connector's tasks using the REST API:
$ curl -s "http://localhost:8083/connectors?expand=info&expand=status" | jq '.'
"status": {
"name": "source-datagen-01",
"connector": {
"state": "RUNNING",
"worker_id": "kafka-connect-03:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "kafka-connect-03:8083"
},
{
"id": 1,
"state": "RUNNING",
"worker_id": "kafka-connect-02:8083"
},
{
"id": 2,
"state": "RUNNING",
"worker_id": "kafka-connect-01:8083"
},
{
"id": 3,
"state": "RUNNING",
"worker_id": "kafka-connect-03:8083"
},
I don't believe there's a REST API call to view the workers in the cluster regardless of connector execution.
You could parse the Kafka Connect worker logs in the cluster to determine when workers leave and join the cluster, and also inspect the Kafka Connect status topic's messages - both are shown in this doc. However, as noted, Kafka Connect itself will do the rebalancing when a workers leaves or joins.

Related

Debezium Server MongoDB connector Enabling the Signalling doesn't work

I run the Debezium Server with source MongoDB and sink Kinesis on k8s. After enabling the signalling by setting
DEBEZIUM_SOURCE_SIGNAL_DATA_COLLECTION="test.debezium_signaling"
DEBEZIUM_SOURCE_COLLECTION_INCLUDE_LIST: "<some other existing collections>,test.debezium_signaling"
and restarting the pod and inserting
{
"_id": {
"$oid": "***"
},
"type": "execute-snapshot",
"data": "{'data-collections': ['test.new_collection']}"
}
nothing happens and debezium seems to be stuck at the checking current members of replica set at ...
There aren't any errors, any idea what could be the issue?
I also enabled the debug logs but there aren't any logs indicating that the signaling record is picked up by the server

Mongo Kafka Connector Collection Listen Limitations

We have several collections in Mongo based on n tenants and want the kafka connector to only watch for specific collections.
Below is my mongosource.properties file where I have added the pipeline filter to listen only to specific collections.It works
pipeline=[{$match:{“ns.coll”:{"$in":[“ecom-tesla-cms-instance”,“ca-tesla-cms-instance”,“ecom-tesla-cms-page”,“ca-tesla-cms-page”]}}}]
the collections will grow in the future to maybe 200 collections which have to be watched, wanted to know the below three things
is there some performance impact with one connector listening to huge number of collections ?
is there any limit on the collections one connector can watch ?
what would be the best practice, to run one connector listening to 100 collections or 10 different connectors listening to 10 collections each ?
Best practice would say to run many connectors, where "many" depends on your ability to maintain the overhead of them all.
Reason being - one connector creates a single point of failure (per task, but only one task should be assigned to any collection at a time, to prevent duplicates). If the Connect task fails with a non-retryable error, then that will halt the connector's tasks completely, and stop reading from all collections assigned to that connector.
You could also try Debezium, which might have less resource usage than the Mongo Source Connector since it acts as a replica rather than querying the collection at an interval.
You can listen to multiple change streams from multiple mongo collections, you just need to provide the suitable Regex for the collection names in pipeline. You can even exclude collection/collections by providing the Regex from where you don't want to listen to any change streams.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
You can even exclude any given database using $nin, which you don't want to listen for any change-stream.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/,\"$nin\":[/^any_database_name$/]}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
Coming to your questions:
Is there some performance impact with one connector listening to huge number of collections?
To my knowledge I don't think so, since it is not mentioned anywhere in the docs. You can listen to multiple mongo collections using a single connector.
Is there any limit on the collections one connector can watch?
Again to my knowledge there is no limit mentioned in docs.
What would be the best practice, to run one connector listening to 100 collections or 10 different connectors listening to 10 collections each?
From my point of view it will be an overhead to create an N number of Kafka connectors for each collection, make sure you provide fault tolerance using recommended configurations, just don't rely on a default configuration of connector.
Here is the basic Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^collection_.*/}}]}}]"
}
}
You can get more details from official docs.
Mongo docs: https://www.mongodb.com/docs/kafka-connector/current/source-connector/
Confluent docs: https://docs.confluent.io/platform/current/connect/index.html
Regex: https://www.mongodb.com/docs/manual/reference/operator/query/regex/#mongodb-query-op.-regex

Is there way to map RabbitMQ routing key in Kafka connnector configurations for registering RabbitMQSourceConnector?

We have microservices which are communicating each other via RabbitMQ.
For example :- we have services ServiceA, ServiceB, ServiceC.
ServiceA sends ADto message to ServiceB via RabbitMq with routing key service.a.100
ServiceB sends BDto message to ServiceC via RabbitMq with routing key service.b.200
ADto message:
{
"#dto": "A",
"quantity": 100,
"price": 10
}
BDto message:
{
"#dto": "B",
"quantity": 200,
"price": 20
}
We need to publish ADto and BDto messages to another ServiceD service which uses Kafka.
To publish RabbitMq ADto and BDto messages to Kafka, we are planning to use Kafka connectors e.g Confluent provides RabbitMQSourceConnector.
RabbitMQSourceConnector - pulls messages from RabbitMQ queue and writes messages into Kafka topics.
We need to get routing keys in Kafka because we have identifier (e.g service.a.100 where we need 'a' and '100') in routing keys for identify messages, we do not have identifier in the messages but Kafka does not support routing key instead it uses topic partitions.
Is there way to map routing key in connector configurations while registering confluent RabbitMQSourceConnector in Kafka connector in the Kafka cluster?

Is there a way to use Kafka Connect with REST Proxy?

Kafka Connect source and sink connectors provide practically ideal feature set for configuring a data pipeline without writing any code. In my case I wanted to use it for integrating data from several DB servers (producers) located on the public Internet.
However some producers don't have direct access to Kafka brokers as their network/firewall configuration allows traffic to a specific host only (port 443). And unfortunately I cannot really change these settings.
My thought was to use Confluent REST Proxy but I learned that Kafka Connect uses KafkaProducer API so it needs direct access to brokers.
I found a couple possible workarounds but none is perfect:
SSH Tunnel - as described in: Consume from a Kafka Cluster through SSH Tunnel
Use REST Proxy but replace Kafka Connect with custom producers, mentioned in How can we configure kafka producer behind a firewall/proxy?
Use SSHL demultiplexer to route the trafic to broker (but just one broker)
Has anyone faced similar challenge? How did you solve it?
Sink Connectors (ones that write to external systems) do not use the Producer API.
That being said, you could use some HTTP Sink Connector that issues POST requests to the REST Proxy endpoint. It's not ideal, but it would address the problem. Note: This means you have two clusters - one that you are consuming from in order to issue HTTP requests via Connect, and the other behind the proxy.
Overall, I don't see how the question is unique to Connect, since you'd have similar issues with any other attempt to write the data to Kafka through the only open HTTPS port.
As #OneCricketeer recommended, I tried a HTTP Sink Connector with REST Proxy approach.
I managed to configure Confluent HTTP Sink connector as well as alternative one (github.com/llofberg/kafka-connect-rest) to work with Confluent REST Proxy.
I'm adding connector configuration in case it saves some time to anyone trying this approach.
Confluent HTTP Sink connector
{
"name": "connector-sink-rest",
"config": {
"topics": "test",
"tasks.max": "1",
"connector.class": "io.confluent.connect.http.HttpSinkConnector",
"headers": "Content-Type:application/vnd.kafka.json.v2+json",
"http.api.url": "http://rest:8082/topics/test",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"batch.prefix": "{\"records\":[",
"batch.suffix": "]}",
"batch.max.size": "1",
"regex.patterns":"^~$",
"regex.replacements":"{\"value\":~}",
"regex.separator":"~",
"confluent.topic.bootstrap.servers": "localhost:9092",
"confluent.topic.replication.factor": "1"
}
}
Kafka Connect REST connector
{
"name": "connector-sink-rest-v2",
"config": {
"connector.class": "com.tm.kafka.connect.rest.RestSinkConnector",
"tasks.max": "1",
"topics": "test",
"rest.sink.url": "http://rest:8082/topics/test",
"rest.sink.method": "POST",
"rest.sink.headers": "Content-Type:application/vnd.kafka.json.v2+json",
"transforms": "velocityEval",
"transforms.velocityEval.type": "org.apache.kafka.connect.transforms.VelocityEval$Value",
"transforms.velocityEval.template": "{\"records\":[{\"value\":$value}]}",
"transforms.velocityEval.context": "{}"
}
}

Distributed Official Mongodb Kafka Source Connector with Multiple tasks Not working

I am running Apache Kafka on my Windows machine with two Kafka-Connect-Workers(Port 8083, 8084) and one topic with three partitions(replication of one).
My issue is that I am able to see the fail-over to other Kafka-Connect worker whenever I shutdown one of them, but load balancing is not happening because the number of tasks is always ONE.
I am using Official MongoDB-Kafka-Connector as Source(ChangeStream) with tasks.max=6.
I tried updating MongoDB with multiple threads so that it could push more data into Kafka-Connect and may perhaps make Kafka-Connect create more tasks. Even under higher volume of data, tasks count remain one.
How I confirmed only one task is running? That's through the api "http://localhost:8083/connectors/mongodb-connector/status" :
Response:
{
"name":"mongodb-connector",
"connector": {
"state":"RUNNING",
"worker_id":"xx.xx.xx.xx:8083"
}
"tasks": [
{
"id": 0,
"state": "RUNNING"
"worker_id": "xx.xx.xx.xx:8083"
}
],
"type": "source"
}
Am I missing something here? Why more tasks are not created?
It seems this is the behavior of Official MongoDB Kafka Source Connector. This is the answer I got on another forum from Ross Lawley(MongoDB developer):
Prior to 1.2.0 only a single task was supported by the sink connector.
The Source connector still only supports a single task, this is because it uses a single Change Stream cursor. This is enough to watch and publish changes cluster wide, database wide or down to a single collection.
I raised this ticket: https://jira.mongodb.org/browse/KAFKA-121
Got following response:
The source connector will only ever produce a single task.
This is by design as the source connector is backed by a change stream. Change streams internally use the same data as used by replication engine and as such should be able to scale as the database does.
There are no plans to allow multiple cursors, however, should you feel that this is not meeting your requirements, then you can configure multiple connectors and each would have its own change stream cursor.