Apply a Quota to a Kafka Connect consumer group - apache-kafka

I have Kafka Connect JDBC sink connectors writing to various databases and I'd like to throttle the traffic to one database. The Kafka quotas feature can set a consumer_byte_rate quota for a client ID, but Kafka Connect client IDs look like consumer-1234 and are dynamically assigned to connectors. So if my sink connector is rebalanced, it will be assigned all new client IDs. I tried setting a quota using my sink connector consumer group ID as the client ID, but that doesn't work. Is there any way to set a quota for a Kafka Connect consumer group?

If you upgrade to Apache Kafka 2.3 you'll benefit from KIP-411: Make default Kafka Connect worker task client IDs distinct
. You can see an example of it in action here. However, you'd have to test if the client-id is deterministic since quotas can't be wildcarded.

Related

How does kafka GroupCoordinator play a role in kafka connect?

In kafka connect distributed mode, we need to submit kafka connect source and kafka connect sink configuration through REST api. How does Kafka Connect leverage kafka group coordinator to form a consumer group?
Kafka sink connectors form consumer groups using regular consumer API with a group.id named connect-<name>, using the name you've configured in the connector config

Kafka Connect best practices for topic compaction

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100
The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

How can I configure my connector to run in specific worker group in multicluster connect environment in distributed kafka connect?

As per the documentation, worker service is set to run before adding connectors. Suppose I am running worker-a with group.id "cluster-a" and worker-b with group.id "cluster-b" on three distributed VM's. What is the configuration that makes connectors to choose their worker group.
Suppose I need to configure debezium mysql connector's tasks to run on cluster-a and jdbc connector's all tasks on cluster-b. How should I do it?
Thanks in advance.
Each Connector group specified by group.id communicate with each other over HTTP via their rest.advertised.listener (similar to the brokers). Each Connect cluster also requires its own unique config, offsets, and status topics
You'd HTTP POST to one of the group's rest.port endpoints, and the tasks will be distributed within that group
However, if you only have 3 machines, there's really no need to setup two unique Connect clusters (JDBC and Debezium tasks can run in the same cluster)

How to make a Data Pipeline from MQTT to KAFKA Broker to MongoDB?

How can I make a data pipeline, I am sending data from MQTT to KAFKA topic using Source Connector. and on the other side, I have also connected Kafka Broker to MongoDB using Sink Connector. I am having trouble making a data pipeline that goes from MQTT to KAFKA and then MongoDB. Both connectors are working properly individually. How can I integrate them?
here is my MQTT Connector
MQTT Connector
Node 1 MQTT Connector
Message Published from MQTT
Kafka Consumer
Node 2 MongoDB Connector
MongoDB
that is my MongoDB Connector
MongoDB Connector
It is hard to tell what exactly the problem is without more logs, please provide your connect.config as well, please check /status of your connector, I still did not understand exactly what the issue you are facing, you are saying that , MQTT SOURCE CONNECTOR sending messages successfully to KAFKA TOPIC and your MONGO DB SINK CONNECTOR successfully reading this KAFKA TOPIC and write to your mobgodb, hence your pipeline, Where is the error? Is your KAFKA is the same KAFKA? Or separated different KAFKA CLUSTERS? Seems like both localhost, but is it the same machine?
Please elaborate and explain what are you expecting? What does "pipeline" means in your word?
You need both connectors to share same kafka cluster, what does node1 and node2 mean is it seperate kafka instance? Your connector need to connect to the same kafka "node" / cluster in order to share the data inside the kafka topic one for input and one for output, share your bootstrap service parameters, share your server.properties as well of the kafka
In order to run two different connect clusters inside same kafka , you need to set in different internal topics for each connect cluster
config.storage.topic
offset.storage.topic
status.storage.topic

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)