In Kafka Connect, how to connect with multiple kafka clusters? - apache-kafka

I set the kafka connect cluster in distributed mode and I wanna get connections with multiple kafka CLUSTERS, not just multiple brokers.
Target brokers can be set with bootstrap.servers in connect-distributed.properties.
So, at first, I set broker1 from kafka-cluster-A like below:
bootstrap.servers=broker1:9092
Absolutely, it worked well.
And then, I added broker2 from kafka-cluster-B like below:
bootstrap.servers=broker1:9092,broker2:9092
So, these two brokers are in the different clusters.
And this didn't work at all.
Without any error, it was just stuck and there was no answer with the request like creating connector through the REST API.
How can I connect with multiple kafka clusters?

As far as I know, you can only connect a Kafka Connect worker to one Kafka cluster.
If you have data on different clusters that you want to handle with Kafka Connect then run multiple Kafka Connect worker processes.

Related

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

How can I show different kafka to my confluent?

I install confluent and it has own kafka.
I want to change kafka from own to another?
Which .properties or whatelse file I must change to look different kafka.
thanks in advance
In your Kafka Connect worker configuration, you need to set bootstrap.servers to point to the broker(s) on your source Kafka cluster.
You can only connect to one source Kafka cluster per Kafka Connect worker. If you need to stream data from multiple Kafka clusters, you would run multiple Kafka Connect workers.
Edit If you're using Confluent CLI then the Kafka Connect worker config is taken from etc/schema-registry/connect-avro-distributed.properties.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

Run Kafka and Kafka-connect on different servers?

I want to know if Kafka and Kafka-connect can run on different servers? So a connector would be started on server A and send data from a kafka topic on server B to HDFS or S3 etc. Thanks
Yes, and for Production deployments this is typically recommended for resource reasons. Generally you'd deploy a cluster of Kafka Brokers (3+ for HA), and then a cluster of Kafka Cluster workers (as many as needed for throughput capacity / resilience) -- all on separate nodes.
For more details, see the Confluent Enterprise Reference Architecture.
Yes, you can do it.
I have my set of kafka servers and kafka connect applications are running in different machines and writing data in hdfs. you have to mention list of brokers in bootstrap.servers under worker properties file (config/connect-distributed.properties or config/connect-standalone.properties) instead of localhost:9092

Connecting Storm with remote Kafka cluster, what would happen if new brokers are added

We are working on an application that uses Storm to pull data from a remote Kafka cluster. As the two cluster lies in different environment there is an issue with network connectivity between them. In simple term by default the remote zookeeper and Kafka brokers does not allow connection from our Storm's worker/supervisor nodes. In order to do that we need firewall access to be given.
My concern is what would happen if new Brokers or Zookeeper is added in the remote cluster ? I understand that we don't have to specify all the zk nodes in order to consume but say they add few brokers and we need to consume from a partition which is served by those new set of nodes ? What would be the impact on the running Storm application ?