In the Kafka Streams Developer Guide it says:
Kafka Streams applications can only communicate with a single Kafka cluster
specified by this config value. Future versions of Kafka Streams will
support connecting to different Kafka clusters for reading input streams and
writing output streams.
Does this mean that my whole application can only connect to a single Kafka Cluster or each instance of KafkaStreams can only connect to a single cluster?
Could I create multiple KafkaStreams instances with different properties that connect to different clusters?
It means that a single application can only connect to one cluster.
You cannot read a topic from cluster A and write the result of your computation to cluster B.
It's not possible to read two topics from two different clusters with the same instance.
Could I create multiple KafkaStreams instances with different properties that connect to different clusters?
Yes, absolutely. But those different instances will be different applications. (Think "consumer groups".)
Update:
Within a single JVM, you can create as many KafkaStreams instances as you like. You can also configure them to connect to different clusters (and you can use the same KStreamBuilder for all of them if you want to do the same processing).
Just to add to the excellent answer from #Matthias J. Sax.
Does this mean that my whole application can only connect to a single Kafka Cluster or each instance of KafkaStreams can only connect to a single cluster?
I think there are two questions here.
It depends on the definition of "my whole application", i.e. it could simply be a single KafkaStreams instance or multiple instances on a single JVM or perhaps multiple KafkaStreams instances on a single JVM in a Docker container that is executed as a pod. Whatever it is, you can find "my whole application" a bit too broad and not very precise.
The point is that there is no way you can create a KafkaStreams instance that could talk to multiple Kafka clusters (since the configuration is through properties that are key-value pairs in a map) and so just by this you could answer your own question, couldn't you?
Being unable to use two or more Kafka clusters in a Kafka Streams application is one of the differences between Kafka Streams and Spark Structured Streaming (with the latter being able to use as many Kafka clusters as you want and so you could build pipelines between different Kafka clusters).
Related
Are there any tools service which will allow users to generate Kafka events based on specific schema ? Wanted to do stress testing on my Kafka Topic.
There is one tool i found : kafka-connect-datagen, what here we can't configure the bootstrap server as my topic are spread across multiple cluster ? Looking for some recommendations
my topic are spread across multiple cluster
Kafka topics cannot be spread across more than one cluster. Their partitions can be spread in one cluster, which only needs one bootstrap address.
The DataGen connectors will work fine for single clusters. Run it with different config files for unique clusters.
a similar question has been answered before but the solution doesn't work for my use case.
We run 2 Kafka clusters each in 2 separate DCs. Our overall incoming traffic is split between these 2 DCs.
I'd be running separate Kafka streaming app in each DC to transform that data and want to write to a Kafka topic in a single DC.
How can I achieve that?
Ultimately we'd be indexing the kafka topic data in Druid. Its not possible to run separate Druid clusters since we are trying to aggregate the data.
I've read that its not possible with a single Kafka stream. Is there a way I can use another Kafka stream to read from DC1 and write to DC2 kafka cluster ?
As you wrote yourself, you cannot use the Kafka Streams API to read from Kafka cluster A and write to a different Kafka cluster B.
Instead, if you want to move data between Kafka clusters (whether it's in the same DC or across DCs) you should use a tool such as Apache Kafka's Mirror Maker or Confluent Replicator.
I need to replicate kafka topics from one cluster to another using a java process. The new messages to the source cluster should also be replicated to the destination cluster. Is there any simple java code to done this?
Can the two Apacke Kafka Connect metadata topics, config.storage.topic and offset.storage.topic, be shared safely between two or more separate Kafka Connect clusters?
While the documentation states for each that:
This must be the same for all workers with the same group.id.
It's not clear from the documentation whether these topics need to essentially be owned by single Kafka Connect clusters.
Broadly speaking, I can see that my connect group.id could be used as a prefix for the items in the topic, but it's not clear whether it is, nor whether it's safe to rely on as a way to separate out Connect clusters from each other's metadata.
Each cluster needs its own dedicated topics.
If you have multiple Kafka Connect clusters trying to use the same config/offset/status topics—even with different group.id set—you'll get a ton of errors and/or unexpected behaviour from Kafka Connect.
We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.