Kafka Generates Events Based on Schema - apache-kafka

Are there any tools service which will allow users to generate Kafka events based on specific schema ? Wanted to do stress testing on my Kafka Topic.
There is one tool i found : kafka-connect-datagen, what here we can't configure the bootstrap server as my topic are spread across multiple cluster ? Looking for some recommendations

my topic are spread across multiple cluster
Kafka topics cannot be spread across more than one cluster. Their partitions can be spread in one cluster, which only needs one bootstrap address.
The DataGen connectors will work fine for single clusters. Run it with different config files for unique clusters.

Related

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

Kafka streams read & write to separate cluster

a similar question has been answered before but the solution doesn't work for my use case.
We run 2 Kafka clusters each in 2 separate DCs. Our overall incoming traffic is split between these 2 DCs.
I'd be running separate Kafka streaming app in each DC to transform that data and want to write to a Kafka topic in a single DC.
How can I achieve that?
Ultimately we'd be indexing the kafka topic data in Druid. Its not possible to run separate Druid clusters since we are trying to aggregate the data.
I've read that its not possible with a single Kafka stream. Is there a way I can use another Kafka stream to read from DC1 and write to DC2 kafka cluster ?
As you wrote yourself, you cannot use the Kafka Streams API to read from Kafka cluster A and write to a different Kafka cluster B.
Instead, if you want to move data between Kafka clusters (whether it's in the same DC or across DCs) you should use a tool such as Apache Kafka's Mirror Maker or Confluent Replicator.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

How to connect to multiple clusters in a single Kafka Streams application?

In the Kafka Streams Developer Guide it says:
Kafka Streams applications can only communicate with a single Kafka cluster
specified by this config value. Future versions of Kafka Streams will
support connecting to different Kafka clusters for reading input streams and
writing output streams.
Does this mean that my whole application can only connect to a single Kafka Cluster or each instance of KafkaStreams can only connect to a single cluster?
Could I create multiple KafkaStreams instances with different properties that connect to different clusters?
It means that a single application can only connect to one cluster.
You cannot read a topic from cluster A and write the result of your computation to cluster B.
It's not possible to read two topics from two different clusters with the same instance.
Could I create multiple KafkaStreams instances with different properties that connect to different clusters?
Yes, absolutely. But those different instances will be different applications. (Think "consumer groups".)
Update:
Within a single JVM, you can create as many KafkaStreams instances as you like. You can also configure them to connect to different clusters (and you can use the same KStreamBuilder for all of them if you want to do the same processing).
Just to add to the excellent answer from #Matthias J. Sax.
Does this mean that my whole application can only connect to a single Kafka Cluster or each instance of KafkaStreams can only connect to a single cluster?
I think there are two questions here.
It depends on the definition of "my whole application", i.e. it could simply be a single KafkaStreams instance or multiple instances on a single JVM or perhaps multiple KafkaStreams instances on a single JVM in a Docker container that is executed as a pod. Whatever it is, you can find "my whole application" a bit too broad and not very precise.
The point is that there is no way you can create a KafkaStreams instance that could talk to multiple Kafka clusters (since the configuration is through properties that are key-value pairs in a map) and so just by this you could answer your own question, couldn't you?
Being unable to use two or more Kafka clusters in a Kafka Streams application is one of the differences between Kafka Streams and Spark Structured Streaming (with the latter being able to use as many Kafka clusters as you want and so you could build pipelines between different Kafka clusters).

How to span the kafka partitions across multiple VM's?

I am familiar with basic kafka system. I want to span a single kafka instance across 2 VM's such that some partitions are in one VM and some more in another VM. Please tell me how to configure this kind of system.
What do you mean by "to span kafka instance across 2 VMs" ? What you can due is having two different Kafka instances running on the 2 VMs. They should be configured in order to connect to the same Zookeeper cluster. When you create a new topic with a specific number of partitions, Kafka will span such partitions over the 2 VMs.