I am fairly new to Kafka. I am designing and developing a Kafka solution to read Oracle tables (Kafka Connect or CDC is not an option in my organization yet) and send those large number of records to Kafka Brokers which will be eventually picked up by the Consumers and sent to the target system. Can my Kafka Producer and Consumer code lie in the same Spring Boot app? Or do I need to have my AppName-Producer and AppName-Consumer as TWO separate Spring Boot apps so that they can be scaled separately?
Let's say, I need to have 3 producers + 3 consumers (as part of one consumer group), so will that mean I need to deploy 6 instances of the same app? or 3 producer-app and 3 consumer-app to be deployed as separate applications?
The idea behind brokers (Kafka, ActiveMq etc...) is to share data asynchronously between diffetents application. This is part of distributed architecture patterns. That's mean diffetents applications (in the enterprise or a third party) share the same data.
Let's say, I need to have 3 producers + 3 consumers (as part of one
consumer group), so will that mean I need to deploy 6 instances of the
same app? or 3 producer-app and 3 consumer-app to be deployed as
separate applications?
Now, to answer your questions:
you don't need to deploy 6 instances of the same app
you can deploy in the same application the consumer and the producer
Related
We are building a flink application which will be deployed to AWS Kinesis data analytics(KDA). This application will consume from Kafka and write to S3.
Our setup is as follows:
We have a Kafka bootstrap server (MSK) with several topics.
We are planning to have multiple Flink applications deployed on KDA. All these applications will be part of the same consumer group.
We want to do the following:
Assume we have 10 kafka topics (topic 1 through topic 10).
Assume we have 5 Flink application (app 1 through app 5).
Initially we will assign applications to topics (ex: app 1 will consume from topic 1 and 2, app 2 will consume from topic 3 and 4 and so on).
We will store this in a config system (say CRUD application) and each Flink app when it comes alive, should be able to see which topic it should consume from based on its name. (This part we are able to do).
Assume, suddenly there is a huge surge in the number of messages coming through topic 4 for example. We will update the config system to point App 4 which is consuming from topic 7 and topic 8 to instead consume from topic 7 and topic 4.
We want the Flink app to stop consuming from the old topic and start consuming from the new topic without re-deploying the Flink app. We will have a poller which can inform the Flink app that it should consume from a different topic. The issue is making the Flink app stop consuming from the old topic and start consuming from the new topic without re-deployment.
Is there any way to do this? As far my research goes, the only way to make the Flink app to read from a new topic is to redeploy it. But want to check if there is some way some one has figured out.
Conversely: Will this situation be automatically handled if we make all the 5 Flink applications to listen to all the 10 topics? I mean, if there is a sudden surge in one of the topics, will the flink applications rebalance themselves to dedicate more resources to read from the hot topic since they are all part of the same consumer group?
Flink's Kafka consumer does not support stopping consumption from a topic (without a restart), but it does support dynamic topic and partition discovery. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/datastream/kafka/#dynamic-partition-discovery for details.
We are deploying kafka consumers based of Java API in a seperate VM grouped by usage. Probably 3-4 consumers (not in same group)/vm based on throughput of these consumers.
Is it best to use this method or deploy the consumer using dockers? Any pointers would be helpful.
Though you can use Kafka confluent REST proxy and others, my question is about consumer deployment.
A VM has too much overhead for simply running one or few JVM applications. If you have a container platform, then that would be preferred, and would start the app faster than provisioning new VMs per app
I have written a streams application to talk to topic on cluster of 5 brokers with 10 partitions. I have tried multiple combinations here like 10 application instances (on 10 different machines) with 1 stream thread each, 5 instances with 2 threads each. But for some reason, when I check in kafka manager, the 1:1 mapping between partition and stream thread is not happening. Some of the threads are picking up 2 partitions while some picking up none. Can you please help me with same?? All threads are part of same group and subscribed to only one topic.
The kafka streams version we are using is 0.11.0.2 and broker version is 0.10.0.2
Thanks for your help
Maybe you are hitting https://issues.apache.org/jira/browse/KAFKA-7144 -- I would recommend to upgrade to the latest versions.
Note: you do not need to upgrade your brokers
We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.
So far, I have been using Spring Boot apps (with Spring Cloud Stream) and Kafka running without any supporting infrastructure (PaaS).
Since our corporate platform is running on Kubernetes we need to move those Spring Boot apps into K8s to allow the apps to scale and so on. Obviously there will be more than one instance of every application so we will define a consumer group per application to ensure the unique delivery and processing of every message.
Kafka will be running outside Kubernetes.
Now my doubt is: since the apps deployed on k8s are accessed through the k8s service that abstracts the underlying pods, and individual application pods can't be access directly outside of the k8s cluster, Kafka won't know how to call individual instances of the consumer group to deliver the messages, will it?
How can I make them work together?
Kafka brokers do not push data to clients. Rather clients poll() and pull data from the brokers. As long as the consumers can connect to the bootstrap servers and you set the Kafka brokers to advertise an IP and port that the clients can connect to and poll() then it will all work fine.
Can Spring Cloud Data Flow solve your requirement to control the number of instances deployed?
and, there is a community released Spring Cloud Data Flow server for OpenShift:
https://github.com/donovanmuller/spring-cloud-dataflow-server-openshift