We are building a flink application which will be deployed to AWS Kinesis data analytics(KDA). This application will consume from Kafka and write to S3.
Our setup is as follows:
We have a Kafka bootstrap server (MSK) with several topics.
We are planning to have multiple Flink applications deployed on KDA. All these applications will be part of the same consumer group.
We want to do the following:
Assume we have 10 kafka topics (topic 1 through topic 10).
Assume we have 5 Flink application (app 1 through app 5).
Initially we will assign applications to topics (ex: app 1 will consume from topic 1 and 2, app 2 will consume from topic 3 and 4 and so on).
We will store this in a config system (say CRUD application) and each Flink app when it comes alive, should be able to see which topic it should consume from based on its name. (This part we are able to do).
Assume, suddenly there is a huge surge in the number of messages coming through topic 4 for example. We will update the config system to point App 4 which is consuming from topic 7 and topic 8 to instead consume from topic 7 and topic 4.
We want the Flink app to stop consuming from the old topic and start consuming from the new topic without re-deploying the Flink app. We will have a poller which can inform the Flink app that it should consume from a different topic. The issue is making the Flink app stop consuming from the old topic and start consuming from the new topic without re-deployment.
Is there any way to do this? As far my research goes, the only way to make the Flink app to read from a new topic is to redeploy it. But want to check if there is some way some one has figured out.
Conversely: Will this situation be automatically handled if we make all the 5 Flink applications to listen to all the 10 topics? I mean, if there is a sudden surge in one of the topics, will the flink applications rebalance themselves to dedicate more resources to read from the hot topic since they are all part of the same consumer group?
Flink's Kafka consumer does not support stopping consumption from a topic (without a restart), but it does support dynamic topic and partition discovery. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/datastream/kafka/#dynamic-partition-discovery for details.
Related
I am fairly new to Kafka. I am designing and developing a Kafka solution to read Oracle tables (Kafka Connect or CDC is not an option in my organization yet) and send those large number of records to Kafka Brokers which will be eventually picked up by the Consumers and sent to the target system. Can my Kafka Producer and Consumer code lie in the same Spring Boot app? Or do I need to have my AppName-Producer and AppName-Consumer as TWO separate Spring Boot apps so that they can be scaled separately?
Let's say, I need to have 3 producers + 3 consumers (as part of one consumer group), so will that mean I need to deploy 6 instances of the same app? or 3 producer-app and 3 consumer-app to be deployed as separate applications?
The idea behind brokers (Kafka, ActiveMq etc...) is to share data asynchronously between diffetents application. This is part of distributed architecture patterns. That's mean diffetents applications (in the enterprise or a third party) share the same data.
Let's say, I need to have 3 producers + 3 consumers (as part of one
consumer group), so will that mean I need to deploy 6 instances of the
same app? or 3 producer-app and 3 consumer-app to be deployed as
separate applications?
Now, to answer your questions:
you don't need to deploy 6 instances of the same app
you can deploy in the same application the consumer and the producer
I was using Kafka 0.9 and recently migrated to Kafka 1.0, but the client I am using is still 0.9. Irrespective of this I was facing a problem where our consumers sometimes intermittently stop consuming from one or two of the partitions.
I have 5 consumers reading from 24 partitions, these are consumer JVM threads created from an application deployed in the single server. Frequently one of the consumer (thread) will stop reading from one of the partitions it would be consuming from.
Eg: One consumer thread would be reading from partition 1,2,3,and 4. It will stop reading from partition 1 and end up in building the lag. I have to restart the consumer to start picking those messages from that particular partition.
I want to understand the issue here.
My consumer configuration
session.timeout.ms=150000
request.timeout.ms=300000
max.partition.fetch.bytes=153600
I have Nifi cluster of and Kafka is also installed there.
Created one topic with 5 partitions, start consuming that topic with one gourp-id. So that each partition will get unique messages.
Now I created the 5 ConsumeKafka_1_0 processors having the intent of getting unique messages on each consumer side. But only 2 of the ConsumeKafka_1_0 are consuming all the messages rest is setting ideal.
Now what I did is started the 5 command line Kafka consumer, and what happened is, I was able to see the all the partitions are getting the messages and able to consume them from command line consumer in round-robin fashion only.
Also, I tried descried the Kafka group and what I saw was only 2 of the Nifi ConsumeKafka_1_0 is consuming all the 5 partitions and rest is ideal, see the snapshot.
Would you please let me what I am doing wrong here with Nifi consumer processor.
Note - i used Nifi version is 1.5 and Kafka version is 1.0.
I've written this article which explains how the integration with Kafka works:
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
The Apache Kafka client (used by NiFi) is what assigns partitions to the consumers.
Typically if you had a 5 node NiFi cluster, with 1 ConsumeKafka processor on the canvas with 1 concurrent task, then each node would be consuming 1 partition.
I have a storm cluster of 5 nodes and a kafka cluster installed on the same nodes.
storm version: 1.2.1
kafka version: 1.1.0
I also have a kafka topic of 10 partitions.
Now, i want to consume this topic's data and process it by storm. But the message consume speed is really strange.
For test reason, my storm topology have only one component - kafka spout, and i always set kafka spout parallelism of 10, so that one partition will be read by only one thread.
When i run this topology on just 1 worker, all partitions will be read quickly and the lag is almost the same.(very small)
When i run this topology on 2 workers, 5 partitions will be read quickly, but the other 5 partitions will be read very slowly.
When i run this topology on 3 or 4 workers, 7 partitions will be read quickly and the other 3 partitions will be read very slowly.
When i run this topology on more than 5 workers, 8 partitions will be read quickly and the other 2 partitions will be read slowly.
Another strange thing is, when i use a different consumer group id when configure kafka spout, the test result may be different.
For example, when i use a specific group id and run topology on 5 workers, only 2 partitions can be read quickly. Just the opposite of the test using another group id.
I have written a simple java app that call High-level kafka jave api. I run it on each of the 5 storm node and find it can consume data very quickly for every partition. So the network issue can be excluded.
Has anyone met the same problem before? Or has any idea of what may cause such strange problem?
Thanks!
I am very new to Kafka and I am dabbling about with it.
Say I have Kafka running on a Debian machine and I have managed to create a topic with a 100 messages on it.
After that initial burst of activity (i.e. placing a 100 messages onto the topic via some Kafka Producer) the Topic is just sat there idle with nothing happening (no consumers consuming and no producers producing)
I am aware of a Message Retention Policy setting, which I believe has a default value of 7 days. Let's say those 7 days pass, and the messages are indeed removed from the Topic, but what about the Topic itself?
Will Kafka eventually kill that Topic?
Also, what happens when I manually go and pull out the power cord for the machine that Kafka is running on? Will the Topic be discarded? Or will I still have my topic after I start up the machine, run ZooKeeper and create a Kafka Broker?
Any light on this matter would be appreciated.
Thank you
No, Kafka will keep the topic. It sounds like a bad idea that Kafka deletes topics by itself.
Before version 1.0.0 the topic deletion option (delete.topic.enable) was set to false by default. So it wasn't even possible to delete it without changing the config.
So the answer for you question would be Kafka never deletes topics.