How exactly Apache Nifi ConsumeKafka_1_0 processor works - apache-kafka

I have Nifi cluster of and Kafka is also installed there.
Created one topic with 5 partitions, start consuming that topic with one gourp-id. So that each partition will get unique messages.
Now I created the 5 ConsumeKafka_1_0 processors having the intent of getting unique messages on each consumer side. But only 2 of the ConsumeKafka_1_0 are consuming all the messages rest is setting ideal.
Now what I did is started the 5 command line Kafka consumer, and what happened is, I was able to see the all the partitions are getting the messages and able to consume them from command line consumer in round-robin fashion only.
Also, I tried descried the Kafka group and what I saw was only 2 of the Nifi ConsumeKafka_1_0 is consuming all the 5 partitions and rest is ideal, see the snapshot.
Would you please let me what I am doing wrong here with Nifi consumer processor.
Note - i used Nifi version is 1.5 and Kafka version is 1.0.

I've written this article which explains how the integration with Kafka works:
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
The Apache Kafka client (used by NiFi) is what assigns partitions to the consumers.
Typically if you had a 5 node NiFi cluster, with 1 ConsumeKafka processor on the canvas with 1 concurrent task, then each node would be consuming 1 partition.

Related

Apache flink: Dynamically change the consumer topic

We are building a flink application which will be deployed to AWS Kinesis data analytics(KDA). This application will consume from Kafka and write to S3.
Our setup is as follows:
We have a Kafka bootstrap server (MSK) with several topics.
We are planning to have multiple Flink applications deployed on KDA. All these applications will be part of the same consumer group.
We want to do the following:
Assume we have 10 kafka topics (topic 1 through topic 10).
Assume we have 5 Flink application (app 1 through app 5).
Initially we will assign applications to topics (ex: app 1 will consume from topic 1 and 2, app 2 will consume from topic 3 and 4 and so on).
We will store this in a config system (say CRUD application) and each Flink app when it comes alive, should be able to see which topic it should consume from based on its name. (This part we are able to do).
Assume, suddenly there is a huge surge in the number of messages coming through topic 4 for example. We will update the config system to point App 4 which is consuming from topic 7 and topic 8 to instead consume from topic 7 and topic 4.
We want the Flink app to stop consuming from the old topic and start consuming from the new topic without re-deploying the Flink app. We will have a poller which can inform the Flink app that it should consume from a different topic. The issue is making the Flink app stop consuming from the old topic and start consuming from the new topic without re-deployment.
Is there any way to do this? As far my research goes, the only way to make the Flink app to read from a new topic is to redeploy it. But want to check if there is some way some one has figured out.
Conversely: Will this situation be automatically handled if we make all the 5 Flink applications to listen to all the 10 topics? I mean, if there is a sudden surge in one of the topics, will the flink applications rebalance themselves to dedicate more resources to read from the hot topic since they are all part of the same consumer group?
Flink's Kafka consumer does not support stopping consumption from a topic (without a restart), but it does support dynamic topic and partition discovery. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/datastream/kafka/#dynamic-partition-discovery for details.

How can I run Kafka Consumer processor instance on multiple nodes with Apache Nifi

Currently we are using Apache NiFi to consume messages via Kafka consumer. Output of kafka consumer is connected to hive processor.
I'm looking into how to run kafka consumer instance on a nifi cluster.
I have 3 nodes of nifi cluster and a kafka topic which have 3 partitions, I want the kafka consumer to be able run on each node so each consumer can poll message from one of topic partitions.
After I started the kafka consumer processor ,i can only see that the kafka consumer always run on a single node but not all nodes.
Is there any configuration that I missed?
NiFi uses the Apache Kafka client which is what performs the assignment of consumers to partitions. When you start the processor, assuming you have it set to 1 concurrent task, then you should have 1 consumer on each node of your cluster, and each consumer should get assigned a different partition.
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka

Kafka consumer is not reading from only one partition out of 4

I was using Kafka 0.9 and recently migrated to Kafka 1.0, but the client I am using is still 0.9. Irrespective of this I was facing a problem where our consumers sometimes intermittently stop consuming from one or two of the partitions.
I have 5 consumers reading from 24 partitions, these are consumer JVM threads created from an application deployed in the single server. Frequently one of the consumer (thread) will stop reading from one of the partitions it would be consuming from.
Eg: One consumer thread would be reading from partition 1,2,3,and 4. It will stop reading from partition 1 and end up in building the lag. I have to restart the consumer to start picking those messages from that particular partition.
I want to understand the issue here.
My consumer configuration
session.timeout.ms=150000
request.timeout.ms=300000
max.partition.fetch.bytes=153600

storm-kafka-client spout consume message at different speed for different partition

I have a storm cluster of 5 nodes and a kafka cluster installed on the same nodes.
storm version: 1.2.1
kafka version: 1.1.0
I also have a kafka topic of 10 partitions.
Now, i want to consume this topic's data and process it by storm. But the message consume speed is really strange.
For test reason, my storm topology have only one component - kafka spout, and i always set kafka spout parallelism of 10, so that one partition will be read by only one thread.
When i run this topology on just 1 worker, all partitions will be read quickly and the lag is almost the same.(very small)
When i run this topology on 2 workers, 5 partitions will be read quickly, but the other 5 partitions will be read very slowly.
When i run this topology on 3 or 4 workers, 7 partitions will be read quickly and the other 3 partitions will be read very slowly.
When i run this topology on more than 5 workers, 8 partitions will be read quickly and the other 2 partitions will be read slowly.
Another strange thing is, when i use a different consumer group id when configure kafka spout, the test result may be different.
For example, when i use a specific group id and run topology on 5 workers, only 2 partitions can be read quickly. Just the opposite of the test using another group id.
I have written a simple java app that call High-level kafka jave api. I run it on each of the 5 storm node and find it can consume data very quickly for every partition. So the network issue can be excluded.
Has anyone met the same problem before? Or has any idea of what may cause such strange problem?
Thanks!

Why Nifi consumerKafka_0_10 Processor receives flowfile less than total flowfile?

I have 1 producers (PublishKafka_0_10 processor) and 1 consumer (ConsumerKafka_0_10 processor) to receive flowfile from Kafka cluster.
I see on Nifi UI admin, the total out of producers is 7 packages but the consumer just receives only 4 packages. I also use kafka_console_consumer.sh to view the packages from producer and it displays whole 7 packages.
I don't know why and where I lost 3 packages from consumerKafka_0_10 processor.
I use kafka cluster with 3 nodes and nifi cluster with 3 nodes too.
A couple of things to check...
The ConsumeKafka processor defaults to latest offset the first time you run it, so if you had start PublishKafka first, and then ConsumeKafka, its possible that a few messages got published before the consumer started, and then the consumer is start at the offset of message 4.
Also check if you have a Message Demarcator in ConsumeKafka. If you do then it will be placing more than one message into a flow file.