How to monitor consumer lag in kafka via jmx? - apache-kafka

I have a kafka setup that includes a jmx exporter to prometheus. I'm looking for a metric, that gives the offset lag based on topic and groupid. I'm running kafka 2.2.0.
Some resources online point to a metric called kafka.consumer, but I have no such metric in my setup.
From my jmxterminal:
$>domains
#following domains are available
JMImplementation
com.sun.management
java.lang
java.nio
java.util.logging
jdk.management.jfr
kafka
kafka.cluster
kafka.controller
kafka.coordinator.group
kafka.coordinator.transaction
kafka.log
kafka.network
kafka.server
kafka.utils
I am, however, able to see the data I need by using the following command:
root#kafka-0:/kafka# bin/kafka-consumer-groups.sh --describe --group benchmark_consumer_group --bootstrap-server localhost:9092
Consumer group 'benchmark_consumer_group' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
benchmark_topic_10B 2 2795128 54223220 51428092 - - -
benchmark_topic_10B 9 4 4 0 - - -
benchmark_topic_10B 6 7 7 0 - - -
benchmark_topic_10B 7 5 5 0 - - -
benchmark_topic_10B 0 2834028 54224939 51390911 - - -
benchmark_topic_10B 1 15342331 54222342 38880011 - - -
benchmark_topic_10B 4 5 5 0 - - -
benchmark_topic_10B 5 6 6 0 - - -
benchmark_topic_10B 8 8 8 0 - - -
benchmark_topic_10B 3 4 4 0 - - -
But that does not help since I need to track if from a metric. Also, this command takes about 25 seconds to execute, making it unreasonable to use as a source for metrics.
My guess is that the metric kafka.consumer does not exist in version 2.2.0 and was replaced with another. Although, I can't find any resources online with up-to-date information on how and where to get that metric

You can give Kafka Minion ( https://github.com/cloudworkz/kafka-minion ) a try. While Kafka Minion internally works similiarly as Burrow (consumes __consumer_offsets topic for Consumer Group Offsets) it has several advantages for your use case
Advantages of Kafka Minion over Burrow for your case:
Has native prometheus support (no additional deployment necessary to just expose metrics to prometheus)
Has a sample Grafana dashboard
Has additional metrics (such as last commit timestamp for a consumergroup:topic:partition combination, commitrates, info about cleanup policy, you can list all consumer groups for a given topic, etc)
No zookeeper dependency included (which also means that consumers who still commit offsets to zookeeper are not supported)
High Availability support (!!). Burrow has the problem that it will always expose metrics, which will be wrong when it just has started consuming the __consumer_offsets topic. Therefore you cannot run it in a HA mode. This is a problem when you want to setup alerts based on consumer group lags
Kafka Minion does not support multiple clusters, which reduces complexity in code and as enduser. You can obviously still deploy Kafka Minion per cluster
Disclaimer: I am the author of Kafka Minion, and I am still looking for more feedback from other users. I intend to actively maintain and develop the exporter for my projects, the company I am working for and for the community.
To answer your question regarding what you are seeing using the kafka-consumer-groups.sh shell script. This won't work as it cannot report lags for inactive consumers which is a bit counterproductive.

The kafka.consumer JMX metrics are only present on the consumer processes themselves, not on the Kafka broker processes. Note that you would not get the kafka.consumer metric from consumers using a consumer library other than the Java one.
Currently, there are no available JMX metrics for consumer lag from the Kafka broker itself. There are other solutions that are commonly used for monitoring consumer lag, such as Burrow by LinkedIn. There are also a few open source projects such as kafka9.offsets that expose consumer lag metrics via JMX, but may not be updated to work with the latest Kafka.

Related

How exactly Apache Nifi ConsumeKafka_1_0 processor works

I have Nifi cluster of and Kafka is also installed there.
Created one topic with 5 partitions, start consuming that topic with one gourp-id. So that each partition will get unique messages.
Now I created the 5 ConsumeKafka_1_0 processors having the intent of getting unique messages on each consumer side. But only 2 of the ConsumeKafka_1_0 are consuming all the messages rest is setting ideal.
Now what I did is started the 5 command line Kafka consumer, and what happened is, I was able to see the all the partitions are getting the messages and able to consume them from command line consumer in round-robin fashion only.
Also, I tried descried the Kafka group and what I saw was only 2 of the Nifi ConsumeKafka_1_0 is consuming all the 5 partitions and rest is ideal, see the snapshot.
Would you please let me what I am doing wrong here with Nifi consumer processor.
Note - i used Nifi version is 1.5 and Kafka version is 1.0.
I've written this article which explains how the integration with Kafka works:
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
The Apache Kafka client (used by NiFi) is what assigns partitions to the consumers.
Typically if you had a 5 node NiFi cluster, with 1 ConsumeKafka processor on the canvas with 1 concurrent task, then each node would be consuming 1 partition.

Kafka broker goes out of sync for random partitions

I have a setup of 4 Kafka brokers. Each partition in each topic in my setup has a replication factor of 2. All partitions are balanced - Leaders and followers are uniformly distributed
This setup has been running for over 6 months
While monitoring the setup via Kafka Manager I see that 8% of my partitions are under-replicated.
All these partitions were assigned to the same set of replicas. And every partition which was assigned to this set of replicas is displayed as under-replicated
Lets call this set of brokers as [1,2] - broker 1 and 2. The ISR for all these partitions is [1] right now.
Both brokers 1 and 2 are up and running. All other partitions have the ISR count as expected.
The script bin/kafka-topics.sh also shows 8% of partitions to be under replicated.
But the jolokia metric - UnderReplicatedPartitions - is 0
I need help to answer -
Is there an issue?
Why is there an inconsistency in the jolokia metric and kafka console?
How can I fix the issue ?
I can't say anything about "jolokia metric" but we experienced the same because we had a "slow" broker which was lagging behind with replicating the data.
"Slow" meaning that the replication requests somes breached the broker-wide configuration replica.lag.time.max.ms which defaults to 10 seconds and is described as:
"If a follower hasn't sent any fetch requests or hasn't consumed up to the leaders log end offset for at least this time, the leader will remove the follower from isr"
Slightly increasing this configuration solved the problem for us.

storm-kafka-client spout consume message at different speed for different partition

I have a storm cluster of 5 nodes and a kafka cluster installed on the same nodes.
storm version: 1.2.1
kafka version: 1.1.0
I also have a kafka topic of 10 partitions.
Now, i want to consume this topic's data and process it by storm. But the message consume speed is really strange.
For test reason, my storm topology have only one component - kafka spout, and i always set kafka spout parallelism of 10, so that one partition will be read by only one thread.
When i run this topology on just 1 worker, all partitions will be read quickly and the lag is almost the same.(very small)
When i run this topology on 2 workers, 5 partitions will be read quickly, but the other 5 partitions will be read very slowly.
When i run this topology on 3 or 4 workers, 7 partitions will be read quickly and the other 3 partitions will be read very slowly.
When i run this topology on more than 5 workers, 8 partitions will be read quickly and the other 2 partitions will be read slowly.
Another strange thing is, when i use a different consumer group id when configure kafka spout, the test result may be different.
For example, when i use a specific group id and run topology on 5 workers, only 2 partitions can be read quickly. Just the opposite of the test using another group id.
I have written a simple java app that call High-level kafka jave api. I run it on each of the 5 storm node and find it can consume data very quickly for every partition. So the network issue can be excluded.
Has anyone met the same problem before? Or has any idea of what may cause such strange problem?
Thanks!

How to run kafka on different machines

From last 10 days i am trying to set Kafka on different machine:
Server32
Server56
Below are the list of task which i have done so far
Configured Zookeeper and started on both server with
server.1=Server32_IP:2888:3888
server.2=Server56_IP:2888:3888
I also changed server and server-1 properties as below
broker.id=0 port=9092 log.dir=/tmp/kafka0-logs
host.name=Server32
zookeeper.connect=Server32_IP:9092,Server56_IP:9062
& server-1
broker.id=1 port=9062 log.dir=/tmp/kafka1-logs
host.name=Server56
zookeeper.connect=Server32_IP:9092,Server56_IP:9062
Server.property i ran in Server32
Server-1.property i ran in Server56
The Problem is : when i start producer in both the servers and if i try to consume from any one then it is working BUT
When i stop any one server then another one is not able to send the details
Please help me in explaining the process
Running 2 zookeepers is not fault tolerant. If one of the zookeepers is stopped, then the system will not work. Unlike Kafka brokers, zookeeper needs a quorum (or majority) of the configured nodes in order to work. This is why zookeeper is typically deployed with an odd number of instances (nodes). Since 1 of 2 nodes is not a majority it really is no better than running a single zookeeper. You need at least 3 zookeepers to tolerate a failure because 2 of 3 is a majority so the system will stay up.
Kafka is different so you can have any number of Kafka brokers and if they are configured correctly and you create your topics with a replication factor of 2 or greater, then the Kafka cluster can continue if you take any one of the broker nodes down , even if it's just 1 of 2.
There's a lot of information missing here like the Kafka version and whether or not you're using the new consumer APIs or the old APIs. I'm assuming you're probably using a new version of Kafka like 0.10.x along with the new client APIs. In the new version of the client APIs the log data is stored on the Kafka brokers and not Zookeeper as in the older versions. I think your issue here is that you created your topics with a replication factor of 1 and coincidently the Kafka broker server you shutdown was hosting the only replica, so you won't be able to produce or consume messages. You can confirm the health of your topics by running the command:
kafka-topics.sh --zookeeper ZHOST:2181 --describe
You might want to increase the replication factor to 2. That way you might be able to get away with one broker failing. Ideally you would have 3 or more Kafka Broker servers with a replication factor of 2 or higher (obviously not more than the number of brokers in your cluster). Refer to the link below:
https://kafka.apache.org/documentation/#basic_ops_increase_replication_factor
For a topic with replication factor N, we will tolerate up to N-1 server >failures without losing any records committed to the log."

Kafka 0.10 quickstart: consumer fails when "primary" broker is brought down

So I'm trying the kafka quickstart as per the main documentation. Got the multi-cluster example all setup and test per the instructions and it works. For example, bringing down one broker and the producer and consumer can still send and receive.
However, as per the example, we setup 3 brokers and we bring down broker 2 (with broker id = 1). Now if I bring up all brokers again, but I bring down broker 1 (with broker id = 0), the consumer just hangs. This only happens with broker 1 (id = 0), does not happen with broker 2 or 3. I'm testing this on Windows 7.
Is there something special here with broker 1? Looking at the config they are exactly the same between all 3 brokers except the id, port number and log file location.
I thought it is just a problem with the provided console consumer which doesn't take a broker list, so I wrote a simple java consumer as per their documentation using the default setup but specify the list of brokers in the "bootstrap.servers" property, but no dice, still get the same problem.
The moment I startup broker 1 (broker id = 0), the consumers will just resume working. This isn't a highly available/fault tolerant behavior for the consumer... any help on how to setup a HA/fault tolerant consumer?
Producers doesn't seem to have an issue.
If you follow the quick-start, the created topic should have only one partition with one replica which is hosted in the first broker by default, namely broker 1. That's why the consumer got failed when you brought down this broker.
Try to create a topic with multiple replicas(specifying --replication-factor when creating topic) and rerun your test to see whether it brings higher availability.