Gathering `kafka.producer` metrics using JMX - apache-kafka

I have a Kakfa broker running, which I am monitoring with JMX.
This broker is a docker container running as a process started with kafka-server-start.sh JMX port 9999 is exposed as and used as an environment variables.
When I connect to the JMX port and try to list all the domains, I get the following;
kafka
kafka.cluster
kafka.controller
kafka.coordinator.group
kafka.coordinator.transaction
kafka.log
kafka.network
kafka.server
kafka.utils
I dont see kafka.producer which is understandable because the producer for this Kafka broker are N numbers of different applications, but at this point I am confused.
How do I get the kafka.producer metrics as well.
Do I have to expose the kafka.producer metrics in each of N application that is acting as producer OR is there some configuration that start gathering kafka.producer metrics on the broker only.
What is the correct way of doing this. Please help.

Yes you are correct , to capture the producer JMX metrics , you need to enable JMX in all the processes which are running the kafka producer instance.

It might be helpful to rephrase producing as writing over an unreliable network in this context.
From this perspective, the most reasonable place to measure writing characteristics seems to be the client itself (i.e. in each "application" as you call it).
If messages between the producer and the broker are lost, you can still send stats to a local "metric store" for example (e.g. you could see a "spike" in record-retry-rate or some other relevant metric).
Additionally, pairing Kafka producer metrics with additional, local metrics might be extremely useful (JVM stats, detailed business metrics and so on). Keep in mind, that the client will almost definitely run on a different machine in a production environment, and might be affected by different factors, than the broker itself.
If you intend to monitor your client application (which will most likely happen anyway), then I'd simply do it there (i.e. the standard way).

Related

How to add health check for topics in KafkaStreams api

I have a critical Kafka application that needs to be up and running all the time. The source topics are created by debezium kafka connect for mysql binlog. Unfortunately, many things can go wrong with this setup. A lot of times debezium connectors fail and need to be restarted, so does my apps then (because without throwing any exception it just hangs up and stops consuming). My manual way of testing and discovering the failure is checking kibana log, then consume the suspicious topic through terminal. I can mimic this in code but obviously no way the best practice. I wonder if there is the ability in KafkaStream api that allows me to do such health check, and check other parts of kafka cluster?
Another point that bothers me is if I can keep the stream alive and rejoin the topics when connectors are up again.
You can check the Kafka Streams State to see if it is rebalancing/running, which would indicate healthy operations. Although, if no data is getting into the Topology, I would assume there would be no errors happening, so you need to then lookup the health of your upstream dependencies.
Overall, sounds like you might want to invest some time into using monitoring tools like Consul or Sensu which can run local service health checks and send out alerts when services go down. Or at the very least Elasticseach alerting
As far as Kafka health checking goes, you can do that in several ways
Is the broker and zookeeper process running? (SSH to the node, check processes)
Is the broker and zookeeper ports open? (use Socket connection)
Are there important JMX metrics you can track? (Metricbeat)
Can you find an active Controller broker (use AdminClient#describeCluster)
Are there a required minimum number of brokers you would like to respond as part of the Controller metadata (which can be obtained from AdminClient)
Are the topics that you use having the proper configuration? (retention, min-isr, replication-factor, partition count, etc)? (again, use AdminClient)

During rolling upgrade/restart, how to detect when a kafka broker is "done"?

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).
What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.
Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...
I would suggest exposing JMX metrics and tracking the following for cluster health
the controller count (must be 1 over the whole cluster)
under replicated partitions (should be zero for healthy cluster)
unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
ISR shrinks within a reasonable time period, like 10 minute window (should be none)
Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true
Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.
As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.
Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.

How many bootstrap servers to provide for large Kafka cluster

I have a use case where my Kafka cluster will have 1000 brokers and I am writing Kafka client.
In order to write client, i need to provide brokers list.
Question is, what are the recommended guidelines to provide brokers list in client?
Is there any proxy like service available in kafka which we can give to client?
- that proxy will know all the brokers in cluster and connect client to appropriate broker.
- like in redis world, we have twemproxy (nutcracker)
- confluent-rest-api can act as proxy?
Is it recommended to provide any specific number of brokers in client, for example provide list of 3 brokers even though cluster has 1000 nodes?
- what if provided brokers gets crashed?
- what if provided brokers restarts and there location/ip changes?
The list of broker URL you pass to the client are only to bootstrap the client. Thus, the client will automatically learn about all other available brokers automatically, and also connect to the correct brokers it need to "talk to".
Thus, if the client is already running, the those brokers go down, the client will not even notice. Only if all those brokers are down at the same time, and you startup the client, the client will "hang" as it cannot connect to the cluster and eventually time out.
It's recommended to provide at least 3 broker URLs to "survive" the outage of 2 brokers. But you can also provide more if you need a higher level of resilience.

How to monitor apache kafka using nagios?

Is there a way to monitor my kafka cluster using nagios? any working plugin, api or whatever to check: broker status, partition status, memory status, current offset and all valuable metrics from my cluster?
We are using Nagios to monitor Kafka JMX metrics (we use JMXeval, but you can use any of your favorite JMX monitoring script for Nagios) where we can find many useful metrics like memory, lag, number of offline partition, and so on.
I can highly recommend you to read this article about Kafka monitoring, where you can find many useful tips what you can monitor - https://blog.serverdensity.com/how-to-monitor-kafka/
Because JMX is by default disabled, you need enable it first. You can follow instruction on Enable JMX on Kafka Brokers

Understanding kafka broker vs zookeper

I notice that when sending messages to kafka (a producer) the samples show connecting to port 9092 -- writing directly to a broker. When consuming the examples show connecting to port 2181, presumably zookeeper.
The latter makes sense--I want to read from "the cluster", letting zookeeper figure out which broker the client should communicate with, and managing such things as knowing who's alive/dead in the cluster.
Why wouldn't publish/writes work the same way, i.e. write to "the cluster" (via zookeeper)?
Am I understanding this correctly, that for producing I'm bypassing zookeeper (cluster knowledge) and must know producer nodes (and presumably figure out what to do if one fails)?
The "high level consumer" of Kafka uses Zookeeper to keep track of which partitions each member in a consumer group is consuming and sometimes to track which offsets were read in which partition. Since access to Zookeeper is required, we may as well use it to figure out where are the brokers...
In the new consumer (coming soon in the next release), Zookeeper is no longer needed, and consumers connect directly to brokers, just like producers currently do.