Impact of the starting kafka service on bootup - apache-kafka

I have configure the kafka service on auto start on bootup.I wanted to understand the impact of doing so.
1.If service started on the all kafka servers at the same time.
2.If Kafka service auto started before all zookeeper severs started.
3.If Kafka service auto started after some time of gap.
Are there any other impact of starting kafka service on bootup automatically.

The only significant impact would be I/O usage of the machine. The order of the brokers don't matter, but if they start before Zookeeper, they would fail to start at all.

Related

Kafka Cluster cotinues to run without zookeeper

I have a five node kafka cluster(confluent 5.5 community edition) with 3 zookeeper nodeseach on different aws instances.
While doing failover testing , noticed that the kafka cluster works fine even if all zookeeper nodes are down.
I was able to produce , consume and also create new consumers.
why does the kafka cluster not stop if it cannot connect to any zookeeper nodes ?
What would be the possible issues if we are unaware of such a failure scenario in production and kafka cluster continues to run without zookeeper connectivity ?
how do we handle such a scenario ?
Broker leader election, topic creation, simple ACLs (if you use them) still depend on Zookeeper. For other basic functions relying on the Kafka bootstrap protocols, they might still work, sure. There should definitely be broker logs indicating connection was lost
Ideally you'd have basic process healthchecking and incident management software that you shouldn't miss critical services going down in prod
How to handle? Restart Zookeeper...

Kafka won't start if a Zookeeper node is down

I have Kafka and Zookeeper co-located on the same servers, with multiple nodes.
In Kafka's server.properties, I have a line like
zookeeper.connect=server1:2181,server2:2181...
the problem is, Kafka will not start until all of the Zookeeper nodes are available. Otherwise, I will get an error like "fatal error during Kafka startup" and "Timed out waiting for connection while in state: CONNECTING" even though the other Zookeeper nodes are up.
This makes it challenging to script startup of each node independently, since the startup scripts on one node are dependent on the state of other nodes.
First: is this expected behavior or am I doing something wrong? Suppose I have 3 nodes in Zookeeper cluster; all 3 nodes have to be up for Kafka to start? That seems counterintuitive, since a larger cluster would actually increase the chance of failure on startup rather than provide more resiliency.
Second: What's a good solution for this? Is the only approach to make Kafka on each node wait until Zookeeper is fully up on all nodes?
As far as I know, this is a prerequisite for Kafka to start up correctly, and I don't think too much of a burden. If the zookeeper cluster itself is already having problems at startup time, Kafka itself might run into problems, so ensuring that the Zookeeper cluster is healthy is a good initial check, IMHO.
A way to get around this limitation is to configure a single-node Zookeeper cluster, and tell Kafka to use that cluster. After the fact, you can grow the zookeeper cluster to 3 or more nodes, while Kafka is already up and running. More details can be found here:
Adding new ZooKeeper node in Kafka cluster?
For the record, Kafka itself is completely fine if the Zookeeper cluster goes down once it's up and running. It just wouldn't be able to accept new producer/consumer connections or create topics, but the current ones that are active on the cluster continue to work just fine.
We have met the same problem in our production environment.
It turns out to be a bug (ZOOKEEPER-2184) from zookeeper library which kafka uses talking to zookeeper.
Our kafka version is 1.1.1 which use zookeeper-3.4.10.jar.
After we replaced it with zookeeper-3.4.13.jar, kafka can restart successfully.

During rolling upgrade/restart, how to detect when a kafka broker is "done"?

I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).
What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.
Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...
I would suggest exposing JMX metrics and tracking the following for cluster health
the controller count (must be 1 over the whole cluster)
under replicated partitions (should be zero for healthy cluster)
unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
ISR shrinks within a reasonable time period, like 10 minute window (should be none)
Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true
Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.
As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.
Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.

Gathering `kafka.producer` metrics using JMX

I have a Kakfa broker running, which I am monitoring with JMX.
This broker is a docker container running as a process started with kafka-server-start.sh JMX port 9999 is exposed as and used as an environment variables.
When I connect to the JMX port and try to list all the domains, I get the following;
kafka
kafka.cluster
kafka.controller
kafka.coordinator.group
kafka.coordinator.transaction
kafka.log
kafka.network
kafka.server
kafka.utils
I dont see kafka.producer which is understandable because the producer for this Kafka broker are N numbers of different applications, but at this point I am confused.
How do I get the kafka.producer metrics as well.
Do I have to expose the kafka.producer metrics in each of N application that is acting as producer OR is there some configuration that start gathering kafka.producer metrics on the broker only.
What is the correct way of doing this. Please help.
Yes you are correct , to capture the producer JMX metrics , you need to enable JMX in all the processes which are running the kafka producer instance.
It might be helpful to rephrase producing as writing over an unreliable network in this context.
From this perspective, the most reasonable place to measure writing characteristics seems to be the client itself (i.e. in each "application" as you call it).
If messages between the producer and the broker are lost, you can still send stats to a local "metric store" for example (e.g. you could see a "spike" in record-retry-rate or some other relevant metric).
Additionally, pairing Kafka producer metrics with additional, local metrics might be extremely useful (JVM stats, detailed business metrics and so on). Keep in mind, that the client will almost definitely run on a different machine in a production environment, and might be affected by different factors, than the broker itself.
If you intend to monitor your client application (which will most likely happen anyway), then I'd simply do it there (i.e. the standard way).

Flink with zookeeper: Service temporarily unavailable due to an ongoing leader election. Please refresh

I want to run the flink cluster with High-availability mode. Hence I have made the setting as per JobManager High Availability into flink configuration files. When I start the zookeeper quorum by using start-zookeeper-quorum.sh, I am able to start two zookeerper servers(peers) on two machines. but when I start the flink cluster with 2 JobManagers, I get the message as Service temporarily unavailable due to an ongoing leader election. Please refresh. on web UI of flink.
What does this massage means? Is there a way to notify the leader in configuration file?
The problem is with your zookeeper installation. Your zk nodes can not choose a leader. Also number of two nodes is not the best choice. You should have at least 3 instances or other greater odd number.
You should check the admin docs of Zookeeper for instance here