How to start Zookeeper and then Kafka? - apache-kafka

I'm getting started with Confluent Platform which requires to run Zookeeper (zookeeper-server-start /etc/kafka/zookeeper.properties) and then Kafka (kafka-server-start /etc/kafka/server.properties). I am writing an Upstart script that should run both Kafka and Zookeeper. The issue is that Kafka should block until Zookeeper is ready (because it depends on it) but I can't find a reliable way to know when Zookeeper is ready. Here are some attempts in pseudo-code after running the Zookeeper server start:
Use a hardcoded block
sleep 5
Does not work reliably on slower computers and/or waits longer than needed.
Check when something (hopefully Zookeeper) is running on port 2181
wait until $(echo stat | nc localhost ${port}) is not none
This did not seem to work as it doesn't wait long enough for Zookeeper to accept a Kafka connection.
Check the logs
wait until specific string in zookeeper log is found
This is sketchy and there isn't even a string that cannot also be found on error too (e.g. "binding to port [...]").
Is there a reliable way to know when Zookeeper is ready to accept a Kafka connection? Otherwise, I will have to resort to a combination of 1 and 2.

I found that using a timer is not reliable. the second option (waiting for the port) worked for me:
bin/zookeeper-server-start.sh -daemon config/zookeeper.properties && \
while ! nc -z localhost 2181; do sleep 0.1; done && \
bin/kafka-server-start.sh -daemon config/server.properties

The Kafka error message from your comment is definitely relevant:
FATAL [Kafka Server 0], Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/0. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering.
This indicates that ZooKeeper is up and running, and Kafka was able to connect to it. As I would have expected, technique #2 was sufficient for verifying that ZooKeeper is ready to accept connections.
Instead, the problem appears to be on the Kafka side. It has registered a ZooKeeper ephemeral node to represent the starting Kafka broker. An ephemeral node is deleted automatically when the client's ZooKeeper session expires (e.g. the process terminates so it stops heartbeating to ZooKeeper). However, this is based on timeouts. If the Kafka broker restarts rapidly, then after restart, it sees that a znode representing that broker already exists. To the new process start, this looks like there is already a broker started and registered at that path. Since brokers are expected to have unique IDs, it aborts.
Waiting for a period of time past the ZooKeeper session expiration is an appropriate response to this problem. If necessary, you could potentially tune the session expiration to happen faster as discussed in the ZooKeeper Administrator's Guide. (See discussion of tickTime, minSessionTimeout and maxSessionTimeout.) However, tuning session expiration to something too rapid could cause clients to experience spurious session expirations during normal operations.
I have less knowledge on Kafka, but perhaps there is also something that can be done on the Kafka side. I know that some management tools like Apache Ambari take steps to guarantee assignment of a unique ID to each broker on provisioning.

Related

Kafka - Error on specific consumer -Broker not available

We have deployed multiple Kafka consumers in container's clusters. All are working properly except for one, which is throwing warning "Connection to node 0 could not be established. Broker may not be available", however, this error appears only in one of the containers, and this consumer is running in the same network and server of the others. So I have ruled out issues with kafka server configuration.
I tried changing the groupid of the consumer and I got it working for some minutes, but now warn is appearing again. I consume all topics used by this consumer from a bash shell and I can consume.
Having into account the above context, I think it could be due to bad practice in the consumer software code, also, it could be about offsets got damaged. How could I identify if are there some of this kind using kafka logs?
You can exec into the container and netcat the broker's advertised addresses to verify connectivity.
You can also use the Kafka shell scripts to verify consuming functionality, as always.
Corrupted offsets would prevent any consumer from reading, not only one. Bad code practices wouldn't show up in logs
If you have the container running "on same server as others", I'd suggest working with affinity rules and constraints to spread your applications onto multiple servers before placing on the same machine

Trouble with Apache Kafka to Allow External Connections

I'm just having a difficult time with Kafka right now, but I feel like I'm close.
I have two VMs on FreeNAS running locally. Both Running Ubuntu 18.04 LTS.
VM Graylog: 192.168.1.25. Running Graylog Server. Working well retrieving rsyslogs and apache from itself.
VM Kafka: 192.168.1.16. Running Kafka.
My goal is to have VM Graylog pull logs from VM Kafka, via a Graylog Kafka UDP input. The secondary goal is to replicate this, except tha the Kafka instance will sit on my VPS server feeding apache logs from a website. Of course, I want to test this in a dev environment first.
I am able to have my VM Kafka server successfully listen through this line of code:
/opt/kafka_2.13-2.6.0/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic rsyslog_kafka --from-beginning
This is my 60-kafka.conf file:
module(load="omkafka")
template(name="json"
type="list"
option.json="on") {
constant(value="{")
constant(value="\"#timestamp\":\"") property(name="timereported" dateFormat="rfc33$
constant(value="\",\"#version\":\"1")
constant(value="\",\"message\":\"") property(name="msg")
constant(value="\",\"host\":\"") property(name="hostname")
constant(value="\",\"severity\":\"") property(name="syslogseverity-text")
constant(value="\",\"facility\":\"") property(name="syslogfacility-text")
constant(value="\",\"programname\":\"") property(name="programname")
constant(value="\",\"procid\":\"") property(name="procid")
constant(value="\"}\n")
}
action(
broker=["192.168.1.16:9092"]
type="omkafka"
topic="rsyslog_kafka"
template="json"
)
I'm using the default server.properties file which doesn't contain any listeners, just the defaults. I do understand I need to set the listeners and advertised.listeners.
I've attempted the following settings to no avail:
Attempt 1:
listeners = PLAINTEXT://localhost:9092
advertised.listeners=PLAINTEXT://192.168.1.16:9092
Attempt 2:
listeners = PLAINTEXT://127.0.0.1:9092
advertised.listeners=PLAINTEXT://192.168.1.16:9092
This after reloading both Kafka and Rsyslog and confirming their statuses are active.
Example errors when attempting to read messages.
Bunch of these
[2020-08-20 00:52:42,248] WARN [Consumer clientId=consumer-console-consumer-70205-1, groupId=console-consumer-70205] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
Followed by an infinite amount of these:
[2020-08-20 00:48:50,598] WARN [Consumer clientId=consumer-console-consumer-11975-1, groupId=console-consumer-11975] Error while fetching metadata with correlation id 254 : {rsyslog_kafka=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
I feel like I'm close. Perhaps there is something I'm just understanding. I've read lots of similar articles where they say just replace the IP addresses with your server. I feel like I've done that, with no success.
You need to set listeners to PLAINTEXT://0.0.0.0:9092 in order to bind externally.
The advertised listener ought to be set to an address that your consumers will be able to use to discover the cluster
Note: Docker Compose might be easier than VMs

getting "org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)"

I have installed zookeeper and kafka,
first step :
running zookeeper by the following commands :
bin/zkServer.sh start
bin/zkCli.sh
second step :
running kafka server
bin/kafka-server-start.sh config/server.properties
kafka should run at localhost:9092
but I am getting the following error :
WARN Unexpected error from /0:0:0:0:0:0:0:1; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
I am following the following link :
Link1
Link2
I am new to kafka ,please help me to set it up.
1195725856 is GET[space] encoded as a big-endian, four-byte integer (see here for more information on how that works). This indicates that HTTP traffic is being sent to Kafka port 9092, but Kafka doesn't accept HTTP traffic, it only accepts its own protocol (which takes the first four bytes as the receive size, hence the error).
Since the error is received on startup, it is likely benign and may indicate a scanning service or similar on your network scanning ports with protocols that Kafka doesn't understand.
In order to find the cause, you can find where the HTTP traffic is coming from using tcpdump:
tcpdump -i any -w trap.pcap dst port 9092
# ...wait for logs to appear again, then ^C...
tcpdump -qX -r trap.pcap | less +/HEAD
Overall though, this is probably annoying but harmless. At least Kafka isn't actually allocating/dirtying the memory. :-)
Try to reset socket.request.max.bytes value in $KAFKA_HOME/config/server.properties file to more than your packet size and restart kafka server.
My initial guess would be that you might be trying to receive a request that is too large. The maximum size is the default size for socket.request.max.bytes, which is 100MB. So if you have a message which is bigger than 100MB try to increase the value of this variable under server.properties and make sure to restart the cluster before trying again.
If the above doesn't work, then most probably you are trying to connect to a non-SSL-listener.
If you are using the default broker of the port, you need to verify that :9092 is the SSL listener port on that broker.
For example,
listeners=SSL://:9092
advertised.listeners=SSL://:9092
inter.broker.listener.name=SSL
should do the trick for you (Make sure you restart Kafka after re-configuring these properties).
This is how I resolved this issue after installing a Kafka, ELK and Kafdrop set up:
First stop every application one by one that interfaces with Kakfa
to track down the offending service.
Resolve the issue with that application.
In my set up it was Metricbeats.
It was resolved by editing the Metricbeats kafka.yml settings file located in modules.d sub folder:
Ensuring the Kafka advertised.listener in server.properties was
referenced in the hosts property.
Uncomment the metricsets and client_id properties.
The resulting kafka.yml looks like:
# Module: kafka
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.6/metricbeat-module-kafka.html
# Kafka metrics collected using the Kafka protocol
- module: kafka
metricsets:
- partition
- consumergroup
period: 10s
hosts: ["[your advertised.listener]:9092"]
client_id: metricbeat
The answer is most likely in one of the 2 areas
a. socket.request.max.bytes
b. you are using a non SSL end point to connect the producer and the consumer too.
Note: the port you run it really does not matter. Make sure if you have an ELB the ELB is returning all the healthchecks to be successful.
In my case i had an AWS ELB fronting KAFKA. I had specified the Listernet Protocol as TCP instead of Secure TCP. This caused the issue.
#listeners=PLAINTEXT://:9092
inter.broker.listener.name=INTERNAL
listeners=INTERNAL://:9093,EXTERNAL://:9092
advertised.listeners=EXTERNAL://<AWS-ELB>:9092,INTERNAL://<EC2-PRIVATE-DNS>:9093
listener.security.protocol.map=INTERNAL:SASL_PLAINTEXT,EXTERNAL:SASL_PLAINTEXT
sasl.enabled.mechanisms=PLAIN
sasl.mechanism.inter.broker.protocol=PLAIN
Here is a snippet of my producer.properties and consumer.properties for testing externally
bootstrap.servers=<AWS-ELB>:9092
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
In my case, some other application was already sending data to port 9092, hence the starting of server failed. Closing the application resolved this issue.
Please make sure that you use .security.protocol=plaintext or you have mismatch server security compared to the clients trying to connect.

Maximum value for zookeeper.connection.timeout.ms

Right now we are running kafka in AWS EC2 servers and zookeeper is also running on separate EC2 instances.
We have created a service (system units ) for kafka and zookeeper to make sure that they are started in case the server gets rebooted.
The problem is sometimes zookeeper severs are little late in starting and kafka brokers by that time getting terminated.
So to deal with this issue we are planning to increase the zookeeper.connection.timeout.ms to some high number like 10 mins, at the broker side. Is this a good approach ?
Are there any size effect of increasing the zookeeper.connection.timeout.ms timeout in zookeeper ?
Increasing zookeeper.connection.timeout.ms may or may not handle your problem in hand but there is a possibility that it will take longer time to detect a broker soft failure.
Couple of things you can do:
1) You must alter the System to launch the kafka to delay by 10 mins (the time you wanted to put in zookeper timeout).
2) We are using HDP cluster which automatically takes care of such scenarios.
Here is an explanation from Kafka FAQs:
During a broker soft failure, e.g., a long GC, its session on ZooKeeper may timeout and hence be treated as failed. Upon detecting this situation, Kafka will migrate all the partition leaderships it currently hosts to other replicas. And once the broker resumes from the soft failure, it can only act as the follower replica of the partitions it originally leads.
To move the leadership back to the brokers, one can use the preferred-leader-election tool here. Also, in 0.8.2 a new feature will be added which periodically trigger this functionality (details here).
To reduce Zookeeper session expiration, either tune the GC or increase zookeeper.session.timeout.ms in the broker config.
https://cwiki.apache.org/confluence/display/KAFKA/FAQ
Hope this helps

What happens if Zookeeper fails completely?

we have setup a Kafka/Zookeeper Cluster consisting of 3 Brokers. We have one producer, sending messages to one specific Kafka topic and a few consumer groups reading from said topic. Those consumers perform a leader election via Zookeeper for themselves (independent from Kafka).
The versions used are:
Kafka: 0.9.0.1
Zookeeper: 3.4.6 (included in the Kafka-Package)
All processes are managed by Supervisor. So far, everything works just fine. What we tried now (for testing purposes) was to simply kill off all Zookeeper processes and see what happens.
As we expected, our consumer processes couldn't connect to Zookeeper anymore. But unexpectedly, the Kafka Brokers still worked. Our producer didn't complain at all and was still able to write into the topic. While I couldn't use kafka/bin/kafka-topics.sh or similar, since they all require a zookeeper-parameter, I could still see the actual size of the topic-log grow. After restarting the zookeeper processes, everything again worked just like before.
What we couldn't figure out is now... what actually happened there?
We thought, Kafka would require a working Zookeeper-Connection and we couldn't find any explanation for this behaviour online.
When you have one node of zookeeper, broker will not be able to contact zookeeper, after broker discovers zookeeper is not reachable, broker also will become unreachable. Hence the producer and consumer.
In case of producer it starts dropping(reject the record). In case of consumer it can happen that, the read record which is not ack'ed may end up processing again when broker is up and ready...
in case of 3node zk one node failure is acceptable as quorum is still satisfied... but cant afford the 2node failures which will lead to the above consequences...