ZooKeeper not getting started - apache-zookeeper

I am facing issue while starting zookeeper.
zoo.cfg file is
# The number of milliseconds of each tick
tickTime=2000
dataDir=/Users/admin/Documents/delete/zookeeper/zookeeper-3.4.6/zookeeperdata/1
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890
I don't see any error while starting zookeeper :
nohup ./bin/zkServer.sh start zoo.cfg
JMX enabled by default
Using config: /Users/admin/Documents/delete/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
could see a new process Id also :
cat /Users/admin/Documents/delete/zookeeper/zookeeper-3.4.6/zookeeperdata/1/zookeeper_server.pid
14120
But while checking the status of the process, getting below error:
bin/zkServer.sh status
JMX enabled by default
Using config: /Users/admin/Documents/delete/zookeeper/zookeeper-3.4.6/bin/../conf/zoo.cfg
Error contacting service. It is probably not running
Could you please help.

server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2890:3890
This means that you're setting up a ZooKeeper ensemble, and one of the rules for a zk-ensamble is that the servers needs to form a majority before they are allowed to answer requests. This means that a zk-server is not running until it has formed a majority.
To get an informative answer from status you need to have at least 2 out of 3 servers running for a 3 server ensemble. Either remove these lines from your config or start another server. (And make sure the servers have different 'dataDir' and 'myid'.)
(Something a lot of people misunderstand is that the majority that is necessary is not a majority among the started servers but a majority among the servers in the configuration.)

Related

Change storm zookeeper server in storm.yaml

I want to point my storm cluster to 3 zookeeper nodes. From my understanding, I should just change storm.yaml configuration in nimbus and supervisor nodes without stopping any of the nodes. Is it correct?
How should I validate my change?
I think you need to restart Nimbus and the supervisors once you've updated storm.yaml in order for them to pick up the change.
I'm assuming you currently have 1 Zookeeper node? In that case, you can validate your change after updating to 3 by stopping the Zookeeper node and checking that your Nimbus and supervisors keep running.
An alternative is to enable debug logging for org.apache.curator, it will most likely log which nodes it is connecting to.

Kafka - zookeeper doesn't run with others

I have a problem with Apache kafka
I have 4 clusters where I want to install kafka instances. On 3 clusters its works, they can product, and consume messages between each other, zookepers work fine. But on 4th cluster I can't run zookeeper connected with others zookeepers. If I set in zoo.cfg only local server (0.0.0.0:2888:3888) zookeeper runs in mode standalone, but if I add others servers I get error
./zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
Error contacting service. It is probably not running.
How I can fix this error? I will add, that I can ping servers, so they can see each other.

getting "org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)"

I have installed zookeeper and kafka,
first step :
running zookeeper by the following commands :
bin/zkServer.sh start
bin/zkCli.sh
second step :
running kafka server
bin/kafka-server-start.sh config/server.properties
kafka should run at localhost:9092
but I am getting the following error :
WARN Unexpected error from /0:0:0:0:0:0:0:1; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
I am following the following link :
Link1
Link2
I am new to kafka ,please help me to set it up.
1195725856 is GET[space] encoded as a big-endian, four-byte integer (see here for more information on how that works). This indicates that HTTP traffic is being sent to Kafka port 9092, but Kafka doesn't accept HTTP traffic, it only accepts its own protocol (which takes the first four bytes as the receive size, hence the error).
Since the error is received on startup, it is likely benign and may indicate a scanning service or similar on your network scanning ports with protocols that Kafka doesn't understand.
In order to find the cause, you can find where the HTTP traffic is coming from using tcpdump:
tcpdump -i any -w trap.pcap dst port 9092
# ...wait for logs to appear again, then ^C...
tcpdump -qX -r trap.pcap | less +/HEAD
Overall though, this is probably annoying but harmless. At least Kafka isn't actually allocating/dirtying the memory. :-)
Try to reset socket.request.max.bytes value in $KAFKA_HOME/config/server.properties file to more than your packet size and restart kafka server.
My initial guess would be that you might be trying to receive a request that is too large. The maximum size is the default size for socket.request.max.bytes, which is 100MB. So if you have a message which is bigger than 100MB try to increase the value of this variable under server.properties and make sure to restart the cluster before trying again.
If the above doesn't work, then most probably you are trying to connect to a non-SSL-listener.
If you are using the default broker of the port, you need to verify that :9092 is the SSL listener port on that broker.
For example,
listeners=SSL://:9092
advertised.listeners=SSL://:9092
inter.broker.listener.name=SSL
should do the trick for you (Make sure you restart Kafka after re-configuring these properties).
This is how I resolved this issue after installing a Kafka, ELK and Kafdrop set up:
First stop every application one by one that interfaces with Kakfa
to track down the offending service.
Resolve the issue with that application.
In my set up it was Metricbeats.
It was resolved by editing the Metricbeats kafka.yml settings file located in modules.d sub folder:
Ensuring the Kafka advertised.listener in server.properties was
referenced in the hosts property.
Uncomment the metricsets and client_id properties.
The resulting kafka.yml looks like:
# Module: kafka
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.6/metricbeat-module-kafka.html
# Kafka metrics collected using the Kafka protocol
- module: kafka
metricsets:
- partition
- consumergroup
period: 10s
hosts: ["[your advertised.listener]:9092"]
client_id: metricbeat
The answer is most likely in one of the 2 areas
a. socket.request.max.bytes
b. you are using a non SSL end point to connect the producer and the consumer too.
Note: the port you run it really does not matter. Make sure if you have an ELB the ELB is returning all the healthchecks to be successful.
In my case i had an AWS ELB fronting KAFKA. I had specified the Listernet Protocol as TCP instead of Secure TCP. This caused the issue.
#listeners=PLAINTEXT://:9092
inter.broker.listener.name=INTERNAL
listeners=INTERNAL://:9093,EXTERNAL://:9092
advertised.listeners=EXTERNAL://<AWS-ELB>:9092,INTERNAL://<EC2-PRIVATE-DNS>:9093
listener.security.protocol.map=INTERNAL:SASL_PLAINTEXT,EXTERNAL:SASL_PLAINTEXT
sasl.enabled.mechanisms=PLAIN
sasl.mechanism.inter.broker.protocol=PLAIN
Here is a snippet of my producer.properties and consumer.properties for testing externally
bootstrap.servers=<AWS-ELB>:9092
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
In my case, some other application was already sending data to port 9092, hence the starting of server failed. Closing the application resolved this issue.
Please make sure that you use .security.protocol=plaintext or you have mismatch server security compared to the clients trying to connect.

ZooKeeper cluster of 2 nodes - strange behavior when one node is down programmatically

When I have two nodes operational, then everything works as expected
[dmitry#zk2-prod]/etc/supervisor.d% sudo /opt/zookeeper/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Mode: leader
however as soon as I stop one of the nodes zk1-prod (via supervisord's supervisorctl)
[dmitry#zk2-prod]/etc/supervisor.d% sudo /opt/zookeeper/bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running
hoewever
[dmitry#zk2-prod]/etc/supervisor.d% sudo supervisorctl status
zookeeper RUNNING pid 4838, uptime 0:04:01
As soon as I bring the slave back - I'm immediately get first output (mode: leader)
[dmitry#zk2-prod]/etc/supervisor.d% ps aufx G zoo
89:zookeep+ 4838 0.2 1.4 2970424 56816 ? Sl 19:32 0:00 \_ java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/bin/../build/classes:/opt/zookeeper/bin/../build/lib/*.jar:/opt/zookeeper/bin/../lib/slf4j-log4j12-1.6.1.jar:/opt/zookeeper/bin/../lib/slf4j-api-1.6.1.jar:/opt/zookeeper/bin/../lib/netty-3.10.5.Final.jar:/opt/zookeeper/bin/../lib/log4j-1.2.16.jar:/opt/zookeeper/bin/../lib/jline-0.9.94.jar:/opt/zookeeper/bin/../zookeeper-3.4.10.jar:/opt/zookeeper/bin/../src/java/lib/*.jar:/opt/zookeeper/bin/../conf: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /opt/zookeeper/bin/../conf/zoo.cfg
Do I need 3 instances at least so org.apache.zookeeper.server.quorum.QuorumPeerMain can select a leader?
I thought one instance will be able to select itself as a leader and continue serve requests.
Am I missing something?
Do I need 3 instances at least so
org.apache.zookeeper.server.quorum.QuorumPeerMain can select a leader?
Yes, to tolerate the event of losing one server.
In a Zookeeper quorum, as long as majority of Servers are available the zookeeper service will be available. A server cannot elect itself as a leader.
In this case where 2 servers form the ensemble, 2 is the majority. When one is lost, the majority making member is lost along with it. Losing the majority is considered as the failure of the quorum.
A much easier to explain 3 server scenario, If one is lost still 2 remains to maintain the majority but in the event of losing 2, the majority making member in this 3-member quorum is lost which will lead to unavailability of zookeeper service.

How to start Zookeeper and then Kafka?

I'm getting started with Confluent Platform which requires to run Zookeeper (zookeeper-server-start /etc/kafka/zookeeper.properties) and then Kafka (kafka-server-start /etc/kafka/server.properties). I am writing an Upstart script that should run both Kafka and Zookeeper. The issue is that Kafka should block until Zookeeper is ready (because it depends on it) but I can't find a reliable way to know when Zookeeper is ready. Here are some attempts in pseudo-code after running the Zookeeper server start:
Use a hardcoded block
sleep 5
Does not work reliably on slower computers and/or waits longer than needed.
Check when something (hopefully Zookeeper) is running on port 2181
wait until $(echo stat | nc localhost ${port}) is not none
This did not seem to work as it doesn't wait long enough for Zookeeper to accept a Kafka connection.
Check the logs
wait until specific string in zookeeper log is found
This is sketchy and there isn't even a string that cannot also be found on error too (e.g. "binding to port [...]").
Is there a reliable way to know when Zookeeper is ready to accept a Kafka connection? Otherwise, I will have to resort to a combination of 1 and 2.
I found that using a timer is not reliable. the second option (waiting for the port) worked for me:
bin/zookeeper-server-start.sh -daemon config/zookeeper.properties && \
while ! nc -z localhost 2181; do sleep 0.1; done && \
bin/kafka-server-start.sh -daemon config/server.properties
The Kafka error message from your comment is definitely relevant:
FATAL [Kafka Server 0], Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/0. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering.
This indicates that ZooKeeper is up and running, and Kafka was able to connect to it. As I would have expected, technique #2 was sufficient for verifying that ZooKeeper is ready to accept connections.
Instead, the problem appears to be on the Kafka side. It has registered a ZooKeeper ephemeral node to represent the starting Kafka broker. An ephemeral node is deleted automatically when the client's ZooKeeper session expires (e.g. the process terminates so it stops heartbeating to ZooKeeper). However, this is based on timeouts. If the Kafka broker restarts rapidly, then after restart, it sees that a znode representing that broker already exists. To the new process start, this looks like there is already a broker started and registered at that path. Since brokers are expected to have unique IDs, it aborts.
Waiting for a period of time past the ZooKeeper session expiration is an appropriate response to this problem. If necessary, you could potentially tune the session expiration to happen faster as discussed in the ZooKeeper Administrator's Guide. (See discussion of tickTime, minSessionTimeout and maxSessionTimeout.) However, tuning session expiration to something too rapid could cause clients to experience spurious session expirations during normal operations.
I have less knowledge on Kafka, but perhaps there is also something that can be done on the Kafka side. I know that some management tools like Apache Ambari take steps to guarantee assignment of a unique ID to each broker on provisioning.