How to get more information about the Zookeeper status in Confluent - apache-zookeeper

I am making Zookeeper cluster, and I separately start the Zookeeper in Confluent by using:
./bin/zookeeper-server-start etc/kafka/zookeeper.properties
and I want to get the status of Zookeeper.
I search it online, and all of them is using:
./zkServer.sh status
But I can't find zkServer.sh in Confluent.
I know that I can use ./bin/confluent status to get status. But I want more information about the Zookeeper like follow:
./zkServer.sh status
JMX enabled by default
Using config: /opt/../conf/zoo.cfg
Mode: follower
How can I do that?

You can use the Four Letter Words to get the same information or better instead. The output from stat:
$ echo "stat" | nc <ZOOKEEPER-IP-ADDRESS> 2181
Zookeeper version: 3.4.10
Clients:
/192.168.1.2:49618[1](queued=0,recved=1304,sent=1304)
/192.168.1.3:53484[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/0/15
Received: 1330
Sent: 1329
Connections: 2
Outstanding: 0
Zxid: 0x1000001ee
Mode: leader
Node count: 435
The output from conf:
$ echo "conf" | nc <ZOOKEEPER-IP-ADDRESS> 2181
clientPort=2181
dataDir=/var/zookeeper/data
dataLogDir=/var/log/zookeeper
tickTime=2000
maxClientCnxns=0
minSessionTimeout=4000
maxSessionTimeout=40000
serverId=3
initLimit=20
syncLimit=5
electionAlg=3
electionPort=3888
quorumPort=2888
peerType=0

Related

zookeeper server cluster + Mode: follower and Mode: leader are change evry couple min

we have zookeeper cluster with 3 nodes
when we perform the following commands
echo stat | nc zookeeper_server01 2181 | grep Mode
echo stat | nc zookeeper_server02 2181 | grep Mode
echo stat | nc zookeeper_server03 2181 | grep Mode
we saw that zookeeper_server03 is the leader and other are the Mode: follower
but we noticed that every couple min the state is change and indeed after 4 min zookeeper_server01 became the leader and other are Mode: follower
again after 6 min zookeeper_server02 became a leader and so on
my Question is - dose this strange behavior is normal ?
I want to say that production Kafka cluster is using this zookeeper servers so , we are worry about that

Trying to run Apache/NIFI on Zookeeper from Confluent

I am trying to run Apache/NIFI on confluent-zookeeper. NIFI ver 1.11.3 installed in /opt/nifi by unpacking tar container, confluent is community edition, ver 5.3. installed using confluent repo https://packages.confluent.io/rpm/5.3.
So NIFI works using integrated zookeper, NIFI works if I download zookeeper separatly from Apache/zookeeper site. Confluent Kafka also works with separate zookeeper and NIFI-integrated. BUT I cannot make it works using zookeeper from confluent.
In logs I see only one warning which is:
WARN Received packet at server of unknown type 15 (org.apache.zookeeper.server.ZooKeeperServer)
My config file for all three zookeepers are the same:
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=myhost1:2888:3888
server.2=myhost2:2888:3888
server.3=myhost3:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=24
I do not think that Confluent really changed smth in their zookeeper. What could be the reason of this error?
As #BryanBende said:
NiFi 1.11.x (In our particular case, 1.11.4) requires ZK 3.5, please
confirm the version of ZK used in Confluent platform, IF ITS 3.4 THEN
ITS NOT GOING TO WORK – Bryan Bende Mar 27 at 14:35
The typical error you will see in zookeeper log:
Oct 08 17:22:23 some-pro-zk03 zookeeper-server-start[14136]:
[2020-10-08 17:22:23,275] INFO Accepted socket connection from
/10.10.10.1:53794 (org.apache.zookeeper.server.NIOServerCnxnFactory)
Oct 08 17:22:23 some-pro-zk03 zookeeper-server-start[14136]:
[2020-10-08 17:22:23,275] INFO Refusing session request for client
/10.10.10.1:53794 as it hasseen zxid 0x400000000 our last zxid is
0x300000004 client must try another server
(org.apache.zookeeper.server.ZooKeeperServer)
Oct 08 17:22:23
some-pro-zk03 zookeeper-server-start[14136]: [2020-10-08 17:22:23,275]
INFO Closed socket connection for client /10.159.164.93:53794 (no
session established for client)
(org.apache.zookeeper.server.NIOServerCnxn)
The typical error you will see in client log:
2020-10-07 16:00:09,112 ERROR [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave
uporg.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLossat
org.apache.zookeeper.KeeperException.create(KeeperException.java:102)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)at
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
java.lang.Thread.run(Thread.java:748)
This takes a full day of work (reading logs!!) to find the error.
So, check your zookeeper version:
FOR Zookeeper 3.5+
echo srvr | nc localhost 2181
FOR Zookeeper 3.5<
echo stats | nc localost 2181
Also you can use telnet
FOR Zookeeper 3.5+
telnet localhost 2181
srvr
FOR Zookeeper 3.5<
telnet localhost 2181
stats

Why is Zookeeper not re-electing new leader in Apache Nifi Cluster?

Following is my architecture
2 Servers:
Server 1: running Apache Nifi + Zookeeper (Not embedded)
Server 2: running Apache Nifi + Zookeeper (Not embedded)
To test failovers, I close down the Server that has been selected as Cluster Coordinator
In this case, zookeeper should automatically elect the remaining one server as leader. But it keeps failing and goes into continuous loop of trying to connect to the first server
Zookeeper Logs in Server 2 when leader (Server 1) went down:
2019-10-22 18:44:01,135 [myid:2] - WARN [NIOWorkerThread-2:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:02,925 [myid:2] - WARN [NIOWorkerThread-3:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:03,320 [myid:2] - WARN [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager#677] -
Cannot open channel to 1 at election address ec2-server-1.compute-1.amazonaws.com/172.xx.x.x:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
Server 2 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=ec2-server-1.compute-1.amazonaws.com:2888:3888
server.2=0.0.0.0:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-2.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
Server 1 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=0.0.0.0:2888:3888
server.2=ec2-server-2.compute-1.amazonaws.com:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-1.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
What am I doing wrong?
You need at least three nodes to be able to handle the failure of one node.
Check the Admin guide:
Clustered (Multi-Server) Setup For reliable ZooKeeper service, you
should deploy ZooKeeper in a cluster known as an ensemble. As long as
a majority of the ensemble are up, the service will be available.
Because Zookeeper requires a majority, it is best to use an odd number
of machines. For example, with four machines ZooKeeper can only handle
the failure of a single machine; if two machines fail, the remaining
two machines do not constitute a majority. However, with five machines
ZooKeeper can handle the failure of two machines.
A simpler explanation here also

How to validate zookeeper quorum

How do I verify that all the nodes in a zookeeper are part of a quorum and are healthy? Manual talks about "ruok" but that doesnt still say if the zookeeper node is part of quorum and is in sync with the rest.
You can use the srvr command documented in The Four Letter Words to get more detailed status information about each ZooKeeper server in the ensemble. See below for sample output from a 3-node cluster, with hosts named ubuntu1, ubuntu2 and ubuntu3.
The Mode field will tell you if that particular server is the leader or a follower. The Zxid field refers to the ZooKeeper cluster's internal transaction ID used for tracking state changes to the tree of znodes. In a healthy cluster, you'll see one leader, multiple followers, and all nodes will generally be close to one another in the zxid value.
> for x in ubuntu1 ubuntu2 ubuntu3; do echo $x; echo srvr|nc $x 2181; echo; done
ubuntu1
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 3/9/21
Received: 9
Sent: 8
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: follower
Node count: 6
ubuntu2
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: leader
Node count: 6
ubuntu3
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: follower
Node count: 6

How to get zxid of a zookeeper server?

Zookeeper assigns a unique number for each transaction called zxid. It has two parts - an epoch and a counter. I could find the epoch value in zookeeper's data directory. However I cant find the counter. Does anyone know where I can find it?
In general, how to get zxid for zookeeper?
Turns out its pretty easy
echo srvr | nc localhost 2181
Also looking at the current status of zookeeper server can show the zxid which is answered in another post. Firt i executed telnet:
telnet localhost 2181
Then send following data to server:
stats
and then received following information:
Zookeeper version: 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 00:39 GMT
Clients:
/127.0.0.1:54864[1](queued=0,recved=6030,sent=6033)
/192.168.80.1:55675[0](queued=0,recved=1,sent=0)
/192.168.80.1:54769[1](queued=0,recved=432,sent=432)
Latency min/avg/max: 0/0/35
Received: 7104
Sent: 7114
Connections: 3
Outstanding: 0
Zxid: 0xd0
Mode: standalone
Node count: 148
Connection to host lost.
As you see the zxid is currently 0xd0 in my zookeeper server.