Zookeeper unable to talk to new Kafka broker - apache-kafka

In an attempt to reduce the storage on my AWS instance I decided to launch a new, smaller instance and setup Kafka again from scratch using the Ansible playbook we had from before. I then terminated the old, larger instance and took its IP address that it and the other brokers were using and put it on my new instance.
When tailing my Zookeeper logs however I'm receiving this error -
2018-04-13 14:17:34,884 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#810] - Connection broken for id 1, my id = 2, error =
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:153)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.net.SocketInputStream.read(SocketInputStream.java:211)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)
2018-04-13 14:17:34,885 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#813] - Interrupting SendWorker
2018-04-13 14:17:34,884 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#727] - Interrupted while waiting for message on queue
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
I double checked and all 3 Kafka broker IP addresses are correctly listed in these location and I restarted all their services to be safe.


Root cause for Connection broken for id 1, my id = 3, error =

I am using Confluent 4 for kafka and zookeeper installation.
On our Kafka Cluster environment (of 3 brokers and 3 zookeeper nodes running on 3 aws instances)
we are seeing a set of below warnings, repeatedly getting recorded in the broker's server.log file.
We have not observed any functionality issues due to this yet, but we are not able to find the root cause and there may be a chance in future it will affect the clients or other broker nodes. We are not sure yet about this. Below is the set of warnings
[2018-04-03 12:00:40,707] WARN Interrupted while waiting for message on queue (org.apache.zookeeper.server.quorum.QuorumCnxManager)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
[2018-04-03 12:00:40,707] WARN Connection broken for id 1, my id = 3, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
[2018-04-03 12:00:40,708] WARN Interrupting SendWorker (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2018-04-03 12:00:40,707] WARN Send worker leaving thread (org.apache.zookeeper.server.quorum.QuorumCnxManager)
This set of warnings get repeated and getting observed in all 3 kafka nodes.
If anyone has any idea about why this warning gets generate, then please let me know.
Thanks in advance.
This sounds like a known issue with newer version of Zk, Check out this JIRA https://issues.apache.org/jira/browse/ZOOKEEPER-2938
In my case, I was replacing a ZK node and the old one was still running which I didn't realize. So I had created 2x nodes with the same "myid".

Kafka Zookeeper Connection drop continuously

I have setup Kafka 3-node cluster and Zookeeper 3-node cluster, on separate nodes. Using Kafka I can produce and consume messages successfully and run commands like kafka-topic.sh to get topic lists and their informations from Zookeeper, but there are some errors on Kafka server.log file. The following warning appears continuously:
[2018-02-18 21:50:01,241] WARN Client session timed out, have not heard from server in 320190154ms for sessionid 0x161a94b101f0001 (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:01,242] INFO Client session timed out, have not heard from server in 320190154ms for sessionid 0x161a94b101f0001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:01,343] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2018-02-18 21:50:01,989] INFO Opening socket connection to server zookeeper3/ Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:02,008] INFO Socket connection established to zookeeper3/, initiating session (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:02,042] INFO Session establishment complete on server zookeeper3/, sessionid = 0x161a94b101f0001, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:02,042] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
[2018-02-18 21:59:31,570] INFO [Group Metadata Manager on Broker 102]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
It seems the Kafka sessions in zookeeper expires periodically!
In Zookeeper logs are the following warninngs, too:
2018-02-18 18:20:06,149 [myid:1] - WARN [NIOServerCxn.Factory:] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x161a94b101f0001, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
2018-02-18 18:20:06,151 [myid:1] - INFO [NIOServerCxn.Factory:] - Closed socket connection for client / which had sessionid 0x161a94b101f0001
2018-02-18 18:20:06,781 [myid:1] - WARN [NIOServerCxn.Factory:] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x161a94b101f0002, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
2018-02-18 18:20:06,782 [myid:1] - INFO [NIOServerCxn.Factory:] - Closed socket connection for client / which had sessionid 0x161a94b101f0002
2018-02-18 18:37:29,127 [myid:1] - INFO [NIOServerCxn.Factory:] - Accepted socket connection from /
2018-02-18 18:37:29,139 [myid:1] - INFO [NIOServerCxn.Factory:] - Client attempting to establish new session at /
2018-02-18 18:37:29,143 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer#687] - Established session 0x161a94b101f0003 with negotiated timeout 30000 for client /
2018-02-18 18:37:29,432 [myid:1] - INFO [NIOServerCxn.Factory:] - Closed socket connection for client / which had sessionid 0x161a94b101f0003
I think it's because zookeeper can't get heartbeat from Kafka nodes. The followings are Zookeeper zoo.cfg:
and Kafka server.properties customized setting:
listeners = PLAINTEXT://kafka1:9092
I use the same zookeeper cluster for Hadoop HA without any problem. I think there is something wrong with the Kafka properties listeners and advertised.listeners. I read the Kafka documentation but couldn't understand their meaning.
In the host file of all OSes, hostnames such that zookeeper1 to zookeeper3 and kafka1 to kafka3 are defined and reachable through ping command. I removed the following lines from hosts: localhost hostname
I think it couldn't cause the problem.
Kafka version: 0.11
Zookeeper version: 3.4.10
Can anyone help?
We were facing a similar issue with Kafka. As #Soheil pointed out it was due to a Major GC running.
When a Major GC runs, then Kafka would sometimes not be able to send heartbeat to zookeeper. For us the Major GC was running almost once every 15 sec. On taking a heap dump, we realized it was due to a Metric Memory Leak in Kafka.

Flink: HA mode killing leading jobmanager terminating standby jobmanagers

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.
So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.
My setup:
4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.
ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting
My masters file looks like this:
My flink-conf.yaml:
jobmanager.rpc.address: ad011.local
blob.server.port: 6130,6131,6132
jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791
high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002
When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?
I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.
This is what I see in my standby JobManager logs:
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#ad011.local:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService#72d546c8.
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#ad011.local:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#ad011.local:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#ad011.local:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at
Any help would be appreciated.
Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.

Leader election with Curator and Zookeeper

I am running 3 instances of ZooKeeper and the config is this:
I am using the leader election example code given here:
The code runs fine with TestingServer but when I change connection string to : ",,", I get the exceptions:
[main-SendThread(] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /, server:
[main-SendThread(] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server, sessionid = 0x3521552283c0000, negotiated timeout = 40000
[main-EventThread] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED
[main-SendThread(] INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x3521552283c0000, likely server has closed socket, closing socket connection and attempting reconnect
[main-EventThread] INFO org.apache.curator.framework.imps.EnsembleTracker - New config event received: null
[main-EventThread] ERROR org.apache.curator.framework.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
at java.io.ByteArrayInputStream.<init>(ByteArrayInputStream.java:106)
at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:163)
at org.apache.curator.framework.imps.EnsembleTracker.access$200(EnsembleTracker.java:48)
at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:134)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:829)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:611)
at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:151)
at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:210)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:528)
[main-EventThread] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
What could be the issue?
I am hitting the same issue. I think it might be related to the Zookeeper 3.5.1 ClientCnxn. Even though I return back to curator 2.6.0, I still see the same stack trace. A GET_CONFIG event type is sent without the event data.
My stack trace looks like this:
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
! java.lang.NullPointerException: null
! at java.io.ByteArrayInputStream.(ByteArrayInputStream.java:106)
! at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:163)
! at org.apache.curator.framework.imps.EnsembleTracker.access$200(EnsembleTracker.java:48)
! at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:134)
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:829)
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:611)
! at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:151)
! at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:210)
! at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619)
! at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:528)
If use Zookeeper 3.5.1, then curator-recipes 3.2.1+ fix this issue.

Zookeeper: Connection request from old client will be dropped if server is in r-o mode

storm version: 0.82
zookeeper version: 3.4.5.
We have a small storm cluster (1 nimbus and 3 supervisors), so using just 1 zookeeper instance that's co-located with storm nimbus.
Infrequently we start getting the following errors in the zookeeper logs and our storm cluster comes to a standstill.
2014-04-05 13:27:32,885 [myid:] - INFO [NIOServerCxn.Factory:
ory#197] - Accepted socket connection from /
2014-04-05 13:27:32,886 [myid:] - WARN [NIOServerCxn.Factory:
93] - Connection request from old client /; will be dropped if server is in r-o mode
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:
32] - Client attempting to renew session 0x1452dd02834002e at /
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:
95] - Established session 0x1452dd02834002e with negotiated timeout 40000 for client /
On the storm end we start seeing the following in supervisor and worker logs:
2014-04-05 11:37:29 ConnectionStateManager [WARN] There are no ConnectionStateListeners registered.
2014-04-05 11:37:29 cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2014-04-05 11:37:31 ClientCnxn [WARN] Session 0x1452dd028340015 for server null, unexpected error,
losing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-04-05 11:37:42 CuratorFrameworkImpl [ERROR] Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(Curat
at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
Do we need to downgrade zookeeper to 3.3.3 or is there a known issue/config that we're missing?
We also experienced several issues with Storm 0.9 and Zookeeper 3.4.X, even though not exactly the one you describe.
Storm mailing list are also reporting such incompatibility issues:
This later one is pointing us to this Storm pull request, which should hopefully let us use ZK 3.4.X with future versions of Storm when it will be released:
Until then, I would recommend downgrading ZK to 3.3.6 (you may install a specific separate instance of ZK for Storm if you absolutely need ZK 3.4.X for another system). You could also clone the Storm code and merge that pull request locally or compile the latest version of the trunk, but that's a bit adventurous and more tiresome than just waiting for those nice folks to just deliver a new release for us :)
A workaround for this situation is to clear storm's data directory (configured in strom.yaml==>storm.local.dir), then restart the supervisor. I did that in my test environment by clear storm's data directory and restart the nimbus and supervisor.
I think it's caused by a previous crash of the storm cluster, and the supervisor can not recovery from such a spot.