CuratorFrameworkImpl - Background exception was not retry-able or retry gave up - apache-zookeeper

Curator framework version - 4.3.0, Zookeeper version - 5.5.0
We have deployed apache atlas on Kubernetes and it uses Zookeeper to elect one out of two atlas pods as a leader.
We are running three zookeeper pods (3 node cluster) and one pod going down should not create any issue. When one zookeeper pod is down, the zookeeper cluster is still healthy and there is one zookeeper leader available. I tested this by exec'ing into a zookeeper pod and checking zookeeper status. But curator framework throws the following error -
[main:] ~ Background exception was not retry-able or retry gave up (CuratorFrameworkImpl:685)
java.net.UnknownHostException: zookeeper-2.zookeeper-headless.atlas.svc.cluster.local: Name or service not known
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:196)
at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:101)
at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:57)
at org.apache.curator.ConnectionState.reset(ConnectionState.java:201)
at org.apache.curator.ConnectionState.start(ConnectionState.java:111)
at org.apache.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:214)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:314)
at org.apache.atlas.web.service.CuratorFactory.initializeCuratorFramework(CuratorFactory.java:88)
at org.apache.atlas.web.service.CuratorFactory.<init>(CuratorFactory.java:78)
at org.apache.atlas.web.service.CuratorFactory.<init>(CuratorFactory.java:73)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:142)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:89)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateBean(AbstractAutowireCapableBeanFactory.java:1152)
zookeeperConnectionString = "zookeeper-0.zookeeper-headless.atlas.svc.cluster.local:2181,zookeeper-1.zookeeper-headless.atlas.svc.cluster.local:2181,zookeeper-2.zookeeper-headless.atlas.svc.cluster.local:2181"
and the problem we are facing is, when we try to run leaderLatch.start() it does not return any error but the corresponding znode is not created in zookeeper.

The reason you see that error is on Kubernetes when a pod is restarted its DNS record also gets removed for a short time until the pod comes up again. In your case, there will not be an issue cause curator will connect to another ZK server in your CS.

Related

Zookeeper - Error the current epoch, is older than the last zxid

I am using a zookeeper ensemble of 3 nodes running 3.4.13. Sometimes after reboot of machine zookeeper is not starting in one of the node and I am seeing the below errors in logs
2019-08-19 04:18:36,906 [myid:2] - ERROR [main:QuorumPeer#692] - Unable to load database on disk
java.io.IOException: The current epoch, 7, is older than the last zxid, 34359738370
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:674)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
2019-08-19 04:18:36,908 [myid:2] - ERROR [main:QuorumPeerMain#92] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:693)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:635)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)
Caused by: java.io.IOException: The current epoch, 7, is older than the last zxid, 34359738370
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:674)
... 4 more----
I have seen ZOOKEEPER-2354 and the symptoms look similar.
support#platform2:/var/lib/zookeeper/version-2$ sudo cat acceptedEpoch
8support#platform2:/var/lib/zookeeper/version-2$ sudo cat currentEpoch
7support#platform2:/var/lib/zookeeper/version-2$ sudo cat currentEpoch.tmp
8support#platform2
The above issue states the issue is fixed in 3.4.6 but I am observing the same in 3.4.13.
Can someone let me know how can I recover the zookeeper node from this?
This has been discussed in zookeeper mailing thread. Relevant quote from that thread
With the other two zookeeper servers running I stopped the zookeeper
in the broken node and the deleted all the contents inside
/var/lib/zookeeper/version-2 and started the zookeeper back on the
node. It is running fine now and got all the data from the other
servers.

After few successful updates to solr, throws SolrException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper

I am using solr version 7.3.1 and using it with 3 external zookeeper nodes.
Below is my zookeeper config for one of the node, all have similar config:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/<user>/zookeeper/zookdata/zk2/
clientPort=2182
server.1=localhost:2666:3666
server.2=localhost:2667:3667
server.3=localhost:2668:3668
Then, I am using these 3 nodes to start solr. In my application I am using localhost:2182,localhost:2183 to connect to solr, using below code.
List<String> zkHosts = Arrays.asList(solrZkHostPort.split(","));
CloudSolrClient.Builder builder = new CloudSolrClient.Builder(zkHosts, Optional.empty());
solrClient = builder.build();
I am using multiple spark executors to update document to solr. It works fine for some 1100-1300 records update after that update fails with below exception:
Caused by: org.apache.solr.common.SolrException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper localhost:2182,localhost:2183 within 10000 ms
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:183)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:120)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:110)
at org.apache.solr.common.cloud.ZkStateReader.<init>(ZkStateReader.java:285)
at org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.connect(ZkClientClusterStateProvider.java:155)
at org.apache.solr.client.solrj.impl.CloudSolrClient.connect(CloudSolrClient.java:399)
at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:828)
at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:818)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
at com.package.SomeApplicationClass
... 16 more
Caused by: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper localhost:2182,localhost:2183 within 10000 ms
at org.apache.solr.common.cloud.ConnectionManager.waitForConnected(ConnectionManager.java:232)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:175)
... 26 more
I get below exception too, not sure though if this has any significance:
18/09/20 12:45:40 WARN ClientCnxn: Session 0x0 for server localhost/127.0.0.1:2183, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
any idea what could be the problem? Do I need to change zookeeper config or they way I create solr client needs to change?
Noob mistake.
My spark job was creating more then 1000 spark executors and for each executor, solr client was getting created and I was not closing the solrClient. closed solrclient after executor completed.

Schema Registry won't start after upgrading to Confluent 4.1

I have recently upgraded Confluent to 4.1 but schema registry seems to have some issues. On confluent start schema-registry (and consequently ksql-server) cannot start.
Here's the error I get in the logs of schema-registry:
[2018-04-20 11:27:38,426] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:65)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:203)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:63)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:41)
at io.confluent.rest.Application.createServer(Application.java:165)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:43)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: io.confluent.kafka.schemaregistry.storage.exceptions.StoreException: Failed to write Noop record to kafka store.
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:139)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:201)
... 4 more
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreException: Failed to write Noop record to kafka store.
at io.confluent.kafka.schemaregistry.storage.KafkaStore.getLatestOffset(KafkaStore.java:423)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.waitUntilKafkaReaderReachesLastOffset(KafkaStore.java:276)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:137)
... 5 more
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:77)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.getLatestOffset(KafkaStore.java:418)
... 7 more
Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
[2018-04-20 11:27:38,430] INFO Shutting down schema registry (io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry:726)
[2018-04-20 11:27:38,430] INFO [kafka-store-reader-thread-_schemas]: Shutting down (io.confluent.kafka.schemaregistry.storage.KafkaStoreReaderThread:66)
[2018-04-20 11:27:38,431] INFO [kafka-store-reader-thread-_schemas]: Stopped (io.confluent.kafka.schemaregistry.storage.KafkaStoreReaderThread:66)
[2018-04-20 11:27:38,440] INFO [kafka-store-reader-thread-_schemas]: Shutdown completed (io.confluent.kafka.schemaregistry.storage.KafkaStoreReaderThread:66)
[2018-04-20 11:27:38,446] INFO KafkaStoreReaderThread shutdown complete. (io.confluent.kafka.schemaregistry.storage.KafkaStoreReaderThread:227)
I have no clue why this error is reported and the error messages are not that meaningful to me.
After failing, confluent start schema-registry and confluent start ksql-server bring both services up but when starting KSQL I get the following warning:
**************** WARNING ******************
Remote server address may not be valid:
Error issuing GET to KSQL server
Caused by: java.net.ConnectException: Connection refused (Connection refused)
Caused by: Could not connect to the server.
*******************************************
When trying to run a command (e.g. show tables;) the following error is reported:
ksql> show tables;
Error issuing POST to KSQL server
Caused by: java.net.ConnectException: Connection refused (Connection refused)
Caused by: Could not connect to the server.
EDIT: I've fixed this by destroying current run (confluent destroy) but it would be interesting if someone could explain this issue.
From the info you've posted it feels like you may have had some zombie processes or bad data somewhere, though I can't be sure.
The Schema Registry was complaining that it couldn't write a message to Kafka, because the Kafka broker was complaining that it didn't own the topic partition the Schema Registry was writing to. This might of been caused by a previous Kafka broker, (from the old install), still running.
Did you confluent stop before upgrading?
Using confluent destroy, as you did, to flatted/reset the installation is always a good option, as long as you're not precious about your data. Checking for spurious processes, (or using the old 'reboot machine' trick), can also be a good place to start when things aren't behaving as you'd expect.
Glad its all sorted now :D
Andy

Root cause for Connection broken for id 1, my id = 3, error =

I am using Confluent 4 for kafka and zookeeper installation.
On our Kafka Cluster environment (of 3 brokers and 3 zookeeper nodes running on 3 aws instances)
we are seeing a set of below warnings, repeatedly getting recorded in the broker's server.log file.
We have not observed any functionality issues due to this yet, but we are not able to find the root cause and there may be a chance in future it will affect the clients or other broker nodes. We are not sure yet about this. Below is the set of warnings
[2018-04-03 12:00:40,707] WARN Interrupted while waiting for message on queue (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
[2018-04-03 12:00:40,707] WARN Connection broken for id 1, my id = 3, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
[2018-04-03 12:00:40,708] WARN Interrupting SendWorker (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2018-04-03 12:00:40,707] WARN Send worker leaving thread (org.apache.zookeeper.server.quorum.QuorumCnxManager)
This set of warnings get repeated and getting observed in all 3 kafka nodes.
If anyone has any idea about why this warning gets generate, then please let me know.
Thanks in advance.
This sounds like a known issue with newer version of Zk, Check out this JIRA https://issues.apache.org/jira/browse/ZOOKEEPER-2938
In my case, I was replacing a ZK node and the old one was still running which I didn't realize. So I had created 2x nodes with the same "myid".

Zookeeper: Connection request from old client will be dropped if server is in r-o mode

storm version: 0.82
zookeeper version: 3.4.5.
We have a small storm cluster (1 nimbus and 3 supervisors), so using just 1 zookeeper instance that's co-located with storm nimbus.
Infrequently we start getting the following errors in the zookeeper logs and our storm cluster comes to a standstill.
2014-04-05 13:27:32,885 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFact
ory#197] - Accepted socket connection from /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#7
93] - Connection request from old client /10.0.1.183:56121; will be dropped if server is in r-o mode
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#8
32] - Client attempting to renew session 0x1452dd02834002e at /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#5
95] - Established session 0x1452dd02834002e with negotiated timeout 40000 for client /10.0.1.183:561
21
On the storm end we start seeing the following in supervisor and worker logs:
2014-04-05 11:37:29 ConnectionStateManager [WARN] There are no ConnectionStateListeners registered.
2014-04-05 11:37:29 cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2014-04-05 11:37:31 ClientCnxn [WARN] Session 0x1452dd028340015 for server null, unexpected error,
losing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-04-05 11:37:42 CuratorFrameworkImpl [ERROR] Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(Curat
rFrameworkImpl.java:380)
at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl
java:49)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
Do we need to downgrade zookeeper to 3.3.3 or is there a known issue/config that we're missing?
We also experienced several issues with Storm 0.9 and Zookeeper 3.4.X, even though not exactly the one you describe.
Storm mailing list are also reporting such incompatibility issues:
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/144313a45ba069b5
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/1447d95d10ce7582
This later one is pointing us to this Storm pull request, which should hopefully let us use ZK 3.4.X with future versions of Storm when it will be released:
https://github.com/apache/incubator-storm/pull/29
Until then, I would recommend downgrading ZK to 3.3.6 (you may install a specific separate instance of ZK for Storm if you absolutely need ZK 3.4.X for another system). You could also clone the Storm code and merge that pull request locally or compile the latest version of the trunk, but that's a bit adventurous and more tiresome than just waiting for those nice folks to just deliver a new release for us :)
A workaround for this situation is to clear storm's data directory (configured in strom.yaml==>storm.local.dir), then restart the supervisor. I did that in my test environment by clear storm's data directory and restart the nimbus and supervisor.
I think it's caused by a previous crash of the storm cluster, and the supervisor can not recovery from such a spot.