I went through full disk space(100%) issue due to which core got corrupted. As I have googled I have removed all snapshots from version-2 and datalog/version-2.
Right now /var/zookeeper having only myid file and __backup folder.
While I am trying to start zookeeper I am getting below error
2016-06-12 12:43:36,512 [myid:4] - ERROR [main:FileTxnSnapLog#210] - Parent /search/cluster1/overseer/queue missing for /search/cluster1/overseer/queue/qn-0000000288
2016-06-12 12:43:36,514 [myid:4] - ERROR [main:QuorumPeer#453] - Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /search/cluster1/overseer/queue
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /search/cluster1/overseer/queue
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
2016-06-12 12:43:36,514 [myid:4] - ERROR [main:QuorumPeerMain#89] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /search/cluster1/overseer/queue
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
... 4 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /search/cluster1/overseer/queue
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
Could you please help me with this fix.
Thank you.
Apparently your zookeeper seems to be in an inconsistent state where it's trying to find a missing node. Look in the zoo.cfg file to see if your ZooKeeper's dataDir setting are consistent.
Or this could be a know issue of Zookeeper.
https://issues.apache.org/jira/browse/ZOOKEEPER-1813
Related
I have a Flink cluster running on minikube : 1 jobmanager and 3 taskmanagers.
I am using Kubernetes Ha service to handle jobmanager leader election.
when i am trying to kill the jobmanager to simulate a crash , the taskmanager could not connect
the new jobmanager it try always to connect the previous ip address of the jobmanager that was terminated.
here is the exception :
2021-05-05 12:14:28.126 [flink-akka.actor.default-dispatcher-3] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-7 - Association with remote system [akka.tcp://flink#172.17.0.7:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#172.17.0.7:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
2021-05-05 12:14:28.131 [flink-akka.actor.default-dispatcher-3] ERROR o.a.f.runtime.rest.handler.cluster.ClusterOverviewHandler - Unhandled exception.
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:61)
at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:53)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink#172.17.0.7:6123/user/rpc/resourcemanager_0.
at org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$10(AkkaRpcService.java:570)
at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59)
... 5 common frames omitted
Caused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink#172.17.0.7:6123/user/rpc/resourcemanager_0.
... 7 common frames omitted
Caused by: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink#172.17.0.7:6123/), Path(/user/rpc/resourcemanager_0)]
at akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:71)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:81)
at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:120)
at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:114)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:80)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:556)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:593)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:582)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:104)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:606)
at akka.actor.Actor.aroundPostStop(Actor.scala:536)
at akka.actor.Actor.aroundPostStop$(Actor.scala:536)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:458)
at akka.actor.dungeon.FaultHandling.finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling.terminate(FaultHandling.scala:172)
at akka.actor.dungeon.FaultHandling.terminate$(FaultHandling.scala:142)
at akka.actor.ActorCell.terminate(ActorCell.scala:429)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:533)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:549)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:283)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I started by creating zookeeper and kafka servers respectively on my machine. The connection gets established for both the servers but goes off after a few mins, throwing some errors.
ZOOKEEPER ERROR LOG-
'''2020-05-11 14:32:36,908 [myid:] - INFO [SyncThread:0:FileTxnLog#284] - Creating new log file: log.1
2020-05-11 14:36:05,170 [myid:] - WARN [NIOWorkerThread-1:NIOServerCnxn#373] - Close of session 0x10000cc587b0003
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:324)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-05-11 14:36:10,589 [myid:] - INFO [SessionTracker:ZooKeeperServer#600] - Expiring session 0x10000cc587b0003, timeout of 6000ms exceeded
'''
KAFKA ERROR LOG - Below are the kafka server error logs## Heading ##
'''[2020-05-11 14:33:41,077] INFO [KafkaServer id=0] started (kafka.server.KafkaServer)
[2020-05-11 14:36:04,762] ERROR Error while writing to checkpoint file C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint (kafka.server.LogDirFailureChannel)
java.nio.file.FileAlreadyExistsException: C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint.tmp -> C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:81)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
.....
[2020-05-11 14:36:04,776] INFO [ReplicaManager broker=0] Stopping serving replicas in dir C:\Kafka\Kafkakafka-logs (kafka.server.ReplicaManager)
[2020-05-11 14:36:04,776] ERROR [ReplicaManager broker=0] Error while writing to highwatermark file in directory C:\Kafka\Kafkakafka-logs (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint
Caused by: java.nio.file.FileAlreadyExistsException: C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint.tmp -> C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:81)
'''
I'm following this guide to configure messos 3 node master and 3 node slave cluster. However when I start master zookeepers I get following error log
2017-07-05 09:46:18,568 - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.100000016
2017-07-05 09:46:18,606 - ERROR [main:FileTxnSnapLog#210] - Parent /mesos/log_replicas missing for /mesos/log_replicas/0000000002
2017-07-05 09:46:18,607 - ERROR [main:QuorumPeer#453] - Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
2017-07-05 09:46:18,610 - ERROR [main:QuorumPeerMain#89] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
... 4 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
When slaves are started obviously it cannot discover the masters since it cannot connect to zookeeper. Slaves gives this error
I0705 09:33:43.593530 25710 provisioner.cpp:410] Provisioner recovery complete
I0705 09:33:43.593668 25710 slave.cpp:5970] Finished recovery
W0705 09:33:53.529522 25717 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
I0705 09:33:53.530243 25717 group.cpp:510] ZooKeeper session expired
W0705 09:34:03.532635 25710 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
I0705 09:34:03.533331 25710 group.cpp:510] ZooKeeper session expired
Any ideas how to troubleshoot this.
Reinstalling master nodes solved the first problem.
Still I had the 2nd problem, where slaves could not find zookeeper. Documentation seems to indicate slaves could discover the master nodes. Was not working for me. However when I pointed zookeeper nodes in slaves in (/etc/mesos/zk) it started working
While I am trying to start zookeeper I am getting below error
[2017-05-24 11:52:31,633] ERROR Last transaction was partial. (org.apache.zookeeper.server.persistence.Util)
[2017-05-24 11:52:31,634] ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.ZooKeeperServerMain)
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:585)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:604)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:570)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:652)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:158)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:283)
at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:410)
at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:118)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:119)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:87)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:53)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Could you please help me with this fix.
Thank you.
This sometimes happens when the snapshots in zookeeper dataDir is corrupt.
you can fix this by cleanup the contents of dataDir and restart zookeeper
I try to run zookeeper, but I have an error :Failed to start role.
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:196)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:156)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
2016-03-01 10:55:38,873 ERROR org.apache.zookeeper.server.quorum.QuorumPeerMain: Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:156)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
Caused by: java.io.FileNotFoundException: /var/lib/zookeeper/version-2/log.d00015690 (Permission denied)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:574)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:196)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
any help please !
Looks like you don't have premission to open this file, have you tried running zookeeper as root/sudo?
Caused by: java.io.FileNotFoundException: /var/lib/zookeeper/version-2/log.d00015690 (Permission denied)