taskManager could not connect to the newly elected jobmanager - kubernetes

I have a Flink cluster running on minikube : 1 jobmanager and 3 taskmanagers.
I am using Kubernetes Ha service to handle jobmanager leader election.
when i am trying to kill the jobmanager to simulate a crash , the taskmanager could not connect
the new jobmanager it try always to connect the previous ip address of the jobmanager that was terminated.
here is the exception :
2021-05-05 12:14:28.126 [flink-akka.actor.default-dispatcher-3] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-7 - Association with remote system [akka.tcp://flink#172.17.0.7:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#172.17.0.7:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
2021-05-05 12:14:28.131 [flink-akka.actor.default-dispatcher-3] ERROR o.a.f.runtime.rest.handler.cluster.ClusterOverviewHandler - Unhandled exception.
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:386)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:61)
at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:53)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink#172.17.0.7:6123/user/rpc/resourcemanager_0.
at org.apache.flink.runtime.rpc.akka.AkkaRpcService.lambda$resolveActorAddress$10(AkkaRpcService.java:570)
at scala.concurrent.java8.FuturesConvertersImpl$CF$$anon$1.accept(FutureConvertersImpl.scala:59)
... 5 common frames omitted
Caused by: org.apache.flink.runtime.rpc.exceptions.RpcConnectionException: Could not connect to rpc endpoint under address akka.tcp://flink#172.17.0.7:6123/user/rpc/resourcemanager_0.
... 7 common frames omitted
Caused by: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink#172.17.0.7:6123/), Path(/user/rpc/resourcemanager_0)]
at akka.actor.ActorSelection.$anonfun$resolveOne$1(ActorSelection.scala:71)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:81)
at akka.dispatch.BatchingExecutor.execute(BatchingExecutor.scala:120)
at akka.dispatch.BatchingExecutor.execute$(BatchingExecutor.scala:114)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:80)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:556)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:593)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:582)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:104)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:606)
at akka.actor.Actor.aroundPostStop(Actor.scala:536)
at akka.actor.Actor.aroundPostStop$(Actor.scala:536)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:458)
at akka.actor.dungeon.FaultHandling.finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling.terminate(FaultHandling.scala:172)
at akka.actor.dungeon.FaultHandling.terminate$(FaultHandling.scala:142)
at akka.actor.ActorCell.terminate(ActorCell.scala:429)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:533)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:549)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:283)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Related

Getting started with keycloak and kubernetes

I am trying to run keycloak on kubernetes so I follow this example on my kubernetes cluster. But when I run command:
kubectl create -f https://raw.githubusercontent.com/keycloak/keycloak-quickstarts/latest/kubernetes-examples/keycloak.yaml
I got database error after 2 minutes. Why I can not run this simple example without error ?
00:15:29,210 INFO [org.keycloak.connections.infinispan.DefaultInfinispanConnectionProviderFactory] (ServerService Thread Pool -- 66) Node name: keycloak-b6b94bd59-v96tk, Site name: null
00:17:39,565 WARN [org.jboss.jca.core.connectionmanager.pool.strategy.OnePool] (ServerService Thread Pool -- 66) IJ000604: Throwable while attempting to get a new connection: null: javax.resource.ResourceException: IJ031084: Unable to create connection
at org.jboss.ironjacamar.jdbcadapters#1.4.23.Final//org.jboss.jca.adapters.jdbc.local.LocalManagedConnectionFactory.createLocalManagedConnection(LocalManagedConnectionFactory.java:345)
at org.jboss.ironjacamar.jdbcadapters#1.4.23.Final//org.jboss.jca.adapters.jdbc.local.LocalManagedConnectionFactory.getLocalManagedConnection(LocalManagedConnectionFactory.java:352)
at org.jboss.ironjacamar.jdbcadapters#1.4.23.Final//org.jboss.jca.adapters.jdbc.local.LocalManagedConnectionFactory.createManagedConnection(LocalManagedConnectionFactory.java:287)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.pool.mcp.SemaphoreConcurrentLinkedDequeManagedConnectionPool.createConnectionEventListener(SemaphoreConcurrentLinkedDequeManagedConnectionPool.java:1322)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.pool.mcp.SemaphoreConcurrentLinkedDequeManagedConnectionPool.getConnection(SemaphoreConcurrentLinkedDequeManagedConnectionPool.java:499)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.pool.AbstractPool.getSimpleConnection(AbstractPool.java:632)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.pool.AbstractPool.getConnection(AbstractPool.java:604)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:624)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.tx.TxConnectionManagerImpl.getManagedConnection(TxConnectionManagerImpl.java:440)
at org.jboss.ironjacamar.impl#1.4.23.Final//org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:789)
at org.jboss.ironjacamar.jdbcadapters#1.4.23.Final//org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:151)
at org.jboss.as.connector#21.0.2.Final//org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64)
at org.keycloak.keycloak-model-jpa#12.0.4//org.keycloak.connections.jpa.DefaultJpaConnectionProviderFactory.getConnection(DefaultJpaConnectionProviderFactory.java:371)
at org.keycloak.keycloak-model-jpa#12.0.4//org.keycloak.connections.jpa.updater.liquibase.lock.LiquibaseDBLockProvider.lazyInit(LiquibaseDBLockProvider.java:65)
at org.keycloak.keycloak-model-jpa#12.0.4//org.keycloak.connections.jpa.updater.liquibase.lock.LiquibaseDBLockProvider.lambda$waitForLock$2(LiquibaseDBLockProvider.java:96)
at org.keycloak.keycloak-server-spi-private#12.0.4//org.keycloak.models.utils.KeycloakModelUtils.suspendJtaTransaction(KeycloakModelUtils.java:654)
at org.keycloak.keycloak-model-jpa#12.0.4//org.keycloak.connections.jpa.updater.liquibase.lock.LiquibaseDBLockProvider.waitForLock(LiquibaseDBLockProvider.java:94)
at org.keycloak.keycloak-services#12.0.4//org.keycloak.services.resources.KeycloakApplication$1.run(KeycloakApplication.java:136)
at org.keycloak.keycloak-server-spi-private#12.0.4//org.keycloak.models.utils.KeycloakModelUtils.runJobInTransaction(KeycloakModelUtils.java:228)
at org.keycloak.keycloak-services#12.0.4//org.keycloak.services.resources.KeycloakApplication.startup(KeycloakApplication.java:129)
at org.keycloak.keycloak-wildfly-extensions#12.0.4//org.keycloak.provider.wildfly.WildflyPlatform.onStartup(WildflyPlatform.java:29)
at org.keycloak.keycloak-services#12.0.4//org.keycloak.services.resources.KeycloakApplication.<init>(KeycloakApplication.java:115)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.core.ConstructorInjectorImpl.construct(ConstructorInjectorImpl.java:152)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.spi.ResteasyProviderFactory.createProviderInstance(ResteasyProviderFactory.java:2815)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.spi.ResteasyDeployment.createApplication(ResteasyDeployment.java:371)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.spi.ResteasyDeployment.startInternal(ResteasyDeployment.java:283)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.spi.ResteasyDeployment.start(ResteasyDeployment.java:93)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.plugins.server.servlet.ServletContainerDispatcher.init(ServletContainerDispatcher.java:140)
at org.jboss.resteasy.resteasy-jaxrs#3.13.2.Final//org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.init(HttpServletDispatcher.java:42)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.LifecyleInterceptorInvocation.proceed(LifecyleInterceptorInvocation.java:117)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.security.RunAsLifecycleInterceptor.init(RunAsLifecycleInterceptor.java:78)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.LifecyleInterceptorInvocation.proceed(LifecyleInterceptorInvocation.java:103)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.ManagedServlet$DefaultInstanceStrategy.start(ManagedServlet.java:305)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.ManagedServlet.createServlet(ManagedServlet.java:145)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.DeploymentManagerImpl$2.call(DeploymentManagerImpl.java:588)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.DeploymentManagerImpl$2.call(DeploymentManagerImpl.java:559)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:42)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.security.SecurityContextThreadSetupAction.lambda$create$0(SecurityContextThreadSetupAction.java:105)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1530)
at io.undertow.servlet#2.2.2.Final//io.undertow.servlet.core.DeploymentManagerImpl.start(DeploymentManagerImpl.java:601)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.deployment.UndertowDeploymentService.startContext(UndertowDeploymentService.java:97)
at org.wildfly.extension.undertow#21.0.2.Final//org.wildfly.extension.undertow.deployment.UndertowDeploymentService$1.run(UndertowDeploymentService.java:78)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at org.jboss.threads#2.4.0.Final//org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads#2.4.0.Final//org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1990)
at org.jboss.threads#2.4.0.Final//org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
at org.jboss.threads#2.4.0.Final//org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1377)
at java.base/java.lang.Thread.run(Thread.java:834)
at org.jboss.threads#2.4.0.Final//org.jboss.threads.JBossThread.run(JBossThread.java:513)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:456)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:246)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:198)
at org.jboss.ironjacamar.jdbcadapters#1.4.23.Final//org.jboss.jca.adapters.jdbc.local.LocalManagedConnectionFactory.createLocalManagedConnection(LocalManagedConnectionFactory.java:321)
... 57 more
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at com.mysql.jdbc#8.0.22//com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61)
at com.mysql.jdbc#8.0.22//com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105)
at com.mysql.jdbc#8.0.22//com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151)
at com.mysql.jdbc#8.0.22//com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167)
at com.mysql.jdbc#8.0.22//com.mysql.cj.protocol.a.NativeSocketConnection.connect(NativeSocketConnection.java:89)
at com.mysql.jdbc#8.0.22//com.mysql.cj.NativeSession.connect(NativeSession.java:144)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:956)
at com.mysql.jdbc#8.0.22//com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:826)
... 61 more
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.base/java.net.Socket.connect(Socket.java:609)
at com.mysql.jdbc#8.0.22//com.mysql.cj.protocol.StandardSocketFactory.connect(StandardSocketFactory.java:155)
at com.mysql.jdbc#8.0.22//com.mysql.cj.protocol.a.NativeSocketConnection.connect(NativeSocketConnection.java:63)
... 64 more
00:17:39,573 FATAL [org.keycloak.services] (ServerService Thread Pool -- 66) Error during startup: java.lang.RuntimeException: Failed to connect to database
your error is clearly informing keylock is trying to connect with the database but can't get it there might be an issue with your configuration file or environment you are passing.
You can try this out : https://github.com/harsh4870/Keycloack-postgres-kubernetes-deployment
let me know if these files don't work. it's not for a production use case but you can use it for setting up the dev environment.

Connection refused to Schema Registry

I have installed the new version of confluent i.e 5.4 and since after that I am unable to connect to the confluent, my schema registry also gets terminated untimely.
Today when I started the confluent and tried to produce the data, I recieved the following error:
2020-03-05 12:25:00,453] ERROR Failed to send HTTP request to endpoint: http://localhost:8081/subjects/avro-key/versions (io.confluent.kafka.schemaregistry.client.rest.RestService:245)
java.net.ConnectException: Connection refused (Connection refused)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
at java.base/java.net.Socket.connect(Socket.java:609)
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1248)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1337)
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:241)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:322)
at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:422)
at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:414)
at io.confluent.kafka.schemaregistry.client.rest.RestService.registerSchema(RestService.java:400)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.registerAndGetId(CachedSchemaRegistryClient.java:140)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:196)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.register(CachedSchemaRegistryClient.java:172)
at io.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.java:71)
at io.confluent.kafka.formatter.AvroMessageReader.readMessage(AvroMessageReader.java:199)
at kafka.tools.ConsoleProducer$.main(ConsoleProducer.scala:55)
at kafka.tools.ConsoleProducer.main(ConsoleProducer.scala)'
Updated the question with the Schema-registry logs:
INFO Logging initialized #865ms to org.eclipse.jetty.util.log.Slf4jLog (org.eclipse.jetty.util.log:169)
[2020-03-09 12:35:51,851] INFO Adding listener: http://0.0.0.0:8081 (io.confluent.rest.ApplicationServer:316)
[2020-03-09 12:35:52,366] INFO Created schema registry namespace localhost:2181 /schema_registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryConfig:709)
[2020-03-09 12:35:53,329] INFO Initializing KafkaStore with broker endpoints: PLAINTEXT://LAP-LIN-897:9092 (io.confluent.kafka.schemaregistry.storage.KafkaStore:108)
[2020-03-09 12:38:03,215] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:77)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:248)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:75)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:90)
at io.confluent.rest.Application.configureHandler(Application.java:217)
at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:185)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:43)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:177)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:119)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:246)
... 6 more
Caused by: java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:170)
... 8 more
'

Kafka - SaslAuthenticationException: Failed to configure SaslClientAuthenticator

I connect my app to Kafka through SASL_SSL using keytab. At first it worked well, but once it got an error. How to fix this and avoid such errors in future? stacktrace is bellow.
java.io.IOException: Channel could not be created for socket java.nio.channels.SocketChannel[closed]
at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:298)
at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:280)
at org.apache.kafka.common.network.Selector.connect(Selector.java:215)
at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
at org.apache.kafka.clients.NetworkClient.access$700(NetworkClient.java:64)
at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1035)
at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:920)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:508)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.kafka.common.KafkaException: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
at org.apache.kafka.common.network.SaslChannelBuilder.buildChannel(SaslChannelBuilder.java:205)
at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:289)
... 10 common frames omitted
Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to configure SaslClientAuthenticator
Caused by: org.apache.kafka.common.KafkaException: Principal could not be determined from Subject, this may be a transient failure due to Kerberos re-login
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.firstPrincipal(SaslClientAuthenticator.java:441)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.<init>(SaslClientAuthenticator.java:135)
at org.apache.kafka.common.network.SaslChannelBuilder.buildClientAuthenticator(SaslChannelBuilder.java:244)
at org.apache.kafka.common.network.SaslChannelBuilder.buildChannel(SaslChannelBuilder.java:194)
at org.apache.kafka.common.network.Selector.buildAndAttachKafkaChannel(Selector.java:289)
at org.apache.kafka.common.network.Selector.registerChannel(Selector.java:280)
at org.apache.kafka.common.network.Selector.connect(Selector.java:215)
at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:864)
at org.apache.kafka.clients.NetworkClient.access$700(NetworkClient.java:64)
at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1035)
at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:920)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:508)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.base/java.lang.Thread.run(Thread.java:834)

Apache Flink Zookeeper Checkpoint Conflict After Upgrade

I recently upgraded an Apache Flink cluster from 1.3.2 to 1.4.2 and now I get following exception in Zookeeper Logs
2018-06-19 18:45:34,658 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#649] - Got user-level KeeperException when processing sessionid:0x363f3574f420001 type:create cxid:0x2284d zxid:0x900016e9d txntype:-1 reqpath:n/a Error Path:/flink/cluster_one/checkpoints/5e8ad58b9f2ef81b155d0e15b23d2365/0000000000000010305/81783429-1405-4609-9cdb-ce9dc95b8272 Error:KeeperErrorCode = NodeExists for /flink/cluster_one/checkpoints/5e8ad58b9f2ef81b155d0e15b23d2365/0000000000000010305/81783429-1405-4609-9cdb-ce9dc95b8272
Because of this the Apache Flink node where this is thrown is repeatedly kicked out of the cluster and then rejoins.
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#tm03-dev:6124/user/taskmanager#-50864731]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
... 1 more
06/22/2018 16:11:29 Job execution switched to status FAILING.
java.lang.Exception: Cannot deploy task Source: Custom Source -> StringToJSONEvents (1/1) (a78bb6a5d7c90a7b5cee785bc4ac2426) - TaskManager (37d2c90dd316c87974dda50e0b4525d6 # tm03-dev (dataPort=6125)) not responding after a timeout of 10000 ms
at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$3(Execution.java:529)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#tm03-dev:6124/user/taskmanager#-50864731]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
... 1 more
06/22/2018 16:11:29 FilterLoginFailedEvents -> Sink: SendRequestToService(1/1) switched to CANCELING
06/22/2018 16:11:29 FilterLoginFailedEvents -> Sink: SendRequestToService(1/1) switched to CANCELED
In TaskManager logs this keeps repeating
2018-06-22 15:59:45.168 [main-SendThread(ip-10-11-21-15.domain:2181)] DEBUG o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x163f3574f380005 after 0ms
2018-06-22 15:59:58.513 [main-SendThread(ip-10-11-21-15.domain:2181)] DEBUG o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x163f3574f380005 after 0ms
2018-06-22 16:00:10.658 [flink-akka.actor.default-dispatcher-8172] INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#flink-jobmanager01-flink-dev.domain.com:26229/user/jobmanager (attempt 9305, timeout: 30000 milliseconds)
2018-06-22 16:00:11.858 [main-SendThread(ip-10-11-21-15.domain:2181)] DEBUG o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x163f3574f380005 after 0ms
2018-06-22 16:00:14.971 [flink-akka.actor.default-dispatcher-9313] DEBUG akka.remote.transport.ProtocolStateActor flink-akka.remote.default-remote-dispatcher-15 - Association between local [tcp://flink#tm03-flink-dev:25422] and remote [tcp://flink#flink-jobmanager01-flink-dev.domain.com:26229] was disassociated because the ProtocolStateActor failed: Unknown
2018-06-22 16:00:14.971 [flink-akka.actor.default-dispatcher-9313] DEBUG akka.remote.transport.ProtocolStateActor flink-akka.remote.default-remote-dispatcher-15 - Association between local [tcp://flink#tm03-flink-dev:25422] and remote [tcp://flink#flink-jobmanager01-flink-dev.domain.com:26229] was disassociated because the ProtocolStateActor failed: Unknown
2018-06-22 16:00:14.971 [flink-akka.actor.default-dispatcher-9313] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-15 - Association with remote system [akka.tcp://flink#flink-jobmanager01-flink-dev.domain.com:26229] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2018-06-22 16:00:14.971 [flink-akka.actor.default-dispatcher-9313] WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-15 - Association with remote system [akka.tcp://flink#flink-jobmanager01-flink-dev.domain.com:26229] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2018-06-22 16:00:14.971 [flink-akka.actor.default-dispatcher-9313] DEBUG akka.remote.transport.netty.NettyTransport New I/O worker #2 - Remote connection to [flink-jobmanager01-flink-dev.domain.com/10.11.21.24:26229] was disconnected because of [id: 0x735f1b8b, /10.11.21.13:25422 :> flink-jobmanager01-flink-dev.domain.com/10.11.21.24:26229] DISCONNECTED
2018-06-22 16:00:14.971 [flink-akka.actor.default-dispatcher-9313] DEBUG akka.remote.transport.netty.NettyTransport New I/O worker #2 - Remote connection to [flink-jobmanager01-flink-dev.domain.com/10.11.21.24:26229] was disconnected because of [id: 0x735f1b8b, /10.11.21.13:25422 :> flink-jobmanager01-flink-dev.domain.com/10.11.21.24:26229] DISCONNECTED
2018-06-22 16:00:25.199 [main-SendThread(ip-10-11-21-15.domain:2181)] DEBUG o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x163f3574f380005 after 0ms
2018-06-22 16:00:38.539 [main-SendThread(ip-10-11-21-15.domain:2181)] DEBUG o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x163f3574f380005 after 0ms
2018-06-22 16:00:40.679 [flink-akka.actor.default-dispatcher-8172] INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#flink-jobmanager01-flink-dev.domain.com:26229/user/jobmanager (attempt 9306, timeout: 30000 milliseconds)
Another clue, on the leader JobManager such 2 messages are displayed in the log every few seconds:
2018-06-27 17:33:46.632 [jobmanager-future-thread-2] DEBUG o.a.flink.runtime.rest.handler.legacy.metrics.MetricFetcher - Could not retrieve QueryServiceGateway.
java.util.concurrent.CompletionException: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink#tm03-dev:6124/), Path(/user/MetricQueryService_64bde0e9e6f3f0a906a30e88c261c9d7)]
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:442)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
at scala.concurrent.Promise$class.failure(Promise.scala:104)
at scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:157)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:68)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:76)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:75)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:558)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:595)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:584)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:98)
at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:353)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink#tm03-dev:6124/), Path(/user/MetricQueryService_64bde0e9e6f3f0a906a30e88c261c9d7)]
... 27 common frames omitted
2018-06-27 17:34:01.625 [flink-akka.actor.default-dispatcher-19] DEBUG org.apache.flink.runtime.webmonitor.RuntimeMonitorHandler - Error while handling request.
java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.NotFoundException: Could not find job 93d6fa4fb5b2355bb03253cb80d81ef5.
at org.apache.flink.runtime.rest.handler.legacy.AbstractExecutionGraphRequestHandler.lambda$handleJsonRequest$0(AbstractExecutionGraphRequestHandler.java:70)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.lambda$getExecutionGraph$0(ExecutionGraphCache.java:130)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:444)
at akka.dispatch.OnComplete.internal(Future.scala:259)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
at org.apache.flink.runtime.jobmanager.MemoryArchivist$$anonfun$handleMessage$1.applyOrElse(MemoryArchivist.scala:123)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at org.apache.flink.runtime.jobmanager.MemoryArchivist.aroundReceive(MemoryArchivist.scala:65)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.rest.NotFoundException: Could not find job 93d6fa4fb5b2355bb03253cb80d81ef5.
... 53 common frames omitted
What does this exception mean?
Are we supposed to erase contents of Zookeeper folders (high-availability.zookeeper.path.root) before upgrade?
This question was related to Apache Flink 1.4.2 akka.actor.ActorNotFound.
After we
restarted the JobManagers
increased TaskManager Memory size from 1GB to 2GB
The issue seems to be gone and the cluster is working fine now.

zookeeper messos configuration issue

I'm following this guide to configure messos 3 node master and 3 node slave cluster. However when I start master zookeepers I get following error log
2017-07-05 09:46:18,568 - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.100000016
2017-07-05 09:46:18,606 - ERROR [main:FileTxnSnapLog#210] - Parent /mesos/log_replicas missing for /mesos/log_replicas/0000000002
2017-07-05 09:46:18,607 - ERROR [main:QuorumPeer#453] - Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
2017-07-05 09:46:18,610 - ERROR [main:QuorumPeerMain#89] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
... 4 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
When slaves are started obviously it cannot discover the masters since it cannot connect to zookeeper. Slaves gives this error
I0705 09:33:43.593530 25710 provisioner.cpp:410] Provisioner recovery complete
I0705 09:33:43.593668 25710 slave.cpp:5970] Finished recovery
W0705 09:33:53.529522 25717 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
I0705 09:33:53.530243 25717 group.cpp:510] ZooKeeper session expired
W0705 09:34:03.532635 25710 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
I0705 09:34:03.533331 25710 group.cpp:510] ZooKeeper session expired
Any ideas how to troubleshoot this.
Reinstalling master nodes solved the first problem.
Still I had the 2nd problem, where slaves could not find zookeeper. Documentation seems to indicate slaves could discover the master nodes. Was not working for me. However when I pointed zookeeper nodes in slaves in (/etc/mesos/zk) it started working