We have an application in production.
Logs are suffixed with the date.
When I look at the sizes of the logs, I notice that some of them are less than 1 KB with the following contents:
16:54:44,497 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) com.arjuna.ats.arjuna.exceptions.ObjectStoreException: ShadowingStore::read_state err$
16:54:44,538 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) ARJUNA012154: RecoverAtomicAction: transaction 0:ffff0ab2c460:5954aed8:558ecf50:989db$
16:54:44,540 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) ARJUNA012155: RecoverAtomicAction - tried to move failed activation log 0:ffff0ab2c46$
16:54:44,541 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) com.arjuna.ats.arjuna.exceptions.ObjectStoreException: ShadowingStore::read_state err$
16:54:44,544 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) ARJUNA012154: RecoverAtomicAction: transaction 0:ffff0ab2c460:5954aed8:558ecf50:989d2$
16:54:44,545 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) ARJUNA012155: RecoverAtomicAction - tried to move failed activation log 0:ffff0ab2c46$
16:59:04,549 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) com.arjuna.ats.arjuna.exceptions.ObjectStoreException: ShadowingStore::read_state err$
16:59:04,552 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) ARJUNA012154: RecoverAtomicAction: transaction 0:ffff0ab2c460:5954aed8:558ecf50:98ef3$
16:59:04,553 WARN [com.arjuna.ats.arjuna] (Periodic Recovery) ARJUNA012155: RecoverAtomicAction - tried to move failed activation log 0:ffff0ab2c46$
I am definite that:
The application was up and running all the time.
The application was doing a lot of work (It is not possible to have empty logs)
Other full log files do not contain these lines (com.arjuna.ats.arjuna.exceptions.ObjectStoreException)
I also remark that:
The time of the logs is 16:54 which means that even if an exception stopped the logging, we should have previous logs.
Apparently, this might be a bug but according to some issues I found, the whole application stops.
Related
A jobmanager and taskmanager are running on a single VM. Also Kafka runs on the same server.
I have 10 tasks, all read from different kafka topics , process messages and write back to Kafka.
Sometimes I find my task manager is down and nothing is working. I tried to figure out the problem by checking the logs and I believe it is a problem with Kafka connection. (Or maybe a network problem?. But everything is on a single server.)
What I want to ask is, if for a short period I lose connection to Kafka what happens. Why tasks are failing and most importantly why task manager crushes?
Some logs:
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-8] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,692 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=telefilter1-0, groupId=telefilter1] Cancelled in-flight FETCH request with correlation id 3630156 due to node 0 being disconnected (elapsed time since creation: 61648ms, elapsed time since send: 61648ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159429 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Cancelled in-flight FETCH request with correlation id 2344708 due to node 0 being disconnected (elapsed time since creation: 51184ms, elapsed time since send: 51184ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159430 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-15] Received invalid metadata error in produce request on partition tele.alerts.cpu-4 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-8] Received invalid metadata error in produce request on partition tele.alerts.cpu-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2
and then
2022-11-26 23:35:56,673 WARN org.apache.flink.runtime.taskmanager.Task [] - CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
...
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
2022-11-26 23:35:56,682 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0).
2022-11-26 23:35:57,199 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0).
2022-11-26 23:35:57,202 WARN org.apache.flink.runtime.taskmanager.Task [] - TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
Why taskexecutor loses connection to JobManager?
If I dont care any data lost, how should I configure Kafka clients and flink recovery. I just want Kafka Client not to die. Especially I dont want my tasks or task managers to crush. If I lose connection, is it possible to configure Flink to just for wait? If we can`t read, wait and if we can't write back to Kafka, just wait?
The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
Sounds like the server is somewhat overloaded. But you could try increasing the heartbeat timeout.
I am trying to sync the 10M rows table from Oracle & Postgres Database to Snowflake. But both the sync exactly failing at after some point of time.
Looks like after 200MB file processed in snowflake its unable to gzip the data. Below is the error log,
2022-08-03 05:20:19 INFO
i.a.w.i.DefaultAirbyteDestination(cancel):125 - Attempting to cancel
destination process... 2022-08-03 05:20:19 INFO
i.a.w.i.DefaultAirbyteDestination(cancel):130 - Destination process
exists, cancelling... 2022-08-03 05:20:19 INFO
i.a.w.g.DefaultReplicationWorker(run):175 - One of source or
destination thread complete. Waiting on the other. 2022-08-03 05:20:19
INFO i.a.w.g.DefaultReplicationWorker(run):177 - Source and
destination threads complete. 2022-08-03 05:20:19 INFO
i.a.w.i.DefaultAirbyteDestination(cancel):132 - Cancelled destination
process! 2022-08-03 05:20:19 INFO
i.a.w.g.DefaultReplicationWorker(cancel):459 - Cancelling source...
2022-08-03 05:20:19 INFO i.a.w.i.DefaultAirbyteSource(cancel):142 -
Attempting to cancel source process... 2022-08-03 05:20:19 INFO
i.a.w.i.DefaultAirbyteSource(cancel):147 - Source process exists,
cancelling... 2022-08-03 05:20:19 WARN
i.a.c.i.LineGobbler(voidCall):86 - airbyte-source gobbler IOException:
Stream closed. Typically happens when cancelling a job. 2022-08-03
05:20:19 INFO i.a.w.i.DefaultAirbyteSource(cancel):149 - Cancelled
source process! 2022-08-03 05:20:19 INFO
i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):195
Interrupting worker thread... 2022-08-03 05:20:19 INFO i.a.w.t.TemporalAttemptExecution(lambda$getCancellationChecker$3):198
Cancelling completable future... 2022-08-03 05:20:19 WARN i.a.w.t.CancellationHandler$TemporalCancellationHandler(checkAndHandleCancellation):53
Job either timed out or was cancelled. 2022-08-03 05:20:19 WARN i.a.w.t.CancellationHandler$TemporalCancellationHandler(checkAndHandleCancellation):53
Job either timed out or was cancelled.
I have an apache kafka streams application . I notice that it sometimes shutsdown when a rebalancing occurs with no real reason for the shutdown . It doesn't even throw an exception.
Here are some logs on the same
[2022-03-08 17:13:37,024] INFO [Consumer clientId=svc-stream-collector-StreamThread-1-consumer, groupId=svc-stream-collector] Adding newly assigned partitions: (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2022-03-08 17:13:37,024] ERROR stream-thread [svc-stream-collector-StreamThread-1] A Kafka Streams client in this Kafka Streams application is requesting to shutdown the application (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,030] INFO stream-client [svc-stream-collector] State transition from REBALANCING to PENDING_ERROR (org.apache.kafka.streams.KafkaStreams)
old state:REBALANCING new state:PENDING_ERROR
[2022-03-08 17:13:37,031] INFO [Consumer clientId=svc-stream-collector-StreamThread-1-consumer, groupId=svc-stream-collector] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2022-03-08 17:13:37,032] INFO stream-thread [svc-stream-collector-StreamThread-1] Informed to shut down (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,032] INFO stream-thread [svc-stream-collector-StreamThread-1] State transition from PARTITIONS_REVOKED to PENDING_SHUTDOWN (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,067] INFO stream-thread [svc-stream-collector-StreamThread-1] Thread state is already PENDING_SHUTDOWN, skipping the run once call after poll request (org.apache.kafka.streams.processor.internals.StreamThread)
[2022-03-08 17:13:37,067] WARN stream-thread [svc-stream-collector-StreamThread-1] Detected that shutdown was requested. All clients in this app will now begin to shutdown (org.apache.kafka.streams.processor.internals.StreamThread)
I'm suspecting its because there are no newly assigned partitions in the log below
[2022-03-08 17:13:37,024] INFO [Consumer clientId=svc-stream-collector-StreamThread-1-consumer, groupId=svc-stream-collector] Adding newly assigned partitions: (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
However I'm not exactly sure why this error occurs . Any help would be appreciated.
We are migrating existing .war files from Jboss 7.1 EAP to Jboss 7.2 eap. Deployment getting completed successfully on 7.2 but when the user tries to access the webpage, it's downloading instead.
==============================================================================================
20:48:10,999 DEBUG [io.undertow.request.error-response] (default task-1) Setting error code 500 for exchange HttpServerExchange{ GET /pfgcEdmCustomUIWebApp/useropco.html request {Connection=[keep-alive], Accept=[text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9], Accept-Language=[en-US,en;q=0.9], Accept-Encoding=[gzip, deflate], User-Agent=[Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36], Upgrade-Insecure-Requests=[1], Host=[njinfmdmt02.performance.pfgc.com:18080]} response {}}: java.lang.RuntimeException
at io.undertow.server.HttpServerExchange.setStatusCode(HttpServerExchange.java:1410) [undertow-core-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:317) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:81) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:138) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:135) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at org.wildfly.extension.undertow.security.SecurityContextThreadSetupAction.lambda$create$0(SecurityContextThreadSetupAction.java:105)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:272) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:81) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:104) [undertow-servlet-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.server.Connectors.executeRootHandler(Connectors.java:360) [undertow-core-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830) [undertow-core-2.0.15.Final-redhat-00001.jar:2.0.15.Final-redhat-00001]
at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1349)
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_222]
20:48:14,283 DEBUG [org.apache.activemq.artemis.core.server.impl.QueueImpl] (Thread-7 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5#67650b6d)) Scanning for expires on jms.queue.DLQ
20:48:14,283 DEBUG [org.apache.activemq.artemis.core.server.impl.QueueImpl] (Thread-7 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5#67650b6d)) Scanning for expires on jms.queue.DLQ done
20:48:44,284 DEBUG [org.apache.activemq.artemis.core.server.impl.QueueImpl] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5#67650b6d)) Scanning for expires on jms.queue.DLQ
20:48:44,284 DEBUG [org.apache.activemq.artemis.core.server.impl.QueueImpl] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5#67650b6d)) Scanning for expires on jms.queue.DLQ done
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) PeriodicRecovery: background thread Status <== SCANNING
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) PeriodicRecovery: background thread scanning
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) Periodic recovery first pass at Mon, 2 Nov 2020 20:48:51
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) processing /StateManager/BasicAction/TwoPhaseCoordinator/AtomicAction transactions
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery)
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) AtomicActionRecoveryModule first pass
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) processing /StateManager/BasicAction/TwoPhaseCoordinator/AtomicAction transactions
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery)
20:48:51,826 DEBUG [com.arjuna.ats.txoj] (Periodic Recovery) TORecoveryModule - first pass
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery)
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery)
20:48:51,826 DEBUG [com.arjuna.ats.arjuna] (Periodic Recovery) XARecoveryModule state change IDLE->FIRST_PASS
20:48:51,826 DEBUG [com.arjuna.ats.jta] (Periodic Recovery) Local XARecoveryModule - first pass
20:48:51,826 DEBUG [org.jboss.jca.core.tx.jbossts.XAResourceRecoveryImpl] (Periodic Recovery) Recovery user name=mdmadm
20:48:51,831 DEBUG [org.jboss.jca.core.tx.jbossts.XAResourceRecoveryImpl] (Periodic Recovery) Recovery Subject=Subject:
Principal: mdmadm
Private Credential: javax.resource.spi.security.PasswordCredential#81cfdca
20:48:51,831 DEBUG [org.jboss.jca.core.tx.jbossts.XAResourceRecoveryImpl] (Periodic Recovery) Open managed connection (Subject:
Principal: mdmadm
Private Credential: javax.resource.spi.security.PasswordCredential#81cfdca
)
Our Flink cluster has two jobmanagers. Recently the job often goes down whenever jobmanager leader switches, and flink can't recovery the previous job after the switch. Also the job can not automatically start when I restart the flink cluster. So I have to manually start the job. According to the log, it seems whenever a new jobmanager leader is elected, connection to the new leader is refused, and this leads to the failure to start the job that. On our jobmanager server I can't find an active port 58088 open. I wonder if the talk between flink and zookeeper has some problem. We are using flink-1.0.3.
What can be the possible reason for this? Is it a flink bug? Thanks.
Log:
2017-03-09 15:32:41,211 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-03-09 15:32:41,243 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#172.27.163.235:58088/user/jobmanager:36e428ba-0af3-4e39-90d4-106b7779f94a.
2017-03-09 15:32:41,318 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink#172.27.163.235:58088]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.27.163.235:58088
2017-03-09 15:32:41,325 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink#172.27.163.235:58088/), Path(/user/jobmanager)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-09 15:32:48,029 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink#172.27.163.227:36876/user/jobmanager was granted leadership with leader session ID Some(ff50dc37-048e-4d95-93f5-df788c06725c).
2017-03-09 15:32:48,037 INFO org.apache.flink.runtime.jobmanager.JobManager - Delaying recovery of all jobs by 10000 milliseconds.
2017-03-09 15:32:48,038 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#172.27.163.227:36876/user/jobmanager:ff50dc37-048e-4d95-93f5-df788c06725c.
2017-03-09 15:32:49,038 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app87 (akka.tcp://flink#172.27.165.66:32781/user/taskmanager) as e3a9364cd8deeebf8a757d3979c5ae55. Current number of registered hosts is 1. Current number of alive task slots is 4.
2017-03-09 15:32:49,044 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app83 (akka.tcp://flink#172.27.165.58:40972/user/taskmanager) as 0c980a4c64189d975aa71cb97b1ecb7c. Current number of registered hosts is 2. Current number of alive task slots is 8.
2017-03-09 15:32:49,427 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app27 (akka.tcp://flink#172.27.164.5:50762/user/taskmanager) as 7dcc90275bd63cbcda8361bfe00cb6e8. Current number of registered hosts is 3. Current number of alive task slots is 12.
2017-03-09 15:32:49,676 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app26 (akka.tcp://flink#172.27.163.245:41734/user/taskmanager) as ab62098118261dcaa2d218ea17aa8117. Current number of registered hosts is 4. Current number of alive task slots is 16.
2017-03-09 15:32:49,916 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app84 (akka.tcp://flink#172.27.165.60:53871/user/taskmanager) as 012f186f437e7ba95111ff61d206dae6. Current number of registered hosts is 5. Current number of alive task slots is 20.
2017-03-09 15:32:49,930 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app85 (akka.tcp://flink#172.27.165.62:50068/user/taskmanager) as 68506e37647dfbff11ae193f20a7b624. Current number of registered hosts is 6. Current number of alive task slots is 24.
2017-03-09 15:32:50,658 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app86 (akka.tcp://flink#172.27.165.64:57339/user/taskmanager) as c1e922599fae53e6edc78a2add4edb61. Current number of registered hosts is 7. Current number of alive task slots is 28.
2017-03-09 15:32:50,780 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app25 (akka.tcp://flink#172.27.163.241:45878/user/taskmanager) as 3ee2f5d3cb8003df5531d444bd11890c. Current number of registered hosts is 8. Current number of alive task slots is 32.
2017-03-09 15:32:58,054 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover all jobs.
2017-03-09 15:32:58,083 ERROR org.apache.flink.runtime.jobmanager.JobManager - Fatal error: Failed to recover jobs.
java.io.FileNotFoundException: /apps/flink/recovery/submittedJobGraphb6357063f81b (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:52)
at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:51)
at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraphs(ZooKeeperSubmittedJobGraphStore.java:173)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply$mcV$sp(JobManager.scala:433)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:429)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)