JobManager doesn't automatically redirect all requests to the remaining / running TaskManager - redirect

Problem Description
2 computers(203,204)
created a Standalone mode HA Flink v1.6.1 cluster
both run jobmanager and taskmanager(2 task slots) on every computer
After I start a job (examples SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) on the JobManager node, I kill the working TaskManager instance.
Web Dashboard I can see the job being cancelled and then failed. Web Dashboard image
flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab
full logs
log-standalonesession-203
log-taskexecutor-203
log-standalonesession-204
exception
kill working TM, get the excpetion like this
2018-12-28 11:04:27,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.
submit job
run ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, get the JM logs like this
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink#hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124
kill TM
Body is limited to 30000 characters. please read this JM logs when kill TM

The logs indicate that your RestartStrategy has depleted its restart attempts or that no RestartStrategy has been configured. Please check whether you specified a RestartStrategy in your program via env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) or in flink-conf.yaml via restart-strategy: fixed-delay. If you want to learn more about Flink's restart strategies check out the documentation.

Related

Losing connection to Kafka. What happens?

A jobmanager and taskmanager are running on a single VM. Also Kafka runs on the same server.
I have 10 tasks, all read from different kafka topics , process messages and write back to Kafka.
Sometimes I find my task manager is down and nothing is working. I tried to figure out the problem by checking the logs and I believe it is a problem with Kafka connection. (Or maybe a network problem?. But everything is on a single server.)
What I want to ask is, if for a short period I lose connection to Kafka what happens. Why tasks are failing and most importantly why task manager crushes?
Some logs:
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-8] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,692 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=telefilter1-0, groupId=telefilter1] Cancelled in-flight FETCH request with correlation id 3630156 due to node 0 being disconnected (elapsed time since creation: 61648ms, elapsed time since send: 61648ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159429 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Cancelled in-flight FETCH request with correlation id 2344708 due to node 0 being disconnected (elapsed time since creation: 51184ms, elapsed time since send: 51184ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159430 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-15] Received invalid metadata error in produce request on partition tele.alerts.cpu-4 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-8] Received invalid metadata error in produce request on partition tele.alerts.cpu-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2
and then
2022-11-26 23:35:56,673 WARN org.apache.flink.runtime.taskmanager.Task [] - CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
...
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
2022-11-26 23:35:56,682 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0).
2022-11-26 23:35:57,199 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0).
2022-11-26 23:35:57,202 WARN org.apache.flink.runtime.taskmanager.Task [] - TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
Why taskexecutor loses connection to JobManager?
If I dont care any data lost, how should I configure Kafka clients and flink recovery. I just want Kafka Client not to die. Especially I dont want my tasks or task managers to crush. If I lose connection, is it possible to configure Flink to just for wait? If we can`t read, wait and if we can't write back to Kafka, just wait?
The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
Sounds like the server is somewhat overloaded. But you could try increasing the heartbeat timeout.

Kub8s Flink TaskManager not reconnecting to the new Jobmanager

Setup Flink Kub8s Cluster.
JM pod started.
TM (Task Manager ) Pod giving below error.
2020-10-31 11:08:54,895 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: flink-jobmanager/10.96.28.96:6123
2020-10-31 11:08:54,906 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*.
2020-10-31 11:09:14,946 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*.
2020-10-31 11:09:24,946 INFO akka.remote.transport.ProtocolStateActor [] - No response from remote for outbound association. Associate timed out after [20000 ms].
2020-10-31 11:09:24,947 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#flink-jobmanager:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#flink-jobmanager:6123]] Caused by: [No response from remote for outbound association. Associate timed out after [20000 ms].]
2020-10-31 11:09:24,956 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException: connection timed out: flink-jobmanager/10.96.28.96:6123
2020-10-31 11:09:24,967 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*.
2020-10-31 11:09:45,006 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink#flink-jobmanager:6123/user/rpc/resourcemanager_*.

To achieve Kafka-mule integration using Mule 4, I installed zookeeper&kafka on my machine. Both the servers connection goes off after sometime

I started by creating zookeeper and kafka servers respectively on my machine. The connection gets established for both the servers but goes off after a few mins, throwing some errors.
ZOOKEEPER ERROR LOG-
'''2020-05-11 14:32:36,908 [myid:] - INFO [SyncThread:0:FileTxnLog#284] - Creating new log file: log.1
2020-05-11 14:36:05,170 [myid:] - WARN [NIOWorkerThread-1:NIOServerCnxn#373] - Close of session 0x10000cc587b0003
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:324)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-05-11 14:36:10,589 [myid:] - INFO [SessionTracker:ZooKeeperServer#600] - Expiring session 0x10000cc587b0003, timeout of 6000ms exceeded
'''
KAFKA ERROR LOG - Below are the kafka server error logs## Heading ##
'''[2020-05-11 14:33:41,077] INFO [KafkaServer id=0] started (kafka.server.KafkaServer)
[2020-05-11 14:36:04,762] ERROR Error while writing to checkpoint file C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint (kafka.server.LogDirFailureChannel)
java.nio.file.FileAlreadyExistsException: C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint.tmp -> C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:81)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
.....
[2020-05-11 14:36:04,776] INFO [ReplicaManager broker=0] Stopping serving replicas in dir C:\Kafka\Kafkakafka-logs (kafka.server.ReplicaManager)
[2020-05-11 14:36:04,776] ERROR [ReplicaManager broker=0] Error while writing to highwatermark file in directory C:\Kafka\Kafkakafka-logs (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while writing to checkpoint file C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint
Caused by: java.nio.file.FileAlreadyExistsException: C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint.tmp -> C:\Kafka\Kafkakafka-logs\replication-offset-checkpoint
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:81)
'''

Flink can't access leader jobmanager and failed to start job

Our Flink cluster has two jobmanagers. Recently the job often goes down whenever jobmanager leader switches, and flink can't recovery the previous job after the switch. Also the job can not automatically start when I restart the flink cluster. So I have to manually start the job. According to the log, it seems whenever a new jobmanager leader is elected, connection to the new leader is refused, and this leads to the failure to start the job that. On our jobmanager server I can't find an active port 58088 open. I wonder if the talk between flink and zookeeper has some problem. We are using flink-1.0.3.
What can be the possible reason for this? Is it a flink bug? Thanks.
Log:
2017-03-09 15:32:41,211 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-03-09 15:32:41,243 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#172.27.163.235:58088/user/jobmanager:36e428ba-0af3-4e39-90d4-106b7779f94a.
2017-03-09 15:32:41,318 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink#172.27.163.235:58088]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.27.163.235:58088
2017-03-09 15:32:41,325 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink#172.27.163.235:58088/), Path(/user/jobmanager)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-09 15:32:48,029 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink#172.27.163.227:36876/user/jobmanager was granted leadership with leader session ID Some(ff50dc37-048e-4d95-93f5-df788c06725c).
2017-03-09 15:32:48,037 INFO org.apache.flink.runtime.jobmanager.JobManager - Delaying recovery of all jobs by 10000 milliseconds.
2017-03-09 15:32:48,038 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#172.27.163.227:36876/user/jobmanager:ff50dc37-048e-4d95-93f5-df788c06725c.
2017-03-09 15:32:49,038 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app87 (akka.tcp://flink#172.27.165.66:32781/user/taskmanager) as e3a9364cd8deeebf8a757d3979c5ae55. Current number of registered hosts is 1. Current number of alive task slots is 4.
2017-03-09 15:32:49,044 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app83 (akka.tcp://flink#172.27.165.58:40972/user/taskmanager) as 0c980a4c64189d975aa71cb97b1ecb7c. Current number of registered hosts is 2. Current number of alive task slots is 8.
2017-03-09 15:32:49,427 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app27 (akka.tcp://flink#172.27.164.5:50762/user/taskmanager) as 7dcc90275bd63cbcda8361bfe00cb6e8. Current number of registered hosts is 3. Current number of alive task slots is 12.
2017-03-09 15:32:49,676 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app26 (akka.tcp://flink#172.27.163.245:41734/user/taskmanager) as ab62098118261dcaa2d218ea17aa8117. Current number of registered hosts is 4. Current number of alive task slots is 16.
2017-03-09 15:32:49,916 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app84 (akka.tcp://flink#172.27.165.60:53871/user/taskmanager) as 012f186f437e7ba95111ff61d206dae6. Current number of registered hosts is 5. Current number of alive task slots is 20.
2017-03-09 15:32:49,930 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app85 (akka.tcp://flink#172.27.165.62:50068/user/taskmanager) as 68506e37647dfbff11ae193f20a7b624. Current number of registered hosts is 6. Current number of alive task slots is 24.
2017-03-09 15:32:50,658 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app86 (akka.tcp://flink#172.27.165.64:57339/user/taskmanager) as c1e922599fae53e6edc78a2add4edb61. Current number of registered hosts is 7. Current number of alive task slots is 28.
2017-03-09 15:32:50,780 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at app25 (akka.tcp://flink#172.27.163.241:45878/user/taskmanager) as 3ee2f5d3cb8003df5531d444bd11890c. Current number of registered hosts is 8. Current number of alive task slots is 32.
2017-03-09 15:32:58,054 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover all jobs.
2017-03-09 15:32:58,083 ERROR org.apache.flink.runtime.jobmanager.JobManager - Fatal error: Failed to recover jobs.
java.io.FileNotFoundException: /apps/flink/recovery/submittedJobGraphb6357063f81b (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:52)
at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:51)
at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraphs(ZooKeeperSubmittedJobGraphStore.java:173)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply$mcV$sp(JobManager.scala:433)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:429)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Why can't I connect to Kafka/Zookeeper? (In a Docker)

I'm running Kafka (0.10.0.0) in a Docker on a Mac (w/docker-machine). I derived my Dockerfile from Spotify's, which means Kafka and Zookeeper run in the same image.
My instance starts cleanly and poking around inside it appears everything is normal/OK.
Docker maps ports 2181 and 9092 to high-ports 32822 and 32820 in this case. From outside my running Kafka Docker I am able to successfully telnet 192.168.99.100 32822 (where 192.168.99.100 is the IP of my docker-machine). From there I can issue a zookeeper command and get expected output.
It all seems so encouraging, but... I then try this code:
val numPartitions = 4
val replicationFactor = 1
val topicConfig = new java.util.Properties
// zookeeper = "192.168.99.100:32822"
val zkClient = ZkUtils(zookeeper, 10000, 10000, false)
try {
AdminUtils.createTopic(zkClient, topic, numPartitions, replicationFactor, topicConfig)
} catch {
case k: kafka.common.TopicExistsException => // do nothing...topic exists
}
zkClient.close()
This produces this error output:
DEBUG ZkConnection - Creating new ZookKeeper instance to connect to 192.168.99.100:32822.
INFO ZkEventThread - Starting ZkClient event thread.
INFO ZooKeeper - Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
INFO ZooKeeper - Client environment:host.name=172.25.42.82
INFO ZooKeeper - Client environment:java.version=1.8.0_60
INFO ZooKeeper - Client environment:java.vendor=Oracle Corporation
INFO ZooKeeper - Client environment:java.home=/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre
INFO ZooKeeper - Client environment:java.class.path=/usr/local/Cellar/sbt/0.13.11/libexec/sbt-launch.jar
INFO ZooKeeper - Client environment:java.library.path=/Users/wmy965/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
INFO ZooKeeper - Client environment:java.io.tmpdir=/var/folders/ph/ccz4n1qs62n0bn8mqdg94gswt1jlwk/T/
INFO ZooKeeper - Client environment:java.compiler=<NA>
INFO ZooKeeper - Client environment:os.name=Mac OS X
INFO ZooKeeper - Client environment:os.arch=x86_64
INFO ZooKeeper - Client environment:os.version=10.11.5
INFO ZooKeeper - Client environment:user.name=wmy965
INFO ZooKeeper - Client environment:user.home=/Users/wmy965
INFO ZooKeeper - Client environment:user.dir=/Users/wmy965/git/LateKafka
INFO ZooKeeper - Initiating client connection, connectString=192.168.99.100:32822 sessionTimeout=10000 watcher=org.I0Itec.zkclient.ZkClient#55397e3
DEBUG ClientCnxn - zookeeper.disableAutoWatchReset is false
DEBUG ZkClient - Awaiting connection to Zookeeper server
INFO ZkClient - Waiting for keeper state SyncConnected
INFO ClientCnxn - Opening socket connection to server 192.168.99.100/192.168.99.100:32822. Will not attempt to authenticate using SASL (unknown error)
WARN ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
DEBUG ClientCnxnSocketNIO - Ignoring exception during shutdown input
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:780)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:399)
at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:200)
at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1185)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1110)
DEBUG ClientCnxnSocketNIO - Ignoring exception during shutdown output
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:797)
at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:407)
at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:207)
at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1185)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1110)
INFO ClientCnxn - Opening socket connection to server 192.168.99.100/192.168.99.100:32822. Will not attempt to authenticate using SASL (unknown error)
INFO ClientCnxn - Socket connection established to 192.168.99.100/192.168.99.100:32822, initiating session
DEBUG ClientCnxn - Session establishment request sent on 192.168.99.100/192.168.99.100:32822
INFO ClientCnxn - Session establishment complete on server 192.168.99.100/192.168.99.100:32822, sessionid = 0x155225c51720000, negotiated timeout = 10000
DEBUG ZkClient - Received event: WatchedEvent state:SyncConnected type:None path:null
INFO ZkClient - zookeeper state changed (SyncConnected)
DEBUG ZkClient - Leaving process event
DEBUG ZkClient - State is SyncConnected
DEBUG ClientCnxn - Reading reply sessionid:0x155225c51720000, packet:: clientPath:null serverPath:null finished:false header:: 1,8 replyHeader:: 1,1,-101 request:: '/brokers/ids,F response:: v{}
It looks like I can't connect (presumably to zookeeper). Why not?
In new kafka streams, the ip of producer must have been knowing by kafka (docker). Kafka send their uuid (you can show this in /etc/hosts inside kafka docker) and espect response from this.
Summary:
Map uuid kafka docker to docker-machine in /etc/host of mac OS.
To help you, how to change etc/host file in mac:
https://www.tekrevue.com/tip/edit-hosts-file-mac-os-x/
Cleaner would be to set advertised.listeners=host-ip:port since advertised.host.name and advertised.port are deprecated in Kafka server.properties file.
If set host-ip to 0.0.0.0 it will listen requests from anywhere. But it's insecure.