Spark 0.9.0: worker keeps dying in standalone mode when job fails - scala

I am new to spark. I am running Spark in standalone mode on my mac. I bring up the master and the worker and they all come up fine. The log file of the master looks like:
...
14/02/25 18:52:43 INFO Slf4jLogger: Slf4jLogger started
14/02/25 18:52:43 INFO Remoting: Starting remoting
14/02/25 18:52:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster#Shirishs-MacBook-Pro.local:7077]
14/02/25 18:52:43 INFO Master: Starting Spark master at spark://Shirishs-MacBook-Pro.local:7077
14/02/25 18:52:43 INFO MasterWebUI: Started Master web UI at http://192.168.1.106:8080
14/02/25 18:52:43 INFO Master: I have been elected leader! New state: ALIVE
14/02/25 18:53:03 INFO Master: Registering worker Shirishs-MacBook-Pro.local:53956 with 4 cores, 15.0 GB RAM
The worker log looks like:
14/02/25 18:53:02 INFO Slf4jLogger: Slf4jLogger started
14/02/25 18:53:02 INFO Remoting: Starting remoting
14/02/25 18:53:02 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#192.168.1.106:53956]
14/02/25 18:53:02 INFO Worker: Starting Spark worker 192.168.1.106:53956 with 4 cores, 15.0 GB RAM
14/02/25 18:53:02 INFO Worker: Spark home: /Users/shirish_kumar/Developer/spark-0.9.0-incubating
14/02/25 18:53:02 INFO WorkerWebUI: Started Worker web UI at http://192.168.1.106:8081
14/02/25 18:53:02 INFO Worker: Connecting to master spark://Shirishs-MacBook-Pro.local:7077...
14/02/25 18:53:03 INFO Worker: Successfully registered with master spark://Shirishs-MacBook-Pro.local:7077
Now, when I submit a job, the job fails to execute (because class not found error) but the worker also dies. Here is the master log:
14/02/25 18:55:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
14/02/25 18:55:52 INFO Master: Launching driver driver-20140225185552-0000 on worker worker-20140225185302-192.168.1.106-53956
14/02/25 18:55:55 INFO Master: Registering worker Shirishs-MacBook-Pro.local:53956 with 4 cores, 15.0 GB RAM
14/02/25 18:55:55 INFO Master: Attempted to re-register worker at same address: akka.tcp://sparkWorker#192.168.1.106:53956
14/02/25 18:55:55 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:55:57 INFO Master: akka.tcp://driverClient#192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:55:57 INFO Master: akka.tcp://driverClient#192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:55:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40192.168.1.106%3A53962-2#-21389169] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
4/02/25 18:55:57 INFO Master: akka.tcp://driverClient#192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:55:57 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#Shirishs-MacBook-Pro.local:7077] -> [akka.tcp://driverClient#192.168.1.106:53961]: Error [Association failed with [akka.tcp://driverClient#192.168.1.106:53961]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://driverClient#192.168.1.106:53961]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /192.168.1.106:53961
]
...
...
14/02/25 18:55:57 INFO Master: akka.tcp://driverClient#192.168.1.106:53961 got disassociated, removing it.
14/02/25 18:56:03 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:10 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:18 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:25 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:33 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/02/25 18:56:40 WARN Master: Got heartbeat from unregistered worker worker-20140225185555-192.168.1.106-53956
14/
The worker log looks like this
14/02/25 18:55:52 INFO Worker: Asked to launch driver driver-20140225185552-0000
2014-02-25 18:55:52.534 java[11415:330b] Unable to load realm info from SCDynamicStore
14/02/25 18:55:52 INFO DriverRunner: Copying user jar file:/Users/shirish_kumar/Developer/spark_app/SimpleApp to /Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140225185552-0000/SimpleApp
14/02/25 18:55:53 INFO DriverRunner: Launch Command: "/Library/Java/JavaVirtualMachines/jdk1.7.0_40.jdk/Contents/Home/bin/java" "-cp" ":/Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140225185552-0000/SimpleApp:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/conf:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop1.0.4.jar" "-Xms512M" "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#192.168.1.106:53956/user/Worker" "SimpleApp"
14/02/25 18:55:55 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/02/25 18:55:55 INFO Worker: Starting Spark worker 192.168.1.106:53956 with 4 cores, 15.0 GB RAM
14/02/25 18:55:55 INFO Worker: Spark home: /Users/shirish_kumar/Developer/spark-0.9.0-incubating
14/02/25 18:55:55 INFO WorkerWebUI: Started Worker web UI at http://192.168.1.106:8081
14/02/25 18:55:55 INFO Worker: Connecting to master spark://Shirishs-MacBook-Pro.local:7077...
14/02/25 18:55:55 INFO Worker: Successfully registered with master spark://Shirishs-MacBook-Pro.local:7077
After this in the webUI - the worker is show is dead.
My question is - has anyone encountered this problem. The worker should not die if a job fails.

Check you /Spark/work folder.
You can see the exact error for that particular driver.
For me its a class not found exception.Just give the fully qualified class name for the application main class(include package name too).
Then clear out the work directory and launch your application again in stand alone mode again.
This will work....!

You have to specify the path to your JAR files.
Pragmatically, you can do it this way:
sparkConf.set("spark.jars", "file:/myjar1, file:/myjarN")
Which implies you have to first compile a JAR file.
You also have to link dependent JARs - of which there are multiple ways of automating, but well beyond the scope of this question.

Related

JobManager doesn't automatically redirect all requests to the remaining / running TaskManager

Problem Description
2 computers(203,204)
created a Standalone mode HA Flink v1.6.1 cluster
both run jobmanager and taskmanager(2 task slots) on every computer
After I start a job (examples SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) on the JobManager node, I kill the working TaskManager instance.
Web Dashboard I can see the job being cancelled and then failed. Web Dashboard image
flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab
full logs
log-standalonesession-203
log-taskexecutor-203
log-standalonesession-204
exception
kill working TM, get the excpetion like this
2018-12-28 11:04:27,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.
submit job
run ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, get the JM logs like this
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink#hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124
kill TM
Body is limited to 30000 characters. please read this JM logs when kill TM
The logs indicate that your RestartStrategy has depleted its restart attempts or that no RestartStrategy has been configured. Please check whether you specified a RestartStrategy in your program via env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) or in flink-conf.yaml via restart-strategy: fixed-delay. If you want to learn more about Flink's restart strategies check out the documentation.

Hortonworks, Eclipse and Kerberos Client (Authentication, HOW?)

Hello everybody, we have a kerberized HDP (Hortonworks) cluster, we can run Spark jobs from Spark-Submit (CLI), Talend Big Data, but not from Eclipse.
We have a Windows client machine where Eclipse is installed and MIT windows Kerberos Client is confgiured (TGT Configuration). The goal is to run Spark job using eclipse. Portion of the java code related with Spark is operational and tested via CLI. Below is mentioned part of the code for the job.
private void setConfigurationProperties()
{
try{
sConfig.setAppName("abcd-name");
sConfig.setMaster("yarn-client");
sConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sConfig.set("spark.hadoop.yarn.resourcemanager.address", "rs.abcd.com:8032"); sConfig.set("spark.hadoop.yarn.resourcemanager.scheduler.address","rs.abcd.com:8030");
sConfig.set("spark.hadoop.mapreduce.jobhistory.address","rs.abcd.com:10020");
sConfig.set("spark.hadoop.yarn.app.mapreduce.am.staging-dir", "/dir");
sConfig.set("spark.executor.memory", "2g");
sConfig.set("spark.executor.cores", "4");
sConfig.set("spark.executor.instances", "24");
sConfig.set("spark.yarn.am.cores", "24");
sConfig.set("spark.yarn.am.memory", "16g");
sConfig.set("spark.eventLog.enabled", "true");
sConfig.set("spark.eventLog.dir", "hdfs:///spark-history");
sConfig.set("spark.shuffle.memoryFraction", "0.4");
sConfig.set("spark.hadoop." + "mapreduce.application.framework.path","/hdp/apps/version/mapreduce/mapreduce.tar.gz#mr-framework");
sConfig.set("spark.local.dir", "/tmp");
sConfig.set("spark.hadoop.yarn.resourcemanager.principal", "rm/_HOST#ABCD.COM");
sConfig.set("spark.hadoop.mapreduce.jobhistory.principal", "jhs/_HOST#ABCD.COM");
sConfig.set("spark.hadoop.dfs.namenode.kerberos.principal", "nn/_HOST#ABCD.COM");
sConfig.set("spark.hadoop.fs.defaultFS", "hdfs://hdfs.abcd.com:8020");
sConfig.set("spark.hadoop.dfs.client.use.datanode.hostname", "true"); }
}
When we run the code the following error pops up:
17/04/05 23:37:06 INFO Remoting: Starting remoting
17/04/05 23:37:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#1.1.1.1:54356]
17/04/05 23:37:06 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54356.
17/04/05 23:37:06 INFO SparkEnv: Registering MapOutputTracker
17/04/05 23:37:06 INFO SparkEnv: Registering BlockManagerMaster
17/04/05 23:37:06 INFO DiskBlockManager: Created local directory at C:\tmp\blockmgr-baee2441-1977-4410-b52f-4275ff35d6c1
17/04/05 23:37:06 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
17/04/05 23:37:06 INFO SparkEnv: Registering OutputCommitCoordinator
17/04/05 23:37:07 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/04/05 23:37:07 INFO SparkUI: Started SparkUI at http://1.1.1.1:4040
17/04/05 23:37:07 INFO RMProxy: Connecting to ResourceManager at rs.abcd.com/1.1.1.1:8032
17/04/05 23:37:07 ERROR SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
17/04/05 23:37:07 INFO SparkUI: Stopped Spark web UI at http://1.1.1.1:4040
Please guide us how to specify in java code Kerberos authentication method instead of SIMPLE. Or how to instruct the client for Kerberos authentication request. And whole what should the process look like and what would be the right approach
Thank you

Unable to connect to Spark master

I start my DataStax cassandra instance with Spark:
dse cassandra -k
I then run this program (from within Eclipse):
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Start {
def main(args: Array[String]): Unit = {
println("***** 1 *****")
val sparkConf = new SparkConf().setAppName("Start").setMaster("spark://127.0.0.1:7077")
println("***** 2 *****")
val sparkContext = new SparkContext(sparkConf)
println("***** 3 *****")
}
}
And I get the following output
***** 1 *****
***** 2 *****
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/29 15:27:50 INFO SparkContext: Running Spark version 1.5.2
15/12/29 15:27:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/29 15:27:51 INFO SecurityManager: Changing view acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: Changing modify acls to: nayan
15/12/29 15:27:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nayan); users with modify permissions: Set(nayan)
15/12/29 15:27:52 INFO Slf4jLogger: Slf4jLogger started
15/12/29 15:27:52 INFO Remoting: Starting remoting
15/12/29 15:27:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.0.1.88:55126]
15/12/29 15:27:53 INFO Utils: Successfully started service 'sparkDriver' on port 55126.
15/12/29 15:27:53 INFO SparkEnv: Registering MapOutputTracker
15/12/29 15:27:53 INFO SparkEnv: Registering BlockManagerMaster
15/12/29 15:27:53 INFO DiskBlockManager: Created local directory at /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/blockmgr-21a96671-c33e-498c-83a4-bb5c57edbbfb
15/12/29 15:27:53 INFO MemoryStore: MemoryStore started with capacity 983.1 MB
15/12/29 15:27:53 INFO HttpFileServer: HTTP File server directory is /private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/spark-fce0a058-9264-4f2c-8220-c32d90f11bd8/httpd-2a0efcac-2426-49c5-982a-941cfbb48c88
15/12/29 15:27:53 INFO HttpServer: Starting HTTP Server
15/12/29 15:27:53 INFO Utils: Successfully started service 'HTTP file server' on port 55127.
15/12/29 15:27:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/12/29 15:27:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/12/29 15:27:53 INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
15/12/29 15:27:54 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
15/12/29 15:27:54 INFO AppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
15/12/29 15:27:54 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#127.0.0.1:7077] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
15/12/29 15:28:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#1f22aef0 rejected from java.util.concurrent.ThreadPoolExecutor#176cb4af[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
So something is happening during the creation of the spark context.
When i look in $DSE_HOME/logs/spark, it is empty. Not sure where else to look.
It turns out that the problem was the spark library version AND the Scala version. DataStax was running Spark 1.4.1 and Scala 2.10.5, while my eclipse project was using 1.5.2 & 2.11.7 respectively.
Note that BOTH the Spark library and Scala appear to have to match. I tried other combinations, but it only worked when both matched.
I am getting pretty familiar with this part of your posted error:
WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://...
It can have numerous causes, pretty much all related to misconfigured IPs. First I would do whatever zero323 says, then here's my two cents: I have solved my own problems recently by using IP addresses, not hostnames, and the only config I use in a simple standalone cluster is SPARK_MASTER_IP.
SPARK_MASTER_IP in the $SPARK_HOME/conf/spark-env.sh on your master then should lead the master webui to show the IP address you set:
spark://your.ip.address.numbers:7077
And your SparkConf setup can refer to that.
Having said that, I am not familiar with your specific implementation but I notice in the error two occurrences containing:
/private/var/folders/pd/6rxlm2js10gg6xys5wm90qpm0000gn/T/
Have you looked there to see if there's a logs directory? Is that where $DSE_HOME points? Alternatively connect to the driver where it creates it's webui:
INFO SparkUI: Started SparkUI at http://10.0.1.88:4040
and you should see a link to an error log there somewhere.
More on the IP vs. hostname thing, this very old bug is marked as Resolved but I have not figured out what they mean by Resolved, so I just tend toward IP addresses.

How to set remoteHost in spark RetryingBlockFetcher IOException

I apologize for such an extremely long post, but I wanted to be better understood.
I have built up my cluster, where master in on another machine than workers. Workers are allocated on a quite efficient machine. Between these two machines no firewall is applied.
URL: spark://MASTER_IP:7077
Workers: 10
Cores: 10 Total, 0 Used
Memory: 40.0 GB Total, 0.0 B Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Before launching the app, in the workers logfile is (an example for one worker)
15/03/06 18:52:19 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/03/06 18:52:19 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:52:19 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:52:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:52:20 INFO Slf4jLogger: Slf4jLogger started
15/03/06 18:52:20 INFO Remoting: Starting remoting
15/03/06 18:52:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#WORKER_MACHINE_IP:42240]
15/03/06 18:52:20 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkWorker#WORKER_MACHINE_IP:42240]
15/03/06 18:52:20 INFO Utils: Successfully started service 'sparkWorker' on port 42240.
15/03/06 18:52:20 INFO Worker: Starting Spark worker WORKER_MACHINE_IP:42240 with 1 cores, 4.0 GB RAM
15/03/06 18:52:20 INFO Worker: Spark home: /home/szymon/spark
15/03/06 18:52:20 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/03/06 18:52:20 INFO WorkerWebUI: Started WorkerWebUI at http://WORKER_MACHINE_IP:8081
15/03/06 18:52:20 INFO Worker: Connecting to master spark://MASTER_IP:7077...
15/03/06 18:52:20 INFO Worker: Successfully registered with master spark://MASTER_IP:7077
I launch my application on a cluster (on the master machine)
./bin/spark-submit --class SimpleApp --master spark://MASTER_IP:7077 --executor-memory 3g --total-executor-cores 10 code/trial_2.11-0.9.jar
My app is then fetched by workers, this is an example of the log output for a worker (#WORKER_MACHINE)
15/03/06 18:07:45 INFO ExecutorRunner: Launch command: "/usr/java/jdk1.8.0_31/bin/java" "-cp" "::/home/machine/spark/conf:/home/machine/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-hadoop2.4.0.jar" "-Dspark.driver.port=56753" "-Dlog4j.configuration=file:////home/machine/spark/conf/log4j.properties" "-Dspark.driver.host=MASTER_IP" "-Xms3072M" "-Xmx3072M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriver#MASTER_IP:56753/user/CoarseGrainedScheduler" "4" "WORKER_MACHINE_IP" "1" "app-20150306181450-0000" "akka.tcp://sparkWorker#WORKER_MACHINE_IP:45288/user/Worker"
The app wants to connect to localhost at address 127.0.0.1 instead of MASTER_IP (I believe).
How could it be fixed?
15/03/06 18:58:52 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to localhost/127.0.0.1:56545
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
The problem is caused by createClient method in TransportClientFactory which is in spark-network-common_2.10-1.2.1-sources.jar, String remoteHost is set up as localhost
/**
* Create a {#link TransportClient} connecting to the given remote host / port.
*
* We maintains an array of clients (size determined by spark.shuffle.io.numConnectionsPerPeer)
* and randomly picks one to use. If no client was previously created in the randomly selected
* spot, this function creates a new client and places it there.
*
* Prior to the creation of a new TransportClient, we will execute all
* {#link TransportClientBootstrap}s that are registered with this factory.
*
* This blocks until a connection is successfully established and fully bootstrapped.
*
* Concurrency: This method is safe to call from multiple threads.
*/
public TransportClient createClient(String remoteHost, int remotePort) throws IOException {
// Get connection from the connection pool first.
// If it is not found or not active, create a new one.
final InetSocketAddress address = new InetSocketAddress(remoteHost, remotePort);
.
.
.
clientPool.clients[clientIndex] = createClient(address);
}
Here is the file spark-env.sh on the workers site
export SPARK_HOME=/home/szymon/spark
export SPARK_MASTER_IP=MASTER_IP
export SPARK_MASTER_WEBUI_PORT=8081
export SPARK_LOCAL_IP=WORKER_MACHINE_IP
export SPARK_DRIVER_HOST=WORKER_MACHINE_IP
export SPARK_LOCAL_DIRS=/home/szymon/spark/slaveData
export SPARK_WORKER_INSTANCES=10
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_DIR=/home/szymon/spark/work
And on the master
export SPARK_MASTER_IP=MASTER_IP
export SPARK_LOCAL_IP=MASTER_IP
export SPARK_MASTER_WEBUI_PORT=8081
export SPARK_JAVA_OPTS="-Dlog4j.configuration=file:////home/szymon/spark/conf/log4j.properties -Dspark.driver.host=MASTER_IP"
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=10"
This is the full log output with more details
15/03/06 18:58:50 INFO Worker: Asked to launch executor app-20150306190555-0000/0 for Simple Application
15/03/06 18:58:50 INFO ExecutorRunner: Launch command: "/usr/java/jdk1.8.0_31/bin/java" "-cp" "::/home/szymon/spark/conf:/home/szymon/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-hadoop2.4.0.jar" "-Dspark.driver.port=49407" "-Dlog4j.configuration=file:////home/szymon/spark/conf/log4j.properties" "-Dspark.driver.host=MASTER_IP" "-Xms3072M" "-Xmx3072M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriver#MASTER_IP:49407/user/CoarseGrainedScheduler" "0" "WORKER_MACHINE_IP" "1" "app-20150306190555-0000" "akka.tcp://sparkWorker#WORKER_MACHINE_IP:42240/user/Worker"
15/03/06 18:58:50 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/03/06 18:58:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/06 18:58:51 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:58:51 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:58:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:58:51 INFO Slf4jLogger: Slf4jLogger started
15/03/06 18:58:51 INFO Remoting: Starting remoting
15/03/06 18:58:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#WORKER_MACHINE_IP:52038]
15/03/06 18:58:51 INFO Utils: Successfully started service 'driverPropsFetcher' on port 52038.
15/03/06 18:58:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/03/06 18:58:52 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:58:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/03/06 18:58:52 INFO Slf4jLogger: Slf4jLogger started
15/03/06 18:58:52 INFO Remoting: Starting remoting
15/03/06 18:58:52 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/03/06 18:58:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor#WORKER_MACHINE_IP:37114]
15/03/06 18:58:52 INFO Utils: Successfully started service 'sparkExecutor' on port 37114.
15/03/06 18:58:52 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver#MASTER_IP:49407/user/CoarseGrainedScheduler
15/03/06 18:58:52 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#WORKER_MACHINE_IP:42240/user/Worker
15/03/06 18:58:52 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#WORKER_MACHINE_IP:42240/user/Worker
15/03/06 18:58:52 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
15/03/06 18:58:52 INFO Executor: Starting executor ID 0 on host WORKER_MACHINE_IP
15/03/06 18:58:52 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:58:52 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver#MASTER_IP:49407/user/MapOutputTracker
15/03/06 18:58:52 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver#MASTER_IP:49407/user/BlockManagerMaster
15/03/06 18:58:52 INFO DiskBlockManager: Created local directory at /home/szymon/spark/slaveData/spark-b09c3727-8559-4ab8-ab32-1f5ecf7aeaf2/spark-0c892a4d-c8b9-4144-a259-8077f5316b52/spark-89577a43-fb43-4a12-a305-34b267b01f8a/spark-7ad207c4-9d37-42eb-95e4-7b909b71c687
15/03/06 18:58:52 INFO MemoryStore: MemoryStore started with capacity 1589.8 MB
15/03/06 18:58:52 INFO NettyBlockTransferService: Server created on 51205
15/03/06 18:58:52 INFO BlockManagerMaster: Trying to register BlockManager
15/03/06 18:58:52 INFO BlockManagerMaster: Registered BlockManager
15/03/06 18:58:52 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#MASTER_IP:49407/user/HeartbeatReceiver
15/03/06 18:58:52 INFO CoarseGrainedExecutorBackend: Got assigned task 0
15/03/06 18:58:52 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/06 18:58:52 INFO Executor: Fetching http://MASTER_IP:57850/jars/trial_2.11-0.9.jar with timestamp 1425665154479
15/03/06 18:58:52 INFO Utils: Fetching http://MASTER_IP:57850/jars/trial_2.11-0.9.jar to /home/szymon/spark/slaveData/spark-b09c3727-8559-4ab8-ab32-1f5ecf7aeaf2/spark-0c892a4d-c8b9-4144-a259-8077f5316b52/spark-411cd372-224e-44c1-84ab-b0c3984a6361/fetchFileTemp7857926599487994869.tmp
15/03/06 18:58:52 INFO Utils: Copying /home/szymon/spark/slaveData/spark-b09c3727-8559-4ab8-ab32-1f5ecf7aeaf2/spark-0c892a4d-c8b9-4144-a259-8077f5316b52/spark-411cd372-224e-44c1-84ab-b0c3984a6361/-19284804851425665154479_cache to /home/szymon/spark/work/app-20150306190555-0000/0/./trial_2.11-0.9.jar
15/03/06 18:58:52 INFO Executor: Adding file:/home/szymon/spark/work/app-20150306190555-0000/0/./trial_2.11-0.9.jar to class loader
15/03/06 18:58:52 INFO TorrentBroadcast: Started reading broadcast variable 0
15/03/06 18:58:52 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to localhost/127.0.0.1:56545
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:587)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.org$apache$spark$broadcast$TorrentBroadcast$$anonfun$$getRemote$1(TorrentBroadcast.scala:126)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply(TorrentBroadcast.scala:136)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:56545
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
15/03/06 18:58:52 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
15/03/06 18:58:57 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to localhost/127.0.0.1:56545
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:56545
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
.
.
.
15/03/06 19:00:22 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
15/03/06 19:00:24 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor#WORKER_MACHINE_IP:37114] -> [akka.tcp://sparkDriver#MASTER_IP:49407] disassociated! Shutting down.
15/03/06 19:00:24 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#MASTER_IP:49407] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/03/06 19:00:24 INFO Worker: Asked to kill executor app-20150306190555-0000/0
15/03/06 19:00:24 INFO ExecutorRunner: Runner thread for executor app-20150306190555-0000/0 interrupted
15/03/06 19:00:24 INFO ExecutorRunner: Killing process!
15/03/06 19:00:25 INFO Worker: Executor app-20150306190555-0000/0 finished with state KILLED exitStatus 1
15/03/06 19:00:25 INFO Worker: Cleaning up local directories for application app-20150306190555-0000
15/03/06 19:00:25 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#WORKER_MACHINE_IP:37114] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/03/06 19:00:25 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40WORKER_MACHINE_IP%3A45806-2#1549100100] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
There is a warning, which I believe is not the case at this issue
15/03/06 18:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
You can try to set conf.set("spark.driver.host",""), the client host is a host where you start spark-shell or other script.

There is a HTTP server starts when Launching Spark jar on a machine, what's that?

I want to use machine A where I will submit my Spark job to the cluster, A has no spark environment, just java. When I launch the jar, there is a HTTP server starts:
[steven#bj-230 ~]$ java -jar helloCluster.jar SimplyApp
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/06/10 16:54:54 INFO SparkEnv: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/06/10 16:54:54 INFO SparkEnv: Registering BlockManagerMaster
14/06/10 16:54:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140610165454-4393
14/06/10 16:54:54 INFO MemoryStore: MemoryStore started with capacity 1055.1 MB.
14/06/10 16:54:54 INFO ConnectionManager: Bound socket to port 59981 with id = ConnectionManagerId(bj-230,59981)
14/06/10 16:54:54 INFO BlockManagerMaster: Trying to register BlockManager
14/06/10 16:54:54 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager bj-230:59981 with 1055.1 MB RAM
14/06/10 16:54:54 INFO BlockManagerMaster: Registered BlockManager
14/06/10 16:54:54 INFO HttpServer: Starting HTTP Server
14/06/10 16:54:54 INFO HttpBroadcast: Broadcast server started at http://10.10.10.230:59233
14/06/10 16:54:54 INFO SparkEnv: Registering MapOutputTracker
14/06/10 16:54:54 INFO HttpFileServer: HTTP File server directory is /tmp/spark-bfdd02f1-3c02-4233-854f-af89542b9acf
14/06/10 16:54:54 INFO HttpServer: Starting HTTP Server
14/06/10 16:54:54 INFO SparkUI: Started Spark Web UI at http://bj-230:4040
14/06/10 16:54:54 INFO SparkContext: Added JAR hdfs://master:8020/tmp/helloCluster.jar at hdfs://master:8020/tmp/helloCluster.jar with timestamp 1402390494838
14/06/10 16:54:54 INFO AppClient$ClientActor: Connecting to master spark://master:7077...
So, what's the meaning of this server? And if I am behind a NAT, is it possible to use this machine A to submit my job to remote cluster?
By the way, the result of this execution is failed. Error log:
14/06/10 16:55:05 INFO SparkDeploySchedulerBackend: Executor app-20140610165321-0005/7 removed: Command exited with code 1
14/06/10 16:55:05 ERROR AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/10 16:55:05 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/10 16:55:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
The spark driver starts few HTTP endpoints:
It provides a Web console that shows the job progress. This http endpoint has a default port of 4040 and can be changed with the configuration option: spark.ui.port. Then, you connect to it with your browser: http://your_host:4040 and you will be able to follow the job. It's only alive the time the driver runs.
There's an additional HTTP endpoint to provide a file download service for the jars declared as dependencies. The workers will contact the driver to download the list of dependencies. This is a random assigned port. Therefore, the driver must be on a routable network from the Spark workers.