Spark&hbase: java.io.IOException: Connection reset by peer - scala

I would appreciate it if you could help me.
During implementation of spark streaming from kafka to hbase (code is attached) we have faced an issue “java.io.IOException: Connection reset by peer” (full log is attached).
This issue comes up if we work with hbase and dynamic allocation option is on in spark settings. In case we write data in hdfs (hive table) instead of hbase or if dynamic allocation option is off there are no errors found.
We have tried to change zookeeper connections, spark executor idle timeout, network timeout. We have tried to change shuffle block transfer service (NIO) but the error is still there. If we set min/max executers (less then 80) amount for dynamic allocation there are no problems too.
What may the problem be? There are a lot of almost the same problems in Jira and stack overflow, but nothing helps.
Versions:
HBase 1.2.0-cdh5.14.0
Kafka 3.0.0-1.3.0.0.p0.40
SPARK2 2.2.0.cloudera2-1.cdh5.12.0.p0.232957
hbase-client/hbase-spark(org.apache.hbase) 1.2.0-cdh5.11.1
Spark settings:
--num-executors=80
--conf spark.sql.shuffle.partitions=200
--conf spark.driver.memory=32g
--conf spark.executor.memory=32g
--conf spark.executor.cores=4
Cluster:
1+8 nodes, 70 CPU, 755Gb RAM, x10 HDD,
Log:
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 717 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 717 successfully in removeExecutor
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 717 has been removed (new total is 26)
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 705.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 705 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 705 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 705 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(705, lang32.ca.sbrf.ru, 22805, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 705 has been removed (new total is 25)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 705 successfully in removeExecutor
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 716.
18/04/09 13:51:56 INFO scheduler.DAGScheduler: Executor lost: 716 (epoch 45)
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 716 from BlockManagerMaster.
18/04/09 13:51:56 INFO cluster.YarnClusterScheduler: Executor 716 on lang32.ca.sbrf.ru killed by driver.
18/04/09 13:51:56 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(716, lang32.ca.sbrf.ru, 28678, None)
18/04/09 13:51:56 INFO spark.ExecutorAllocationManager: Existing executor 716 has been removed (new total is 24)
18/04/09 13:51:56 INFO storage.BlockManagerMaster: Removed 716 successfully in removeExecutor
18/04/09 13:51:56 WARN server.TransportChannelHandler: Exception in connection from /10.116.173.65:57542
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
18/04/09 13:51:56 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.116.173.65:57542 is closed
18/04/09 13:51:56 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 548.

Try setting these two parameters. Also try caching the Dataframe before writing to HBase.
spark.network.timeout
spark.executor.heartbeatInterval

Please see my related answer here: What are possible reasons for receiving TimeoutException: Futures timed out after [n seconds] when working with Spark
It also took me a while to understand why Cloudera is stating following:
Dynamic allocation and Spark Streaming
If you are using Spark Streaming, Cloudera recommends that you disable
dynamic allocation by setting spark.dynamicAllocation.enabled to false
when running streaming applications.
Reference: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#ki_dynamic_allocation_streaming

Related

ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. - Spark standalone cluster

Spark job (Scala/s3) worked fine for few runs in stand-alone cluster with spark-submit but after few run it started giving the below error. There were no changes to code, it is making connection to spark-master but immediately application is getting killed with the reason “All masters are unresponsive! Giving up”.
22/03/20 05:33:39 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
22/03/20 05:33:39 INFO TransportClientFactory: Successfully created connection to spark-master/xx.x.x.xxx:7077 after 42 ms (0 ms spent in bootstraps)
22/03/20 05:33:59 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
22/03/20 05:34:19 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
22/03/20 05:34:39 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
22/03/20 05:34:39 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
22/03/20 05:34:39 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33139.
22/03/20 05:34:39 INFO NettyBlockTransferService: Server created on a1326e4ae4bb:33139
22/03/20 05:34:39 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/03/20 05:34:39 INFO SparkUI: Stopped Spark web UI at http://xxxxxxxxxxxxx:4040
22/03/20 05:34:39 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, a1326e4ae4bb, 33139, None)
22/03/20 05:34:39 INFO StandaloneSchedulerBackend: Shutting down all executors
22/03/20 05:34:39 INFO BlockManagerMasterEndpoint: Registering block manager a1326e4ae4bb:33139 with 1168.8 MiB RAM, BlockManagerId(driver, a1326e4ae4bb, 33139, None)
22/03/20 05:34:39 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/03/20 05:34:39 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, a1326e4ae4bb, 33139, None)
22/03/20 05:34:39 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, a1326e4ae4bb, 33139, None)
22/03/20 05:34:39 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
22/03/20 05:34:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/20 05:34:39 INFO MemoryStore: MemoryStore cleared
22/03/20 05:34:39 INFO BlockManager: BlockManager stopped
22/03/20 05:34:39 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/20 05:34:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/20 05:34:40 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:281)

Spark: Disconnected from Spark cluster! Waiting for reconnection

Folks!
I am attempting to move Spark processing into cluster (Standalone). Previously, jobs were running successfully over cluster set up with 1 Worker node + Master on the same machine. For example one of the jobs was running without any issues on 4gb and 2 cores. The same with local[*] mode.
Now I set up the cluster in Kubernetes with 24.0 GB RAM and 6 cores - 3 Workers + Master. And getting errors. For that simple job I am using all of the resources available in cluster.
Spark 2.2.0,Client mode.
spark-submit \
--name RSS_Analysys\
--class org.MainClass \
--conf "spark.driver.extraJavaOptions=-Ddata=file:///aws/efs/data/"\
--conf spark.executor.memory=8g\
--conf spark.executor.cores=2\
--master spark://AWS-ELB:7077\
--deploy-mode client \
--packages com.squareup.okhttp:okhttp:2.7.5\
file:///aws/efs/app/rss_app.jar
Driver output:
17/07/20 16:42:13 INFO TaskSetManager: Finished task 199.0 in stage 95.0 (TID 4017) in 4 ms on 172.12.0.1 (executor 2) (200/200)
17/07/20 16:42:13 INFO TaskSchedulerImpl: Removed TaskSet 95.0, whose tasks have all completed, from pool
17/07/20 16:42:13 INFO DAGScheduler: ShuffleMapStage 95 (head at RSSProcessor.scala:76) finished in 0.670 s
17/07/20 16:42:13 INFO DAGScheduler: looking for newly runnable stages
17/07/20 16:42:13 INFO DAGScheduler: running: Set(ShuffleMapStage 96)
17/07/20 16:42:13 INFO DAGScheduler: waiting: Set(ResultStage 97)
17/07/20 16:42:13 INFO DAGScheduler: failed: Set()
17/07/20 16:42:25 WARN StandaloneAppClient$ClientEndpoint: Connection to 172.12.0.1:7077 failed; waiting for master to reconnect...
17/07/20 16:42:25 WARN StandaloneSchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
17/07/20 16:42:26 ERROR TaskSchedulerImpl: Lost executor 0 on 172.12.0.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/07/20 16:42:26 ERROR TaskSchedulerImpl: Lost executor 1 on 172.12.0.3: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/07/20 16:42:26 INFO DAGScheduler: Executor lost: 0 (epoch 29)
17/07/20 16:42:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/07/20 16:42:26 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_9_0 !
17/07/20 16:42:26 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, 172.12.0.2, 42948, None)
17/07/20 16:42:26 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
17/07/20 16:42:26 INFO DAGScheduler: Shuffle files lost for executor: 0 (epoch 29)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 93 is now unavailable on executor 0 (124/200, false)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 95 is now unavailable on executor 0 (126/200, false)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 94 is now unavailable on executor 0 (0/1, false)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 92 is now unavailable on executor 0 (0/1, false)
17/07/20 16:42:26 INFO DAGScheduler: Executor lost: 1 (epoch 35)
17/07/20 16:42:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/07/20 16:42:26 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 172.12.0.3, 37556, None)
17/07/20 16:42:26 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
17/07/20 16:42:26 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 35)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 91 is now unavailable on executor 1 (0/1, false)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 93 is now unavailable on executor 1 (65/200, false)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 95 is now unavailable on executor 1 (41/200, false)
17/07/20 16:42:26 ERROR TaskSchedulerImpl: Lost executor 2 on 172.12.0.1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/07/20 16:42:26 WARN TaskSetManager: Lost task 0.0 in stage 96.0 (TID 3617, 172.12.0.1, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/07/20 16:42:26 INFO DAGScheduler: Executor lost: 2 (epoch 41)
17/07/20 16:42:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/07/20 16:42:26 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 172.12.0.1, 33264, None)
17/07/20 16:42:26 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/07/20 16:42:26 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 41)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 93 is now unavailable on executor 2 (0/200, false)
17/07/20 16:42:26 INFO ShuffleMapStage: ShuffleMapStage 95 is now unavailable on executor 2 (0/200, false)
17/07/20 17:11:26 INFO BlockManagerInfo: Removed broadcast_50_piece0 on 172.12.0.4:36275 in memory (size: 12.1 KB, free: 366.2 MB)
17/07/20 17:11:26 INFO BlockManagerInfo: Removed broadcast_45_piece0 on 172.12.0.4:36275 in memory (size: 5.5 KB, free: 366.2 MB)
17/07/20 17:11:26 INFO BlockManagerInfo: Removed broadcast_49_piece0 on 172.12.0.4:36275 in memory (size: 12.1 KB, free: 366.2 MB)
Interesting, that 12 seconds passed before error (Maybe timeout settings required?)
17/07/20 16:42:13 INFO DAGScheduler: failed: Set()
17/07/20 16:42:25 WARN StandaloneAppClient$ClientEndpoint: Connection to 172.12.0.1:7077 failed; waiting for master to reconnect...
Before, I see that some stages are passed.
Any help or advise are highly appreciated.

Spark cluster can't assign resources from remote scala application

So, I've been trying to get off of the ground running Spark-scala. I've written a simple test program, which just extends the SparkPi example a bit :
def main(args: Array[String]): Unit = {
test()
}
def calcPi(spark: SparkContext, args: Array[String], numSlices: Long): Array[Double] = {
val start = System.nanoTime()
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(numSlices * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
val piVal = 4.0 * count / n
println("Pi is roughly " + piVal)
spark.stop()
val end = System.nanoTime()
return Array(piVal, end - start, (piVal - Math.PI)/Math.PI)
}
def test(): Unit ={
val conf = new SparkConf().setAppName("Pi Test")
conf.setSparkHome("/usr/local/spark")
conf.setMaster("spark://<URL_OF_SPARK_CLUSTER>:7077")
conf.set("spark.executor.memory", "512m")
conf.set("spark.cores.max", "1")
conf.set("spark.blockManager.port", "33291")
conf.set("spark.executor.port", "33292")
conf.set("spark.broadcast.port", "33293")
conf.set("spark.fileserver.port", "33294")
conf.set("spark.driver.port", "33296")
conf.set("spark.replClassServer.port", "33297")
val sc = new SparkContext(conf)
val pi = calcPi(sc, Array(), 1000)
for(item <- pi) {
println(item)
}
}
I then made sure that ports 33291-33300 are open on my machine.
when I run the program, it succssfully hits the spark cluster, and seems to assign cores:
But when the program gets the point where it's actually running the hadoop job, the application logs say:
15/12/07 11:50:21 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at BotDetector.scala:49), which has no missing parents
15/12/07 11:50:21 INFO MemoryStore: ensureFreeSpace(1840) called with curMem=0, maxMem=2061647216
15/12/07 11:50:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1840.0 B, free 1966.1 MB)
15/12/07 11:50:21 INFO MemoryStore: ensureFreeSpace(1194) called with curMem=1840, maxMem=2061647216
15/12/07 11:50:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1194.0 B, free 1966.1 MB)
15/12/07 11:50:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.5.106:33291 (size: 1194.0 B, free: 1966.1 MB)
15/12/07 11:50:21 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/12/07 11:50:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at BotDetector.scala:49)
15/12/07 11:50:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/12/07 11:50:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:50:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/0 is now EXITED (Command exited with code 1)
15/12/07 11:52:22 INFO SparkDeploySchedulerBackend: Executor app-20151207175020-0003/0 removed: Command exited with code 1
15/12/07 11:52:22 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor added: app-20151207175020-0003/1 on worker-20151207173821-10.240.0.7-33295 (10.240.0.7:33295) with 5 cores
15/12/07 11:52:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151207175020-0003/1 on hostPort 10.240.0.7:33295 with 5 cores, 512.0 MB RAM
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/1 is now LOADING
15/12/07 11:52:23 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/1 is now RUNNING
15/12/07 11:52:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
and when I go onto the remote server and look at the worker logs, they say:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/apache-tez-0.7.0-src/tez-dist/target/tez-0.7.0/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/07 17:50:21 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/12/07 17:50:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/07 17:50:21 INFO spark.SecurityManager: Changing view acls to: hduser,jschirmer
15/12/07 17:50:21 INFO spark.SecurityManager: Changing modify acls to: hduser,jschirmer
15/12/07 17:50:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser, jschirmer); users with modify permissions: Set(hduser, jschirmer)
15/12/07 17:50:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/12/07 17:50:22 INFO Remoting: Starting remoting
15/12/07 17:50:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#10.240.0.7:33292]
15/12/07 17:50:22 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 33292.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:97)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:159)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
... 4 more
15/12/07 17:52:22 INFO util.Utils: Shutdown hook called
I've tried setting the driver and executor ports to explicitly open ports, with the same result. It's unclear what the problem is. Does anyone have any advice?
Also, note that if I compile this exact same code to a fat jar, and copy it to the remote server, and run it through spark-submit, then it runs successfully. I do have a yarn configuration defined on my server, and I'm open to running spark-yarn, but my understanding is that this cannot be done from a remote server, since you specify master as yarn-cluster, and there's no place to put the host in the config.
It seems you have firewall problem. First check you enabled all required port in your cluster or not then after there is some random ports in spark so you need fix those ports for your cluster then only you can use spark remotely.

java.io.EOFException on Spark EC2 Cluster when submitting job programatically

realy need your help to understand, what I'm doing wrong.
The intent of my experiment is to run spark job programatically instead of using ./spark-shell or ./spark-submit (These both work for me)
Environment:
I've created a Spark Cluster with 1 master & 1 worker using ./spark-ec2 script
Cluster looks good, however, when I try to run the code being packaged in a jar:
val logFile = "file:///root/spark/bin/README.md"
val conf = new SparkConf()
conf.setAppName("Simple App")
conf.setJars(List("file:///root/spark/bin/hello-apache-spark_2.10-1.0.0-SNAPSHOT.jar"))
conf.setMaster("spark://ec2-54-89-51-36.compute-1.amazonaws.com:7077")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(_.contains("a")).count()
val numBs = logData.filter(_.contains("b")).count()
println(s"1. Lines with a: $numAs, Lines with b: $numBs")
I get an exception:
*[info] Running com.paycasso.SimpleApp
14/09/05 14:50:29 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/05 14:50:29 INFO SecurityManager: Changing view acls to: root
14/09/05 14:50:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root)
14/09/05 14:50:30 INFO Slf4jLogger: Slf4jLogger started
14/09/05 14:50:30 INFO Remoting: Starting remoting
14/09/05 14:50:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#ip-10-224-14-90.ec2.internal:54683]
14/09/05 14:50:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#ip-10-224-14-90.ec2.internal:54683]
14/09/05 14:50:30 INFO SparkEnv: Registering MapOutputTracker
14/09/05 14:50:30 INFO SparkEnv: Registering BlockManagerMaster
14/09/05 14:50:30 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140905145030-85cb
14/09/05 14:50:30 INFO MemoryStore: MemoryStore started with capacity 589.2 MB.
14/09/05 14:50:30 INFO ConnectionManager: Bound socket to port 47852 with id = ConnectionManagerId(ip-10-224-14-90.ec2.internal,47852)
14/09/05 14:50:30 INFO BlockManagerMaster: Trying to register BlockManager
14/09/05 14:50:30 INFO BlockManagerInfo: Registering block manager ip-10-224-14-90.ec2.internal:47852 with 589.2 MB RAM
14/09/05 14:50:30 INFO BlockManagerMaster: Registered BlockManager
14/09/05 14:50:30 INFO HttpServer: Starting HTTP Server
14/09/05 14:50:30 INFO HttpBroadcast: Broadcast server started at http://**.***.**.**:49211
14/09/05 14:50:30 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e2748605-17ec-4524-983b-97aaf2f94b30
14/09/05 14:50:30 INFO HttpServer: Starting HTTP Server
14/09/05 14:50:31 INFO SparkUI: Started SparkUI at http://ip-10-224-14-90.ec2.internal:4040
14/09/05 14:50:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/09/05 14:50:32 INFO SparkContext: Added JAR file:///root/spark/bin/hello-apache-spark_2.10-1.0.0-SNAPSHOT.jar at http://**.***.**.**:46491/jars/hello-apache-spark_2.10-1.0.0-SNAPSHOT.jar with timestamp 1409928632274
14/09/05 14:50:32 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-89-51-36.compute-1.amazonaws.com:7077...
14/09/05 14:50:32 INFO MemoryStore: ensureFreeSpace(163793) called with curMem=0, maxMem=617820979
14/09/05 14:50:32 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 160.0 KB, free 589.0 MB)
14/09/05 14:50:32 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140905145032-0005
14/09/05 14:50:32 INFO AppClient$ClientActor: Executor added: app-20140905145032-0005/0 on worker-20140905141732-ip-10-80-90-29.ec2.internal-57457 (ip-10-80-90-29.ec2.internal:57457) with 2 cores
14/09/05 14:50:32 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140905145032-0005/0 on hostPort ip-10-80-90-29.ec2.internal:57457 with 2 cores, 512.0 MB RAM
14/09/05 14:50:32 INFO AppClient$ClientActor: Executor updated: app-20140905145032-0005/0 is now RUNNING
14/09/05 14:50:33 INFO FileInputFormat: Total input paths to process : 1
14/09/05 14:50:33 INFO SparkContext: Starting job: count at SimpleApp.scala:26
14/09/05 14:50:33 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:26) with 1 output partitions (allowLocal=false)
14/09/05 14:50:33 INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:26)
14/09/05 14:50:33 INFO DAGScheduler: Parents of final stage: List()
14/09/05 14:50:33 INFO DAGScheduler: Missing parents: List()
14/09/05 14:50:33 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:26), which has no missing parents
14/09/05 14:50:33 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:26)
14/09/05 14:50:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/09/05 14:50:36 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#ip-10-80-90-29.ec2.internal:36966/user/Executor#2034537974] with ID 0
14/09/05 14:50:36 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: ip-10-80-90-29.ec2.internal (PROCESS_LOCAL)
14/09/05 14:50:36 INFO TaskSetManager: Serialized task 0.0:0 as 1880 bytes in 8 ms
14/09/05 14:50:37 INFO BlockManagerInfo: Registering block manager ip-10-80-90-29.ec2.internal:59950 with 294.9 MB RAM
14/09/05 14:50:38 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/09/05 14:50:38 WARN TaskSetManager: Loss was due to java.io.EOFException
java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1032)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:147)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)*
What I'm actualy doing is a call "sbt run". So I assemble the scala project and run it.
By the way, I run that project on a master host, so the driver definitely is visible for a worker host.
Any help is appreciated. That's very strange, that such a simple example doesn't work in cluster. Using ./spark-submit is not convenient, I believe.
Thanks in advance.
After wasting a lot of time, I've found the problem. Despite I haven't used hadoop/hdfs in my application, hadoop client matters. The problem was in hadoop-client version, it was different than the version of hadoop, spark was built for. Spark's hadoop version 1.2.1, but in my application that was 2.4.
When I changed the version of hadoop client to 1.2.1 in my app, I'm able to execute spark code on cluster.

There is a HTTP server starts when Launching Spark jar on a machine, what's that?

I want to use machine A where I will submit my Spark job to the cluster, A has no spark environment, just java. When I launch the jar, there is a HTTP server starts:
[steven#bj-230 ~]$ java -jar helloCluster.jar SimplyApp
log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14/06/10 16:54:54 INFO SparkEnv: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/06/10 16:54:54 INFO SparkEnv: Registering BlockManagerMaster
14/06/10 16:54:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140610165454-4393
14/06/10 16:54:54 INFO MemoryStore: MemoryStore started with capacity 1055.1 MB.
14/06/10 16:54:54 INFO ConnectionManager: Bound socket to port 59981 with id = ConnectionManagerId(bj-230,59981)
14/06/10 16:54:54 INFO BlockManagerMaster: Trying to register BlockManager
14/06/10 16:54:54 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager bj-230:59981 with 1055.1 MB RAM
14/06/10 16:54:54 INFO BlockManagerMaster: Registered BlockManager
14/06/10 16:54:54 INFO HttpServer: Starting HTTP Server
14/06/10 16:54:54 INFO HttpBroadcast: Broadcast server started at http://10.10.10.230:59233
14/06/10 16:54:54 INFO SparkEnv: Registering MapOutputTracker
14/06/10 16:54:54 INFO HttpFileServer: HTTP File server directory is /tmp/spark-bfdd02f1-3c02-4233-854f-af89542b9acf
14/06/10 16:54:54 INFO HttpServer: Starting HTTP Server
14/06/10 16:54:54 INFO SparkUI: Started Spark Web UI at http://bj-230:4040
14/06/10 16:54:54 INFO SparkContext: Added JAR hdfs://master:8020/tmp/helloCluster.jar at hdfs://master:8020/tmp/helloCluster.jar with timestamp 1402390494838
14/06/10 16:54:54 INFO AppClient$ClientActor: Connecting to master spark://master:7077...
So, what's the meaning of this server? And if I am behind a NAT, is it possible to use this machine A to submit my job to remote cluster?
By the way, the result of this execution is failed. Error log:
14/06/10 16:55:05 INFO SparkDeploySchedulerBackend: Executor app-20140610165321-0005/7 removed: Command exited with code 1
14/06/10 16:55:05 ERROR AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/10 16:55:05 WARN SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/10 16:55:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
The spark driver starts few HTTP endpoints:
It provides a Web console that shows the job progress. This http endpoint has a default port of 4040 and can be changed with the configuration option: spark.ui.port. Then, you connect to it with your browser: http://your_host:4040 and you will be able to follow the job. It's only alive the time the driver runs.
There's an additional HTTP endpoint to provide a file download service for the jars declared as dependencies. The workers will contact the driver to download the list of dependencies. This is a random assigned port. Therefore, the driver must be on a routable network from the Spark workers.