Spark broadcast/serialize error - scala

I’ve created a CLI driver for a Spark version of a Mahout job called "item similarity" with several tests that all work fine on local[4] Spark standalone. The code even reads and writes to clustered HDFS. But switching to clustered Spark has a problem that seems tied to a broadcast and/or serialization.
The code uses HashBiMap, which is a Guava Java thing. There are two of these created for every Mahout drm (a distributed matrix), for bi-directional row and column ID lookup. They are created once and then broadcast for access everywhere.
When I run this on clustered Spark I get the following error. At one point we were using HashMaps and they seemed to work on the cluster. So I suspect something about the HashBiMap is causing the problem. I’m also suspicious that it may have to do with serialization in the broadcast. Here is a snippet of code and the error.
// create BiMaps for bi-directional lookup of ID by either Mahout ID or external ID
// broadcast them for access in distributed processes, so they are not recalculated in every task.
// rowIDDictionary is a HashBiMap[String, Int]
val rowIDDictionary = asOrderedDictionary(rowIDs) // this creates the HashBiMap in a non-dsitributed manner
val rowIDDictionary_bcast = mc.broadcast(rowIDDictionary)
val columnIDDictionary = asOrderedDictionary(columnIDs)) // this creates the HashBiMap in a non-dsitributed manner
val columnIDDictionary_bcast = mc.broadcast(columnIDDictionary)
val indexedInteractions =
interactions.map { case (rowID, columnID) => //<<<<<<<<<<< this is the stage being submitted before the error
val rowIndex = rowIDDictionary_bcast.value.get(rowID).get
val columnIndex = columnIDDictionary_bcast.value.get(columnID).get
rowIndex -> columnIndex
}
The error seems to happen in executing interactions.map when accessing the _bcast vals. Any idea where to start looking for this?
14/06/26 11:23:36 INFO scheduler.DAGScheduler: Submitting Stage 9 (MappedRDD[17] at map at TextDelimitedReaderWriter.scala:83), which has no missing parents
14/06/26 11:23:36 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 9 (MappedRDD[17] at map at TextDelimitedReaderWriter.scala:83)
14/06/26 11:23:36 INFO scheduler.TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Starting task 9.0:0 as TID 16 on executor 0: occam4 (PROCESS_LOCAL)
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Serialized task 9.0:0 as 2418 bytes in 0 ms
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Starting task 9.0:1 as TID 17 on executor 0: occam4 (PROCESS_LOCAL)
14/06/26 11:23:36 INFO scheduler.TaskSetManager: Serialized task 9.0:1 as 2440 bytes in 0 ms
14/06/26 11:23:36 WARN scheduler.TaskSetManager: Lost TID 16 (task 9.0:0)
14/06/26 11:23:36 WARN scheduler.TaskSetManager: Loss was due to java.lang.NullPointerException
java.lang.NullPointerException
at com.google.common.collect.HashBiMap.seekByKey(HashBiMap.java:180)
at com.google.common.collect.HashBiMap.put(HashBiMap.java:230)
at com.google.common.collect.HashBiMap.put(HashBiMap.java:218)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:102)
at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:165)
at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1775)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at org.apache.spark.scheduler.ShuffleMapTask$.deserializeInfo(ShuffleMapTask.scala:69)
at org.apache.spark.scheduler.ShuffleMapTask.readExternal(ShuffleMapTask.scala:138)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1814)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1773)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1327)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

It looks like you are using kryo serialization, are you using this in your local tests as well?
You may wish to explicitly register the class kryo to use Java serilization if kryo serialization for HashBiMap doesn't succeed.

Related

Out of memory exception or worked node lost during the spark scala job

I am executing a spark-scala job using spark-shell and the problem I am facing is, at the end of the final stage and final mapper like in stage 5 it allocates 50 and completed 49 very quickly and at the 50th it takes 5 minutes and says that out of memory and fails. I am using SPARK_MAJOR_VERSION=2
I am using the below command
spark-shell --master yarn --conf spark.driver.memory=30G --conf spark.executor.memory=40G --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=false --conf spark.sql.broadcastTimeout=36000 --conf spark.shuffle.compress=true --conf spark.executor.heartbeatInterval=3600s --conf spark.executor.instance=160
In the above configuration I have tried the dynamic allocation to true and started the driver and executor memory from 1GB. I have the overall ram of 6.78TB and 1300 VCores(This is my entire hadoop hardware).
The table I am reading is 40GB and I am joining 6 tables to that 40GB table, so, overall might be 60GB. so spark is initializing 4 stages for this and in the final stage at the end it is failing. I am using the spark sql to execute SQL.
Below are the errors:
19/04/26 14:29:02 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 125967 ms exceeds timeout 120000 ms
19/04/26 14:29:02 ERROR YarnScheduler: Lost executor 2 on worker03.some.com: Executor heartbeat timed out after 125967 ms
19/04/26 14:29:02 WARN TaskSetManager: Lost task 5.0 in stage 2.0 (TID 119, worker03.some.com, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 125967 ms
19/04/26 14:29:02 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 126225 ms exceeds timeout 120000 ms
19/04/26 14:29:02 ERROR YarnScheduler: Lost executor 1 on ncednhpwrka0008.devhadoop.charter.com: Executor heartbeat timed out after 126225 ms
19/04/26 14:29:02 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e1223_1556277056929_0976_01_000003 on host: worker03.some.com. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e1223_1556277056929_0976_01_000003
Exit code: 52
Shell output: main : command provided 1
main : run as user is svc-bd-xdladmrw-dev
main : requested yarn user is svc-bd-xdladmrw-dev
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data/00/yarn/local/nmPrivate/application_1556277056929_0976/container_e1223_1556277056929_0976_01_000003/container_e1223_1556277056929_0976_01_000003.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 52. Last 4096 bytes of stderr :
0 in stage 2.0 (TID 119)
19/04/26 14:27:37 INFO HadoopRDD: Input split: hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2:0+6623042
19/04/26 14:27:37 INFO OrcRawRecordMerger: min key = null, max key = null
19/04/26 14:27:37 INFO ReaderImpl: Reading ORC rows from hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2 with {include: [true, true, true], offset: 0, length: 9223372036854775807}
19/04/26 14:29:00 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 119)
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 119,5,main]
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 INFO DiskBlockManager: Shutdown hook called
19/04/26 14:29:00 INFO ShutdownHookManager: Shutdown hook called
19/04/26 14:29:02 ERROR YarnScheduler: Lost executor 2 on worker03.some.com: Container marked as failed: container_e1223_1556277056929_0976_01_000003 on host: worker03.some.com. Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e1223_1556277056929_0976_01_000003
Exit code: 52
Shell output: main : command provided 1
main : run as user is svc-bd-xdladmrw-dev
main : requested yarn user is svc-bd-xdladmrw-dev
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /data/00/yarn/local/nmPrivate/application_1556277056929_0976/container_e1223_1556277056929_0976_01_000003/container_e1223_1556277056929_0976_01_000003.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...
Container exited with a non-zero exit code 52. Last 4096 bytes of stderr :
0 in stage 2.0 (TID 119)
19/04/26 14:27:37 INFO HadoopRDD: Input split: hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2:0+6623042
19/04/26 14:27:37 INFO OrcRawRecordMerger: min key = null, max key = null
19/04/26 14:27:37 INFO ReaderImpl: Reading ORC rows from hdfs://datadev/data/dev/HIVE_SCHEMA/somedb.db/sbscr_usge_cycl_key_xref/000000_0_copy_2 with {include: [true, true, true], offset: 0, length: 9223372036854775807}
19/04/26 14:29:00 ERROR Executor: Exception in task 5.0 in stage 2.0 (TID 119)
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 119,5,main]
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/04/26 14:29:00 INFO DiskBlockManager: Shutdown hook called
19/04/26 14:29:00 INFO ShutdownHookManager: Shutdown hook called
Can anyone let me know if I am doing anything wrong here, like the memory allocation or something?, please suggest any alternatives to complete this job without getting the our of memory exception or worker node lost error. Any help or info is greatly appreciated.
Thanks!
at the end of the final stage and final mapper like in stage 5 it allocates 50 and completed 49 very quickly and at the 50th it takes 5 minutes and says that out of memory and fails.
The table I am reading is 40GB and I am joining 6 tables to that 40GB table
It sounds like a skewed data to me, most of the keys used for joining are in one partition. So instead spreading the work among multiple executors, Spark uses just one and overloads it. It affects both memory consumption and performance.
There are a few ways to deal with it:
Skewed dataset join in Spark?
How to repartition a dataframe in Spark scala on a skewed column?

Spark stand alone mode: Submitting jobs programmatically

I am new to Spark and I am trying to submit the "quick-start" job from my app. I try to emulate standalone-mode by starting master and slave on my localhost.
object SimpleApp {
def main(args: Array[String]): Unit = {
val logFile = "/opt/spark-2.0.0-bin-hadoop2.7/README.md"
val conf = new SparkConf().setAppName("SimpleApp")
conf.setMaster("spark://10.49.30.77:7077")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile,2).cache();
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s , lines with b: %s".format(numAs,numBs))
}
}
I run my Spark app in my IDE (IntelliJ).
Looking at the logs (logs in workernode), it seems spark cannot find the job class.
16/09/15 17:50:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1912.0 B, free 366.3 MB)
16/09/15 17:50:58 INFO TorrentBroadcast: Reading broadcast variable 1 took 137 ms
16/09/15 17:50:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 366.3 MB)
16/09/15 17:50:58 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassNotFoundException: SimpleApp$$anonfun$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
1.Does this mean the Job resources (classes) are not transmitted to the slave node?
2.For stand-alone mode , I must submit jobs using "spark-submit" CLI? If so, how to submit sparks jobs from a app(for example a webapp)
3.Also unrelated question : I see in the logs the,DriverProgram starts a server(port 4040).Whats the purpose of this? DriveProgram being the client, why does it start this service ?
16/09/15 17:50:52 INFO SparkEnv: Registering OutputCommitCoordinator
16/09/15 17:50:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/09/15 17:50:53 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.49.30.77:4040
You should either set the resources paths in SparkConf using the setJars method or provide the resources in spark-submit command with the --jars option when running from CLI.

Spark cluster can't assign resources from remote scala application

So, I've been trying to get off of the ground running Spark-scala. I've written a simple test program, which just extends the SparkPi example a bit :
def main(args: Array[String]): Unit = {
test()
}
def calcPi(spark: SparkContext, args: Array[String], numSlices: Long): Array[Double] = {
val start = System.nanoTime()
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(numSlices * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
val piVal = 4.0 * count / n
println("Pi is roughly " + piVal)
spark.stop()
val end = System.nanoTime()
return Array(piVal, end - start, (piVal - Math.PI)/Math.PI)
}
def test(): Unit ={
val conf = new SparkConf().setAppName("Pi Test")
conf.setSparkHome("/usr/local/spark")
conf.setMaster("spark://<URL_OF_SPARK_CLUSTER>:7077")
conf.set("spark.executor.memory", "512m")
conf.set("spark.cores.max", "1")
conf.set("spark.blockManager.port", "33291")
conf.set("spark.executor.port", "33292")
conf.set("spark.broadcast.port", "33293")
conf.set("spark.fileserver.port", "33294")
conf.set("spark.driver.port", "33296")
conf.set("spark.replClassServer.port", "33297")
val sc = new SparkContext(conf)
val pi = calcPi(sc, Array(), 1000)
for(item <- pi) {
println(item)
}
}
I then made sure that ports 33291-33300 are open on my machine.
when I run the program, it succssfully hits the spark cluster, and seems to assign cores:
But when the program gets the point where it's actually running the hadoop job, the application logs say:
15/12/07 11:50:21 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at BotDetector.scala:49), which has no missing parents
15/12/07 11:50:21 INFO MemoryStore: ensureFreeSpace(1840) called with curMem=0, maxMem=2061647216
15/12/07 11:50:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1840.0 B, free 1966.1 MB)
15/12/07 11:50:21 INFO MemoryStore: ensureFreeSpace(1194) called with curMem=1840, maxMem=2061647216
15/12/07 11:50:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1194.0 B, free 1966.1 MB)
15/12/07 11:50:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.5.106:33291 (size: 1194.0 B, free: 1966.1 MB)
15/12/07 11:50:21 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/12/07 11:50:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at BotDetector.scala:49)
15/12/07 11:50:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/12/07 11:50:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:50:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/0 is now EXITED (Command exited with code 1)
15/12/07 11:52:22 INFO SparkDeploySchedulerBackend: Executor app-20151207175020-0003/0 removed: Command exited with code 1
15/12/07 11:52:22 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor added: app-20151207175020-0003/1 on worker-20151207173821-10.240.0.7-33295 (10.240.0.7:33295) with 5 cores
15/12/07 11:52:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151207175020-0003/1 on hostPort 10.240.0.7:33295 with 5 cores, 512.0 MB RAM
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/1 is now LOADING
15/12/07 11:52:23 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/1 is now RUNNING
15/12/07 11:52:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
and when I go onto the remote server and look at the worker logs, they say:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/apache-tez-0.7.0-src/tez-dist/target/tez-0.7.0/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/07 17:50:21 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/12/07 17:50:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/07 17:50:21 INFO spark.SecurityManager: Changing view acls to: hduser,jschirmer
15/12/07 17:50:21 INFO spark.SecurityManager: Changing modify acls to: hduser,jschirmer
15/12/07 17:50:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser, jschirmer); users with modify permissions: Set(hduser, jschirmer)
15/12/07 17:50:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/12/07 17:50:22 INFO Remoting: Starting remoting
15/12/07 17:50:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#10.240.0.7:33292]
15/12/07 17:50:22 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 33292.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:97)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:159)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
... 4 more
15/12/07 17:52:22 INFO util.Utils: Shutdown hook called
I've tried setting the driver and executor ports to explicitly open ports, with the same result. It's unclear what the problem is. Does anyone have any advice?
Also, note that if I compile this exact same code to a fat jar, and copy it to the remote server, and run it through spark-submit, then it runs successfully. I do have a yarn configuration defined on my server, and I'm open to running spark-yarn, but my understanding is that this cannot be done from a remote server, since you specify master as yarn-cluster, and there's no place to put the host in the config.
It seems you have firewall problem. First check you enabled all required port in your cluster or not then after there is some random ports in spark so you need fix those ports for your cluster then only you can use spark remotely.

why after some stages, all the taskes are assigned to one machine(excuter) in spark?

I encountered this problem: when i run a machine learning task on spark, after some stages, all the taskes are assigned to one machine(excutor), and the stage execution get slower and slower.
[the spark conf setting]
val conf = new SparkConf().setMaster(sparkMaster).setAppName("ModelTraining").setSparkHome(sparkHome).setJars(List(jarFile))
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "LRRegistrator")
conf.set("spark.storage.memoryFraction", "0.7")
conf.set("spark.executor.memory", "8g")
conf.set("spark.cores.max", "150")
conf.set("spark.speculation", "true")
conf.set("spark.storage.blockManagerHeartBeatMs", "300000")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://xxx:52310"+inputPath , 3)
val trainset = lines.map(parseWeightedPoint).repartition(50).persist(StorageLevel.MEMORY_ONLY)
[the warn log from the spark]
14/09/19 10:26:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(45, TS-BH109, 48384, 0)
14/09/19 10:27:18 WARN TaskSetManager: Lost TID 726 (task 14.0:9)
14/09/19 10:29:03 WARN SparkDeploySchedulerBackend: Ignored task status update (737 state FAILED) from unknown executor Actor[akka.tcp://sparkExecutor#TS-BH96:33178/user/Executor#-913985102] with ID 39
14/09/19 10:29:03 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(30, TS-BH136, 28518, 0)
14/09/19 11:01:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(47, TS-BH136, 31644, 0) with no recent heart beats: 47765ms exceeds 45000ms
Any suggestions?
Can you post the executor log -- anything suspect there? In particular, you have Lost TID 726 (task 14.0:9). Further up in the driver log you should see which executor TID 726 was assigned to -- I'd check the error log on that machine and see if anything of interest shows up there.
My guess (and it's only a guess) is that your executor is crashing. At which point the system will try to launch a new executor but that is generally slow. In the mean time the current task might get resent to an existing executor hosing your system further.

Apache Spark -- MlLib -- Collaborative filtering

I'm trying to use MlLib for my colloborative filtering.
I encounter the following error in my Scala program when I run it in Apache Spark 1.0.0.
14/07/15 16:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/07/15 16:16:31 WARN LoadSnappy: Snappy native library not loaded
14/07/15 16:16:31 INFO FileInputFormat: Total input paths to process : 1
14/07/15 16:16:38 WARN TaskSetManager: Lost TID 10 (task 80.0:0)
14/07/15 16:16:38 WARN TaskSetManager: Loss was due to java.lang.UnsatisfiedLinkError
java.lang.UnsatisfiedLinkError: org.jblas.NativeBlas.dposv(CII[DII[DII)I
at org.jblas.NativeBlas.dposv(Native Method)
at org.jblas.SimpleBlas.posv(SimpleBlas.java:369)
at org.jblas.Solve.solvePositive(Solve.java:68)
at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$2.apply(ALS.scala:522)
at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$2.apply(ALS.scala:509)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:156)
at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:509)
at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:445)
at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:444)
at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/07/15 16:16:38 ERROR TaskSchedulerImpl: Lost executor 0 on maroki.office.mkechinov.ru: Uncaught exception
14/07/15 16:16:38 WARN TaskSetManager: Lost TID 12 (task 80.0:0)
14/07/15 16:16:42 WARN TaskSetManager: Lost TID 18 (task 80.0:1)
14/07/15 16:16:42 WARN TaskSetManager: Loss was due to fetch failure from null
14/07/15 16:16:42 WARN TaskSetManager: Loss was due to fetch failure from null
14/07/15 16:16:43 WARN TaskSetManager: Lost TID 25 (task 80.1:0)
14/07/15 16:16:43 WARN TaskSetManager: Loss was due to java.lang.UnsatisfiedLinkError
How can I solve this error?
Spark documentation clearly mentions that MLLib uses native libraries, which need to be present on the nodes. (that is it does not come with spark installation)
MLlib uses the jblas linear algebra library, which itself depends on native Fortran routines. You may need to install the gfortran runtime library if it is not already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries automatically.
You have to make sure that libgfortran library exists on all nodes.
for debian/ubuntu use:
sudo apt-get install libgfortran3
for centos use:
sudo yum install gcc-gfortran