Spark submit local Executor cannot fetch jar - scala

I was trying to run a Spark example from their docs:
https://spark.apache.org/docs/1.2.0/quick-start.html
Whenever I try section Self-Contained Applications I get the following output:
16/08/28 13:18:30 INFO SparkContext: Running Spark version 1.5.1
16/08/28 13:18:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/28 13:18:31 INFO SecurityManager: Changing view acls to: alejandrohernandez
16/08/28 13:18:31 INFO SecurityManager: Changing modify acls to: alejandrohernandez
16/08/28 13:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(alejandrohernandez); users with modify permissions: Set(alejandrohernandez)
16/08/28 13:18:31 INFO Slf4jLogger: Slf4jLogger started
16/08/28 13:18:31 INFO Remoting: Starting remoting
16/08/28 13:18:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#192.168.15.3:56988]
16/08/28 13:18:31 INFO Utils: Successfully started service 'sparkDriver' on port 56988.
16/08/28 13:18:31 INFO SparkEnv: Registering MapOutputTracker
16/08/28 13:18:31 INFO SparkEnv: Registering BlockManagerMaster
16/08/28 13:18:31 INFO DiskBlockManager: Created local directory at /private/var/folders/lb/78w91_l123n0cvprhmldkxhc0000gp/T/blockmgr-be8bedf7-96fe-425b-8344-c668110905eb
16/08/28 13:18:31 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
16/08/28 13:18:31 INFO HttpFileServer: HTTP File server directory is /private/var/folders/lb/78w91_l123n0cvprhmldkxhc0000gp/T/spark-a122037d-3228-4e53-b3dd-6d7213187df0/httpd-e3388b36-1605-4cc5-a4c1-def1b7660570
16/08/28 13:18:31 INFO HttpServer: Starting HTTP Server
16/08/28 13:18:31 INFO Utils: Successfully started service 'HTTP file server' on port 56989.
16/08/28 13:18:31 INFO SparkEnv: Registering OutputCommitCoordinator
16/08/28 13:18:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/08/28 13:18:31 INFO SparkUI: Started SparkUI at http://192.168.15.3:4040
16/08/28 13:18:31 INFO SparkContext: Added JAR file:/Users/alejandrohernandez/repos/AssetBreakdownUploader/target/scala-2.10/AssetBreakdownUploader-0.1-SNAPSHOT.jar at http://192.168.15.3:56989/jars/AssetBreakdownUploader-0.1-SNAPSHOT.jar with timestamp 1472408311863
16/08/28 13:18:31 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/08/28 13:18:31 INFO Executor: Starting executor ID driver on host localhost
16/08/28 13:18:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56990.
16/08/28 13:18:31 INFO NettyBlockTransferService: Server created on 56990
16/08/28 13:18:31 INFO BlockManagerMaster: Trying to register BlockManager
16/08/28 13:18:31 INFO BlockManagerMasterEndpoint: Registering block manager localhost:56990 with 530.0 MB RAM, BlockManagerId(driver, localhost, 56990)
16/08/28 13:18:31 INFO BlockManagerMaster: Registered BlockManager
16/08/28 13:18:32 INFO MemoryStore: ensureFreeSpace(108600) called with curMem=0, maxMem=555755765
16/08/28 13:18:32 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 106.1 KB, free 529.9 MB)
16/08/28 13:18:32 INFO MemoryStore: ensureFreeSpace(11386) called with curMem=108600, maxMem=555755765
16/08/28 13:18:32 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.1 KB, free 529.9 MB)
16/08/28 13:18:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:56990 (size: 11.1 KB, free: 530.0 MB)
16/08/28 13:18:32 INFO SparkContext: Created broadcast 0 from textFile at Main.scala:12
16/08/28 13:18:33 INFO FileInputFormat: Total input paths to process : 1
16/08/28 13:18:33 INFO SparkContext: Starting job: count at Main.scala:13
16/08/28 13:18:33 INFO DAGScheduler: Got job 0 (count at Main.scala:13) with 1 output partitions
16/08/28 13:18:33 INFO DAGScheduler: Final stage: ResultStage 0(count at Main.scala:13)
16/08/28 13:18:33 INFO DAGScheduler: Parents of final stage: List()
16/08/28 13:18:33 INFO DAGScheduler: Missing parents: List()
16/08/28 13:18:33 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at Main.scala:13), which has no missing parents
16/08/28 13:18:33 INFO MemoryStore: ensureFreeSpace(3224) called with curMem=119986, maxMem=555755765
16/08/28 13:18:33 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 529.9 MB)
16/08/28 13:18:33 INFO MemoryStore: ensureFreeSpace(1925) called with curMem=123210, maxMem=555755765
16/08/28 13:18:33 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1925.0 B, free 529.9 MB)
16/08/28 13:18:33 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:56990 (size: 1925.0 B, free: 530.0 MB)
16/08/28 13:18:33 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
16/08/28 13:18:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at Main.scala:13)
16/08/28 13:18:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/08/28 13:18:33 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2258 bytes)
16/08/28 13:18:33 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/08/28 13:18:33 INFO Executor: Fetching http://192.168.15.3:56989/jars/AssetBreakdownUploader-0.1-SNAPSHOT.jar with timestamp 1472408311863
16/08/28 13:19:33 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/08/28 13:19:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/08/28 13:19:33 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
16/08/28 13:19:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/08/28 13:19:33 INFO TaskSchedulerImpl: Cancelling stage 0
16/08/28 13:19:33 INFO DAGScheduler: ResultStage 0 (count at Main.scala:13) failed in 60.069 s
16/08/28 13:19:33 INFO DAGScheduler: Job 0 failed: count at Main.scala:13, took 60.144276 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at com.ooyala.uploader.Main$.main(Main.scala:13)
at com.ooyala.uploader.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/08/28 13:19:33 INFO SparkContext: Invoking stop() from shutdown hook
16/08/28 13:19:33 INFO SparkUI: Stopped Spark web UI at http://192.168.15.3:4040
16/08/28 13:19:33 INFO DAGScheduler: Stopping DAGScheduler
16/08/28 13:19:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/08/28 13:19:33 INFO MemoryStore: MemoryStore cleared
16/08/28 13:19:33 INFO BlockManager: BlockManager stopped
16/08/28 13:19:33 INFO BlockManagerMaster: BlockManagerMaster stopped
16/08/28 13:19:33 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/08/28 13:19:33 INFO SparkContext: Successfully stopped SparkContext
16/08/28 13:19:33 INFO ShutdownHookManager: Shutdown hook called
16/08/28 13:19:33 INFO ShutdownHookManager: Deleting directory /private/var/folders/lb/78w91_l123n0cvprhmldkxhc0000gp/T/spark-a122037d-3228-4e53-b3dd-6d7213187df0
When executing, I wait for some time at this point
16/08/28 13:22:21 INFO Executor: Fetching http://192.168.15.3:57015/jars/AssetBreakdownUploader-0.1-SNAPSHOT.jar with timestamp 1472408540577
until the timeout happens. Any ideas of what can be happening?

Related

Word2Vec on Spark Scala

I'm trying to use Word2Vec from mllib, in order to apply a kmeans subsequently. I'm using scala 2.10.5 and spark 1.6.3. This is my code (after a Tokenization):
val word2Vec = new Word2Vec()
.setMinCount(2)
.setInputCol("FilteredFeauturesEntities")
.setOutputCol("Word2VecFeatures")
.setVectorSize(1000)
val model = word2Vec.fit(CleanedTokenizedDataFrame)
val word2VecDataFrame = model.transform(CleanedTokenizedDataFrame)
word2VecDataFrame.show()
I'm not getting a special error but my job don't reach the finishing lines.
This is the log output :
18/02/05 15:39:32 INFO TaskSetManager: Finished task 4.0 in stage 4.0 (TID 23) in 3143 ms on dhadlx122.haas.xxxxxx (2/9)
18/02/05 15:39:32 INFO TaskSetManager: Starting task 5.1 in stage 4.0 (TID 28, dhadlx121.haas.xxxxxx, partition 5,NODE_LOCAL, 2329 bytes)
18/02/05 15:39:32 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 20) in 3217 ms on dhadlx121.haas.xxxxxx (3/9)
18/02/05 15:39:32 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 22) in 3309 ms on dhadlx123.haas.xxxxxx (4/9)
18/02/05 15:39:32 INFO TaskSetManager: Finished task 2.0 in stage 4.0 (TID 21) in 3677 ms on dhadlx121.haas.xxxxxx (5/9)
18/02/05 15:39:33 INFO TaskSetManager: Finished task 6.0 in stage 4.0 (TID 25) in 3901 ms on dhadlx126.haas.xxxxxx (6/9)
18/02/05 15:39:33 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (dhadlx127.haas.xxxxxx:48384) with ID 6
18/02/05 15:39:33 INFO BlockManagerMasterEndpoint: Registering block manager dhadlx127.haas.xxxxxx:37909 with 5.3 GB RAM, BlockManagerId(6, dhadlx127.haas.xxxxxx, 37909)
18/02/05 15:39:33 INFO TaskSetManager: Lost task 5.1 in stage 4.0 (TID 28) on executor dhadlx121.haas.xxxxxx: java.lang.NullPointerException (null) [duplicate 1]
18/02/05 15:39:33 INFO TaskSetManager: Starting task 5.2 in stage 4.0 (TID 29, dhadlx128.haas.xxxxxx, partition 5,RACK_LOCAL, 2329 bytes)
18/02/05 15:39:33 INFO TaskSetManager: Finished task 7.0 in stage 4.0 (TID 27) in 2948 ms on dhadlx125.haas.xxxxxx (7/9)
18/02/05 15:39:34 INFO TaskSetManager: Lost task 5.2 in stage 4.0 (TID 29) on executor dhadlx128.haas.xxxxxx: java.lang.NullPointerException (null) [duplicate 2]
18/02/05 15:39:34 INFO TaskSetManager: Starting task 5.3 in stage 4.0 (TID 30, dhadlx127.haas.xxxxxx, partition 5,RACK_LOCAL, 2329 bytes)
18/02/05 15:39:35 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on dhadlx127.haas.xxxxxx:37909 (size: 26.4 KB, free: 5.3 GB)
18/02/05 15:39:35 INFO TaskSetManager: Finished task 3.0 in stage 4.0 (TID 19) in 6321 ms on dhadlx120.haas.xxxxxx (8/9)
18/02/05 15:39:36 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on dhadlx127.haas.xxxxxx:37909 (size: 58.9 KB, free: 5.3 GB)
18/02/05 15:39:40 INFO TaskSetManager: Lost task 5.3 in stage 4.0 (TID 30) on executor dhadlx127.haas.xxxxxx: java.lang.NullPointerException (null) [duplicate 3]
18/02/05 15:39:40 ERROR TaskSetManager: Task 5 in stage 4.0 failed 4 times; aborting job
18/02/05 15:39:40 INFO YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
18/02/05 15:39:40 INFO YarnScheduler: Cancelling stage 4
18/02/05 15:39:40 INFO DAGScheduler: ShuffleMapStage 4 (map at Word2Vec.scala:161) failed in 11.037 s
18/02/05 15:39:40 INFO DAGScheduler: Job 3 failed: collect at Word2Vec.scala:170, took 11.058049 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 4.0 failed 4 times, most recent failure: Lost task 5.3 in stage 4.0 (TID 30, dhadlx127.haas.xxxxxx): java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
at java.util.regex.Matcher.reset(Matcher.java:309)
at java.util.regex.Matcher.<init>(Matcher.java:229)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:385)
at SemanticAnalysis.App$$anonfun$extractPattern$1$1.apply(App.scala:63)
at SemanticAnalysis.App$$anonfun$extractPattern$1$1.apply(App.scala:63)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1844)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1857)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1928)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:934)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
at org.apache.spark.rdd.RDD.collect(RDD.scala:933)
at org.apache.spark.mllib.feature.Word2Vec.learnVocab(Word2Vec.scala:170)
at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:284)
at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:149)
at SemanticAnalysis.App$.main(App.scala:126)
at SemanticAnalysis.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
at java.util.regex.Matcher.reset(Matcher.java:309)
at java.util.regex.Matcher.<init>(Matcher.java:229)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:385)
at SemanticAnalysis.App$$anonfun$extractPattern$1$1.apply(App.scala:63)
at SemanticAnalysis.App$$anonfun$extractPattern$1$1.apply(App.scala:63)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
18/02/05 15:39:40 INFO SparkContext: Invoking stop() from shutdown hook
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
18/02/05 15:39:40 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
18/02/05 15:39:40 INFO SparkUI: Stopped Spark web UI at http://xxx.xx.xx.xxx:xxxx
18/02/05 15:39:40 INFO YarnClientSchedulerBackend: Interrupting monitor thread
18/02/05 15:39:40 INFO YarnClientSchedulerBackend: Shutting down all executors
18/02/05 15:39:40 INFO YarnClientSchedulerBackend: Asking each executor to shut down
18/02/05 15:39:40 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
18/02/05 15:39:40 INFO YarnClientSchedulerBackend: Stopped
18/02/05 15:39:40 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/05 15:39:40 INFO MemoryStore: MemoryStore cleared
18/02/05 15:39:40 INFO BlockManager: BlockManager stopped
18/02/05 15:39:40 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/05 15:39:40 INFO SparkContext: Successfully stopped SparkContext
18/02/05 15:39:40 INFO ShutdownHookManager: Shutdown hook called
18/02/05 15:39:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-e769e7c5-4336-45bd-97cd-e0731803f45f
18/02/05 15:39:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-f427cf4c-4236-4e57-a304-6be2a52932f3
18/02/05 15:39:40 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/05 15:39:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-f427cf4c-4236-4e57-a304-6be2a52932f3/httpd-0ab9e5ee-930e-4a48-be77-f5a6d2b01250
18/02/05 15:39:40 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
18/02/05 15:39:40 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
18/02/05 15:39:40 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Moreover, the same code works for a small example, in the same working environment :
package BIGDATA
/**
* #author ${user.name}
*/
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType}
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
object App {
def main(args : Array[String]) {
val conf = new SparkConf()
.setAppName("SEMANTIC ANALYSIS - TEST")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
println("====================================================")
println("READING DATA")
println("====================================================")
val pattern: scala.util.matching.Regex = "(([\\w\\.-]+#[\\w\\.-]+)|((X|A|x|a)\\d{6})|(MA\\d{7}\\w|MA\\d{7}|FR\\d{8}\\w)|(w+\\..*(\\.com|fr))|([|\\[\\]!\\(\\)?,;:#&*#_=\\/]*))".r
def extractPattern(pattern: scala.util.matching.Regex) = udf(
(title: String) => pattern.replaceAllIn(title, "")
)
val df = Seq(
(8, "Hi I heard about Spark x163021. Now, let’s use trained model by loading it. We need to import KMeansModel in order to use it for loading the model from file."),
(64, "I wish Java could use case classes. Above is a very naive example in which we use training dataset as input data too. In real world we will train a model, save it and later use it for predicting clusters of input data."),
(-27, "Logistic regression models are neat. Here is how you can save a trained model and later load it for prediction.")
).toDF("number", "word").select($"number", $"word",
extractPattern(pattern)($"word").alias("NewWord"))
println("====================================================")
println("FEATURE TRANSFORMERS")
println("====================================================")
val tokenizer = new Tokenizer()
.setInputCol("NewWord")
.setOutputCol("FeauturesEntities")
val TokenizedDataFrame = tokenizer.transform(df)
val remover = new StopWordsRemover()
.setInputCol("FeauturesEntities")
.setOutputCol("FilteredFeauturesEntities")
val CleanedTokenizedDataFrame = remover.transform(TokenizedDataFrame)
CleanedTokenizedDataFrame.show()
println("====================================================")
println("WORD2VEC : LEARN A MAPPING FROM WORDS TO VECTORS")
println("====================================================")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setMinCount(2)
.setInputCol("FilteredFeauturesEntities")
.setOutputCol("Word2VecFeatures")
.setVectorSize(1000)
val model = word2Vec.fit(CleanedTokenizedDataFrame)
val word2VecDataFrame = model.transform(CleanedTokenizedDataFrame)
word2VecDataFrame.show()
}
}
What's wrong with the first example ? thx !
You code never reaches Word2Vec. It fails on udf call because word column contains nulls. For example
val df = Seq((1, null), (2, "foo bar")).toDF("id", "word")
df.select(extractPattern(pattern)($"word").alias("NewWord")).show
will fail with the same way:
java.lang.NullPointerException
at java.util.regex.Matcher.getTextLength(Matcher.java:1283)
at java.util.regex.Matcher.reset(Matcher.java:309)
at java.util.regex.Matcher.<init>(Matcher.java:229)
at java.util.regex.Pattern.matcher(Pattern.java:1093)
Clean your data using na.drop before you proceed, and in general use regexp_replace, not udf.

Spark Submit not able to pick classpath from jar

i have created a spark job which will get data from one cassandra table and insert into another table, i am using gradle to build the jar file some how i am able to create a jar with all dependencies , i am using the below command to trigger spark job
spark-submit --class DataMigration OrderAnalytics.jar
All required jars are present inside OrderAnalytics.jar i.e lib/** still i am getting NoClassDefFoundError as below
17/08/19 22:56:38 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
And META-INF looks like this
Manifest-Version: 1.0
Main-Class: DataMigration
Class-Path: lib/spark-sql_2.11-2.2.0.jar lib/spark-cassandra-connector
_2.11-2.0.3.jar lib/univocity-parsers-2.2.1.jar lib/spark-sketch_2.11
-2.2.0.jar lib/spark-core_2.11-2.2.0.jar lib/spark-catalyst_2.11-2.2.
0.jar lib/spark-tags_2.11-2.2.0.jar lib/parquet-column-1.8.2.jar lib/
parquet-hadoop-1.8.2.jar lib/jackson-databind-2.6.5.jar lib/xbean-asm
5-shaded-4.4.jar lib/unused-1.0.0.jar lib/jsr166e-1.1.0.jar lib/commo
ns-beanutils-1.9.3.jar lib/joda-time-2.3.jar lib/joda-convert-1.2.jar
lib/scala-reflect-2.11.8.jar lib/avro-1.7.7.jar lib/avro-mapred-1.7.
7-hadoop2.jar lib/chill_2.11-0.8.0.jar lib/chill-java-0.8.0.jar lib/h
adoop-client-2.6.5.jar lib/spark-launcher_2.11-2.2.0.jar lib/spark-ne
twork-common_2.11-2.2.0.jar lib/spark-network-shuffle_2.11-2.2.0.jar
lib/spark-unsafe_2.11-2.2.0.jar lib/jets3t-0.9.3.jar lib/curator-reci
pes-2.6.0.jar lib/javax.servlet-api-3.1.0.jar lib/commons-lang3-3.5.j
ar lib/commons-math3-3.4.1.jar lib/jsr305-1.3.9.jar lib/jul-to-slf4j-
1.7.16.jar lib/jcl-over-slf4j-1.7.16.jar lib/log4j-1.2.17.jar lib/slf
4j-log4j12-1.7.16.jar lib/compress-lzf-1.0.3.jar lib/snappy-java-1.1.
2.6.jar lib/lz4-1.3.0.jar lib/RoaringBitmap-0.5.11.jar lib/json4s-jac
kson_2.11-3.2.11.jar lib/jersey-client-2.22.2.jar lib/jersey-common-2
.22.2.jar lib/jersey-server-2.22.2.jar lib/jersey-container-servlet-2
.22.2.jar lib/jersey-container-servlet-core-2.22.2.jar lib/netty-3.9.
9.Final.jar lib/stream-2.7.0.jar lib/metrics-core-3.1.2.jar lib/metri
cs-jvm-3.1.2.jar lib/metrics-json-3.1.2.jar lib/metrics-graphite-3.1.
2.jar lib/jackson-module-scala_2.11-2.6.5.jar lib/ivy-2.4.0.jar lib/o
ro-2.0.8.jar lib/pyrolite-4.13.jar lib/py4j-0.10.4.jar lib/commons-cr
ypto-1.0.0.jar lib/janino-3.0.0.jar lib/commons-compiler-3.0.0.jar li
b/antlr4-runtime-4.5.3.jar lib/commons-codec-1.10.jar lib/parquet-com
mon-1.8.2.jar lib/parquet-encoding-1.8.2.jar lib/parquet-format-2.3.1
.jar lib/parquet-jackson-1.8.2.jar lib/jackson-core-2.6.5.jar lib/com
mons-collections-3.2.2.jar lib/commons-compress-1.4.1.jar lib/avro-ip
c-1.7.7.jar lib/avro-ipc-1.7.7-tests.jar lib/kryo-shaded-3.0.3.jar li
b/hadoop-common-2.6.5.jar lib/hadoop-hdfs-2.6.5.jar lib/hadoop-mapred
uce-client-app-2.6.5.jar lib/hadoop-yarn-api-2.6.5.jar lib/hadoop-map
reduce-client-core-2.6.5.jar lib/hadoop-mapreduce-client-jobclient-2.
6.5.jar lib/hadoop-annotations-2.6.5.jar lib/leveldbjni-all-1.8.jar l
ib/httpcore-4.3.3.jar lib/httpclient-4.3.6.jar lib/activation-1.1.1.j
ar lib/mx4j-3.0.2.jar lib/mail-1.4.7.jar lib/bcprov-jdk15on-1.51.jar
lib/java-xmlbuilder-1.0.jar lib/curator-framework-2.6.0.jar lib/zooke
eper-3.4.6.jar lib/guava-16.0.1.jar lib/json4s-core_2.11-3.2.11.jar l
ib/javax.ws.rs-api-2.0.1.jar lib/hk2-api-2.4.0-b34.jar lib/javax.inje
ct-2.4.0-b34.jar lib/hk2-locator-2.4.0-b34.jar lib/javax.annotation-a
pi-1.2.jar lib/jersey-guava-2.22.2.jar lib/osgi-resource-locator-1.0.
1.jar lib/jersey-media-jaxb-2.22.2.jar lib/validation-api-1.1.0.Final
.jar lib/jackson-module-paranamer-2.6.5.jar lib/xz-1.0.jar lib/minlog
-1.3.0.jar lib/objenesis-2.1.jar lib/commons-cli-1.2.jar lib/xmlenc-0
.52.jar lib/commons-httpclient-3.1.jar lib/commons-io-2.4.jar lib/com
mons-lang-2.6.jar lib/commons-configuration-1.6.jar lib/protobuf-java
-2.5.0.jar lib/gson-2.2.4.jar lib/hadoop-auth-2.6.5.jar lib/curator-c
lient-2.6.0.jar lib/htrace-core-3.0.4.jar lib/jetty-util-6.1.26.jar l
ib/xercesImpl-2.9.1.jar lib/hadoop-mapreduce-client-common-2.6.5.jar
lib/hadoop-mapreduce-client-shuffle-2.6.5.jar lib/hadoop-yarn-common-
2.6.5.jar lib/base64-2.3.8.jar lib/json4s-ast_2.11-3.2.11.jar lib/sca
lap-2.11.0.jar lib/hk2-utils-2.4.0-b34.jar lib/aopalliance-repackaged
-2.4.0-b34.jar lib/javassist-3.18.1-GA.jar lib/commons-digester-1.8.j
ar lib/commons-beanutils-core-1.8.0.jar lib/apacheds-kerberos-codec-2
.0.0-M15.jar lib/xml-apis-1.3.04.jar lib/hadoop-yarn-client-2.6.5.jar
lib/hadoop-yarn-server-common-2.6.5.jar lib/hadoop-yarn-server-nodem
anager-2.6.5.jar lib/jaxb-api-2.2.2.jar lib/jackson-jaxrs-1.9.13.jar
lib/jackson-xc-1.9.13.jar lib/guice-3.0.jar lib/scala-compiler-2.11.0
.jar lib/javax.inject-1.jar lib/jline-0.9.94.jar lib/apacheds-i18n-2.
0.0-M15.jar lib/api-asn1-api-1.0.0-M20.jar lib/api-util-1.0.0-M20.jar
lib/jettison-1.1.jar lib/stax-api-1.0-2.jar lib/aopalliance-1.0.jar
lib/cglib-2.2.1-v20090111.jar lib/scala-xml_2.11-1.0.1.jar lib/scala-
parser-combinators_2.11-1.0.1.jar lib/scala-library-2.11.8.jar lib/sl
f4j-api-1.7.16.jar lib/netty-all-4.0.43.Final.jar lib/jackson-core-as
l-1.9.13.jar lib/jackson-mapper-asl-1.9.13.jar lib/jackson-annotation
s-2.6.5.jar lib/commons-net-3.1.jar lib/paranamer-2.6.jar
UPDATE
As few of the comments and answer by Allison Berman suggested i have tried as below
C:\Dev-Tra\OrderAnalytics\build\libs>spark-submit --jars OrderAnalytics.jar \ --class example.DataMigration
Error: Cannot load main class from JAR file:/C:/
Run with --help for usage help or --verbose for debug output
C:\Dev-Tra\OrderAnalytics\build\libs>spark-submit --jars OrderAnalytics.jar --class example.DataMigration
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:160)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:274)
at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:151)
at org.apache.spark.launcher.Main.main(Main.java:86)
but according to Spark Documentation it should be as below in which it's able to start job but not able to get all dependent jars
C:\Dev-Tra\OrderAnalytics\build\libs>spark-submit --class example.DataMigration OrderAnalytics.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/21 21:59:36 INFO SparkContext: Running Spark version 2.2.0
17/08/21 21:59:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/21 21:59:36 INFO SparkContext: Submitted application: DataMigration
17/08/21 21:59:36 INFO SecurityManager: Changing view acls to: ram
17/08/21 21:59:36 INFO SecurityManager: Changing modify acls to: ram
17/08/21 21:59:36 INFO SecurityManager: Changing view acls groups to:
17/08/21 21:59:36 INFO SecurityManager: Changing modify acls groups to:
17/08/21 21:59:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ram); groups with vi
ew permissions: Set(); users with modify permissions: Set(ram); groups with modify permissions: Set()
17/08/21 21:59:37 INFO Utils: Successfully started service 'sparkDriver' on port 62239.
17/08/21 21:59:37 INFO SparkEnv: Registering MapOutputTracker
17/08/21 21:59:37 INFO SparkEnv: Registering BlockManagerMaster
17/08/21 21:59:37 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/08/21 21:59:37 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/08/21 21:59:37 INFO DiskBlockManager: Created local directory at C:\Users\ram\AppData\Local\Temp\blockmgr-38ef35e6-219e-450c-b7da-c8075464a232
17/08/21 21:59:37 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/08/21 21:59:37 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/21 21:59:37 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/21 21:59:37 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.101:4040
17/08/21 21:59:38 INFO SparkContext: Added JAR file:/C:/Dev-Tra/OrderAnalytics/build/libs/OrderAnalytics.jar at spark://192.168.1.101:62239/jars/OrderAnalytics
.jar with timestamp 1503332978023
17/08/21 21:59:38 INFO Executor: Starting executor ID driver on host localhost
17/08/21 21:59:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 62248.
17/08/21 21:59:38 INFO NettyBlockTransferService: Server created on 192.168.1.101:62248
17/08/21 21:59:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/21 21:59:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.101, 62248, None)
17/08/21 21:59:38 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.101:62248 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.101, 62248
, None)
17/08/21 21:59:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.101, 62248, None)
17/08/21 21:59:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.101, 62248, None)
17/08/21 21:59:40 INFO Native: Could not load JNR C Library, native system calls through this library will not be available (set this logger level to DEBUG to
see the full stack trace).
17/08/21 21:59:40 INFO ClockFactory: Using java.lang.System clock to generate timestamps.
17/08/21 21:59:41 WARN NettyUtil: Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead.
17/08/21 21:59:41 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
17/08/21 21:59:41 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
17/08/21 21:59:42 INFO SparkContext: Starting job: runJob at RDDFunctions.scala:36
17/08/21 21:59:42 INFO DAGScheduler: Got job 0 (runJob at RDDFunctions.scala:36) with 4 output partitions
17/08/21 21:59:42 INFO DAGScheduler: Final stage: ResultStage 0 (runJob at RDDFunctions.scala:36)
17/08/21 21:59:42 INFO DAGScheduler: Parents of final stage: List()
17/08/21 21:59:42 INFO DAGScheduler: Missing parents: List()
17/08/21 21:59:42 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at DataMigration.scala:18), which has no missing parents
17/08/21 21:59:42 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 12.3 KB, free 366.3 MB)
17/08/21 21:59:42 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.8 KB, free 366.3 MB)
17/08/21 21:59:42 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.101:62248 (size: 5.8 KB, free: 366.3 MB)
17/08/21 21:59:42 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/08/21 21:59:42 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at DataMigration.scala:18) (first 15 tasks are f
or partitions Vector(0, 1, 2, 3))
17/08/21 21:59:42 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/08/21 21:59:42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, NODE_LOCAL, 17002 bytes)
17/08/21 21:59:42 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/08/21 21:59:42 INFO Executor: Fetching spark://192.168.1.101:62239/jars/OrderAnalytics.jar with timestamp 1503332978023
17/08/21 21:59:42 INFO TransportClientFactory: Successfully created connection to /192.168.1.101:62239 after 18 ms (0 ms spent in bootstraps)
17/08/21 21:59:42 INFO Utils: Fetching spark://192.168.1.101:62239/jars/OrderAnalytics.jar to C:\Users\ram\AppData\Local\Temp\spark-73cbbbe8-9e06-4a11-976
a-a766305d4148\userFiles-3e4c9dea-6273-4d9e-a17b-c807aa0e3da5\fetchFileTemp7196411614488839489.tmp
17/08/21 21:59:43 INFO Executor: Adding file:/C:/Users/ram/AppData/Local/Temp/spark-73cbbbe8-9e06-4a11-976a-a766305d4148/userFiles-3e4c9dea-6273-4d9e-a17b
-c807aa0e3da5/OrderAnalytics.jar to class loader
17/08/21 21:59:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:152)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:174)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.twitter.jsr166e.LongAdder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
17/08/21 21:59:44 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, NODE_LOCAL, 15334 bytes)
17/08/21 21:59:44 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/21 21:59:44 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoClassDefFoundError: com/twitter/jsr166e/Long
Adder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:152)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:174)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.twitter.jsr166e.LongAdder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
17/08/21 21:59:44 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/08/21 21:59:44 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:152)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:174)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/08/21 21:59:44 INFO TaskSchedulerImpl: Cancelling stage 0
17/08/21 21:59:44 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/08/21 21:59:44 INFO TaskSchedulerImpl: Stage 0 was cancelled
17/08/21 21:59:44 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on localhost, executor driver: java.lang.NoClassDefFoundError (com/twitter/jsr166e/Lo
ngAdder) [duplicate 1]
17/08/21 21:59:44 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/08/21 21:59:44 INFO DAGScheduler: ResultStage 0 (runJob at RDDFunctions.scala:36) failed in 1.894 s due to Job aborted due to stage failure: Task 0 in stage
0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoClassDefFoundError: com/twitter/jsr166e/L
ongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:152)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:174)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.twitter.jsr166e.LongAdder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
Driver stacktrace:
17/08/21 21:59:44 INFO DAGScheduler: Job 0 failed: runJob at RDDFunctions.scala:36, took 2.152376 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost tas
k 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:152)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:174)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.twitter.jsr166e.LongAdder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2075)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
at example.DataMigration$.main(DataMigration.scala:20)
at example.DataMigration.main(DataMigration.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsSupport$class.$init$(OutputMetricsUpdater.scala:107)
at org.apache.spark.metrics.OutputMetricsUpdater$TaskMetricsUpdater.<init>(OutputMetricsUpdater.scala:152)
at org.apache.spark.metrics.OutputMetricsUpdater$.apply(OutputMetricsUpdater.scala:75)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:174)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.twitter.jsr166e.LongAdder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
17/08/21 21:59:51 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
17/08/21 21:59:52 INFO SerialShutdownHooks: Successfully executed shutdown hook: Clearing session cache for C* connector
17/08/21 21:59:52 INFO SparkContext: Invoking stop() from shutdown hook
17/08/21 21:59:52 INFO SparkUI: Stopped Spark web UI at http://192.168.1.101:4040
17/08/21 21:59:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/08/21 21:59:52 INFO MemoryStore: MemoryStore cleared
17/08/21 21:59:52 INFO BlockManager: BlockManager stopped
17/08/21 21:59:52 INFO BlockManagerMaster: BlockManagerMaster stopped
17/08/21 21:59:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/08/21 21:59:52 INFO SparkContext: Successfully stopped SparkContext
17/08/21 21:59:52 INFO ShutdownHookManager: Shutdown hook called
17/08/21 21:59:52 INFO ShutdownHookManager: Deleting directory C:\Users\ram\AppData\Local\Temp\spark-73cbbbe8-9e06-4a11-976a-a766305d4148
can any one please tell me why spark not able to pick class path of jar or how to resolve this problem ?
Thanks
Indrajit is correct, you need to include the package. I had similar issues when I left my files in the default package. Make your folder structure the same as this http://www.scala-sbt.org/0.13/docs/Directories.html
Add a new folder YOUR_PACKAGE in src/main/scala or src/main/java and put DataMigration in YOUR_PACKAGE. Make sure the first line of DataMigration is:
package YOUR_PACKAGE
Your spark-submit will then be:
spark-submit --jars OrderAnalytics.jar \
--class YOUR_PACKAGE.DataMigration

MongoDB Spark Connector : mongo-spark cannot find collection

I am getting an error while trying to read data from a collection.
My MongoDB instance is hosted in 192.168.1.2 while my spark instance is hosted in 1.1. The code is :
package org.sparkexample;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.bson.Document;
import com.mongodb.spark.MongoSpark;
import com.mongodb.spark.rdd.api.java.JavaMongoRDD;
public class WordCountTask {
public static void main(String[] args) {
System.out.println("arg : " + args[0]);
//checkArgument(args.length > 1, "Please provide the path of input file as first parameter.");
new WordCountTask().run(args[0]);
}
public void run(String inputFilePath) {
SparkSession spark = SparkSession.builder()
.master("spark://192.168.1.1:7077")
.appName("MongoSparkConnectorIntro")
.config("spark.mongodb.input.uri", "mongodb://192.168.1.2/local.Test")
.config("spark.mongodb.output.uri", "mongodb://192.168.1.2/local.Test")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
JavaMongoRDD<Document> rdd = MongoSpark.load(jsc);
System.out.println("******************************************");
System.out.println("The count is : ");
System.out.println(rdd.count());
System.out.println(rdd.first().toJson());
System.out.println("******************************************");
jsc.close();
}
}
The error(or rather info) obtained is:
INFO MongoSamplePartitioner: Could not find collection (Test),
using a single partition
Due to the above, the .first() command errors out. However, the collection does exists and I am able to access it. Can anyone let me know whats going wrong?
The full log is:
; ui acls disabled; users with view permissions: Set(mklrjv); groups with view
permissions: Set(); users with modify permissions: Set(mklrjv); groups with m
odify permissions: Set()
17/04/10 18:17:09 INFO Utils: Successfully started service 'sparkDriver' on port
34048.
17/04/10 18:17:09 INFO SparkEnv: Registering MapOutputTracker
17/04/10 18:17:09 INFO SparkEnv: Registering BlockManagerMaster
17/04/10 18:17:09 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storag
e.DefaultTopologyMapper for getting topology information
17/04/10 18:17:09 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/04/10 18:17:09 INFO DiskBlockManager: Created local directory at C:\Users\mra
jeev\AppData\Local\Temp\blockmgr-17cba028-2757-4f48-88ea-f8c7b33ccba9
17/04/10 18:17:09 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/04/10 18:17:09 INFO SparkEnv: Registering OutputCommitCoordinator
17/04/10 18:17:09 INFO Utils: Successfully started service 'SparkUI' on port 404
0.
17/04/10 18:17:09 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://
192.168.1.1:4040
17/04/10 18:17:09 INFO SparkContext: Added JAR file:/C:/Projects/SparkJava/targe
t/uber-first-example-1.0-SNAPSHOT.jar at spark://192.168.1.1:34048/jars/uber-f
irst-example-1.0-SNAPSHOT.jar with timestamp 1491828429769
17/04/10 18:17:09 INFO StandaloneAppClient$ClientEndpoint: Connecting to master
spark://192.168.1.1:7077...
17/04/10 18:17:10 INFO TransportClientFactory: Successfully created connection t
o /192.168.1.1:7077 after 55 ms (0 ms spent in bootstraps)
17/04/10 18:17:10 INFO StandaloneSchedulerBackend: Connected to Spark cluster wi
th app ID app-20170410181710-0013
17/04/10 18:17:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-2
0170410181710-0013/0 on worker-20170410150028-192.168.1.1-33151 (192.168.1.1
:33151) with 4 cores
17/04/10 18:17:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170
410181710-0013/0 on hostPort 192.168.1.1:33151 with 4 cores, 1024.0 MB RAM
17/04/10 18:17:10 INFO Utils: Successfully started service 'org.apache.spark.net
work.netty.NettyBlockTransferService' on port 34070.
17/04/10 18:17:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app
-20170410181710-0013/0 is now RUNNING
17/04/10 18:17:10 INFO NettyBlockTransferService: Server created on 10.78.130.13
4:34070
17/04/10 18:17:10 INFO BlockManager: Using org.apache.spark.storage.RandomBlockR
eplicationPolicy for block replication policy
17/04/10 18:17:10 INFO BlockManagerMaster: Registering BlockManager BlockManager
Id(driver, 192.168.1.1, 34070, None)
17/04/10 18:17:10 INFO BlockManagerMasterEndpoint: Registering block manager 10.
78.130.134:34070 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.1, 34070,
None)
17/04/10 18:17:10 INFO BlockManagerMaster: Registered BlockManager BlockManagerI
d(driver, 192.168.1.1, 34070, None)
17/04/10 18:17:10 INFO BlockManager: Initialized BlockManager: BlockManagerId(dr
iver, 192.168.1.1, 34070, None)
17/04/10 18:17:11 INFO EventLoggingListener: Logging events to file:/C:/tmp/spar
k-events/app-20170410181710-0013
17/04/10 18:17:11 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for
scheduling beginning after reached minRegisteredResourcesRatio: 0.0
17/04/10 18:17:11 INFO SharedState: Warehouse path is 'file:/C:/Projects/SparkJa
va/spark-warehouse/'.
17/04/10 18:17:12 WARN SparkSession$Builder: Using an existing SparkSession; som
e configuration may not take effect.
17/04/10 18:17:12 INFO MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 216.0 B, free 366.3 MB)
17/04/10 18:17:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in
memory (estimated size 402.0 B, free 366.3 MB)
17/04/10 18:17:12 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 1
0.78.130.134:34070 (size: 402.0 B, free: 366.3 MB)
17/04/10 18:17:12 INFO SparkContext: Created broadcast 0 from broadcast at Mongo
Spark.scala:499
******************************************
The count is :
17/04/10 18:17:13 INFO cluster: Cluster created with settings {hosts=[10.78.130.
149:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30
000 ms', maxWaitQueueSize=500}
17/04/10 18:17:13 INFO cluster: Cluster description not yet available. Waiting f
or 30000 ms before timing out
17/04/10 18:17:13 INFO connection: Opened connection [connectionId{localValue:1,
serverValue:107}] to 192.168.1.2:27017
17/04/10 18:17:13 INFO cluster: Monitor thread successfully connected to server
with description ServerDescription{address=192.168.1.2:27017, type=STANDALONE,
state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 4, 2]}, minWire
Version=0, maxWireVersion=5, maxDocumentSize=16777216, roundTripTimeNanos=117621
8}
17/04/10 18:17:13 INFO MongoClientCache: Creating MongoClient: [192.168.1.2:27
017]
17/04/10 18:17:13 INFO connection: Opened connection [connectionId{localValue:2,
serverValue:108}] to 192.168.1.2:27017
17/04/10 18:17:13 INFO MongoSamplePartitioner: Could not find collection (Test),
using a single partition
17/04/10 18:17:13 INFO SparkContext: Starting job: count at WordCountTask.java:3
1
17/04/10 18:17:13 INFO DAGScheduler: Got job 0 (count at WordCountTask.java:31)
with 1 output partitions
17/04/10 18:17:13 INFO DAGScheduler: Final stage: ResultStage 0 (count at WordCo
untTask.java:31)
17/04/10 18:17:13 INFO DAGScheduler: Parents of final stage: List()
17/04/10 18:17:13 INFO DAGScheduler: Missing parents: List()
17/04/10 18:17:13 INFO DAGScheduler: Submitting ResultStage 0 (MongoRDD[0] at RD
D at MongoRDD.scala:52), which has no missing parents
17/04/10 18:17:13 INFO MemoryStore: Block broadcast_1 stored as values in memory
(estimated size 3.0 KB, free 366.3 MB)
17/04/10 18:17:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in
memory (estimated size 1855.0 B, free 366.3 MB)
17/04/10 18:17:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 1
0.78.130.134:34070 (size: 1855.0 B, free: 366.3 MB)
17/04/10 18:17:13 INFO SparkContext: Created broadcast 1 from broadcast at DAGSc
heduler.scala:996
17/04/10 18:17:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
0 (MongoRDD[0] at RDD at MongoRDD.scala:52)
17/04/10 18:17:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/04/10 18:17:15 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered
executor NettyRpcEndpointRef(null) (192.168.1.1:34090) with ID 0
17/04/10 18:17:15 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10
.78.130.134, executor 0, partition 0, ANY, 6112 bytes)
17/04/10 18:17:15 INFO BlockManagerMasterEndpoint: Registering block manager 10.
78.130.134:34108 with 366.3 MB RAM, BlockManagerId(0, 192.168.1.1, 34108, None
)
17/04/10 18:17:18 INFO MongoClientCache: Closing MongoClient: [192.168.1.2:270
17]
17/04/10 18:17:18 INFO connection: Closed connection [connectionId{localValue:2,
serverValue:108}] to 192.168.1.2:27017 because the pool has been closed.
17/04/10 18:17:47 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 1
0.78.130.134:34108 (size: 1855.0 B, free: 366.3 MB)
17/04/10 18:17:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 1
0.78.130.134:34108 (size: 402.0 B, free: 366.3 MB)
17/04/10 18:17:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in
34054 ms on 192.168.1.1 (executor 0) (1/1)
17/04/10 18:17:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
all completed, from pool
17/04/10 18:17:49 INFO DAGScheduler: ResultStage 0 (count at WordCountTask.java:
31) finished in 35.409 s
17/04/10 18:17:49 INFO DAGScheduler: Job 0 finished: count at WordCountTask.java
:31, took 35.653876 s
0
17/04/10 18:17:49 INFO SparkContext: Starting job: first at WordCountTask.java:3
2
17/04/10 18:17:49 INFO DAGScheduler: Got job 1 (first at WordCountTask.java:32)
with 1 output partitions
17/04/10 18:17:49 INFO DAGScheduler: Final stage: ResultStage 1 (first at WordCo
untTask.java:32)
17/04/10 18:17:49 INFO DAGScheduler: Parents of final stage: List()
17/04/10 18:17:49 INFO DAGScheduler: Missing parents: List()
17/04/10 18:17:49 INFO DAGScheduler: Submitting ResultStage 1 (MongoRDD[0] at RD
D at MongoRDD.scala:52), which has no missing parents
17/04/10 18:17:49 INFO MemoryStore: Block broadcast_2 stored as values in memory
(estimated size 3.2 KB, free 366.3 MB)
17/04/10 18:17:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in
memory (estimated size 1926.0 B, free 366.3 MB)
17/04/10 18:17:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 1
0.78.130.134:34070 (size: 1926.0 B, free: 366.3 MB)
17/04/10 18:17:49 INFO SparkContext: Created broadcast 2 from broadcast at DAGSc
heduler.scala:996
17/04/10 18:17:49 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage
1 (MongoRDD[0] at RDD at MongoRDD.scala:52)
17/04/10 18:17:49 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/04/10 18:17:49 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, 10
.78.130.134, executor 0, partition 0, ANY, 6194 bytes)
17/04/10 18:17:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 1
0.78.130.134:34108 (size: 1926.0 B, free: 366.3 MB)
17/04/10 18:17:49 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in
56 ms on 192.168.1.1 (executor 0) (1/1)
17/04/10 18:17:49 INFO DAGScheduler: ResultStage 1 (first at WordCountTask.java:
32) finished in 0.057 s
17/04/10 18:17:49 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have
all completed, from pool
17/04/10 18:17:49 INFO DAGScheduler: Job 1 finished: first at WordCountTask.java
:32, took 0.076634 s
Exception in thread "main" java.lang.UnsupportedOperationException: empty collec
tion
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.s
cala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.first(RDD.scala:1366)
at org.apache.spark.api.java.JavaRDDLike$class.first(JavaRDDLike.scala:5
38)
at org.apache.spark.api.java.AbstractJavaRDDLike.first(JavaRDDLike.scala
:45)
at org.sparkexample.WordCountTask.run(WordCountTask.java:32)
at org.sparkexample.WordCountTask.main(WordCountTask.java:14)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
mit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:18
7)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/04/10 18:17:49 INFO SparkContext: Invoking stop() from shutdown hook
17/04/10 18:17:49 INFO SparkUI: Stopped Spark web UI at http://192.168.1.1:404
0
17/04/10 18:17:49 INFO StandaloneSchedulerBackend: Shutting down all executors
17/04/10 18:17:49 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each
executor to shut down
17/04/10 18:17:49 WARN TransportChannelHandler: Exception in connection from /10
.78.130.134:34132
java.io.IOException: An existing connection was forcibly closed by the remote ho
st
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirect
ByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketCha
nnel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/04/10 18:17:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEnd
point stopped!
17/04/10 18:17:49 WARN TransportChannelHandler: Exception in connection from /10
.78.130.134:34113
java.io.IOException: An existing connection was forcibly closed by the remote ho
st
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirect
ByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketCha
nnel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/04/10 18:17:49 WARN TransportChannelHandler: Exception in connection from /10
.78.130.134:34090
java.io.IOException: An existing connection was forcibly closed by the remote ho
st
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirect
ByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketCha
nnel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(Abstra
ctNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
a:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
ntLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
va:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
EventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorato
r.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/04/10 18:17:49 INFO MemoryStore: MemoryStore cleared
17/04/10 18:17:49 INFO BlockManager: BlockManager stopped
17/04/10 18:17:49 INFO BlockManagerMaster: BlockManagerMaster stopped
17/04/10 18:17:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
17/04/10 18:17:49 INFO SparkContext: Successfully stopped SparkContext
17/04/10 18:17:49 INFO ShutdownHookManager: Shutdown hook called
17/04/10 18:17:49 INFO ShutdownHookManager: Deleting directory C:\Users\mklrjv\
AppData\Local\Temp\spark-3213c1b3-9a85-42b0-ba04-6e0e46a90d98

Apache Spark: using spark-submit to transfer files from windows to cluster

Here's what I'm trying to do:
using spark-submit to submit a packaged / compiled (using sbt 0.13.12) scala programm to my virtualized "cluster" running hdp 2.4 (Spark 1.6.0, Scala 2.10.5) using virtual box
using the --files option to copy a text file "foo.txt" (which is located in the project root) from the "submitting" Windows machine (which is also running Spark 1.6.0 and Scala 2.10.5) to the working directories of executors (as described by spark-submit -h)
passing the textfile as first argument to my application
finally: reading in the file and counting the lines
The command for submitting is
spark-submit ^
--class boern.spark.SparkMeApp ^
--master "spark://127.0.0.1:7077" ^
--files "foo.txt" ^
target/scala-2.11/sparkme-project_2.11-1.0.jar foo.txt
The interesting part of code is
val fileName = args(0)
println(s"argument 0 is $fileName")
val lines = sc.textFile(fileName).cache
val c = lines.count /** line 37 */
The error (short version) I'm getting is:
INFO DAGScheduler: Job 0 failed: count at SparkMeApp.scala:37, Exception, Job aborted: java
.io.FileNotFoundException: File file:/E:/myProject/foo.txt does not exist
After two days of a combination "bruteforcing" and reading documentation I am still lost... Am I wrong, that sc.textFile(fileName).cache is executed on the workers and everything which is not preceeded by sc on master? Is using SparkFiles the way to go?
Stacktrace
E:\myProject\>spark-submit --verbose --class boern.spark.SparkMeApp --master "spark://127.0.0.1:7077" --files "foo.txt" target/scala-2.11/sparkme-project_2.11-1.0.jar foo.txt
Using properties file: null
Parsed arguments:
master spark://127.0.0.1:7077
deployMode null
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile null
driverMemory null
driverCores null
driverExtraClassPath null
driverExtraLibraryPath null
driverExtraJavaOptions null
supervise false
queue null
numExecutors null
files file:/E:/myProject/foo.txt
pyFiles null
archives null
mainClass boern.spark.SparkMeApp
primaryResource file:/E:/myProject/target/scala-2.11/sparkme-project_2.11-1.0.jar
name boern.spark.SparkMeApp
childArgs [foo.txt]
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file null:
Main class:
boern.spark.SparkMeApp
Arguments:
foo.txt
System properties:
SPARK_SUBMIT -> true
spark.files -> file:/E:/myProject/foo.txt
spark.app.name -> boern.spark.SparkMeApp
spark.jars -> file:/E:/myProject/target/scala-2.11/sparkme-project_2.11-1.0.jar
spark.submit.deployMode -> client
spark.master -> spark://127.0.0.1:7077
Classpath elements:
file:/E:/myProject/target/scala-2.11/sparkme-project_2.11-1.0.jar
Working directory is E:\myProject\sbtmanual
Files:
\CONF.ENI
\mw.csv
\mw_out.csv
\pagefile.sys
\temp.rds
args:
foo.txt
config set.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/15 14:36:21 INFO SparkContext: Running Spark version 1.6.0
16/09/15 14:36:22 INFO SecurityManager: Changing view acls to: Boern
16/09/15 14:36:22 INFO SecurityManager: Changing modify acls to: Boern
16/09/15 14:36:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Boern); users with modify permissions: Set(Boern)
16/09/15 14:36:22 INFO Utils: Successfully started service 'sparkDriver' on port 59716.
16/09/15 14:36:23 INFO Slf4jLogger: Slf4jLogger started
16/09/15 14:36:23 INFO Remoting: Starting remoting
16/09/15 14:36:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.56.1:59729]
16/09/15 14:36:23 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 59729.
16/09/15 14:36:23 INFO SparkEnv: Registering MapOutputTracker
16/09/15 14:36:23 INFO SparkEnv: Registering BlockManagerMaster
16/09/15 14:36:23 INFO DiskBlockManager: Created local directory at C:\Users\Boern\AppData\Local\Temp\blockmgr-c7ee2dab-ea00-4ae5-9f06-c6ab74f135e5
16/09/15 14:36:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/09/15 14:36:23 INFO SparkEnv: Registering OutputCommitCoordinator
16/09/15 14:36:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/09/15 14:36:23 INFO Utils: Successfully started service 'SparkUI' on port 4041.
16/09/15 14:36:23 INFO SparkUI: Started SparkUI at http://192.168.56.1:4041
16/09/15 14:36:23 INFO HttpFileServer: HTTP File server directory is C:\Users\Boern\AppData\Local\Temp\spark-2736b20a-fc90-40e8-a7ad-2d8cac8001f2\httpd-14abb177-9801-403c-9df9-84afb2e87d70
16/09/15 14:36:23 INFO HttpServer: Starting HTTP Server
16/09/15 14:36:23 INFO Utils: Successfully started service 'HTTP file server' on port 59746.
16/09/15 14:36:23 INFO SparkContext: Added JAR file:/E:/myProject/target/scala-2.11/sparkme-project_2.11-1.0.jar at http://192.168.56.1:59746/jars/sparkme-project_2.11-1.0.jar with timestamp 1473942983631
16/09/15 14:36:23 INFO Utils: Copying E:\myProject\sbtmanual\foo.txt to C:\Users\Boern\AppData\Local\Temp\spark-2736b20a-fc90-40e8-a7ad-2d8cac8001f2\userFiles-7849db02-01ff-40ea-9250-62b87d854f4c\foo.txt
16/09/15 14:36:23 INFO SparkContext: Added file file:/E:/myProject/foo.txt at http://192.168.56.1:59746/files/foo.txt with timestamp 1473942983695
16/09/15 14:36:23 INFO AppClient$ClientEndpoint: Connecting to master spark://127.0.0.1:7077...
16/09/15 14:36:34 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160915123633-0015
16/09/15 14:36:34 INFO AppClient$ClientEndpoint: Executor added: app-20160915123633-0015/0 on worker-20160915105800-10.0.2.15-44537 (10.0.2.15:44537) with 4 cores
16/09/15 14:36:34 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160915123633-0015/0 on hostPort 10.0.2.15:44537 with 4 cores, 1024.0 MB RAM
16/09/15 14:36:34 INFO AppClient$ClientEndpoint: Executor updated: app-20160915123633-0015/0 is now RUNNING
16/09/15 14:36:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59781.
16/09/15 14:36:34 INFO NettyBlockTransferService: Server created on 59781
16/09/15 14:36:34 INFO BlockManagerMaster: Trying to register BlockManager
16/09/15 14:36:34 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.56.1:59781 with 511.1 MB RAM, BlockManagerId(driver, 192.168.56.1, 59781)
16/09/15 14:36:34 INFO BlockManagerMaster: Registered BlockManager
16/09/15 14:36:34 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
sc set.
argument 0 is foo.txt
16/09/15 14:36:34 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 208.5 KB, free 208.5 KB)
16/09/15 14:36:34 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.3 KB, free 227.8 KB)
16/09/15 14:36:34 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.56.1:59781 (size: 19.3 KB, free: 511.1 MB)
16/09/15 14:36:34 INFO SparkContext: Created broadcast 0 from textFile at SparkMeApp.scala:39
16/09/15 14:36:34 INFO FileInputFormat: Total input paths to process : 1
16/09/15 14:36:34 INFO SparkContext: Starting job: count at SparkMeApp.scala:41
16/09/15 14:36:34 INFO DAGScheduler: Got job 0 (count at SparkMeApp.scala:41) with 2 output partitions
16/09/15 14:36:34 INFO DAGScheduler: Final stage: ResultStage 0 (count at SparkMeApp.scala:41)
16/09/15 14:36:34 INFO DAGScheduler: Parents of final stage: List()
16/09/15 14:36:34 INFO DAGScheduler: Missing parents: List()
16/09/15 14:36:34 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at textFile at SparkMeApp.scala:39), which has no missing parents
16/09/15 14:36:34 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.9 KB, free 230.7 KB)
16/09/15 14:36:34 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1752.0 B, free 232.4 KB)
16/09/15 14:36:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.56.1:59781 (size: 1752.0 B, free: 511.1 MB)
16/09/15 14:36:34 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/09/15 14:36:34 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at textFile at SparkMeApp.scala:39)
16/09/15 14:36:34 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/09/15 14:36:37 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (BoernsPC:59783) with ID 0
16/09/15 14:36:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, BoernsPC, partition 0,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, BoernsPC, partition 1,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:47 INFO BlockManagerMasterEndpoint: Registering block manager BoernsPC:48448 with 511.5 MB RAM, BlockManagerId(0, BoernsPC, 48448)
16/09/15 14:36:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on BoernsPC:48448 (size: 1752.0 B, free: 511.5 MB)
16/09/15 14:36:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on BoernsPC:48448 (size: 19.3 KB, free: 511.5 MB)
16/09/15 14:36:49 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, BoernsPC): java.io.FileNotFoundException: File file:/E:/myProject/foo.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/09/15 14:36:49 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 1]
16/09/15 14:36:49 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, BoernsPC, partition 0,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:49 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 3, BoernsPC, partition 1,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:49 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 2]
16/09/15 14:36:49 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, BoernsPC, partition 1,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:49 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 3]
16/09/15 14:36:49 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 5, BoernsPC, partition 0,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:49 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 4]
16/09/15 14:36:49 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 6, BoernsPC, partition 0,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:49 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 5]
16/09/15 14:36:49 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 7, BoernsPC, partition 1,PROCESS_LOCAL, 2286 bytes)
16/09/15 14:36:49 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 6]
16/09/15 14:36:49 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job
16/09/15 14:36:49 INFO TaskSchedulerImpl: Cancelling stage 0
16/09/15 14:36:49 INFO TaskSchedulerImpl: Stage 0 was cancelled
16/09/15 14:36:49 INFO DAGScheduler: ResultStage 0 (count at SparkMeApp.scala:41) failed in 14,616 s
16/09/15 14:36:49 INFO DAGScheduler: Job 0 failed: count at SparkMeApp.scala:41, took 14,694943 s
16/09/15 14:36:49 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on executor BoernsPC: java.io.FileNotFoundException (File file:/E:/myProject/foo.txt does not exist) [duplicate 7]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 7, BoernsPC): java.io.FileNotFoundException: File file:/E:/myProject/foo.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1143)
at boern.spark.SparkMeApp$.main(SparkMeApp.scala:41)
at boern.spark.SparkMeApp.main(SparkMeApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: File file:/E:/myProject/foo.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/09/15 14:36:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/09/15 14:36:49 INFO SparkContext: Invoking stop() from shutdown hook
16/09/15 14:36:49 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4041
16/09/15 14:36:49 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/09/15 14:36:49 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/09/15 14:36:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/15 14:36:49 INFO MemoryStore: MemoryStore cleared
16/09/15 14:36:49 INFO BlockManager: BlockManager stopped
16/09/15 14:36:49 INFO BlockManagerMaster: BlockManagerMaster stopped
16/09/15 14:36:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/15 14:36:49 INFO SparkContext: Successfully stopped SparkContext
16/09/15 14:36:49 INFO ShutdownHookManager: Shutdown hook called
16/09/15 14:36:49 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/09/15 14:36:49 INFO ShutdownHookManager: Deleting directory C:\Users\Boern\AppData\Local\Temp\spark-2736b20a-fc90-40e8-a7ad-2d8cac8001f2
16/09/15 14:36:49 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/09/15 14:36:49 INFO ShutdownHookManager: Deleting directory C:\Users\Boern\AppData\Local\Temp\spark-2736b20a-fc90-40e8-a7ad-2d8cac8001f2\httpd-14abb177-9801-403c-9df9-84afb2e87d70

java.io.EOFException on Spark EC2 Cluster when submitting job programatically

realy need your help to understand, what I'm doing wrong.
The intent of my experiment is to run spark job programatically instead of using ./spark-shell or ./spark-submit (These both work for me)
Environment:
I've created a Spark Cluster with 1 master & 1 worker using ./spark-ec2 script
Cluster looks good, however, when I try to run the code being packaged in a jar:
val logFile = "file:///root/spark/bin/README.md"
val conf = new SparkConf()
conf.setAppName("Simple App")
conf.setJars(List("file:///root/spark/bin/hello-apache-spark_2.10-1.0.0-SNAPSHOT.jar"))
conf.setMaster("spark://ec2-54-89-51-36.compute-1.amazonaws.com:7077")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(_.contains("a")).count()
val numBs = logData.filter(_.contains("b")).count()
println(s"1. Lines with a: $numAs, Lines with b: $numBs")
I get an exception:
*[info] Running com.paycasso.SimpleApp
14/09/05 14:50:29 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/05 14:50:29 INFO SecurityManager: Changing view acls to: root
14/09/05 14:50:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root)
14/09/05 14:50:30 INFO Slf4jLogger: Slf4jLogger started
14/09/05 14:50:30 INFO Remoting: Starting remoting
14/09/05 14:50:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#ip-10-224-14-90.ec2.internal:54683]
14/09/05 14:50:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#ip-10-224-14-90.ec2.internal:54683]
14/09/05 14:50:30 INFO SparkEnv: Registering MapOutputTracker
14/09/05 14:50:30 INFO SparkEnv: Registering BlockManagerMaster
14/09/05 14:50:30 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140905145030-85cb
14/09/05 14:50:30 INFO MemoryStore: MemoryStore started with capacity 589.2 MB.
14/09/05 14:50:30 INFO ConnectionManager: Bound socket to port 47852 with id = ConnectionManagerId(ip-10-224-14-90.ec2.internal,47852)
14/09/05 14:50:30 INFO BlockManagerMaster: Trying to register BlockManager
14/09/05 14:50:30 INFO BlockManagerInfo: Registering block manager ip-10-224-14-90.ec2.internal:47852 with 589.2 MB RAM
14/09/05 14:50:30 INFO BlockManagerMaster: Registered BlockManager
14/09/05 14:50:30 INFO HttpServer: Starting HTTP Server
14/09/05 14:50:30 INFO HttpBroadcast: Broadcast server started at http://**.***.**.**:49211
14/09/05 14:50:30 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e2748605-17ec-4524-983b-97aaf2f94b30
14/09/05 14:50:30 INFO HttpServer: Starting HTTP Server
14/09/05 14:50:31 INFO SparkUI: Started SparkUI at http://ip-10-224-14-90.ec2.internal:4040
14/09/05 14:50:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/09/05 14:50:32 INFO SparkContext: Added JAR file:///root/spark/bin/hello-apache-spark_2.10-1.0.0-SNAPSHOT.jar at http://**.***.**.**:46491/jars/hello-apache-spark_2.10-1.0.0-SNAPSHOT.jar with timestamp 1409928632274
14/09/05 14:50:32 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-89-51-36.compute-1.amazonaws.com:7077...
14/09/05 14:50:32 INFO MemoryStore: ensureFreeSpace(163793) called with curMem=0, maxMem=617820979
14/09/05 14:50:32 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 160.0 KB, free 589.0 MB)
14/09/05 14:50:32 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140905145032-0005
14/09/05 14:50:32 INFO AppClient$ClientActor: Executor added: app-20140905145032-0005/0 on worker-20140905141732-ip-10-80-90-29.ec2.internal-57457 (ip-10-80-90-29.ec2.internal:57457) with 2 cores
14/09/05 14:50:32 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140905145032-0005/0 on hostPort ip-10-80-90-29.ec2.internal:57457 with 2 cores, 512.0 MB RAM
14/09/05 14:50:32 INFO AppClient$ClientActor: Executor updated: app-20140905145032-0005/0 is now RUNNING
14/09/05 14:50:33 INFO FileInputFormat: Total input paths to process : 1
14/09/05 14:50:33 INFO SparkContext: Starting job: count at SimpleApp.scala:26
14/09/05 14:50:33 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:26) with 1 output partitions (allowLocal=false)
14/09/05 14:50:33 INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:26)
14/09/05 14:50:33 INFO DAGScheduler: Parents of final stage: List()
14/09/05 14:50:33 INFO DAGScheduler: Missing parents: List()
14/09/05 14:50:33 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:26), which has no missing parents
14/09/05 14:50:33 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:26)
14/09/05 14:50:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/09/05 14:50:36 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#ip-10-80-90-29.ec2.internal:36966/user/Executor#2034537974] with ID 0
14/09/05 14:50:36 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: ip-10-80-90-29.ec2.internal (PROCESS_LOCAL)
14/09/05 14:50:36 INFO TaskSetManager: Serialized task 0.0:0 as 1880 bytes in 8 ms
14/09/05 14:50:37 INFO BlockManagerInfo: Registering block manager ip-10-80-90-29.ec2.internal:59950 with 294.9 MB RAM
14/09/05 14:50:38 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/09/05 14:50:38 WARN TaskSetManager: Loss was due to java.io.EOFException
java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2744)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1032)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:147)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)*
What I'm actualy doing is a call "sbt run". So I assemble the scala project and run it.
By the way, I run that project on a master host, so the driver definitely is visible for a worker host.
Any help is appreciated. That's very strange, that such a simple example doesn't work in cluster. Using ./spark-submit is not convenient, I believe.
Thanks in advance.
After wasting a lot of time, I've found the problem. Despite I haven't used hadoop/hdfs in my application, hadoop client matters. The problem was in hadoop-client version, it was different than the version of hadoop, spark was built for. Spark's hadoop version 1.2.1, but in my application that was 2.4.
When I changed the version of hadoop client to 1.2.1 in my app, I'm able to execute spark code on cluster.