Spark Internal REST API: Unable to find dependent jars - rest

I am trying to submit the Spark program using the SPARK internal REST API.
Request for submitting the program below. The required supporting jars are in place.
curl -X POST http://quickstart.cloudera:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "SampleSparkProgramApp" ],
"appResource" : "file:///home/cloudera/test_sample_example/spark-example.jar",
"clientSparkVersion" : "1.5.0",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "com.example.SampleSparkProgram",
"sparkProperties" : {
"spark.jars" : "file:///home/cloudera/test_sample_example/lib/mongo-hadoop-core-1.0-snapshot.jar,file:///home/cloudera/test_sample_example/lib/mongo-java-driver-3.0.4.jar,file:///home/cloudera/test_sample_example/lib/lucene-analyzers-common-5.4.0.jar,file:///home/cloudera/test_sample_example/lib/lucene-core-5.2.1.jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "MyJob",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "client",
"spark.master" : "spark://quickstart.cloudera:6066"
}
}'
The class com.mongodb.hadoop.MongoInputFormat is avalible in mongo-hadoop-core-1.0-snapshot.jar and the jar is added in the request with the key "spark.jars".
I am getting below error in spark UI logs.
1.5.0-cdh5.5.0 stderr log page for driver-20160121040910-0026
Back to Master
Bytes 0 - 12640 of 12640
Launch Command: "/usr/java/jdk1.7.0_67-cloudera/jre/bin/java" "-cp" "/usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-assembly-1.5.0-cdh5.5.0-hadoop2.6.0-cdh5.5.0.jar:/etc/hadoop/conf/:/usr/lib/spark/sbin/../lib/spark-assembly.jar:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/*:/usr/lib/hive/lib/*:/usr/lib/flume-ng/lib/*:/usr/lib/paquet/lib/*:/usr/lib/avro/lib/*" "-Xms1024M" "-Xmx1024M" "-Dspark.eventLog.enabled=true" "-Dspark.driver.supervise=false" "-Dspark.app.name=MyJob" "-Dspark.jars=file:///home/cloudera/test_sample_example/lib/mongo-hadoop-core-1.0-snapshot.jar,file:///home/cloudera/test_sample_example/lib/mongo-java-driver-3.0.4.jar,file:///home/cloudera/test_sample_example/lib/lucene-analyzers-common-5.4.0.jar,file:///home/cloudera/test_sample_example/lib/lucene-core-5.2.1.jar" "-Dspark.master=spark://quickstart.cloudera:7077" "-Dspark.submit.deployMode=client" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#182.162.106.131:7078/user/Worker" "/var/run/spark/work/driver-20160121040910-0026/spark-example.jar" "com.example.SampleSparkProgram" "SampleSparkProgramApp"
========================================
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/01/21 04:09:16 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 182.162.106.131 instead (on interface eth1)
16/01/21 04:09:16 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/01/21 04:09:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/21 04:09:21 INFO spark.SecurityManager: Changing view acls to: root
16/01/21 04:09:21 INFO spark.SecurityManager: Changing modify acls to: root
16/01/21 04:09:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/01/21 04:09:26 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/01/21 04:09:26 INFO Remoting: Starting remoting
16/01/21 04:09:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://Driver#182.162.106.131:38181]
16/01/21 04:09:27 INFO Remoting: Remoting now listens on addresses: [akka.tcp://Driver#182.162.106.131:38181]
16/01/21 04:09:27 INFO util.Utils: Successfully started service 'Driver' on port 38181.
16/01/21 04:09:27 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker#182.162.106.131:7078/user/Worker
16/01/21 04:09:28 INFO spark.SparkContext: Running Spark version 1.5.0-cdh5.5.0
16/01/21 04:09:28 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker#182.162.106.131:7078/user/Worker
16/01/21 04:09:28 INFO spark.SecurityManager: Changing view acls to: root
16/01/21 04:09:28 INFO spark.SecurityManager: Changing modify acls to: root
16/01/21 04:09:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/01/21 04:09:29 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/01/21 04:09:29 INFO Remoting: Starting remoting
16/01/21 04:09:29 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#182.162.106.131:35467]
16/01/21 04:09:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#182.162.106.131:35467]
16/01/21 04:09:29 INFO util.Utils: Successfully started service 'sparkDriver' on port 35467.
16/01/21 04:09:29 INFO spark.SparkEnv: Registering MapOutputTracker
16/01/21 04:09:30 INFO spark.SparkEnv: Registering BlockManagerMaster
16/01/21 04:09:30 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-6b24e210-4002-4e28-ac60-c2ecc497b914
16/01/21 04:09:30 INFO storage.MemoryStore: MemoryStore started with capacity 534.5 MB
16/01/21 04:09:31 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-bc7bc70e-af91-44cb-a764-8c6d1d9b3acc/httpd-65b7bbf1-af6d-4252-8629-95fcb60f706f
16/01/21 04:09:31 INFO spark.HttpServer: Starting HTTP Server
16/01/21 04:09:31 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/21 04:09:31 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:38126
16/01/21 04:09:31 INFO util.Utils: Successfully started service 'HTTP file server' on port 38126.
16/01/21 04:09:31 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/01/21 04:09:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/21 04:09:33 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
16/01/21 04:09:33 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/01/21 04:09:33 INFO ui.SparkUI: Started SparkUI at http://182.162.106.131:4040
16/01/21 04:09:33 INFO spark.SparkContext: Added JAR hdfs:///user/cloudera/sample_example/lib/mongo-hadoop-core-1.0-snapshot.jar at hdfs:///user/cloudera/sample_example/lib/mongo-hadoop-core-1.0-snapshot.jar with timestamp 1453378173778
16/01/21 04:09:33 INFO spark.SparkContext: Added JAR hdfs:///user/cloudera/sample_example/lib/mongo-java-driver-3.0.4.jar at hdfs:///user/cloudera/sample_example/lib/mongo-java-driver-3.0.4.jar with timestamp 1453378173782
16/01/21 04:09:33 INFO spark.SparkContext: Added JAR hdfs:///user/cloudera/sample_example/lib/lucene-analyzers-common-5.4.0.jar at hdfs:///user/cloudera/sample_example/lib/lucene-analyzers-common-5.4.0.jar with timestamp 1453378173783
16/01/21 04:09:33 INFO spark.SparkContext: Added JAR hdfs:///user/cloudera/sample_example/lib/lucene-core-5.2.1.jar at hdfs:///user/cloudera/sample_example/lib/lucene-core-5.2.1.jar with timestamp 1453378173783
16/01/21 04:09:34 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/21 04:09:34 INFO client.AppClient$ClientEndpoint: Connecting to master spark://quickstart.cloudera:7077...
16/01/21 04:09:35 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160121040935-0025
16/01/21 04:09:36 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38749.
16/01/21 04:09:36 INFO netty.NettyBlockTransferService: Server created on 38749
16/01/21 04:09:36 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/01/21 04:09:36 INFO storage.BlockManagerMasterEndpoint: Registering block manager 182.162.106.131:38749 with 534.5 MB RAM, BlockManagerId(driver, 182.162.106.131, 38749)
16/01/21 04:09:36 INFO storage.BlockManagerMaster: Registered BlockManager
16/01/21 04:09:40 INFO scheduler.EventLoggingListener: Logging events to file:/tmp/spark-events/app-20160121040935-0025
16/01/21 04:09:40 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/01/21 04:09:40 INFO analyser.NaiveByesAnalyserFactory: ENTERING
16/01/21 04:09:40 INFO dao.MongoDataExtractor: ENTERING
16/01/21 04:09:41 INFO dao.MongoDataExtractor: EXITING
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: com/mongodb/hadoop/MongoInputFormat
at com.examples.dao.MongoDataExtractor.getData(MongoDataExtractor.java:35)
at com.examples.analyser.NaiveByesAnalyserFactory.getNaiveByesAnalyserFactory(NaiveByesAnalyserFactory.java:27)
at com.example.SampleSparkProgram.main(SampleSparkProgram.java:24)
... 6 more
Caused by: java.lang.ClassNotFoundException: com.mongodb.hadoop.MongoInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 9 more
16/01/21 04:09:41 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/01/21 04:09:41 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/01/21 04:09:41 INFO ui.SparkUI: Stopped Spark web UI at http://182.162.106.131:4040
16/01/21 04:09:41 INFO scheduler.DAGScheduler: Stopping DAGScheduler
16/01/21 04:09:41 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/01/21 04:09:41 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/01/21 04:09:41 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/21 04:09:41 INFO storage.MemoryStore: MemoryStore cleared
16/01/21 04:09:41 INFO storage.BlockManager: BlockManager stopped
16/01/21 04:09:41 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/01/21 04:09:41 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/21 04:09:41 INFO spark.SparkContext: Successfully stopped SparkContext
16/01/21 04:09:41 INFO util.ShutdownHookManager: Shutdown hook called
16/01/21 04:09:41 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-bc7bc70e-af91-44cb-a764-8c6d1d9b3acc

Related

Spark MongoDB connector V10 issue

I am having a problem connecting to mongodb using spark structure streaming,
Here is my python code,
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# from lib.logger import Log4j
if __name__ == "__main__":
spark = (SparkSession
.builder
.appName("Streaming from mongo db")
.master("local[3]")
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_10.0.5:3.3.1')
.config("spark.streaming.stopGracefullyOnShutdown", "true")
# .config("spark.sql.shuffle.partitions", 3)
.getOrCreate())
read_from_mongo = (spark
.readStream
.format("mongodb")
.option("uri", "mongodb://admin:admin#localhost:27017")
.option("database", "first_db")
.option("collection", "first_collection")
.load()
.writeStream
.format("console")
.trigger(continuous="1 second")
.outputMode("append"))
y = read_from_mongo.start()
I am running the script using spark-submit file_name.py
the output I'm getting is the following :
22/11/16 16:04:21 INFO SparkContext: Running Spark version 3.3.1
22/11/16 16:04:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/16 16:04:21 INFO ResourceUtils: ==============================================================
22/11/16 16:04:21 INFO ResourceUtils: No custom resources configured for spark.driver.
22/11/16 16:04:21 INFO ResourceUtils: ==============================================================
22/11/16 16:04:21 INFO SparkContext: Submitted application: Streaming from mongo db
22/11/16 16:04:21 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/11/16 16:04:21 INFO ResourceProfile: Limiting resource is cpu
22/11/16 16:04:21 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/11/16 16:04:21 INFO SecurityManager: Changing view acls to: mustaphaaminedebbih
22/11/16 16:04:21 INFO SecurityManager: Changing modify acls to: mustaphaaminedebbih
22/11/16 16:04:21 INFO SecurityManager: Changing view acls groups to:
22/11/16 16:04:21 INFO SecurityManager: Changing modify acls groups to:
22/11/16 16:04:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mustaphaaminedebbih); groups with view permissions: Set(); users with modify permissions: Set(mustaphaaminedebbih); groups with modify permissions: Set()
22/11/16 16:04:21 INFO Utils: Successfully started service 'sparkDriver' on port 52063.
22/11/16 16:04:21 INFO SparkEnv: Registering MapOutputTracker
22/11/16 16:04:21 INFO SparkEnv: Registering BlockManagerMaster
22/11/16 16:04:21 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/11/16 16:04:21 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/11/16 16:04:21 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/16 16:04:21 INFO DiskBlockManager: Created local directory at /private/var/folders/yt/jz2t42md7qx68kjlydk19x5w0000gn/T/blockmgr-53c25f51-35a7-44ab-bc1d-bb16a7ae050c
22/11/16 16:04:22 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/11/16 16:04:22 INFO SparkEnv: Registering OutputCommitCoordinator
22/11/16 16:04:22 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/11/16 16:04:22 INFO Executor: Starting executor ID driver on host 192.168.9.44
22/11/16 16:04:22 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
22/11/16 16:04:22 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52065.
22/11/16 16:04:22 INFO NettyBlockTransferService: Server created on 192.168.9.44:52065
22/11/16 16:04:22 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/11/16 16:04:22 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.9.44, 52065, None)
22/11/16 16:04:22 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.9.44:52065 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.9.44, 52065, None)
22/11/16 16:04:22 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.9.44, 52065, None)
22/11/16 16:04:22 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.9.44, 52065, None)
22/11/16 16:04:22 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/11/16 16:04:22 INFO SharedState: Warehouse path is 'file:/Users/mustaphaaminedebbih/Desktop/Scrambling/Spark%20Streaming/Streaming%20From%20Mongodb/spark-warehouse'.
Traceback (most recent call last):
File "/Users/mustaphaaminedebbih/Desktop/Scrambling/Spark Streaming/Streaming From Mongodb/test.py", line 19, in
read_from_mongo = (spark
File "/Users/mustaphaaminedebbih/spark3/spark-3.3.1-bin-hadoop3/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 469, in load
File "/Users/mustaphaaminedebbih/spark3/spark-3.3.1-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in call
File "/Users/mustaphaaminedebbih/spark3/spark-3.3.1-bin-hadoop3/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
File "/Users/mustaphaaminedebbih/spark3/spark-3.3.1-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.load.
java.lang.ClassNotFoundException:
Failed to find data source: mongodb. Please find packages at
https://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:157)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:144)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.ClassNotFoundException: mongodb.DefaultSource
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:435)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:661)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:661)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:661)
14 more
22/11/16 16:04:23 INFO SparkContext: Invoking stop() from shutdown hook
22/11/16 16:04:23 INFO SparkUI: Stopped Spark web UI at http://192.168.9.44:4040
22/11/16 16:04:23 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/11/16 16:04:23 INFO MemoryStore: MemoryStore cleared
22/11/16 16:04:23 INFO BlockManager: BlockManager stopped
22/11/16 16:04:23 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/16 16:04:23 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/11/16 16:04:23 INFO SparkContext: Successfully stopped SparkContext
22/11/16 16:04:23 INFO ShutdownHookManager: Shutdown hook called
22/11/16 16:04:23 INFO ShutdownHookManager: Deleting directory /private/var/folders/yt/jz2t42md7qx68kjlydk19x5w0000gn/T/spark-2ac1c680-c1d8-46e1-bfae-40c82cee015f
22/11/16 16:04:23 INFO ShutdownHookManager: Deleting directory /private/var/folders/yt/jz2t42md7qx68kjlydk19x5w0000gn/T/spark-1816dfc9-2a95-4361-88d1-025c999f1514
22/11/16 16:04:23 INFO ShutdownHookManager: Deleting directory /private/var/folders/yt/jz2t42md7qx68kjlydk19x5w0000gn/T/spark-1816dfc9-2a95-4361-88d1-025c999f1514/pyspark-095a2954-0010-4da3-b658-2483dcc79afc
I tried almost every possible solution, but nothing worked

unable to print scala word count

Im trying to make a scala program that counts the number of words within a txt file and print the final count (on cloudera and using Spark)
import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleWordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Word Count")
val sc = new SparkContext(conf)
The program recognises the file as the correct location as when I put in a false location it recognises it as an error
scala.io.Source.fromFile("/home/cloudera/Books/book1.txt")
.getLines
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]){
(count, word) => count + (word -> (count.getOrElse(word, 0) + 1))
I've tried different ways to print the line here but got errors i.e
System.Out.Println(count)
[error] /home/cloudera/src/main/scala/SimpleWordCount.scala:19:21: type mismatch;
[error] found : Unit
[error] required: scala.collection.immutable.Map[String,Int]
System.out.println(word,count)
type mismatch;
[error] found : Unit
[error] required: scala.collection.immutable.Map[String,Int]
[error] System.out.println(word,count)
}
Added the following line to check if the program was running
System.out.println("This is working over here !!!!!!!!!#$%%E^$%^%%$%#$^%")
}
}
When I run the code it produces the following output
cloudera#quickstart ~]$ spark-submit --master=local[*] --class=SimpleWordCount /home/cloudera/target/scala-2.10/wordcount_2.10-1.0.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.12.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/04 08:13:09 INFO spark.SparkContext: Running Spark version 1.6.0
18/12/04 08:13:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/12/04 08:13:10 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 192.168.182.129 instead (on interface eth3)
18/12/04 08:13:10 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/12/04 08:13:10 INFO spark.SecurityManager: Changing view acls to: cloudera
18/12/04 08:13:10 INFO spark.SecurityManager: Changing modify acls to: cloudera
18/12/04 08:13:10 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
18/12/04 08:13:10 INFO util.Utils: Successfully started service 'sparkDriver' on port 34679.
18/12/04 08:13:11 INFO slf4j.Slf4jLogger: Slf4jLogger started
18/12/04 08:13:11 INFO Remoting: Starting remoting
18/12/04 08:13:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.182.129:47272]
18/12/04 08:13:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriverActorSystem#192.168.182.129:47272]
18/12/04 08:13:11 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 47272.
18/12/04 08:13:11 INFO spark.SparkEnv: Registering MapOutputTracker
18/12/04 08:13:11 INFO spark.SparkEnv: Registering BlockManagerMaster
18/12/04 08:13:11 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1f139fce-3b13-4c07-bed6-7b35f82ccc6a
18/12/04 08:13:11 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB
18/12/04 08:13:11 INFO spark.SparkEnv: Registering OutputCommitCoordinator
18/12/04 08:13:11 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/12/04 08:13:11 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
18/12/04 08:13:11 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
18/12/04 08:13:11 INFO ui.SparkUI: Started SparkUI at http://192.168.182.129:4040
18/12/04 08:13:11 INFO spark.SparkContext: Added JAR file:/home/cloudera/target/scala-2.10/wordcount_2.10-1.0.jar at spark://192.168.182.129:34679/jars/wordcount_2.10-1.0.jar with timestamp 1543939991769
18/12/04 08:13:11 INFO executor.Executor: Starting executor ID driver on host localhost
18/12/04 08:13:11 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58495.
18/12/04 08:13:11 INFO netty.NettyBlockTransferService: Server created on 58495
18/12/04 08:13:11 INFO storage.BlockManagerMaster: Trying to register BlockManager
18/12/04 08:13:11 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:58495 with 530.3 MB RAM, BlockManagerId(driver, localhost, 58495)
18/12/04 08:13:11 INFO storage.BlockManagerMaster: Registered BlockManager
The PrintLn command seems to be working here
This is working over here !!!!!!!!!#$%%E^$%^%%$%#$^%
18/12/04 08:13:12 INFO spark.SparkContext: Invoking stop() from shutdown hook
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
18/12/04 08:13:12 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.182.129:4040
18/12/04 08:13:12 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/12/04 08:13:12 INFO storage.MemoryStore: MemoryStore cleared
18/12/04 08:13:12 INFO storage.BlockManager: BlockManager stopped
18/12/04 08:13:12 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
18/12/04 08:13:12 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/12/04 08:13:12 INFO spark.SparkContext: Successfully stopped SparkContext
18/12/04 08:13:12 INFO util.ShutdownHookManager: Shutdown hook called
18/12/04 08:13:12 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
18/12/04 08:13:12 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e468a57d-10e2-472e-98f2-f3701f6a4b1a
Maybe this can help you?
As I said, the problem is that you can't return a print inside a foldLeft, but you can just add one and then return.
val lines = List(
"Hello, World!",
"Goodbye, World!",
"Hello, Hadoop!"
)
val wordCount =
lines
.flatMap(_.split("\\W+"))
.foldLeft(Map.empty[String, Int]) {
(count, word) =>
println(s"DEBUG count: $count for word: '$word'.")
count + (word -> (count.getOrElse(word, 0) + 1))
}
val formatteWordCount =
wordCount
.map(tuple => s"${tuple._1} -> ${tuple._2}")
.mkString("\n", "\n", "\n")
println(s"Final Word Count: $formatteWordCount")
Output
DEBUG count: Map() for word: 'Hello'.
DEBUG count: Map(Hello -> 1) for word: 'World'.
DEBUG count: Map(Hello -> 1, World -> 1) for word: 'Goodbye'.
DEBUG count: Map(Hello -> 1, World -> 1, Goodbye -> 1) for word: 'World'.
DEBUG count: Map(Hello -> 1, World -> 2, Goodbye -> 1) for word: 'Hello'.
DEBUG count: Map(Hello -> 2, World -> 2, Goodbye -> 1) for word: 'Hadoop'.
Final Word Count:
Hello -> 2
World -> 2
Goodbye -> 1
Hadoop -> 1

I can't debug my program in intellij idea CE

Disconnected from the target VM, address: '127.0.0.1:39989', transport: 'socket' on intellij idea CE. I can't debug my program. Any suggestions?
Connected to the target VM, address: '127.0.0.1:39989', transport: 'socket'
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/12/29 17:29:47 INFO SparkContext: Running Spark version 2.1.2
17/12/29 17:29:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/12/29 17:29:49 WARN Utils: Your hostname, ashfaq-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
17/12/29 17:29:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/12/29 17:29:49 INFO SecurityManager: Changing view acls to: ashfaq
17/12/29 17:29:49 INFO SecurityManager: Changing modify acls to: ashfaq
17/12/29 17:29:49 INFO SecurityManager: Changing view acls groups to:
17/12/29 17:29:49 INFO SecurityManager: Changing modify acls groups to:
17/12/29 17:29:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ashfaq); groups with view permissions: Set(); users with modify permissions: Set(ashfaq); groups with modify permissions: Set()
17/12/29 17:29:51 INFO Utils: Successfully started service 'sparkDriver' on port 46133.
17/12/29 17:29:51 INFO SparkEnv: Registering MapOutputTracker
17/12/29 17:29:51 INFO SparkEnv: Registering BlockManagerMaster
17/12/29 17:29:51 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/12/29 17:29:51 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/12/29 17:29:51 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b3b48105-28be-4781-a395-c7e83cc72e8c
17/12/29 17:29:51 INFO MemoryStore: MemoryStore started with capacity 393.1 MB
17/12/29 17:29:51 INFO SparkEnv: Registering OutputCommitCoordinator
17/12/29 17:29:53 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/12/29 17:29:53 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
17/12/29 17:29:53 INFO Executor: Starting executor ID driver on host localhost
17/12/29 17:29:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33583.
17/12/29 17:29:54 INFO NettyBlockTransferService: Server created on 10.0.2.15:33583
17/12/29 17:29:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/12/29 17:29:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 33583, None)
17/12/29 17:29:54 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:33583 with 393.1 MB RAM, BlockManagerId(driver, 10.0.2.15, 33583, None)
17/12/29 17:29:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 33583, None)
17/12/29 17:29:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 33583, None)
17/12/29 17:29:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 236.5 KB, free 392.8 MB)
17/12/29 17:29:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 392.8 MB)
17/12/29 17:29:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:33583 (size: 22.9 KB, free: 393.1 MB)
17/12/29 17:29:59 INFO SparkContext: Created broadcast 0 from textFile at scalaApp.scala:13
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ashfaq/Desktop/saclaAPP/data/UserPurchaseHistory.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1968)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
at ScalaApp$.main(scalaApp.scala:18)
at ScalaApp.main(scalaApp.scala)
17/12/29 17:29:59 INFO SparkContext: Invoking stop() from shutdown hook
17/12/29 17:29:59 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
17/12/29 17:29:59 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.0.2.15:33583 in memory (size: 22.9 KB, free: 393.1 MB)
17/12/29 17:29:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/12/29 17:30:00 INFO MemoryStore: MemoryStore cleared
17/12/29 17:30:00 INFO BlockManager: BlockManager stopped
17/12/29 17:30:00 INFO BlockManagerMaster: BlockManagerMaster stopped
17/12/29 17:30:00 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/12/29 17:30:00 INFO SparkContext: Successfully stopped SparkContext
17/12/29 17:30:00 INFO ShutdownHookManager: Shutdown hook called
Disconnected from the target VM, address: '127.0.0.1:39989', transport: 'socket'
17/12/29 17:30:00 INFO ShutdownHookManager: Deleting directory /tmp/spark-58667739-7c15-4665-8ede-fde9c3ff1d83
Process finished with exit code 1
It looks like, you are trying to open a file which doesn't exist. The fisrt line of the error message says so:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ashfaq/Desktop/saclaAPP/data/UserPurchaseHistory.csv

Spark streaming with Kafka connector stopping

I am beginning using Spark streaming. I want to get a stream from Kafka with a sample code I found on the Spark documentation : https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html
Here is my code :
object SparkStreaming {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Test_kafka_spark").setMaster("local[*]") // local parallelism 1
val ssc = new StreamingContext(conf, Seconds(1))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9093",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("spark")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
}
}
All seemed to start well at but the job stopped immediately, logs as follow :
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/19 14:37:37 INFO SparkContext: Running Spark version 2.1.0
17/04/19 14:37:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/19 14:37:37 WARN Utils: Your hostname, thibaut-Precision-M4600 resolves to a loopback address: 127.0.1.1; using 10.192.176.101 instead (on interface eno1)
17/04/19 14:37:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/04/19 14:37:37 INFO SecurityManager: Changing view acls to: thibaut
17/04/19 14:37:37 INFO SecurityManager: Changing modify acls to: thibaut
17/04/19 14:37:37 INFO SecurityManager: Changing view acls groups to:
17/04/19 14:37:37 INFO SecurityManager: Changing modify acls groups to:
17/04/19 14:37:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(thibaut); groups with view permissions: Set(); users with modify permissions: Set(thibaut); groups with modify permissions: Set()
17/04/19 14:37:37 INFO Utils: Successfully started service 'sparkDriver' on port 41046.
17/04/19 14:37:37 INFO SparkEnv: Registering MapOutputTracker
17/04/19 14:37:37 INFO SparkEnv: Registering BlockManagerMaster
17/04/19 14:37:37 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/04/19 14:37:37 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/04/19 14:37:37 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-266e2f13-0eb2-40a8-9d2f-d50797099a29
17/04/19 14:37:37 INFO MemoryStore: MemoryStore started with capacity 879.3 MB
17/04/19 14:37:37 INFO SparkEnv: Registering OutputCommitCoordinator
17/04/19 14:37:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/04/19 14:37:38 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.192.176.101:4040
17/04/19 14:37:38 INFO Executor: Starting executor ID driver on host localhost
17/04/19 14:37:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39207.
17/04/19 14:37:38 INFO NettyBlockTransferService: Server created on 10.192.176.101:39207
17/04/19 14:37:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/04/19 14:37:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.176.101:39207 with 879.3 MB RAM, BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.176.101, 39207, None)
17/04/19 14:37:38 WARN KafkaUtils: overriding enable.auto.commit to false for executor
17/04/19 14:37:38 WARN KafkaUtils: overriding auto.offset.reset to none for executor
17/04/19 14:37:38 WARN KafkaUtils: overriding executor group.id to spark-executor-test
17/04/19 14:37:38 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
17/04/19 14:37:38 INFO SparkContext: Invoking stop() from shutdown hook
17/04/19 14:37:38 INFO SparkUI: Stopped Spark web UI at http://10.192.176.101:4040
17/04/19 14:37:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/04/19 14:37:38 INFO MemoryStore: MemoryStore cleared
17/04/19 14:37:38 INFO BlockManager: BlockManager stopped
17/04/19 14:37:38 INFO BlockManagerMaster: BlockManagerMaster stopped
17/04/19 14:37:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/04/19 14:37:38 INFO SparkContext: Successfully stopped SparkContext
17/04/19 14:37:38 INFO ShutdownHookManager: Shutdown hook called
17/04/19 14:37:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-f28a1361-58ba-416b-ac8e-11da0044c1f2
Thanks for any help.
It appears you haven't started your StreamingContext.
Try adding these 2 lines at the end
ssc.start
ssc.awaitTermination
You did not call any action on DStream, so nothing gets executed (map is a transformation and is lazy), also you need to start the StreamingContext.
Please look into this complete example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala

When reading text file from file system, Spark still tries to connect to HDFS

I just created a DC/OS cluster and am trying to run simple Spark task that reads data from /mnt/mesos/sandbox.
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Simple Application")
println("STARTING JOB!")
val sc = new SparkContext(conf)
val rdd = sc.textFile("file:///mnt/mesos/sandbox/foo")
println(rdd.count)
println("ENDING JOB!")
}
}
And I'm deploying the app using
dcos spark run --submit-args='--conf spark.mesos.uris=https://dripit-spark.s3.amazonaws.com/foo --class SimpleApp https://dripit-spark.s3.amazonaws.com/foobar-assembly-1.0.jar' --verbose
Unfortunately, task keeps failing with following exception
I0701 18:47:35.782994 30997 logging.cpp:188] INFO level logging started!
I0701 18:47:35.783197 30997 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/dripit-spark.s3.amazonaws.com\/foobar-assembly-1.0.jar"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/dripit-spark.s3.amazonaws.com\/foo"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2\/frameworks\/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002\/executors\/driver-20160701184530-0001\/runs\/67b94f34-a9d3-4662-bedc-8578381e9305"}
I0701 18:47:35.784752 30997 fetcher.cpp:379] Fetching URI 'https://dripit-spark.s3.amazonaws.com/foobar-assembly-1.0.jar'
I0701 18:47:35.784791 30997 fetcher.cpp:250] Fetching directly into the sandbox directory
I0701 18:47:35.784818 30997 fetcher.cpp:187] Fetching URI 'https://dripit-spark.s3.amazonaws.com/foobar-assembly-1.0.jar'
I0701 18:47:35.784835 30997 fetcher.cpp:134] Downloading resource from 'https://dripit-spark.s3.amazonaws.com/foobar-assembly-1.0.jar' to '/var/lib/mesos/slave/slaves/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2/frameworks/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002/executors/driver-20160701184530-0001/runs/67b94f34-a9d3-4662-bedc-8578381e9305/foobar-assembly-1.0.jar'
W0701 18:47:36.057448 30997 fetcher.cpp:272] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: https://dripit-spark.s3.amazonaws.com/foobar-assembly-1.0.jar
I0701 18:47:36.057673 30997 fetcher.cpp:456] Fetched 'https://dripit-spark.s3.amazonaws.com/foobar-assembly-1.0.jar' to '/var/lib/mesos/slave/slaves/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2/frameworks/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002/executors/driver-20160701184530-0001/runs/67b94f34-a9d3-4662-bedc-8578381e9305/foobar-assembly-1.0.jar'
I0701 18:47:36.057696 30997 fetcher.cpp:379] Fetching URI 'https://dripit-spark.s3.amazonaws.com/foo'
I0701 18:47:36.057714 30997 fetcher.cpp:250] Fetching directly into the sandbox directory
I0701 18:47:36.057741 30997 fetcher.cpp:187] Fetching URI 'https://dripit-spark.s3.amazonaws.com/foo'
I0701 18:47:36.057770 30997 fetcher.cpp:134] Downloading resource from 'https://dripit-spark.s3.amazonaws.com/foo' to '/var/lib/mesos/slave/slaves/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2/frameworks/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002/executors/driver-20160701184530-0001/runs/67b94f34-a9d3-4662-bedc-8578381e9305/foo'
W0701 18:47:36.114565 30997 fetcher.cpp:272] Copying instead of extracting resource from URI with 'extract' flag, because it does not seem to be an archive: https://dripit-spark.s3.amazonaws.com/foo
I0701 18:47:36.114600 30997 fetcher.cpp:456] Fetched 'https://dripit-spark.s3.amazonaws.com/foo' to '/var/lib/mesos/slave/slaves/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2/frameworks/c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002/executors/driver-20160701184530-0001/runs/67b94f34-a9d3-4662-bedc-8578381e9305/foo'
I0701 18:47:36.307576 31006 exec.cpp:143] Version: 0.28.1
I0701 18:47:36.310127 31022 exec.cpp:217] Executor registered on slave c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-S2
16/07/01 18:47:37 INFO SparkContext: Running Spark version 1.6.1
16/07/01 18:47:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/01 18:47:37 WARN SparkConf:
SPARK_JAVA_OPTS was detected (set to '-Dspark.mesos.executor.docker.image=mesosphere/spark:1.0.0-1.6.1-2 ').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with conf/spark-defaults.conf to set defaults for an application
- ./spark-submit with --driver-java-options to set -X options for a driver
- spark.executor.extraJavaOptions to set -X options for executors
- SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
16/07/01 18:47:37 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to '-Dspark.mesos.executor.docker.image=mesosphere/spark:1.0.0-1.6.1-2 ' as a work-around.
16/07/01 18:47:37 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to '-Dspark.mesos.executor.docker.image=mesosphere/spark:1.0.0-1.6.1-2 ' as a work-around.
16/07/01 18:47:37 INFO SecurityManager: Changing view acls to: root
16/07/01 18:47:37 INFO SecurityManager: Changing modify acls to: root
16/07/01 18:47:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/07/01 18:47:37 INFO Utils: Successfully started service 'sparkDriver' on port 47358.
16/07/01 18:47:38 INFO Slf4jLogger: Slf4jLogger started
16/07/01 18:47:38 INFO Remoting: Starting remoting
16/07/01 18:47:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#10.0.1.107:54467]
16/07/01 18:47:38 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54467.
16/07/01 18:47:38 INFO SparkEnv: Registering MapOutputTracker
16/07/01 18:47:38 INFO SparkEnv: Registering BlockManagerMaster
16/07/01 18:47:38 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-96092a9a-3164-4d65-8c0b-df5403abb056
16/07/01 18:47:38 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/07/01 18:47:38 INFO SparkEnv: Registering OutputCommitCoordinator
16/07/01 18:47:38 INFO Server: jetty-8.y.z-SNAPSHOT
16/07/01 18:47:38 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
16/07/01 18:47:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/07/01 18:47:38 INFO SparkUI: Started SparkUI at http://10.0.1.107:4040
16/07/01 18:47:38 INFO HttpFileServer: HTTP File server directory is /tmp/spark-37696e45-5e8b-4328-81e6-deec1f185d75/httpd-69184304-7ffd-4420-b020-5f8a1bafecbd
16/07/01 18:47:38 INFO HttpServer: Starting HTTP Server
16/07/01 18:47:38 INFO Server: jetty-8.y.z-SNAPSHOT
16/07/01 18:47:38 INFO AbstractConnector: Started SocketConnector#0.0.0.0:49074
16/07/01 18:47:38 INFO Utils: Successfully started service 'HTTP file server' on port 49074.
16/07/01 18:47:38 INFO SparkContext: Added JAR file:/mnt/mesos/sandbox/foobar-assembly-1.0.jar at http://10.0.1.107:49074/jars/foobar-assembly-1.0.jar with timestamp 1467398858626
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#716: Client environment:host.name=ip-10-0-1-107.eu-west-1.compute.internal
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#723: Client environment:os.name=Linux
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#724: Client environment:os.arch=4.1.7-coreos-r1
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#725: Client environment:os.version=#2 SMP Thu Nov 5 02:10:23 UTC 2015
I0701 18:47:38.778355 103 sched.cpp:164] Version: 0.25.0
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#733: Client environment:user.name=(null)
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#741: Client environment:user.home=/root
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#log_env#753: Client environment:user.dir=/opt/spark/dist
2016-07-01 18:47:38,778:6(0x7f74cafc9700):ZOO_INFO#zookeeper_init#786: Initiating client connection, host=master.mesos:2181 sessionTimeout=10000 watcher=0x7f74d587c600 sessionId=0 sessionPasswd=<null> context=0x7f7540003f70 flags=0
2016-07-01 18:47:38,786:6(0x7f74c6ec0700):ZOO_INFO#check_events#1703: initiated connection to server [10.0.7.83:2181]
2016-07-01 18:47:38,787:6(0x7f74c6ec0700):ZOO_INFO#check_events#1750: session establishment complete on server [10.0.7.83:2181], sessionId=0x155a57d07f60050, negotiated timeout=10000
I0701 18:47:38.788107 99 group.cpp:331] Group process (group(1)#10.0.1.107:35064) connected to ZooKeeper
I0701 18:47:38.788147 99 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0701 18:47:38.788162 99 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I0701 18:47:38.789402 99 detector.cpp:156] Detected a new leader: (id='1')
I0701 18:47:38.789512 99 group.cpp:674] Trying to get '/mesos/json.info_0000000001' in ZooKeeper
I0701 18:47:38.790228 99 detector.cpp:481] A new leading master (UPID=master#10.0.7.83:5050) is detected
I0701 18:47:38.790293 99 sched.cpp:262] New master detected at master#10.0.7.83:5050
I0701 18:47:38.790473 99 sched.cpp:272] No credentials provided. Attempting to register without authentication
I0701 18:47:38.792147 97 sched.cpp:641] Framework registered with c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002-driver-20160701184530-0001
16/07/01 18:47:38 INFO CoarseMesosSchedulerBackend: Registered as framework ID c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002-driver-20160701184530-0001
16/07/01 18:47:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38752.
16/07/01 18:47:38 INFO NettyBlockTransferService: Server created on 38752
16/07/01 18:47:38 INFO BlockManagerMaster: Trying to register BlockManager
16/07/01 18:47:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.1.107:38752 with 511.1 MB RAM, BlockManagerId(driver, 10.0.1.107, 38752)
16/07/01 18:47:38 INFO BlockManagerMaster: Registered BlockManager
16/07/01 18:47:39 INFO CoarseMesosSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/07/01 18:47:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 117.2 KB, free 117.2 KB)
16/07/01 18:47:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.6 KB, free 129.8 KB)
16/07/01 18:47:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.1.107:38752 (size: 12.6 KB, free: 511.1 MB)
16/07/01 18:47:39 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:13
16/07/01 18:47:39 INFO CoarseMesosSchedulerBackend: Mesos task 4 is now TASK_RUNNING
16/07/01 18:47:39 INFO CoarseMesosSchedulerBackend: Mesos task 2 is now TASK_RUNNING
16/07/01 18:47:39 INFO CoarseMesosSchedulerBackend: Mesos task 0 is now TASK_RUNNING
16/07/01 18:47:39 INFO CoarseMesosSchedulerBackend: Mesos task 1 is now TASK_RUNNING
16/07/01 18:47:39 INFO CoarseMesosSchedulerBackend: Mesos task 3 is now TASK_RUNNING
16/07/01 18:47:39 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/01 18:47:39 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode1.hdfs.mesos
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:240)
at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.getProxy(ConfiguredFailoverProxyProvider.java:124)
at org.apache.hadoop.io.retry.RetryInvocationHandler.<init>(RetryInvocationHandler.java:74)
at org.apache.hadoop.io.retry.RetryInvocationHandler.<init>(RetryInvocationHandler.java:65)
at org.apache.hadoop.io.retry.RetryProxy.create(RetryProxy.java:58)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:152)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:579)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:524)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:146)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:427)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at SimpleApp$.main(SimpleApp.scala:15)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:123)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: namenode1.hdfs.mesos
... 48 more
16/07/01 18:47:39 INFO SparkContext: Invoking stop() from shutdown hook
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/07/01 18:47:39 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/07/01 18:47:40 INFO SparkUI: Stopped Spark web UI at http://10.0.1.107:4040
16/07/01 18:47:40 INFO CoarseMesosSchedulerBackend: Shutting down all executors
16/07/01 18:47:40 INFO CoarseMesosSchedulerBackend: Asking each executor to shut down
I0701 18:47:40.051103 111 sched.cpp:1771] Asked to stop the driver
I0701 18:47:40.051283 96 sched.cpp:1040] Stopping framework 'c4bf7f81-1cf7-413a-b9be-8dc3b36137ee-0002-driver-20160701184530-0001'
16/07/01 18:47:40 INFO CoarseMesosSchedulerBackend: driver.run() returned with code DRIVER_STOPPED
16/07/01 18:47:40 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/07/01 18:47:40 INFO MemoryStore: MemoryStore cleared
16/07/01 18:47:40 INFO BlockManager: BlockManager stopped
16/07/01 18:47:40 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/01 18:47:40 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/01 18:47:40 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/01 18:47:40 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/01 18:47:40 INFO SparkContext: Successfully stopped SparkContext
16/07/01 18:47:40 INFO ShutdownHookManager: Shutdown hook called
16/07/01 18:47:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-37696e45-5e8b-4328-81e6-deec1f185d75/httpd-69184304-7ffd-4420-b020-5f8a1bafecbd
16/07/01 18:47:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-37696e45-5e8b-4328-81e6-deec1f185d75
Why is Spark trying to connect to HDFS although file type is explicitly set to file://?
I thought that sc.textFile("file:///") doesn’t require HDFS setup.
Spark always use the Hadoop API to access a file, regardless of that file is local or in HDFS.
I think the problem is your Spark is inheriting an invalid HDFS configuration and hit this bug https://issues.apache.org/jira/browse/SPARK-11227
You should try some workarounds in that ticket to see if it works for you:
Use an older Spark < 1.5.0
Disable HA in HDFS configuration.
Spark will still use the hdfs to write the intermediate results of the stages (in your case, I guess the partial counts).