Remote Spark Connection - Scala: NullPointerException on registerBlockManager - scala

Spark Master and Worker, both are running in localhost. I have started Master and Worker node by triggering command:
sbin/start-all.sh
Logs for Master node invocation:
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/bin/java -cp /Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/conf/:/Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host 186590dbe5bd.ant.abc.com --port 7077 --webui-port 8080
Logs for Worker node invocation:
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/bin/java -cp /Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/conf/:/Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://186590dbe5bd.ant.abc.com:7077
I have following configuration in conf/spark-env.sh
SPARK_MASTER_HOST=186590dbe5bd.ant.abc.com
Content of /etc/hosts:
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
127.0.0.1 186590dbe5bd.ant.abc.com
Scala code, that I am running to establish remote spark connection:
val sparkConf = new SparkConf()
.setAppName(AppConstants.AppName)
.setMaster("spark://186590dbe5bd.ant.abc.com:7077")
val sparkSession = SparkSession.builder()
.appName(AppConstants.AppName)
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
While executing code from IDE, I am getting following exception in console:
2018-10-04 18:58:38,488 INFO [main] storage.BlockManagerMaster (Logging.scala:logInfo(54)) - Registering BlockManager BlockManagerId(driver, 192.168.0.38, 56083, None)
2018-10-04 18:58:38,491 ERROR [main] spark.SparkContext (Logging.scala:logError(91)) - Error initializing SparkContext.
java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:64)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:227)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:518)
………
018-10-04 18:58:38,496 INFO [main] spark.SparkContext (Logging.scala:logInfo(54)) - SparkContext already stopped.
2018-10-04 18:58:38,492 INFO [dispatcher-event-loop-3] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint (Logging.scala:logInfo(54)) - OutputCommitCoordinator stopped!
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:64)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:227)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:518)
………
Logs from /logs/master shows following error:
18/10/04 18:58:18 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = 1835832137613908542, local class serialVersionUID = -1329125091869941550
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
…………
Spark Versions:
Spark: spark-2.3.1-bin-hadoop2.7
Build dependencies:
Scala: 2.11
Spark-hive: 2.2.2
Maven-org-spark-project-hive hive-metastore = 1.x;
What changes should be done to successfully connect spark remotely? Thanks
Complete Log:
Console.log
Spark Master-Node.log

Related

Spark application erroring out because driver outofmemory even though lot free memory still available

Can someone please help me to figure out my simple spark application is requiring huge driver memory? Even though I allocated about 112GB, my application fails at about 67GB.
Thanks in advance
The spark driver is using huge memory for running simple application
Allocated about 112G of memory for running my application when do spark-submit
At start of the job, I see below message in the logs
1019 [main] INFO org.apache.spark.storage.memory.MemoryStore - MemoryStore started with capacity 67.0 GiB
My application fails with this error message
java.lang.IllegalStateException: dag-scheduler-event-loop has already been stopped accidentally.
at org.apache.spark.util.EventLoop.post(EventLoop.scala:107)
at org.apache.spark.scheduler.DAGScheduler.taskStarted(DAGScheduler.scala:283)
at org.apache.spark.scheduler.TaskSetManager.prepareLaunchingTask(TaskSetManager.scala:539)
at org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:478)
at scala.Option.map(Option.scala:230)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:455)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:395)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:390)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:390)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:381)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:587)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:582)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:582)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:555)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:555)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:359)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:955)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:351)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:162)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
59376410 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Cancelling stage 1
59376410 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Killing all running tasks in stage 1: Stage cancelled
59376415 [dispatcher-CoarseGrainedScheduler] ERROR org.apache.spark.scheduler.DAGSchedulerEventProcessLoop - DAGSchedulerEventProcessLoop failed; shutting down SparkContext
scala code snippet
val df = spark.read.parquet(data_path)
df.rdd.foreachPartition(p => {
// code to process the code...
})
Job Submit
'''
spark-submit --master "spark://x.x.x.x:7077" --driver-cores=4 --driver-memory=112G --conf spark.driver.maxResultSize=0 --conf spark.rpc.message.maxSize=2047 --conf spark.driver.host=x.x.x.x --class myclass.processor --packages "..,org.apache.hadoop:hadoop-azure:3.3.1" --deploy-mode client
'''

Cannot connect Apache Spark to MongoDB with SSL

I have successfully installed Apache Spark in Ubuntu 18.04. I have also added mongo-spark-connector to my spark installation. I am currently trying to connect to a MongoDB cluster I have setup externally. I can connect through various different sources to my MongoDB cluster with SSL enabled (SSL is required by MongoDB database server). When I try to connect through spark, the connection timeouts. To connect through SSL I usually use a private key (pk.pem) and CA Certificate (ca.crt).
I have done the following setup:
I have converted the private key file from PEM format to PKCS12 format
I have converted the CA certificate into PEM format
I have created a new keystore and added my newly formatted PKCS12 file (using keytool)
I have created a new truststore and added my CA certificate in PEM format (using keytool)
I currently start my script with the following command:
spark-submit \
--driver-java-options -Djavax.net.ssl.trustStore=/path/to/truststore.ks \
--driver-java-options -Djavax.net.ssl.trustStorePassword=tspassword \
--driver-java-options -Djavax.net.ssl.keyStore=/path/to/keystore.ks \
--driver-java-options -Djavax.net.ssl.keyStorePassword=kspassword \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.trustStore=/path/to/truststore.ks \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.trustStorePassword=tspassword \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.keyStore=/path/to/keystore.ks \
--conf spark.executor.extraJavaOptions=--Djavax.net.ssl.keyStorePassword=kspassword \
python script.py
The following is my PySpark code:
mongo_url = 'mongodb://<user>:<pass>#<host>:<port>/db.collection?replicaSet=replica-set-name&ssl=true&authSource=test&readPreference=nearest'
mongo_df = sqlContext.read.format('com.mongodb.spark.sql.DefaultSource').option('uri', mongo_url).load()
When I execute the script the connection timeouts with the following output:
19/11/14 15:36:47 INFO SparkContext: Created broadcast 0 from broadcast at MongoSpark.scala:542
19/11/14 15:36:47 INFO cluster: Cluster created with settings {hosts=[<host>:<port>], mode=MULTIPLE, requiredClusterType=REPLICA_SET, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500, requiredReplicaSetName='haip-replica-set'}
19/11/14 15:36:47 INFO cluster: Adding discovered server <host>:<port> to client view of cluster
19/11/14 15:36:47 INFO MongoClientCache: Creating MongoClient: [<host>:<port>]
19/11/14 15:36:47 INFO cluster: No server chosen by com.mongodb.client.internal.MongoClientDelegate$1#140f4273 from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=<host>:<port>, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
19/11/14 15:36:47 INFO cluster: Exception in monitor thread while connecting to server mongodb-data-1.haip.io:27017
com.mongodb.MongoSocketReadException: Prematurely reached end of stream
at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:112)
at com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:580)
at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:445)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259)
at com.mongodb.internal.connection.CommandHelper.sendAndReceive(CommandHelper.java:83)
at com.mongodb.internal.connection.CommandHelper.executeCommand(CommandHelper.java:33)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.initializeConnectionDescription(InternalStreamConnectionInitializer.java:105)
at com.mongodb.internal.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:62)
at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:129)
at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:117)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/home/user/script.py", line 16, in <module>
mongo_df = sqlContext.read.format('com.mongodb.spark.sql.DefaultSource').option('uri', mongo_url).load()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches com.mongodb.client.internal.MongoClientDelegate$1#140f4273. Client view of cluster state is {type=REPLICA_SET, servers=[{address=<host>:<port>, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketReadException: Prematurely reached end of stream}}]
at com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:408)
at com.mongodb.internal.connection.BaseCluster.selectServer(BaseCluster.java:123)
at com.mongodb.internal.connection.AbstractMultiServerCluster.selectServer(AbstractMultiServerCluster.java:54)
at com.mongodb.client.internal.MongoClientDelegate.getConnectedClusterDescription(MongoClientDelegate.java:147)
at com.mongodb.client.internal.MongoClientDelegate.createClientSession(MongoClientDelegate.java:100)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.getClientSession(MongoClientDelegate.java:277)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:181)
at com.mongodb.client.internal.MongoDatabaseImpl.executeCommand(MongoDatabaseImpl.java:186)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:155)
at com.mongodb.client.internal.MongoDatabaseImpl.runCommand(MongoDatabaseImpl.java:150)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$1.apply(MongoConnector.scala:237)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector$$anonfun$withDatabaseDo$1.apply(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:157)
at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
19/11/14 15:37:17 INFO SparkContext: Invoking stop() from shutdown hook
19/11/14 15:37:17 INFO MongoClientCache: Closing MongoClient: [<host>:<port>]
Spark version: 2.4.4
Scala version: 2.11
Mongo Java Driver version: 3.11.2
Mongo Spark Connector version: 2.11-2.4.1
Keystore and Truststore were in an incorrect format (.ks). Once I changed the keys to the correct format (.jks) it worked.

Remote Spark Connection - Scala: Could not find BlockManagerMaster

Spark Master and Worker, both are running in localhost. I have started Master and Worker node by triggering command:
sbin/start-all.sh
Logs for master node invocation:
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/bin/java -cp /Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/conf/:/Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host 192.168.0.38 --port 7077 --webui-port 8080
Logs for Worker node invocation:
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/bin/java -cp /Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/conf/:/Users/gaurishi/spark/spark-2.3.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://192.168.0.38:7077
I have following configuration in conf/spark-env.sh
SPARK_MASTER_HOST=192.168.0.38
Content of /etc/hosts:
127.0.0.1 localhost
::1 localhost
255.255.255.255 broadcasthost
Scala code, that I am invoking to establish remote spark connection:
val sparkConf = new SparkConf()
.setAppName(AppConstants.AppName)
.setMaster("spark://192.168.0.38:7077")
val sparkSession = SparkSession.builder()
.appName(AppConstants.AppName)
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
While executing code from IDE, I am getting following exception in console:
2018-10-04 14:43:33,426 ERROR [main] spark.SparkContext (Logging.scala:logError(91)) - Error initializing SparkContext.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
........
Caused by: org.apache.spark.SparkException: Could not find BlockManagerMaster.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:157)
at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:132)
.......
2018-10-04 14:43:33,432 INFO [stop-spark-context] spark.SparkContext (Logging.scala:logInfo(54)) - Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
........
Caused by: org.apache.spark.SparkException: Could not find BlockManagerMaster.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:157)
at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:132)
........
Logs from /logs/master shows following error:
18/10/04 14:43:13 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local class incompatible: stream classdesc serialVersionUID = 1835832137613908542, local class serialVersionUID = -1329125091869941550
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1885)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
.......
What changes should be done to connect spark remotely?
Spark Versions:
Spark: spark-2.3.1-bin-hadoop2.7
Build dependencies:
Scala: 2.11
Spark-hive: 2.2.2
Maven-org-spark-project-hive hive-metastore = 1.x;
Logs:
Console log
Spark Master-Node log
I know this is an old post. But, sharing my answer to save someone else precious time.
I was facing a similar issue two days back, and after so much of hacking, I found the root cause for the problem was the Scala version I was using in my Maven project.
I was using Spark 2.4.3, and it's internally using Scala 2.11, and the Scala project I was using was compiled with Scala 2.12. This Scala version mismatch was the reason for the above error.
When I downgraded the Scala version in my Maven project, it started working. Hope it helps.

Run a simple spark code in Scala IDE

I want to use Scala IDE and run spark code on Windows 7. I already installed Scala IDE and start with creating a scala project. So I need to know:
Is there any instruction to run the following code in Scala IDE:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "D:/Spark_Installation/eclipse-ws/Scala/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://myhost:7077")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
When I run this code I got the following errors:
15/03/26 11:59:55 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#myhost:7077/user/Master...
15/03/26 11:59:58 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#myhost:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#myhost:7077
15/03/26 11:59:58 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#myhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: myhost
15/03/26 12:00:15 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#myhost:7077/user/Master...
15/03/26 12:00:17 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#myhost:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#myhost:7077
15/03/26 12:00:17 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#myhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: myhost
15/03/26 12:00:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#myhost:7077/user/Master...
15/03/26 12:00:37 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#myhost:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#myhost:7077
15/03/26 12:00:37 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#myhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: myhost
15/03/26 12:00:55 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/03/26 12:00:55 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
15/03/26 12:00:55 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
Do you have a spark master set up? If not, Have a look at this:
http://spark.apache.org/docs/1.2.1/submitting-applications.html#master-urls
You would mostly want to use
local[*]
which would use each core that your local computer has, instead of using:
spark://myhost:7077
The spark:// assumes you have a spark master setup at myhost:7077
If you are running as a local standalone cluster then please use local[*] that means use all the cores your machine has. So now your sparkconf object creation would look like following.val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")

Monitoring a task in apache Spark

I start spark master using : ./sbin/start-master.sh
as described at :
http://spark.apache.org/docs/latest/spark-standalone.html
I then submit the Spark job :
sh ./bin/spark-submit \
--class simplespark.Driver \
--master spark://`localhost`:7077 \
C:\\Users\\Adrian\\workspace\\simplespark\\target\\simplespark-0.0.1-SNAPSHOT.jar
How can run a simple app which demonstrates a parallel task running ?
When I view http://localhost:4040/executors/ & http://localhost:8080/ there are no
tasks running :
The .jar I'm running (simplespark-0.0.1-SNAPSHOT.jar) just contains a single Scala object :
package simplespark
import org.apache.spark.SparkContext
object Driver {
def main(args: Array[String]) {
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("knn")
.setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val sc = new SparkContext(conf);
val l = List(1)
sc.parallelize(l)
while(true){}
}
}
Update : When I change --master spark://localhost:7077 \ to --master spark://Adrian-PC:7077 \
I can see update on the Spark UI :
I have also updated Driver.scala to read default context, as I'm not sure if I set it correctly for submitting Spark jobs :
package simplespark
import org.apache.spark.SparkContext
object Driver {
def main(args: Array[String]) {
System.setProperty("spark.executor.memory", "2g")
val sc = new SparkContext();
val l = List(1)
val c = sc.parallelize(List(2, 3, 5, 7)).count()
println(c)
sc.stop
}
}
On Spark console I receive multiple same all same messages :
14/12/26 20:08:32 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
So it appears that the Spark job is not reaching the master ?
Update2 : After I start (thanks to Lomig Mégard comment below) the worker using :
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://Adrian-PC:7077
I receive error :
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Executor app-20141227212351-0003/8 removed: java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor added: app-20141227212351-0003/9 on worker-20141227211411-Adrian-PC-58199 (Adrian-PC:58199) with 4 cores
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141227212351-0003/9 on hostPort Adrian-PC:58199 with 4 cores, 2.0 GB RAM
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor updated: app-20141227212351-0003/9 is now RUNNING
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor updated: app-20141227212351-0003/9 is now FAILED (java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified)
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Executor app-20141227212351-0003/9 removed: java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified
14/12/27 21:23:52 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
14/12/27 21:23:52 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED
14/12/27 21:23:52 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at Driver.scala:14)
14/12/27 21:23:52 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
I'm running the scripts on Windows using Cygwin. To fix this error I copy the Spark installation to cygwin C:\ drive. But then I receive a new error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
You have to start the actual computation to see the job.
val c = sc.parallelize(List(2, 3, 5, 7)).count()
println(c)
Here count is called an action, you need at least one of them to begin a job. You can find the list of available actions in the Spark doc.
The other methods are called transformations. They are lazily executed.
Don't forget to stop the context at the end, instead of your infinite loop, with sc.stop().
Edit: For the updated question, you allocate more memory to the executor than there is available in the worker. The defaults should be fine for simple tests.
You also need to have a running worker linked to your master. See this doc to start it.
./sbin/start-master.sh
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT