Spark worker nodes timeout

Spark worker nodes timeout - scala

When I run my Spark app using sbt run with configuration pointing to master of a remote cluster nothing useful gets executed by the workers and the following warning is printed in sbt run log repeatedly.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
This is how my spark config looks like:
#transient lazy val conf: SparkConf = new SparkConf()
.setMaster("spark://master-ip:7077")
.setAppName("HelloWorld")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "12g")
#transient lazy val sc: SparkContext = new SparkContext(conf)
val lines = sc.textFile("hdfs://master-public-dns:9000/test/1000.csv")
I know this warning usually appears when the cluster is misconfigured and the workers either don't have the resources or aren't started in the first place. However, according to my Spark UI (on master-ip:8080) the worker nodes seem to be alive with sufficient RAM and cpu cores, they even try to execute my app but they exit and leave this in stderr log:
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled;
users with view permissions: Set(ubuntu, myuser);
groups with view permissions: Set(); users with modify permissions: Set(ubuntu, myuser); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
... 8 more
ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Any ideas?

Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
Could you telnet to this port on this ip from worker, maybe your driver machine has multiple network interfaces, try to set SPARK_LOCAL_IP in $SPARK_HOME/conf/spark-env.sh

Related

Hudi: Access to timeserver times out in embedded mode

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver.
The following is based on Hudi: Getting Started guide.
Spark Session setup:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()
Code which causes timeout exception:
val inserts = convertToStringList(dataGen.generateInserts(10))
var df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
The timeout and exception throws:
170762 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating remote view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661. Server=xxx:59520
170766 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating InMemory based view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661
170769 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://xxx:59520/v1/hoodie/view/datafiles/beforeoron/latest/?partition=americas%2Funited_states%2Fsan_francisco&maxinstant=20201221180946&basepath=%2Fvar%2Ffolders%2Fz9%2F_9mf84p97hz1n45b0gnpxlj40000gp%2FT%2FHudiQuickStartSpec-hudi_trips_cow2193648737745630661&lastinstantts=20201221180946&timelinehash=70f7aa073fa3d86033278a59cbda71c6488f4883570d826663ebb51934a25abf)
246649 [Executor task launch worker for task 47] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesFromParams(RemoteHoodieTableFileSystemView.java:223)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesBeforeOrOn(RemoteHoodieTableFileSystemView.java:230)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:134)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$19c2c1bb$1(HoodieBloomIndex.java:201)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
I wasn't able to experiment with different port settings for Hudi timeserver port as I wasn't able to find the config setting that controls the port.
Any ideas why access to the timeserver times out?

The problem turned out to be rooted in the way Hudi resolves spark driver host. It seems that although it starts and binds its web server to localhost, Hudi's client subsequently uses the IP address to make calls to the server it started.
5240 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Starting Javalin ...
5348 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Listening on http://localhost:59520/
...
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
The solution is to configure "spark.driver.host" setting explicitly. The following worked for me:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.driver.host", "localhost")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()

Unable to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi

I am new to Python and Apache ecosystem. I am trying to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi. I do not have detailed knowledge of any of the components being used here, I am only doing Googling and hit-and-trial.
In this way I have successfully configured and started Spark, NiFi and Livy in EMR. And I am able to submit Pyspark code via Livy in interactive session.
However, nothing happens when I configure ExecuteSparkInteractive to submit Pyspark code via Livy. Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor.
This is my configuration for LivySessionController:
This is the sample code I submit under properties in ExecuteSparkInteractive.
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Here is the code that works for me in interactive session:
import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'pyspark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
#Get the session URL
session_url = host + r.headers['Location']
sn_r = requests.get(session_url, headers=headers)
statements_url = session_url + '/statements'
data = {
'code': textwrap.dedent("""
import random
from pyspark import SparkConf, SparkContext
#create SparkContext using standalone mode
conf = SparkConf().setMaster("local").setAppName("SimpleETL")
sc = SparkContext.getOrCreate(conf)
NUM_SAMPLES = 100000
def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
""")
}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
These are the log excerpts from nifi-app.log:
#After starting the processor
2018-07-18 06:38:11,768 INFO [NiFi Web Server-112] o.a.n.c.s.StandardProcessScheduler Starting ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:38:11,770 INFO [Monitor Processore Lifecycle Thread-1] o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run with 1 threads
2018-07-18 06:38:11,883 INFO [Flow Service Tasks Thread-1] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#36fb0996 // Another save pending = false
2018-07-18 06:38:57,106 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#12830e23 checkpointed with 0 Records and 0 Swap Files in 7 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 2 millis), max Transaction ID -1
#After stopping the processor
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.c.s.StandardProcessScheduler Stopping ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7]
2018-07-18 06:39:09,835 INFO [NiFi Web Server-106] o.a.n.controller.StandardProcessorNode Stopping processor: class org.apache.nifi.processors.livy.ExecuteSparkInteractive
2018-07-18 06:39:09,838 INFO [Timer-Driven Process Thread-9] o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling ExecuteSparkInteractive[id=ac05cd49-0164-1000-6793-2df960eb8de7] to run
2018-07-18 06:39:09,917 INFO [Flow Service Tasks Thread-2] o.a.nifi.controller.StandardFlowService Saved flow controller org.apache.nifi.controller.FlowController#36fb0996 // Another save pending = false
Interestingly, when I enable LivySessionController in NiFi, the Livy UI shows two new sessions - the one created first shows in "idle" state, while the later (one with the greater Session Id) keeps showing in the "starting" state even after several refreshes. Let's give them Session Ids 1 and 2, respectively. Interestingly, Session Id 2 changes state from "starting" to "shutting_down" to "dead". As soon as it is dead, a new session (Session Id 3) is created with state "starting" which later becomes "idle". Below are log excerpts from these 3 sessions:
#Livy 1st session:
18/07/18 06:33:58 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
18/07/18 06:33:58 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-84-145.ec2.internal:4040
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Shutting down all executors
18/07/18 06:33:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
18/07/18 06:33:58 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
18/07/18 06:33:58 INFO YarnClientSchedulerBackend: Stopped
18/07/18 06:33:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/07/18 06:33:59 INFO MemoryStore: MemoryStore cleared
18/07/18 06:33:59 INFO BlockManager: BlockManager stopped
18/07/18 06:33:59 INFO BlockManagerMaster: BlockManagerMaster stopped
18/07/18 06:33:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/07/18 06:33:59 INFO SparkContext: Successfully stopped SparkContext
#Livy 2nd session:
18/07/18 06:34:30 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
#Livy 3rd session:
18/07/18 06:36:15 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.

Few things here -
Livy session controller :-
Make sure you see 2 sessions per node when you enable the controller
service and both session on spark UI must be in running state (but
not performing any operation until python code with Nifi runs).
If you see unusual behavior then focus on getting it fixed first.
possible action - Add StandardSSLContextService controller and setup Keystore
and truststore. And use the same in LivySessionController (under property : SSL COntext Service)
Within Python Code :
I think you don't have to import SparkConf, SparkContext, also you don't need to create conf and sc. You only need to import Sparksession as below -
from pyspark.sql import SparkSession
and you can simply use spark (it's available by default as spark session variable)
e.g - spark.sql(s""" ....slq-statement.. """) or spark.sparkContext for sc
last thing which you mentioned "Livy session manager shows nothing, and there are no errors visible in ExecuteSparkInteractive processor."
FOr this you can add some dummy processor like updateAttribute after ExecuteSparkInteractive processor and keep it in disabled mode. Also you have to direct the output from spark interactive processor to updateAttribute in all 3 states (success, failure, wait). This way you will be able to see whats the outcome after pyspark code runs within nifi. Refer below diagram for sample.
I hope this will help you fix your issues.
Up Vote if you like the answer
Sample Nifi template to test PySpark code

Spark and Amazon S3 not setting credentials in executors

Im doing a Spark program that reads and writes from Amazon S3.My problem is that It works if I execute in local mode (--master local[6]) but if i execute in the cluster (in other machines) I got an error with the credentials:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 33, mmdev02.stratio.com): com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:155)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
My code is as follows:
val conf = new SparkConf().setAppName("BackupS3")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-" + region + ".amazonaws.com")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/var/tmp/spark")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
I can write to Amazon S3 but cannot read! I also had to send some properties when I do spark-submit because my region is Frankfurt and I had to enable V4:
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I tried passing the credentials this way too. If i put them in the hdfs-site.xml in every machine it works.
My question, is how can I do it from code? Why are the executors not getting the config i pass them from the code?
I'm using Spark 1.5.2, hadoop-aws 2.7.1 and aws-java-sdk 1.7.4.
Thanks

Don't put secrets the keys, that leads to loss of secrets
If you are running in EC2, your secrets will be picked up automatically from the IAM feature; the client asks a magic web server for session secrets.
...which means: it may be that spark's automatic credential propagation is getting in the way. Unset your AWS_ env vars before submitting the work.

If you set these properties explicitly in your code, the values will only be visible to the driver process. The executors will not have a chance to pick up those credentials.
If you had set them in actual config file like core-site.xml, they will propagate.
Your code would work in local mode because all operations are happening in a single process.
Why it works on a cluster on small files but not large ones (*): The code could also work on unpartitioned files, where read operations are performed in the driver and partitions are then broadcast to executors. On partitioned files, where executors read individual partitions, the credentials won't be set on the executors so it fails.
Best to use standard mechanisms for passing credentials, or better yet, use EC2 roles and IAM policies in your cluster as EricJ's answer suggests. By default, if you do not provide credentials, EMRFS will look up temporary credentials via EC2 instance metadata service.
(*) I am still learning about this myself, and I may need to revise this answer as I learn more

Simple Spark program eats all resources

I have server with running in it Spark master and slave. Spark was built manually with next flags:
build/mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package
I'm trying to execute next simple program remotely:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("testApp").setMaster("spark://sparkserver:7077")
val sc = new SparkContext(conf)
println(sc.parallelize(Array(1,2,3)).reduce((a, b) => a + b))
}
Spark dependency:
"org.apache.spark" %% "spark-core" % "1.6.1"
Log on program executing:
16/04/12 18:45:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My cluster WebUI:
Why so simple application uses all availiable resources?
P.S. Also I noticed what if I allocate more memory for my app (10 gb e.g.) next logs appear many times:
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now RUNNING
16/04/12 19:23:40 INFO AppClient$ClientEndpoint: Executor updated: app-20160412182336-0008/208 is now EXITED (Command exited with code 1)
I think that reason in connection between master and slave. How I set up master and slave(on the same machine):
sbin/start-master.sh
sbin/start-slave.sh spark://sparkserver:7077
P.P.S. When I'm connecting to spark master with spark-shell all is good:
spark-shell --master spark://sparkserver:7077

By default, yarn will allocate all "available" ressources if the yarn dynamic ressource allocation is set to true and your job still have queued tasks. You can also look for your yarn configuration, namely the number of executor and the memory allocated to each one and tune in function of your need.

in file:spark-default.xml ------->setting :spark.cores.max=4

It was a driver issue. Driver (My scala app) was ran on my local computer. And workers have no access to it. As result all resources were eaten by attempts to reconnect to a driver.

Spark atop of Docker not accepting jobs

I'm trying to make a hello world example work with spark+docker, and here is my code.
object Generic {
def main(args: Array[String]) {
val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")
val NUM_SAMPLES = 100000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x * x + y * y < 1) 1.0 else 0.0
}.reduce(_ + _)
println("Pi is roughly " + 4 * count / NUM_SAMPLES)
}
}
When I run sbt run, I get
14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing.
The docker logs show no output from the workers and the following from the master
14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#master:7077] -> [akka.tcp://spark#10.0.3.1:48085]: Error [Association failed with [akka.tcp://spark#10.0.3.1:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#10.0.3.1:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085
]
This happens a couple times, and then the program times out and dies with
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down
When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking.
However, the spark console works.
If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")), the program runs

I've been there. The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0.
While your master ip is on: spark://172.17.0.3:7077
Akka is binding on: akka.tcp://spark#10.0.3.1:48085
If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.x.x range.
Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP. See config docs for details.
In our docker setup for Spark 0.9 we are using this command to start the slaves:
${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP
Which directly provides the local IP to the worker.

For running spark on Docker it's crucial to
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/

You have to check firewall if you are on Windows host and make sure java.exe is allowed to access the public network or change dockerNAT to private. In general, the worker must be able to connect back to the driver (the program you submitted).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark worker nodes timeout - scala

Cannot receive any reply from 192.168.0.11:35996 in 120 seconds Could you telnet to this port on this ip from worker, maybe your driver machine has multiple network interfaces, try to set SPARK_LOCAL_IP in $SPARK_HOME/conf/spark-env.sh

Related

Hudi: Access to timeserver times out in embedded mode

Unable to submit Pyspark code via ExecuteSparkInteractive processor in Apache NiFi

Spark and Amazon S3 not setting credentials in executors

Simple Spark program eats all resources

Spark atop of Docker not accepting jobs

Categories

Resources