Number of Executors in Spark Local Mode - scala

So I am running a spark job in local mode.
I use the following command to run the job
spark-submit --master local[*] --driver-memory 256g --class main.scala.mainClass target/scala-2.10/spark_proj-assembly-1.0.jar 0 large.csv 100 outputFolder2 10
I am running this on a machine with 32 Cores and 256GB RAM. When creating the conf i use the following code
val conf = new SparkConf().setMaster("local[*]").setAppName("My App")
Now I now in local mode, Spark runs everything inside a single JVM, but does that mean it launches only one driver and use it as executor as well. In my time line it shows one executor driver added.
And when I go the the Executors page, there is just one executor with 32 cores assigned to it
Is this the default behavior ? I was expecting spark would launch one executor per core instead of just one executor that gets all the core. If some one can explain the behavior, that would be great

Is this the default behavior?
In local mode, your driver + executors are, as you've said, created inside a single JVM process. What you see isn't an executor, it is a view of how many cores your job has at its disposable. Usually when running under local mode, you should only be seeing the driver in the executors view.
If you look at the code for LocalSchedulerBackend, you'll see the following comment:
/**
* Used when running a local version of Spark where the executor, backend, and master all run in
* the same JVM. It sits behind a [[TaskSchedulerImpl]] and handles launching tasks on a single
* Executor (created by the [[LocalSchedulerBackend]]) running locally.
We have a single, in the same JVM instance executor which handles all tasks.

Related

Apache Spark Configuration in Scala [duplicate]

I found some code to start spark locally with:
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ctx = new SparkContext(conf)
What does the [*] mean?
From the doc:
./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed
cluster, or local to run locally with one thread, or local[N] to run
locally with N threads. You should start by using local for testing.
And from here:
local[*] Run Spark locally with as many worker threads as logical
cores on your machine.
Master URL Meaning
local : Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)
local[*] : Run Spark locally with as many worker threads as logical cores on your machine.
local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.
spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.
mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever you have configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
https://spark.apache.org/docs/latest/submitting-applications.html
Some additional Info
Do not run Spark Streaming programs locally with master configured as "local" or "local[ 1]". This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least "local[ 2]" to have more cores.
From -Learning Spark: Lightning-Fast Big Data Analysis
Master URL
You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.
The URL says how many threads can be used in total:
local uses 1 thread only.
local[n] uses n threads.
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.
You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.
The URL says how many threads can be used in total:-
local uses 1 thread only.
local[n] uses n threads.
local[*] uses as many threads as your spark local machine have, where you are running your application.
you can check by lscpu in your Linux machine
[ie#mapr2 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
if your machine has 56 cores means CPU then your spark jobs will be partitioned in 56 part.
NOTE:- there may be the case that in your spark cluster the spark-defaults.conf file has limited the partition value with the default value (like 10 or else) then your partitioned will be the same as default value has been set in config.
local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.
without * spark will use single thread.
With * spark will use all the available threads the run this program

Spark Standalone Cluster deployMode = "cluster": Where is my Driver?

I have researched this for a significant amount of time and find answers that seem to be for a slightly different question than mine.
UPDATE: Spark docs say the Driver runs on a cluster Worker in deployMode: cluster. This does not seem to be true when you don't use spark-submit
My Spark 2.3.3 cluster is running fine. I see the GUI on “http://master-address:8080", there are 2 idle workers, as configured.
I have a Scala application that creates a context and starts a Job. I do not use spark-submit, I start the Job programmatically and this is where many answers diverge from my question.
In "my-app" I create a new SparkConf, with the following code (slightly abbreviated):
conf.setAppName(“my-job")
conf.setMaster(“spark://master-address:7077”)
conf.set(“deployMode”, “cluster”)
// other settings like driver and executor memory requests
// the driver and executor memory requests are for all mem on the slaves, more than
// mem available on the launching machine with “my-app"
val jars = listJars(“/path/to/lib")
conf.setJars(jars)
…
When I launch the job I see 2 executors running on the 2 nodes/workers/slaves. The logs show their IP address and calls them executor 0 and 1.
With a Yarn cluster I would expect the “Driver" to run on/in the Yarn Master but I am using the Spark Standalone Master, where is the Driver part of the Job running? If it runs on a random worker or elsewhere, is there a way to find it from logs
Where is my Spark Driver executing? Does deployMode = cluster work when not using spark-submit? Evidence shows a cluster with one master (on the same machine as executor 0) and 2 Workers. It also show identical memory usage on both Workers during the job. From logs I know both Workers are running Executors. Where is the Driver?
The “Driver” creates and broadcasts some large data structures so the need for an answer is more critical than with more typical tiny Drivers.
Where is the driver running? How do I find it given logs and monitoring? I can't reconcile what I see with the docs, they contradict each other.
This is answered by the official documentation:
In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
In other words driver uses arbitrary worker node, hence it it is likely to co-locate with one on the executors, on such small cluster. And to anticipate the follow-up question - this behavior is not configurable - you just have to make sure that the cluster has capacity to start both required executors, and the driver with it's requested memory and cores.

How to change number of executors in local mode?

Is it possible to set multiple executors for Spark Streaming application in a local mode using some Spark Conf settings?
For now, I can not see any changes in Spark UI in terms of performance or executors number increase when I change spark.executor.instances parameter to 4, for example.
Local mode is a development tool, where all components are simulated in a single machine. Since single JVM mean single executor changing of the number of executors is simply not possible, and spark.executor.instances is not applicable.
All you can do in local mode is to increase number of threads by modifying the master URL - local[n] where n is the number of threads.
local mode is by definition "pseudo-cluster" that runs in Single JVM. That means maximum number of executors is 1.
If you want to experiment with multiple executors on local machine, what you can do is to create cluster with couple workers running on your local machine. Number of running instances is max number of executors for your tasks.
spark.executor.instances is not honoured in local mode.
Reference - https://books.japila.pl/apache-spark-internals/local/?h=local
Local-Mode: In this non-distributed single-JVM deployment mode, Spark spawns all the execution components - driver, executor, LocalSchedulerBackend, and master - in the same single JVM. The default parallelism is the number of threads as specified in the master URL. This is the only mode where a driver is used for execution.
So you can increase number of threads in JVM to n by passing master url as local[n].

Spark-Submit execution time

I have developed a Scala Program on Spark which connected MySQL Database to pull the data about 250K records and process it. When I execute the application from the IDE itself (IntelliJ) it takes about 1 min to complete the job where as if I submit through Spark-Sumit from my terminal it takes 4 minutes.
Scala Code
val sparkSession = SparkSession.builder().
appName("credithistory").
master("local[*]")
.getOrCreate()
From Terminal
spark-submit --master local[*] .....
Any changes should I have to make or it is normal behaviour? Since local[*] I have it in code also Im supplying from terminal.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
it's from the reference of spark web page. link
you can adjust the number of 'K',
for example, "local[4] or local[8]" following your CPU performance.

Spark mesos cluster mode is slower than local mode

I submit the same jar to run by using both local mode and mesos cluster mode. And found for some exactly same stages, local mode only takes several milliseconds to finish however cluster mode will take seconds!
listed is one example: stage 659
local mode:
659
Streaming job from [output operation 1, batch time 17:45:50]
map at KafkaHelper.scala:35 +details
2016/03/22 17:46:31 11 ms
mesos cluster mode:
659
Streaming job from [output operation 1, batch time 18:01:20]
map at KafkaHelper.scala:35 +details
2016/03/22 18:09:33 3 s
And I found from spark UI that mesos cluster mode will consistently take 4 seconds to finish the foreachRDD jobs, why is that? Any submit commands options can help with this?
Bunch of thanks in advance!
That behavior depends on multiple factors. You don't specify what kind of job you run in which cluster mode, and with which settings. If Spark is not installed on the Slaves, you'll see an overhead because the distribution needs to be downloaded etc.
Furthermore, the jars you're using need to be distributed to the executors, which can take some time for the startup as well.
As said, this all depends on how you run Spark on Mesos.
See
http://spark.apache.org/docs/latest/running-on-mesos.html