How Can I submit multiple jobs in Spark Standalone cluster? - scala

I have a Machine with Apache Spark. Machine is 64GB RAM 16 Cores.
My Objective in each spark job
1. Download a gz file from a remote server
2. Extract gz to get csv file (1GB max)
3. Process csv file in spark and save some stats.
Currently I am submitting one job for each file received by doing following
./spark-submit --class ClassName --executor-cores 14 --num-executors 3 --driver-memory 4g --executor-memory 4g jar_path
And wait for this job to complete and then start new job for new file.
Now I want to utilise 64GB RAM by running multiple jobs in parallel.
I can assign 4g RAM to each job and want to queue my jobs when there are enough jobs already running.
How Can I achieve this?

You should submit multiple jobs from different threads:
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
and configure pool properties (set schedulingMode to FAIR):
https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
From Spark Doc:
https://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
By default, it utilise all the resources for one single job.We need to define the resources so that their will be space to run other job as well.Below is the command you can use to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar

Related

spark write parquet to HDFS very slow on multi node

i run well a spark submit with --master local[*],
but when i run the spark submit on my multinode cluster
--master ip of master:port --deploy-mode client :
my app run well until writing to HDFS into parquet, it doesn't stop, no error messages, nothing, still running..
i detected in the app the blocking part, it's :
resultDataFrame.write.parquet(path)
i tried
with
resultDataFrame.repartition(1).write.parquet(path)
but still the same...
Thank you in advance for the help
I am able to see you are trying to use master as local[*], which will run spark job in local mode and unable to use cluster resources.
If you are running spark job on cluster, you can look for spark submit options such as, master as yarn and deploy mode is cluster, here command mentioned below.
spark-submit --class **--master yarn --deploy-mode
cluster ** --conf = ... # other options
[application-arguments]
once you run spark job with yarn master and deploy mode as cluster it will try to utilize all cluster resources.

Executor is taking more memory than defined

spark-submit --num-executors 10 --executor-memory 5g --master yarn --executor-cores 3 --class com.octro.hbase.hbase_final /home/hadoop/testDir/nikunj/Hbase_data_maker/target/Hbase_data_maker-0.0.1-SNAPSHOT-jar-with-dependencies.jar main_user_profile
This is my command to execute my spark code on the cluster.
On this command my YARN page gives total memory allocated as
71GB
I tried searching on the internet for the various reason but didn't received any clear clarification.
Later I figured out it is using the formula as
No of Executors*(Memory*2)+1
Plus 1 is for the main container.But why that 2GB by default.?
It was because of 2GB memory overhead that was specified in configuration file of spark.
That's why it was taking 2GB more.

Spark-Submit execution time

I have developed a Scala Program on Spark which connected MySQL Database to pull the data about 250K records and process it. When I execute the application from the IDE itself (IntelliJ) it takes about 1 min to complete the job where as if I submit through Spark-Sumit from my terminal it takes 4 minutes.
Scala Code
val sparkSession = SparkSession.builder().
appName("credithistory").
master("local[*]")
.getOrCreate()
From Terminal
spark-submit --master local[*] .....
Any changes should I have to make or it is normal behaviour? Since local[*] I have it in code also Im supplying from terminal.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
it's from the reference of spark web page. link
you can adjust the number of 'K',
for example, "local[4] or local[8]" following your CPU performance.

Number of Executors in Spark Local Mode

So I am running a spark job in local mode.
I use the following command to run the job
spark-submit --master local[*] --driver-memory 256g --class main.scala.mainClass target/scala-2.10/spark_proj-assembly-1.0.jar 0 large.csv 100 outputFolder2 10
I am running this on a machine with 32 Cores and 256GB RAM. When creating the conf i use the following code
val conf = new SparkConf().setMaster("local[*]").setAppName("My App")
Now I now in local mode, Spark runs everything inside a single JVM, but does that mean it launches only one driver and use it as executor as well. In my time line it shows one executor driver added.
And when I go the the Executors page, there is just one executor with 32 cores assigned to it
Is this the default behavior ? I was expecting spark would launch one executor per core instead of just one executor that gets all the core. If some one can explain the behavior, that would be great
Is this the default behavior?
In local mode, your driver + executors are, as you've said, created inside a single JVM process. What you see isn't an executor, it is a view of how many cores your job has at its disposable. Usually when running under local mode, you should only be seeing the driver in the executors view.
If you look at the code for LocalSchedulerBackend, you'll see the following comment:
/**
* Used when running a local version of Spark where the executor, backend, and master all run in
* the same JVM. It sits behind a [[TaskSchedulerImpl]] and handles launching tasks on a single
* Executor (created by the [[LocalSchedulerBackend]]) running locally.
We have a single, in the same JVM instance executor which handles all tasks.

Spark reparition() function increases number of tasks per executor, how to increase number of executor

I'm working on IBM Server of 30gb ram (12 cores engine), I have provided all the cores to spark but still, it uses only 1 core, I tried while loading the file and got successful with the command
val name_db_rdd = sc.textFile("input_file.csv",12)
and able to provide all the 12 cores to the processing for the starting jobs but I want to split the operation in between the intermediate operations to the executors, so that it can use all the 12 cores.
Image - description
val new_rdd = rdd.repartition(12)
As you can see in this image only 1 executor is running and repartition function split the data to many tasks at one executor.
It depends how you're launching the job, but you probably want to add --num-executors to your command line when you're launching your spark job.
Something like
spark-submit
--num-executors 10 \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 1 \
might work well for you.
Have a look on the Running Spark on Yarn for more details, though some of the switches they mention are Yarn specific.