Spark-Submit execution time - scala

I have developed a Scala Program on Spark which connected MySQL Database to pull the data about 250K records and process it. When I execute the application from the IDE itself (IntelliJ) it takes about 1 min to complete the job where as if I submit through Spark-Sumit from my terminal it takes 4 minutes.
Scala Code
val sparkSession = SparkSession.builder().
appName("credithistory").
master("local[*]")
.getOrCreate()
From Terminal
spark-submit --master local[*] .....
Any changes should I have to make or it is normal behaviour? Since local[*] I have it in code also Im supplying from terminal.

local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
it's from the reference of spark web page. link
you can adjust the number of 'K',
for example, "local[4] or local[8]" following your CPU performance.

Related

Writing using Pyspark and MongoDB Spark Connector gets stuck on Databricks

I'm using the MongoDB-Spark-Connector (2.12:3.0.1) to write data when running a Databricks (runtime 9.1 LTS ML Spark 3.1.2, Scala 2.12) job from notebook using PySpark. I'm able to run the job successfully when sampling smaller amount of rows, but when I run full scale (180 M rows) the job seems to get stuck after roughly 1.5 hours without throwing any error.
To clarify - The spark process keeps on staying alive but nothing seems to happen on both the Spark nodes side, as well as on the MongoDB side. On the MongoDB side I see a drop of writes to 0 writes while at the same time on Ganglia I see the nodes utilization drops to near 0. The job on Spark UI is at a running state with just few last running tasks still represented as running though nothing is progressing.
My initial code:
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite") \
.option("database", database) \
.option("collection", destination_collection).save()
After investigating a bit it seemed like the root cause could be some timeouts occurring on the MongoDB side which might be related to the writeConcern (w/wTimeoutMS) so I've added the following options to test if I get an exception on a very small allowed timeout but I didn't so I guess it's not being applied correctly.
My refactored code:
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite") \
.option("database", database) \
.option("collection", destination_collection) \
.option("writeConcern.w", 2) \
.option("writeConcern.wTimeoutMS", 1).save()
Have anyone else encountered this issue and have a proper solution?

How Can I submit multiple jobs in Spark Standalone cluster?

I have a Machine with Apache Spark. Machine is 64GB RAM 16 Cores.
My Objective in each spark job
1. Download a gz file from a remote server
2. Extract gz to get csv file (1GB max)
3. Process csv file in spark and save some stats.
Currently I am submitting one job for each file received by doing following
./spark-submit --class ClassName --executor-cores 14 --num-executors 3 --driver-memory 4g --executor-memory 4g jar_path
And wait for this job to complete and then start new job for new file.
Now I want to utilise 64GB RAM by running multiple jobs in parallel.
I can assign 4g RAM to each job and want to queue my jobs when there are enough jobs already running.
How Can I achieve this?
You should submit multiple jobs from different threads:
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
and configure pool properties (set schedulingMode to FAIR):
https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
From Spark Doc:
https://spark.apache.org/docs/latest/spark-standalone.html#resource-scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
By default, it utilise all the resources for one single job.We need to define the resources so that their will be space to run other job as well.Below is the command you can use to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar

Number of Executors in Spark Local Mode

So I am running a spark job in local mode.
I use the following command to run the job
spark-submit --master local[*] --driver-memory 256g --class main.scala.mainClass target/scala-2.10/spark_proj-assembly-1.0.jar 0 large.csv 100 outputFolder2 10
I am running this on a machine with 32 Cores and 256GB RAM. When creating the conf i use the following code
val conf = new SparkConf().setMaster("local[*]").setAppName("My App")
Now I now in local mode, Spark runs everything inside a single JVM, but does that mean it launches only one driver and use it as executor as well. In my time line it shows one executor driver added.
And when I go the the Executors page, there is just one executor with 32 cores assigned to it
Is this the default behavior ? I was expecting spark would launch one executor per core instead of just one executor that gets all the core. If some one can explain the behavior, that would be great
Is this the default behavior?
In local mode, your driver + executors are, as you've said, created inside a single JVM process. What you see isn't an executor, it is a view of how many cores your job has at its disposable. Usually when running under local mode, you should only be seeing the driver in the executors view.
If you look at the code for LocalSchedulerBackend, you'll see the following comment:
/**
* Used when running a local version of Spark where the executor, backend, and master all run in
* the same JVM. It sits behind a [[TaskSchedulerImpl]] and handles launching tasks on a single
* Executor (created by the [[LocalSchedulerBackend]]) running locally.
We have a single, in the same JVM instance executor which handles all tasks.

A master url must be set to your configuration (Spark scala on AWS)

This is what I wrote via intellij. I plan on eventually writing larger spark scala files.
Anyways, I uploaded it on an AWS cluster that I had made. The "master" line, line 11 was "master("local")". I ran into this error
The second picture is the error that was returned by AWS when it did not run successfully. i changed line 11 to "yarn" instead of local (see the first picture for its current state)
It still is returning the same error. I put in the following flags when I uploaded it manually
--steps Type=CUSTOM_JAR,Name="SimpleApp"
It worked two weeks ago. My friend did almost the exact same thing as me. I am not sure why it isn't working.
I am looking for both a brief explanation and an answer. Looks like I need a little more knowledge on how spark works.
I am working with amazon EMR.
I think on the line 9 you are creating SparkContext with "old way" approach in spark 1.6.x and older version - you need to set master in default configuration file (usually location conf/spark-defaults.conf) or pass it to spark-submit (it is required in new SparkConf())...
On line 10 you are creating "spark" context with SparkSesion which is approach in spark 2.0.0. So in my opinion your problem is line num. 9 and I think you should remove it and work with SparkSesion or set reqiered configuration for SparkContext In case when you need sc.
You can access to sparkContext with sparkSession.sparkContext();
If you still want to use SparkConf you need to define master programatically:
val sparkConf = new SparkConf()
.setAppName("spark-application-name")
.setMaster("local[4]")
.set("spark.executor.memory","512m");
or with declarative approach in conf/spark-defaults.conf
spark.master local[4]
spark.executor.memory 512m
or simply at runtime:
./bin/spark-submit --name "spark-application-name" --master local[4] --executor-memory 512m your-spark-job.jar
Try using the below code:
val spark = SparkSession.builder().master("spark://ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com:xxxx").appName("example").getOrCreate()
you need to provide the proper link to your aws cluster.

Spark : check your cluster UI to ensure that workers are registered

I have a simple program in Spark:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("spark://10.250.7.117:7077").setAppName("Simple Application").set("spark.cores.max","2")
val sc = new SparkContext(conf)
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
//first get the first 10 records
println("Getting the first 10 records: ")
ratingsFile.take(10)
//get the number of records in the movie ratings file
println("The number of records in the movie list are : ")
ratingsFile.count()
}
}
When I try to run this program from the spark-shell i.e. I log into the name node (Cloudera installation) and run the commands sequentially on the spark-shell:
val ratingsFile = sc.textFile("hdfs://hostname:8020/user/hdfs/mydata/movieLens/ds_small/ratings.csv")
println("Getting the first 10 records: ")
ratingsFile.take(10)
println("The number of records in the movie list are : ")
ratingsFile.count()
I get correct results, but if I try to run the program from eclipse, no resources are assigned to program and in the console log all I see is:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Also, in the Spark UI, I see this:
Job keeps Running - Spark
Also, it should be noted that this version of spark was installed with Cloudera (hence no worker nodes show up).
What should I do to make this work?
EDIT:
I checked the HistoryServer and these jobs don't show up there (even in incomplete applications)
I have done configuration and performance tuning for many spark clusters and this is a very common/normal message to see when you are first prepping/configuring a cluster to handle your workloads.
This is unequivocally due to insufficient resources to have the job launched. The job is requesting one of:
more memory per worker than allocated to it (1GB)
more CPU's than available on the cluster
Finally figured out what the answer is.
When deploying a spark program on a YARN cluster, the master URL is just yarn.
So in the program, the spark context should just looks like:
val conf = new SparkConf().setAppName("SimpleApp")
Then this eclipse project should be built using Maven and the generated jar should be deployed on the cluster by copying it to the cluster and then running the following command
spark-submit --master yarn --class "SimpleApp" Recommender_2-0.0.1-SNAPSHOT.jar
This means that running from eclipse directly would not work.
You can check your cluster's work node cores: your application can't exceed that. For example, you have two work node. And per work node you have 4 cores. Then you have 2 applications to run. So you can give every application 4 cores to run the job.
You can set like this in the code:
SparkConf sparkConf = new SparkConf().setAppName("JianSheJieDuan")
.set("spark.cores.max", "4");
It works for me.
There are also some causes of this same error message other than those posted here.
For a spark-on-mesos cluster, make sure you have java8 or newer java version on mesos slaves.
For spark standalone, make sure you have java8 (or newer) on the workers.
You don't have any workers to execute the job. There are no available cores for the job to execute and that's the reason the job's state is still in 'Waiting'.
If you have no workers registered with Cloudera how will the jobs execute?