SparkContext params for pyspark application on cluster - pyspark

I'm trying to run a pyspark application on a cluster and am not sure how to parallelize its execution. When I run the application locally, I initialize the SparkContext as:
sc = SparkContext("local", "appname")
When I run on the cluster, I changed this to:
sc = SparkContext(os.sys['MASTER'], 'appname')
where 'MASTER' is set to the master node on the cluster (i.e. spark://node-1:7077). The application starts to run, but then it just stalls (it runs fine on the cluster when I set master to 'local'). My submission script has the following settings:
#SBATCH -N 20
#SBATCH --ntasks-per-node 4
#SBATCH --cpus-per-task 2
...
spark-submit --total-executor-cores 160 --executor-memory 1024G app.py
Any help would be greatly appreciated. Thanks

Related

Pyspark submit master yarn cluster deploy - logs location

I submitted pyspark job with spark-submit command on a haddoop cluster. The command is as follows
spark-submit --master yarn --deploy-mode cluster --driver-memory 1g --num-executors 2 --executor-memory 1g --executor-cores 2 --py-files module_stm_extracts.py,module_table_compare.py datacheck,py
The job completed, but I never got the application id in the console.
How do I find the application log, so that I can review
You can find it at the YARN Resource Manager WebUI, by default it's acessible through the 8088 port of the master node: http://<master_node_ip>:8088
Or you can list the applications through command line too:
yarn application -list -appStates ALL
And with the applicationId get the log with the following command:
yarn logs --applicationId <application_id>

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini
With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py
Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

Setting spark.app.name for PySpark kernel with Jupyter Notebook

I am running a Jupyter Notebook server with PySpark (as explained here) on a Hadoop cluster with YARN. I noticed that each Spark application launched via a new notebook, appears in the Spark Web UI as an application named "PySparkShell" (which corresponds to the "spark.app.name" configuration).
My problem is that I sometimes have many notebooks running in Jupyter, but all of them appear in Spark's Web UI with the same generic name of "PySparkShell". I know I can change the default name to something else, and I also know that I cannot change the app name once a SparkContext has been created. My question is: Can I make so that each application will be given a different name when the kernel starts? (preferably something that will help me connect the notebook name, i.e. 'Untitled.ipynb', to its Spark application name or ID)
UPDATE: added a code snippet of my run command for the notebook
export DAEMON_PORT=8880
ANACONDA_PATH=/opt/cloudera/parcels/Anaconda/bin
export PATH=$ANACONDA_PATH:$PATH
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=$DAEMON_PORT"
pyspark2 \
--executor-memory 5g \
--executor-cores 4 \
--driver-memory 20g \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=40
In the first few lines where you specify you SparkContext() you can include a config object. You can use the config object to set various settings but chaining a set('property_name', 'property_value')
I'll demonstrate by setting the executor memory
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('Your_Project_name').set("spark.executor.memory", "5g")
sc = SparkContext(conf)

Pushing my application.config file to my spark job worker nodes

My spark job is failing and it looks like the reason is that my configuration file is not found on the worker node.
My config file is currently in:
/src/main/resources/application.conf
I copied the file to the root folder where I run the spark-submit command and I did this:
spark-submit --class "com.path.to.main.MainClass" --master local[*] --files application.conf /path/to/jar.jar
That didn't seem to work either as I got the same error.
What am I doing wrong?

Running Applications doesn t appear spark web Ui but runs

i need your help, i created 2 apps (one which using spray framework and the other one receive messages from kafka and send it to cassandra).
Both run all the time and should never stop.
I m in standalone on the server and my conf is :
- In spark_env.sh :
SPARK_MASTER_IP=MYIP
SPARK_EXECUTOR_CORES=2
SPARK_MASTER_PORT=7077
SPARK_EXECUTOR_MEMORY=4g
#SPARK_WORKER_PORT=65000
MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
SPARK_LOCAL_IP=MYIP
SPARK_MASTER_WEBUI_PORT=8080
- In spark_env.sh :
spark.master spark://MYIPMASTER:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark-1.6.1-bin-hadoop2.6/spark-events
spark.history.fs.logDirectory /opt/spark-1.6.1-bin-hadoop2.6/logs
spark.io.compression.codec lzf
spark.cassandra.connection.host MYIPMASTER
spark.cassandra.auth.username LOGIN
spark.cassandra.auth.password PASSWORD
I can access on both pages :
MYIP:8080/ and MYIP:4040/
But on http://MYIP:8080/, i see only my workers , i can t see my application which running.
When i submit i use this :
/opt/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class MYCLASS --verbose --conf spark.eventLog.enable=true --conf spark.master.ui.port=8080 --master local[2] /opt/spark-1.6.1-bin-hadoop2.6/jars/MYJAR.jar
Why ?
Could you help me?
Thanks a lot :)
In your spark-submit command you are using the --master as local[2] which is submitting the application in local mode. If you wants to run it on the standalone cluster that you are running then you should pass spark master URL in master option i.e. --master spark://MYIPMASTER:7077
In terms of the master, spark-submit will respect the setting by following orders,
The master URL in your application code, which is the
SparkSession.builder().master("...")
The --master parameter for the spark-submit command
The default configuration in your spark-defaults.conf
Mode: Standalone cluster
1> bin/spark-submit --class com.deepak.spark.App ../spark-0.0.2-SNAPSHOT.jar --master spark://172.29.44.63:7077, was not working because master was specified after the jar
2> bin/spark-submit --class com.deepak.spark.App --master spark://172.29.44.63:7077 ../spark-0.0.2-SNAPSHOT.jar, this worked