Setting spark.app.name for PySpark kernel with Jupyter Notebook - pyspark

I am running a Jupyter Notebook server with PySpark (as explained here) on a Hadoop cluster with YARN. I noticed that each Spark application launched via a new notebook, appears in the Spark Web UI as an application named "PySparkShell" (which corresponds to the "spark.app.name" configuration).
My problem is that I sometimes have many notebooks running in Jupyter, but all of them appear in Spark's Web UI with the same generic name of "PySparkShell". I know I can change the default name to something else, and I also know that I cannot change the app name once a SparkContext has been created. My question is: Can I make so that each application will be given a different name when the kernel starts? (preferably something that will help me connect the notebook name, i.e. 'Untitled.ipynb', to its Spark application name or ID)
UPDATE: added a code snippet of my run command for the notebook
export DAEMON_PORT=8880
ANACONDA_PATH=/opt/cloudera/parcels/Anaconda/bin
export PATH=$ANACONDA_PATH:$PATH
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=$DAEMON_PORT"
pyspark2 \
--executor-memory 5g \
--executor-cores 4 \
--driver-memory 20g \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=40

In the first few lines where you specify you SparkContext() you can include a config object. You can use the config object to set various settings but chaining a set('property_name', 'property_value')
I'll demonstrate by setting the executor memory
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('Your_Project_name').set("spark.executor.memory", "5g")
sc = SparkContext(conf)

Related

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL, where I assume a SparkContext is already available.
For some of these lightweight use cases I can equivalently use PySpark and submit with gcloud dataproc jobs submit pyspark but sometimes I need easier access to Scala/Java libraries such as directly creating a org.apache.hadoop.fs.FileSystem object inside of map functions. Is there any easy way to submit such "spark-shell" equivalent jobs directly from a command-line using Dataproc Jobs APIs?
At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell is just using the same mechanisms as spark-submit to run a specialized REPL driver: org.apache.spark.repl.Main. Thus, combining this with the --files flag available in gcloud dataproc jobs submit spark, you can just write snippets of Scala that you may have tested in a spark-shell or notebook session, and run that as your entire Dataproc job, assuming job.scala is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files argument as well, such as gs:// or even hdfs://, assuming you've already placed your job.scala file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:/// to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.

scala spark to read file from hdfs cluster

I am learning to develop spark applications using Scala. And I am in my very first steps.
I have my scala IDE on windows. configured and runs smoothly if reading files from local drive. However, I have access to a remote hdfs cluster and Hive database, and I want to develop, try, and test my applications against that Hadoop cluster... but I don't know how :(
If I try
val rdd=sc.textFile("hdfs://masternode:9000/user/hive/warehouse/dwh_db_jrtf.db/discipline")
I will get an error that contains:
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "MyLap/11.22.33.44"; destination host is: "masternode":9000;
Can anyone guide me please ?
You can use SBT to package your code in a .jar file. scp your file on your Node then try to submit it by doing a spark-submit.
spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
You can't access to your Cluster from your Windows Machine in that way.

How to pass external configuration file to pyspark(Spark 2.x) program?

When I am running pyspark program interactive shell able to fetch the configuration file(config.ini) inside pyspark script,
But when I am trying to run same script using Spark submit command with master yarn and cluster deployment mode is cluster it giving me error as config file not exists, I have checked yarn log and able to see same, below is command for running the pyspark job.
spark2-submit --master yarn --deploy-mode cluster test.py /home/sys_user/ask/conf/config.ini
With spark2-sumbmit command there is parameter provided properties-file, you can use that to get this properties file available in spark-submit command.
e.g. spark2-submit --master yarn --deploy-mode cluster --properties-file $CONF_FILE_NAME pyspark_script.py
Pass the ini file in spark.files parameter
.config('spark.files', 'config/local/config.ini') \
Read in pyspark:
with open(SparkFiles.get('config.ini')) as config_file:
print(config_file.read())
It works for me.

Why is the application name defined in code not taken to display in RUNNING Applications in YARN UI?

This is the relevant part of my Spark application where I set the application's name using appName.
import org.apache.spark.sql.SparkSession
object sample extends App {
val spark = SparkSession.
builder().
appName("Cortex-Batch"). // <-- application name
enableHiveSupport().
getOrCreate()
I check the name of the Spark application in the Hadoop YARN cluster under RUNNING Applications and don't see the name I defined in the code. Why?
I use spark-submit with a property file using --properties-file as follows:
/usr/hdp/current/spark2-client/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--class com.jpmc.cortex.LoadCortexDataLake \
--verbose \
--properties-file /home/e707698/cortex-batch.properties \
--jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.‌​jar,/usr/hdp/current‌​/spark-client/lib/da‌​tanucleus-core-3.2.1‌​0.jar,/usr/hdp/curre‌​nt/spark-client/lib/‌​datanucleus-rdbms-3.‌​2.9.jar \
/home/e707698/cortex-data-lake-batch.jar "/tmp/clickfiles1" "cortex_dev.xpo_click1"
Instead, the app name given in property file is taken. I tried to remove the property from the properties file, but then the name is the full class name of the Spark application, i.e. /com/jpmc/cortex/LoadCortexDataLake.
What could I be missing?
--name works. I am now able to see what I give in --name with spark-submit in Yarn Running applications.
When we run spark in cluster mode Yarn application is created before sparkcontext is created, hence we need to give app name as --name in SparkSubmit command.
In client mode we can set the app name in the program like spark.appname("Default App Name")

Running Applications doesn t appear spark web Ui but runs

i need your help, i created 2 apps (one which using spray framework and the other one receive messages from kafka and send it to cassandra).
Both run all the time and should never stop.
I m in standalone on the server and my conf is :
- In spark_env.sh :
SPARK_MASTER_IP=MYIP
SPARK_EXECUTOR_CORES=2
SPARK_MASTER_PORT=7077
SPARK_EXECUTOR_MEMORY=4g
#SPARK_WORKER_PORT=65000
MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
SPARK_LOCAL_IP=MYIP
SPARK_MASTER_WEBUI_PORT=8080
- In spark_env.sh :
spark.master spark://MYIPMASTER:7077
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark-1.6.1-bin-hadoop2.6/spark-events
spark.history.fs.logDirectory /opt/spark-1.6.1-bin-hadoop2.6/logs
spark.io.compression.codec lzf
spark.cassandra.connection.host MYIPMASTER
spark.cassandra.auth.username LOGIN
spark.cassandra.auth.password PASSWORD
I can access on both pages :
MYIP:8080/ and MYIP:4040/
But on http://MYIP:8080/, i see only my workers , i can t see my application which running.
When i submit i use this :
/opt/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class MYCLASS --verbose --conf spark.eventLog.enable=true --conf spark.master.ui.port=8080 --master local[2] /opt/spark-1.6.1-bin-hadoop2.6/jars/MYJAR.jar
Why ?
Could you help me?
Thanks a lot :)
In your spark-submit command you are using the --master as local[2] which is submitting the application in local mode. If you wants to run it on the standalone cluster that you are running then you should pass spark master URL in master option i.e. --master spark://MYIPMASTER:7077
In terms of the master, spark-submit will respect the setting by following orders,
The master URL in your application code, which is the
SparkSession.builder().master("...")
The --master parameter for the spark-submit command
The default configuration in your spark-defaults.conf
Mode: Standalone cluster
1> bin/spark-submit --class com.deepak.spark.App ../spark-0.0.2-SNAPSHOT.jar --master spark://172.29.44.63:7077, was not working because master was specified after the jar
2> bin/spark-submit --class com.deepak.spark.App --master spark://172.29.44.63:7077 ../spark-0.0.2-SNAPSHOT.jar, this worked