How to connect to the spark master remotely using the spark shell? - scala

./dse spark --master spark://sparkMasterIp:port --name "testConnection" --conf "spark.cassandra.connection.host=cassandraHost1, cassandraHost2" --conf spark.app.name=TryShell --conf spark.broadcast.port=54001

Related

Spark submit truncates arguments in yarn cluster mode

I am running spark application on yarn cluster in cluster deploy mode using following command
spark-submit --conf spark.executor.memory=24g --conf spark.master=yarn --conf spark.submit.deployMode=cluster --conf spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 --conf spark.files=file:///opt/configurations/app.conf --class com.example.HelloWorld --queue sample_q file:///opt/jars/example.jar '{"sample":{}}'
This command is not passing the entire argument to HelloWorld class.
Main method argument passed : {"sample":{
Main method argument expected: {"sample":{}}
The same command is running properly with client deploy mode
spark-submit --conf spark.executor.memory=24g --conf spark.master=yarn --conf spark.submit.deployMode=client --conf spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 --conf spark.files=file:///opt/configurations/app.conf --class com.example.HelloWorld --queue sample_q file:///opt/jars/example.jar '{"sample":{}}'
Upon inspecting the launch_container.sh script in yarn worker node it was found that the command also had truncated string within it (--arg '{\"sample\":{')
Spark Version: 2.3
Hadoop Version: 2.7.3
Yarn consider {{ and }} as parameter expansion character hence any occurrence is considered as an environment variable and replaced with the corresponding value. Since there is no environment variable.
This causes an issue in cluster deploy mode as driver runs in yarn cluster.
Reference: YarnApplicationConstants

Pyspark submit master yarn cluster deploy - logs location

I submitted pyspark job with spark-submit command on a haddoop cluster. The command is as follows
spark-submit --master yarn --deploy-mode cluster --driver-memory 1g --num-executors 2 --executor-memory 1g --executor-cores 2 --py-files module_stm_extracts.py,module_table_compare.py datacheck,py
The job completed, but I never got the application id in the console.
How do I find the application log, so that I can review
You can find it at the YARN Resource Manager WebUI, by default it's acessible through the 8088 port of the master node: http://<master_node_ip>:8088
Or you can list the applications through command line too:
yarn application -list -appStates ALL
And with the applicationId get the log with the following command:
yarn logs --applicationId <application_id>

acessing azure databricks data from kubernetes

I am running the following command from a kubernetes cluster to access a file from azure databricks
spark-submit --packages io.delta:delta-core_2.12:0.7.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --conf "spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore" script.py
I am getting this error.
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593)
Do I need to install any jars from hadoop azure. Please guide me

Kubernetes spark-submit

I am trying to use kuberenets as cluster manger for spark. I also want to ship the container logs to splunk. Now I do have monitoring stack deployed (fluent-bit, prometheus etc)in the same namespace and the way it works is if your pod has a certain environment_variable it will start reading the logs and push it to splunk.
The thing I am not able to find is how do I set a environment variable and populate it
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://my-kube-cluster.com \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
....
....
....
--conf spark.kubernetes.driverEnv.UID="set it to spark driver pod id" \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar
To configure additional Spark Driver Pod environment variables you can pass additional --conf spark.kubernetes.driverEnv.EnvironmentVariableName=EnvironmentVariableValue (please refer docs for more details).
To configure additional Spark Executor Pods environment variables you can pass additional --conf spark.executorEnv.EnvironmentVariableName=EnvironmentVariableValue (please refer docs for more details).
Hope it helps.

Setting spark.app.name for PySpark kernel with Jupyter Notebook

I am running a Jupyter Notebook server with PySpark (as explained here) on a Hadoop cluster with YARN. I noticed that each Spark application launched via a new notebook, appears in the Spark Web UI as an application named "PySparkShell" (which corresponds to the "spark.app.name" configuration).
My problem is that I sometimes have many notebooks running in Jupyter, but all of them appear in Spark's Web UI with the same generic name of "PySparkShell". I know I can change the default name to something else, and I also know that I cannot change the app name once a SparkContext has been created. My question is: Can I make so that each application will be given a different name when the kernel starts? (preferably something that will help me connect the notebook name, i.e. 'Untitled.ipynb', to its Spark application name or ID)
UPDATE: added a code snippet of my run command for the notebook
export DAEMON_PORT=8880
ANACONDA_PATH=/opt/cloudera/parcels/Anaconda/bin
export PATH=$ANACONDA_PATH:$PATH
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=$DAEMON_PORT"
pyspark2 \
--executor-memory 5g \
--executor-cores 4 \
--driver-memory 20g \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=40
In the first few lines where you specify you SparkContext() you can include a config object. You can use the config object to set various settings but chaining a set('property_name', 'property_value')
I'll demonstrate by setting the executor memory
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('Your_Project_name').set("spark.executor.memory", "5g")
sc = SparkContext(conf)