I'm trying to connect my Scala kernel in a notebook environment to an existing Apache 3.0 Spark cluster.
I've tried the following methods in integrating Scala into a notebook environment;
Jupyter Scala (Almond)
Spylon Kernel
Apache Zeppelin
Polynote
In each of these Scala environments I've tried to connect to an existing cluster using the following script:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("spark:<ipaddress>:7077)
.getOrCreate()
However when I go to the WebUI at localhost:8080 I don't see anything running on the cluster.
I am able to connect to the cluster using pyspark, but need help with connecting Scala to the cluster.
Related
I have been trying to read/write with synapse spark pools into a mongodb atlas server, i have tried PyMongo but im more interested in using the mongodb spark connector but in the install procedure they use this command:
./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \
--packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
The problem im facing is that Synapse Spark Pools allow for spark session configuration but not for packacges command or use of the spark shell, how can i acomplish this instalation inside a spark pool?
This can be solved by installing the jar directly under workspace packages. Download the jar of the connector and then upload it to synapse, then to the spark pool.
We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.
Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port
I'm new to apache-spark and I'm experiencing some issues while trying to connect from my local machine to a remote server which contains a Spark working instance.
I successfully managed to connect vis SSH tunnel to that server using JSCH but I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$scope()Lscala/xml/TopScope$; at
org.apache.spark.ui.jobs.AllJobsPage.(AllJobsPage.scala:39) at
org.apache.spark.ui.jobs.JobsTab.(JobsTab.scala:38) at
org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:65) at
org.apache.spark.ui.SparkUI.(SparkUI.scala:82) at
org.apache.spark.ui.SparkUI$.create(SparkUI.scala:220) at
org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:162) at
org.apache.spark.SparkContext.(SparkContext.scala:452) at
server.Server$.main(Server.scala:45) at
server.Server.main(Server.scala)
When trying to connect to Spark.
This is my scala code
val conf = new SparkConf().setAppName("Test").setMaster("spark://xx.xxx.xxx.x:7077")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(Array(1, 2, 3, 4, 5)).count()
println(rdd)
Where line 45 highlighted at (Server.scala:45) in the error is the one with new SparkContext(conf).
Both on local and remote machine I'm using scala ~ 2.11.6. On my local pom.xml file I imported scala : 2.11.6, spark-core_2.10 and spark-sql_2.10 both ~2.1.1. On my server I installed spark ~ 2.1.1. ON the server I also managed to setup the master as the local machine by editing conf/spark-env.sh.
Of course, I managed to test server's spark and It works just fine.
What Am I doing wrong?
from the docs of setMaster:
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
If you run it from the spark cluster (as I understand you are), you should use local[n]
In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.
Pyspark kernel config:
PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'
Following is the cmd to initialize cluster :
gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --worker-machine-type=n1-standard-4 --master-machine-type=n1-standard-4
This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.
The suggested workaround is adding
import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))
before your import.
I found another way to do add packages which works on Jupyter notebook:
spark = SparkSession.builder \
.appName("Python Spark SQL") \ \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()
If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command
I succeeded at connecting to mongodb from spark, using the mongo-spark connector from a databricks notebook in python.
Right now I am configuring the mongodb uri in an environment variable, but it is not flexible, since I want to change the connection parameter right in my notebook.
I read in the connector documentation that it is possible to override any values set in the SparkConf.
How can I override the values from python?
You don't need to set anything in the SparkConf beforehand*.
You can pass any configuration options to the DataFrame Reader or Writer eg:
df = sqlContext.read \
.option("uri", "mongodb://example.com/db.coll) \
.format("com.mongodb.spark.sql.DefaultSource") \
.load()
* This was added in 0.2