Prevent pyspark from using in-memory session/docker - pyspark

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.

Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

Related

Apache Spark Pool Mongodb connector

I have been trying to read/write with synapse spark pools into a mongodb atlas server, i have tried PyMongo but im more interested in using the mongodb spark connector but in the install procedure they use this command:
./bin/pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \
--packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
The problem im facing is that Synapse Spark Pools allow for spark session configuration but not for packacges command or use of the spark shell, how can i acomplish this instalation inside a spark pool?
This can be solved by installing the jar directly under workspace packages. Download the jar of the connector and then upload it to synapse, then to the spark pool.

jupyter notebook connecting to Apache Spark 3.0

I'm trying to connect my Scala kernel in a notebook environment to an existing Apache 3.0 Spark cluster.
I've tried the following methods in integrating Scala into a notebook environment;
Jupyter Scala (Almond)
Spylon Kernel
Apache Zeppelin
Polynote
In each of these Scala environments I've tried to connect to an existing cluster using the following script:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("spark:<ipaddress>:7077)
.getOrCreate()
However when I go to the WebUI at localhost:8080 I don't see anything running on the cluster.
I am able to connect to the cluster using pyspark, but need help with connecting Scala to the cluster.

What is the correct way to install the delta module in python?

What is the correct way to install the delta module in python??
In the example they import the module
from delta.tables import *
but i did not find the correct way to install the module in my virtual env
Currently i am using this spark param -
"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0"
As the correct answer is hidden in the comments of the accepted solution, I thought I'd add it here.
You need to create your spark context with some extra settings and then you can import delta:
spark_session = SparkSession.builder \
.master("local") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
from delta.tables import *
Annoyingly, your IDE will of course shout at you about this as the package isn't installed and you will also be operating without autocomplete and type hints. I'm sure there's a work around and I will update if I come accross it.
The package itself is on their github here and the readme suggests you can pip install but that doesn't work. In theory you could clone it and install manually.
Because Delta's Python codes are stored inside a jar and loaded by Spark, delta module cannot be imported until SparkSession/SparkContext is created.
To run Delta locally with PySpark, you need to follow the official documentation.
This works for me but only when executing directly the script (python <script_file>), not with pytest or unittest.
To solve this problem, you need to add this environment variable:
PYSPARK_SUBMIT_ARGS='--packages io.delta:delta-core_2.12:1.0.0 pyspark-shell'
Use Scala and Delta version that match your case. With this environment variable, I can run pytest or unittest via cli without any problem
from unittest import TestCase
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
class TestClass(TestCase):
builder = SparkSession.builder.appName("MyApp") \
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
def test_create_delta_table(self):
self.spark.sql("""CREATE IF NOT EXISTS TABLE <tableName> (
<field1> <type1>)
USING DELTA""")
The function configure_spark_with_delta_pip appends a config option in builder object
.config("io.delta:delta-core_<scala_version>:<delta_version>")
Here's how you can install Delta Lake & PySpark with conda.
Make sure you have Java installed (I use SDKMAN to manage multiple Java versions)
Install Miniconda
Pick Delta Lake & PySpark versions that are compatible. For example, Delta Lake 1.2 is compatible with PySpark 3.2.
Create a YAML file with the required dependencies, here is an example from the delta-examples repo I created.
Create the environment with a command like conda env create envs/mr-delta.yml
Activate the conda environment with conda activate mr-delta
Here is an example notebook. Note that it starts with the following code:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
In my case the issue was I had a Cluster running on a Databricks Runtime lower than 6.1
https://docs.databricks.com/delta/delta-update.html
The Python API is available in Databricks Runtime 6.1 and above.
After changing the Databricks Runtime to 6.4 problem disappeared.
To do that: Click clusters -> Pick the one you are using -> Edit -> Pick Databricks Runtime 6.1 and above

Dataproc: Jupyter pyspark notebook unable to import graphframes package

In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.
Pyspark kernel config:
PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'
Following is the cmd to initialize cluster :
gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --worker-machine-type=n1-standard-4 --master-machine-type=n1-standard-4
This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.
The suggested workaround is adding
import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))
before your import.
I found another way to do add packages which works on Jupyter notebook:
spark = SparkSession.builder \
.appName("Python Spark SQL") \ \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()
If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html
The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

How to change a mongo-spark connection configuration from a databricks python notebook

I succeeded at connecting to mongodb from spark, using the mongo-spark connector from a databricks notebook in python.
Right now I am configuring the mongodb uri in an environment variable, but it is not flexible, since I want to change the connection parameter right in my notebook.
I read in the connector documentation that it is possible to override any values set in the SparkConf.
How can I override the values from python?
You don't need to set anything in the SparkConf beforehand*.
You can pass any configuration options to the DataFrame Reader or Writer eg:
df = sqlContext.read \
.option("uri", "mongodb://example.com/db.coll) \
.format("com.mongodb.spark.sql.DefaultSource") \
.load()
* This was added in 0.2