Dataproc: Jupyter pyspark notebook unable to import graphframes package - pyspark

In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.
Pyspark kernel config:
PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'
Following is the cmd to initialize cluster :
gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --worker-machine-type=n1-standard-4 --master-machine-type=n1-standard-4

This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.
The suggested workaround is adding
import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))
before your import.

I found another way to do add packages which works on Jupyter notebook:
spark = SparkSession.builder \
.appName("Python Spark SQL") \ \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()

If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.
More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached
Just open your terminal and set the two environment variables and start pyspark with the graphframes package
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Related

Prevent pyspark from using in-memory session/docker

We are looking into using Spark as big data processing framework in Azure Synapse Analytics with notebooks. I want to set up a local development environment/sandbox on my own computer similar to that, interacting with Azure Data Lake Storage Gen 2.
For installing Spark I'm using WSL with a Ubuntu distro (Spark seems to be easier to manage in linux)
For notebooks I'm using jupyter notebook with Anaconda
Both components work fine by themself but I can't manage to connect the notebook to my local sparkcluster in WSL. I tried the following:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[1]") \
.appName("Python Spark SQL basic example") \
.getOrCreate()
When examining the spark object it outputs
SparkSession - in-memory
SparkContext
Spark UI
Version v3.3.0
Master local[1]
AppName Python Spark SQL basic example
The spark-ui link points to http://host.docker.internal:4040/jobs/, Also when examining the UI for spark in WSL I can't see any connection. I think there is something I'm missing or not understanding with how pyspark works. Any help would be much appreciated to clarify.
Your are connecting to local instance which is in this case native Windows running jupyter:
.master("local[1]")
Instead, you should connect to your WSL cluster:
.master("spark://localhost:7077") # assuming default port

jupyter notebook connecting to Apache Spark 3.0

I'm trying to connect my Scala kernel in a notebook environment to an existing Apache 3.0 Spark cluster.
I've tried the following methods in integrating Scala into a notebook environment;
Jupyter Scala (Almond)
Spylon Kernel
Apache Zeppelin
Polynote
In each of these Scala environments I've tried to connect to an existing cluster using the following script:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("spark:<ipaddress>:7077)
.getOrCreate()
However when I go to the WebUI at localhost:8080 I don't see anything running on the cluster.
I am able to connect to the cluster using pyspark, but need help with connecting Scala to the cluster.

What is the correct way to install the delta module in python?

What is the correct way to install the delta module in python??
In the example they import the module
from delta.tables import *
but i did not find the correct way to install the module in my virtual env
Currently i am using this spark param -
"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0"
As the correct answer is hidden in the comments of the accepted solution, I thought I'd add it here.
You need to create your spark context with some extra settings and then you can import delta:
spark_session = SparkSession.builder \
.master("local") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
from delta.tables import *
Annoyingly, your IDE will of course shout at you about this as the package isn't installed and you will also be operating without autocomplete and type hints. I'm sure there's a work around and I will update if I come accross it.
The package itself is on their github here and the readme suggests you can pip install but that doesn't work. In theory you could clone it and install manually.
Because Delta's Python codes are stored inside a jar and loaded by Spark, delta module cannot be imported until SparkSession/SparkContext is created.
To run Delta locally with PySpark, you need to follow the official documentation.
This works for me but only when executing directly the script (python <script_file>), not with pytest or unittest.
To solve this problem, you need to add this environment variable:
PYSPARK_SUBMIT_ARGS='--packages io.delta:delta-core_2.12:1.0.0 pyspark-shell'
Use Scala and Delta version that match your case. With this environment variable, I can run pytest or unittest via cli without any problem
from unittest import TestCase
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
class TestClass(TestCase):
builder = SparkSession.builder.appName("MyApp") \
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
def test_create_delta_table(self):
self.spark.sql("""CREATE IF NOT EXISTS TABLE <tableName> (
<field1> <type1>)
USING DELTA""")
The function configure_spark_with_delta_pip appends a config option in builder object
.config("io.delta:delta-core_<scala_version>:<delta_version>")
Here's how you can install Delta Lake & PySpark with conda.
Make sure you have Java installed (I use SDKMAN to manage multiple Java versions)
Install Miniconda
Pick Delta Lake & PySpark versions that are compatible. For example, Delta Lake 1.2 is compatible with PySpark 3.2.
Create a YAML file with the required dependencies, here is an example from the delta-examples repo I created.
Create the environment with a command like conda env create envs/mr-delta.yml
Activate the conda environment with conda activate mr-delta
Here is an example notebook. Note that it starts with the following code:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
In my case the issue was I had a Cluster running on a Databricks Runtime lower than 6.1
https://docs.databricks.com/delta/delta-update.html
The Python API is available in Databricks Runtime 6.1 and above.
After changing the Databricks Runtime to 6.4 problem disappeared.
To do that: Click clusters -> Pick the one you are using -> Edit -> Pick Databricks Runtime 6.1 and above

H2o Package not found Scala Sparkling Water

I am trying to run Sparkling Water on my Local instance of Spark 2.1.0.
I followed documentation on H2o for Sparling Water. But when I try to execute
sparkling-shell.cmd
I am getting following error :
The filename, directory name, or volume label syntax is incorrect.
I look into the batch file and I am getting this error when the following command is executed:
C:\Users\Mansoor\libs\spark\spark-2.1.0/bin/spark-shell.cmd --jars C:\Users\Mansoor\libs\H2o\sparkling\bin\../assembly/build/libs/sparkling-water-assembly_2.11-2.1.0-all.jar --driver-memory 3G --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"
When I remove --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m", Spark starts but I am unable to import the packages of H2o.
import org.apache.spark.h2o._
error: object h2o is not a member of package org.apache.spark
I tried everything I could but unable to solve this issue. Could someone help me in this? Thanks
Please try to correct your path:
C:\Users\Mansoor\libs\spark\spark-2.1.0/bin/spark-shell.cmd --jars C:\Users\Mansoor\libs\H2o\sparkling\bin\..\assembly\build\libs\sparkling-water-assembly_2.11-2.1.0-all.jar --driver-memory 3G --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"
There is also doc page about RSparkling at Windows, which can contain different troubleshooting tips...
https://github.com/h2oai/sales-engineering/tree/master/megan/RSparklingAndWindows
Problem is with spark-shell command while submitting jars. Workaround is to modify spark-defaults.conf
Adding spark.driver.extraClassPath and spark.executor.extraClassPath parameters to spark-defaults.conf file as follows:
spark.driver.extraClassPath \path\to\jar\sparkling-water-assembly_version>-all.jar
spark.executor.extraClassPath \path\to\jar\sparkling-water-assembly_version>-all.jar
And Remove --jars \path\to\jar\sparkling-water-assembly_version>-all.jar from sparkling-shell2.cmd

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:
I started a ssh session with the master node of my cluster, then I input:
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
Then it launched a pyspark shell in which I input:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()
And it worked.
My next step is to launch this job from my main machine using the command:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py
But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.
My question are:
was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?
Short Answer
There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.
Long Answer
So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:
# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0
But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:
# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py
So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.
Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.
Additionally to #Dennis.
Note that if you need to load multiple external packages, you need to specify a custom escape character like so:
--properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌​bricks:spark-avro_2.10:2.0.1
Note the ^#^ right before the package list.
See gcloud topic escaping for more details.