can sparkmagic be used outside the ipython? - pyspark

i'm using a jupyter notebook with sparkmagic extension, but i can only access the spark cluster by create a pyspark kernel. The conflict is that i can't use the py3 environment(some installed python package) in pyspark kernel, either i can't use spark context in python3 kernel.
i don't know how to introduce packages in sparkmagic, so can i use pyspark that actually implement by sparkmagic in py3? or are there any other opinions?

Both kernels - PySpark and default IPython can be used with python3 interpreter on pyspark. It can be specified in ~/.sparkmagic/config.json. This is standard spark configuration and will be just passed by sparkmagic to the livy server running on the spark master node.
"session_configs": {
"conf": {
"spark.pyspark.python":"python3"
}
}
spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.
python3 is in this case available as command on the PATH of each node in the spark cluster. You can install it also into a custom directory on each node and specify the full path. "spark.pyspark.python":"/Users/hadoop/python3.8/bin/python"
All spark conf options can be passed like that.
Thera are 2 ways for importing tensorflow:
install on all spark machines (master and workers) via python3 -m pip install tensorflow
zip, upload and pass the remote path through sparkmagic via spark.submit.pyFiles setting. Accepts a path on s3, hdfs or the master node file system (not a path on your machine)
See answer about --py-files

Related

Change Python version Livy uses in an EMR cluster

I am aware of Change Apache Livy's Python Version and How do i setup Pyspark in Python 3 with spark-env.sh.template.
I also have seen the Livy documentation
However, none of that works. Livy keeps using Python 2.7 no matter what.
This is running Livy 0.6.0 on an EMR cluster.
I have changed the PYSPARK_PYTHON environment variable to /usr/bin/python3 in the hadoop user, my user, the root, and ec2-user. Logging into the EMR master node via ssh and running pyspark starts python3 as expected. But, Livy keeps using python2.7.
I added export PYSPARK_PYTHON=/usr/bin/python3 to the /etc/spark/conf/spark-env.sh file. Livy keeps using python2.7.
I added "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/bin/python3" and "spark.executorEnv.PYSPARK_PYTHON":"/usr/bin/python3" to the items listed below and in every case . Livy keeps using python2.7.
sparkmagic config.json and config_other_settings.json files before starting a PySpark kernel Jupyter
Session Properties in the sparkmagic %manage_spark Jupyter widget. Livy keeps using python2.7.
%%spark config cell-magic before the line-magic %spark add --session test --url http://X.X.X.X:8998 --auth None --language python
Note: This works without any issues in another EMR cluster running Livy 0.7.0 I have gone over all of the settings on the other cluster and cannot find what is different. I did not have to do any of this on the other cluster, Livy just used python3 by default.
How exactly do I get Livy to use python3 instead of python2?
Finally just found an answer after posting.
I ran the following in a PySpark kernel Jupyter session cell before running any code to start the PySpark session on the remote EMR cluster via Livy.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3"
}
}
Simply adding "spark.pyspark.python": "python3"  to the .sparkmagic config.json or config_other_settings.json also worked.
Confusing that this does not match the official Livy documentation.

Adding PostgresSQL JDBC Driver to all-spark-notebook using docker-compose

I'm a beginner with docker and spark with python and I'm trying out some spark examples, extracting data from a local PostgreSQL database. I've experimenting locally on a windows 10 machine running LTS Ubuntu 20.04. My docker-compose version is 1.28.
I keep running into the same issue however, how do I add such-and-such a driver to my docker images. In this case, it's the postgresql jdbc driver. My question is very similar to this question. But, I'm using docker-compose instead of plain docker.
Here is the docker-compose section for the all-spark-notebook image:
services:
spark:
image: jupyter/all-spark-notebook:latest
ports:
- "8888:8888"
working_dir: /home/$USER/work
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 --jars /usr/share/java/postgresql.jar pyspark-shell
The packages entry is necessary to get my kafka integration to work in jupyter (and it does). The --jars entry is my attempt to reference the postgresql jdbc driver installed in the ubuntu LTS terminal using:
sudo apt-get install libpostgresql-jdbc-java libpostgresql-jdbc-java-doc
In python, I've tried this:
conf = SparkConf()
conf.set("spark.jars", "/usr/share/java/postgresql.jar")
findspark.init()
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("My App") \
.getOrCreate()
dataframe = spark.read.format('jdbc').options(\
url = "jdbc:postgresql://host.docker.internal:5432/postgres?user=user&password=***",\
database='postgres',
dbtable='cloud.some-table'
).load()
dataframe.show()
But, I get the following error message:
java.sql.SQLException: No suitable driver
just like the referenced previous poster.
Any ideas? This should be easy, but I'm struggling.
OK, since nobody has come back with an answer I'll post what worked for me (in the end). I'm not claiming this is the correct way to do this and I'm happy for someone to post up a better answer, but it may get someone out of trouble.
Since, different configurations (and versions!) require different solutions, I'll define my setup first. I'm using docker desktop for Windows 10 with Docker Engine V20.10.5. I'm managing my docker containers using docker-compose version 1.29.0. I'm using the latest all-spark-notebook (whatever version that is) and the postgresql-42.2.19 jdbc driver.
I'll also say that this is running on my local Windows machine with LTS installed and is for experimentation only.
The trick that worked for me was:
a) use a package for the jdbc driver with spark. In this way, spark installs the package from maven at runtime (when you create the spark instance within Jupyter) and...
volumes:
- $PWD/work:/home/$USER/work
environment:
PYSPARK_SUBMIT_ARGS: --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1,org.postgresql:postgresql:42.2.19 --driver-class-path /home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar pyspark-shell
b) Understand where the package jars are unpacked and use that directory to tell spark where to find the associated jars. In my case, I used this command to start spark within Jupyter notebook:
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.19.jar") \
.appName("My App") \
.getOrCreate()
One other thing to note, this can be a bit flaky. If spark figures it needs to re-pull the files from maven (it'll do this the first time around, obviously), the library isn't picked up and the connection fails. However, running stop and up -d to recycle the containers and re-running the python script makes the connection happy. I don't pretend I know why, but my suspicion is that the way I have things set up, there's some dependency there.

Pyspark / pyspark kernels not working in jupyter notebook

Here are installed kernels:
$jupyter-kernelspec list
Available kernels:
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala
apache_toree_sql /usr/local/share/jupyter/kernels/apache_toree_sql
pyspark3kernel /usr/local/share/jupyter/kernels/pyspark3kernel
pysparkkernel /usr/local/share/jupyter/kernels/pysparkkernel
python3 /usr/local/share/jupyter/kernels/python3
sparkkernel /usr/local/share/jupyter/kernels/sparkkernel
sparkrkernel /usr/local/share/jupyter/kernels/sparkrkernel
A new notebook was created but fails with
The code failed because of a fatal error:
Error sending http request and maximum retry encountered..
There is no [error] message in the jupyter console
If you use magicspark to connect your Jupiter notebook, you should also start Livy which is API service used by magicspark to talk to your Spark cluster.
Download Livy from Apache Livy and unzip it
Check SPARK_HOME environment is set, if not, set to your Spark installation directory
Run Livy server by <livy_home>/bin/livy-server in the shell/command line
Now go back to your notebook, you should be able to run spark code in cell.

Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip.
Then, at runtime, I can execute
spark-submit --archives venv.zip#VENV --master yarn script.py
and then, so long as I run
os.environ["PYSPARK_PYTHON"] = "VENV/bin/python" inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.
When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running
gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster <CLUSTER_NAME>
with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python" will produce:
Error from python worker:
venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")
It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?
Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.

jupyter pyspark outputs: No module name sknn.mlp

I have 1 WorkerNode SPARK HDInsight cluster. I need to use scikit-neuralnetwork and vaderSentiment module in Pyspark Jupyter.
Installed the library using commands below:
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Next I open pyspark terminal and i am able to successfully import the module. Screenshot below.
Now, i open Jupyter Pyspark Notebook:
Just to add, I am able to import pre-installed module from Jupyter like "import pandas"
The installation goes to:
admin123#hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "vaderSentiment"
/usr/bin/anaconda/lib/python2.7/site-packages/vaderSentiment
/usr/local/lib/python2.7/dist-packages/vaderSentiment
For pre-installed modules:
admin123#hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "pandas"
/usr/bin/anaconda/pkgs/pandas-0.17.1-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/pandas-0.16.2-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/bokeh-0.9.0-np19py27_0/Examples/bokeh/compat/pandas
/usr/bin/anaconda/Examples/bokeh/compat/pandas
/usr/bin/anaconda/lib/python2.7/site-packages/pandas
sys.executable path is same in both Jupyter and terminal.
print(sys.executable)
/usr/bin/anaconda/bin/python
Any help would greatly appreciated.
The issue is that while you are installing it on the headnode (one of the VMs), you are not installing it on all the other VMs (worker nodes). When the Pyspark app for Jupyter gets created, it gets run in YARN cluster mode, and so the application master starts in a random worker node.
One way of installing the libraries in all worker nodes would be to create a script action that runs against worker nodes and installs the necessary libraries:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/
Do note that there's two python installations in the cluster, and you have to refer to the Anaconda installation explicitly. Installing scikit-neuralnetwork would look something like this:
sudo /usr/bin/anaconda/bin/pip install scikit-neuralnetwork
The second way of doing this is to simply ssh into the workernodes from the headnode. First, ssh into the headnode, then figure out the workernode IPs by going to Ambari at: https://YOURCLUSTER.azurehdinsight.net/#/main/hosts. Then, ssh 10.0.0.# and execute the installation commands yourself for all worker nodes.
I did this for scikit-neuralnetwork and while it does import correctly, it throws saying it cannot create a file in ~/.theano. Because YARN is running Pyspark sessions as the nobody user, Theano cannot create its config file. Doing a little bit of digging around, I see that there's a way to change where Theano writes/looks for its config file. Please also take care of that while doing the installation: http://deeplearning.net/software/theano/library/config.html#envvar-THEANORC
Forgot to mention, to modify an env var, you need to set the variable when creating the pyspark session. Execute this in the Jupyter notebook:
%%configure -f
{
"conf": {
"spark.executorEnv.THEANORC": "{YOURPATH}",
"spark.yarn.appMasterEnv.THEANORC": "{YOURPATH}"
}
}
Thanks!
Easy way to resolve this was:
Create a bash script
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Copy the above created bash script to any container in Azure storage account.
While creating HDInsight Spark cluster, use script action and mention the above path in URL. Ex: https://sa-account-name.blob.core.windows.net/containername/path-of-installation-file.sh
Install it in both HeadNodes and WorkerNodes.
Now, open Jupyter and you should be able to import the modules.