ToreeInstall ERROR | Unknown interpreter PySpark. toree can not install PySpark - pyspark

When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks

Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.

As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.

Related

pyspark pip installation hive-site.xml

I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)

Jupyter for Scala with spylon-kernel without having to install Spark

Based on web search and as highly recommended, I am trying to run Jupyter on my local for Scala (using spylon-kernel).
I was able to create a notebook but while trying to run/play a Scala code snippet, I see this message initializing scala interpreter and in the console, I see this error:
ValueError: Couldn't find Spark, make sure SPARK_HOME env is set or Spark is in an expected location (e.g. from homebrew installation).
I am not planning to install Spark. Is there a way I can still use Jupyter for Scala without installing Spark?
I am new to Jupyter and the ecosystem. Pardon me for the amateur question.
Thanks

The Pyspark always use the system‘s python

We know that a system has two Python:
①system's python
/usr/bin/python
②user's python
~/anaconda3/envs/Python3.6/bin/python3
Now I have a cluster with my Desktop(master) and Laptop(slave).
It's OK for different mode of PysparkShell if I set like this:
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
for both two nodes' ~/.bashrc
However,I want to configure it with jupyter notebook.So I set like this in each node's
~/.bashrc
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
then I get the log
My Spark version is:
spark-3.0.0-preview2-bin-hadoop3.2
I have read all the answers in
environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
and
different version 2.7 than that in driver 3.6 in jupyter/all-spark-notebook
But no luck.
I guess slave's python2.7 is from system's python.not from anaconda's python.
How to force spark's slave node to use anaconda's python?
Thanks~!
Jupiyter is looking for ipython, you probably only have ipython installed in your system python.
In order to use jupyter in different python version. You need to use python version manager (pyenv), and python environment manager(virtualenv), together you can choose which version of python you are going to use and which environment you are going to install jupyter, and fully isolated python versions and packages.
Install ipykernel in your chosen python environment and install jupyter.
After you finish above step. You need to make sure that the Spark worker will switch to your chosen python version and environment every time Spark ReourceManager launches a worker executor. In order to swtich python version and environment when the Spark worker executor, you need to make sure that a little script ran right after the Spark Resource Manager ssh into worker:
go to the python environment directory
source 'whatever/bin/activate'
After you have done above steps, you should have chosen python version and jupyter ran by Spark worker executor.

How to add customized jar in Jupyter Notebook in Scala

I need to use a third party jar (mysql) in my Scala script, if I use spark shell, I can specify the jar in the starting command like below:
spark2-shell --driver-class-path mysql-connector-java-5.1.15.jar --jars /opt/cloudera/parcels/SPARK2/lib/spark2/jars/mysql-connector-java-5.1.15.jar
However, how can I do this in Jupyter notebook? I remember there is a magic way to do it in pyspark, I am using Scala, and I can't change the environment setting of the kernel I am using.
I have the solution now, and it is very simple indeed as below:
Use a toree based Scala kernel (which is what I am using)
Use AddJar: in the notebook and run it, the jar will be downloaded and voila!
That's it.
AddJar http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.15/mysql-connector-java-5.1.15.jar

Setting Specific Python in Zeppelin Interpreter

What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`