The Pyspark always use the system‘s python - pyspark

We know that a system has two Python:
①system's python
/usr/bin/python
②user's python
~/anaconda3/envs/Python3.6/bin/python3
Now I have a cluster with my Desktop(master) and Laptop(slave).
It's OK for different mode of PysparkShell if I set like this:
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
for both two nodes' ~/.bashrc
However,I want to configure it with jupyter notebook.So I set like this in each node's
~/.bashrc
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
then I get the log
My Spark version is:
spark-3.0.0-preview2-bin-hadoop3.2
I have read all the answers in
environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
and
different version 2.7 than that in driver 3.6 in jupyter/all-spark-notebook
But no luck.
I guess slave's python2.7 is from system's python.not from anaconda's python.
How to force spark's slave node to use anaconda's python?
Thanks~!

Jupiyter is looking for ipython, you probably only have ipython installed in your system python.
In order to use jupyter in different python version. You need to use python version manager (pyenv), and python environment manager(virtualenv), together you can choose which version of python you are going to use and which environment you are going to install jupyter, and fully isolated python versions and packages.
Install ipykernel in your chosen python environment and install jupyter.
After you finish above step. You need to make sure that the Spark worker will switch to your chosen python version and environment every time Spark ReourceManager launches a worker executor. In order to swtich python version and environment when the Spark worker executor, you need to make sure that a little script ran right after the Spark Resource Manager ssh into worker:
go to the python environment directory
source 'whatever/bin/activate'
After you have done above steps, you should have chosen python version and jupyter ran by Spark worker executor.

Related

pyspark pip installation hive-site.xml

I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)

Different Python version between Dataproc master and worker nodes

I had created a dataproc cluster with Anaconda as optional component and created a virtual env. in that. Now when running a pyspark py file on master node I'm getting this error -
Exception: Python in worker has different version 2.7 than that in
driver 3.6, PySpark cannot run with different minor versions.Please
check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
are correctly set.
I need RDKit package inside the virtual env. and with that python 3x version gets installed. The following commands on my master node and then the python version changes.
conda create -n my-venv -c rdkit rdkit=2019.*
conda activate my-venv
conda install -c conda-forge rdkit
How can I solve this?
There's a few things here:
The 1.3 (default) image uses conda with Python 2.7. I recommend switching to 1.4 (--image-version 1.4) which uses conda with Python 3.6.
If this library will be needed on the workers you can use this initialization action to apply the change consistently to all nodes.
Pyspark does not currently support virtualenvs, but this support is coming. Currently you can run pyspark program from within a virtualenv, but this will not mean workers will run inside the virtualenv. Is it possible to apply your changes to the base conda environment without virtualenv?
Additional info can be found here https://cloud.google.com/dataproc/docs/tutorials/python-configuration

ToreeInstall ERROR | Unknown interpreter PySpark. toree can not install PySpark

When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks
Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.
As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.

Setting Specific Python in Zeppelin Interpreter

What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`

virtualenv can't execute on my system?

I'm trying to create a virtual environment to deploy a Flask app. However, when I try to create a virtual environment using virtualenv, I get this error:
Using base prefix '//anaconda'
New python executable in /Users/sydney/Desktop/ptproject/venv/bin/python
ERROR: The executable /Users/sydney/Desktop/ptproject/venv/bin/python is not functioning
ERROR: It thinks sys.prefix is '/Users/sydney/Desktop/ptproject' (should be '/Users/sydney/Desktop/ptproject/venv')
ERROR: virtualenv is not compatible with this system or executable
I think that I installed virtualenv using conda. When I use which virtualenv, I get this
//anaconda/bin/virtualenv
Is this an incorrect location for virtualenv? I can't figure out what else the problem would be. I don't understand the error log at all.
It turns out that virtualenv just doesn't work correctly with conda. For example:
https://github.com/conda/conda/issues/1367
(A workaround is proposed at the end of that thread, but it looks like you may be seeing a slightly different error, so maybe it won't work for you.)
Instead of deploying your app with virtualenv, why not just use a proper conda environment? Conda environments are more general (and powerful) than those provided by virtualenv.
For example, to create a new environment with python-2.7 and flask in it:
conda create -n my-new-env flask python=2.7