I had created a dataproc cluster with Anaconda as optional component and created a virtual env. in that. Now when running a pyspark py file on master node I'm getting this error -
Exception: Python in worker has different version 2.7 than that in
driver 3.6, PySpark cannot run with different minor versions.Please
check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
are correctly set.
I need RDKit package inside the virtual env. and with that python 3x version gets installed. The following commands on my master node and then the python version changes.
conda create -n my-venv -c rdkit rdkit=2019.*
conda activate my-venv
conda install -c conda-forge rdkit
How can I solve this?
There's a few things here:
The 1.3 (default) image uses conda with Python 2.7. I recommend switching to 1.4 (--image-version 1.4) which uses conda with Python 3.6.
If this library will be needed on the workers you can use this initialization action to apply the change consistently to all nodes.
Pyspark does not currently support virtualenvs, but this support is coming. Currently you can run pyspark program from within a virtualenv, but this will not mean workers will run inside the virtualenv. Is it possible to apply your changes to the base conda environment without virtualenv?
Additional info can be found here https://cloud.google.com/dataproc/docs/tutorials/python-configuration
Related
I am trying to run an simple dag with only a dummy- and a databricksoperator (DatabricksRunNowOperator) just to test it. I uploaded the dag into the airflow container, but the databricks operator is not part of the ordinary airflow package. I installed it (locally) with pip install apache-airflow-providers-databricks. Accordingly, the package is not present in the container and an error occurs.
Does anyone know how I provide the mentioned package to the airflow container?
if you use docker compose as recommended by the official Airflow documentation on Docker setup, then you can specify additional dependencies with _PIP_ADDITIONAL_REQUIREMENTS environment variable (also could be put into .env file in the same folder). For example, I have following in my testing environment:
_PIP_ADDITIONAL_REQUIREMENTS="apache-airflow-providers-databricks==2.4.0rc1"
We know that a system has two Python:
①system's python
/usr/bin/python
②user's python
~/anaconda3/envs/Python3.6/bin/python3
Now I have a cluster with my Desktop(master) and Laptop(slave).
It's OK for different mode of PysparkShell if I set like this:
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
for both two nodes' ~/.bashrc
However,I want to configure it with jupyter notebook.So I set like this in each node's
~/.bashrc
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
then I get the log
My Spark version is:
spark-3.0.0-preview2-bin-hadoop3.2
I have read all the answers in
environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
and
different version 2.7 than that in driver 3.6 in jupyter/all-spark-notebook
But no luck.
I guess slave's python2.7 is from system's python.not from anaconda's python.
How to force spark's slave node to use anaconda's python?
Thanks~!
Jupiyter is looking for ipython, you probably only have ipython installed in your system python.
In order to use jupyter in different python version. You need to use python version manager (pyenv), and python environment manager(virtualenv), together you can choose which version of python you are going to use and which environment you are going to install jupyter, and fully isolated python versions and packages.
Install ipykernel in your chosen python environment and install jupyter.
After you finish above step. You need to make sure that the Spark worker will switch to your chosen python version and environment every time Spark ReourceManager launches a worker executor. In order to swtich python version and environment when the Spark worker executor, you need to make sure that a little script ran right after the Spark Resource Manager ssh into worker:
go to the python environment directory
source 'whatever/bin/activate'
After you have done above steps, you should have chosen python version and jupyter ran by Spark worker executor.
When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks
Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.
As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.
I'm trying to create a virtual environment to deploy a Flask app. However, when I try to create a virtual environment using virtualenv, I get this error:
Using base prefix '//anaconda'
New python executable in /Users/sydney/Desktop/ptproject/venv/bin/python
ERROR: The executable /Users/sydney/Desktop/ptproject/venv/bin/python is not functioning
ERROR: It thinks sys.prefix is '/Users/sydney/Desktop/ptproject' (should be '/Users/sydney/Desktop/ptproject/venv')
ERROR: virtualenv is not compatible with this system or executable
I think that I installed virtualenv using conda. When I use which virtualenv, I get this
//anaconda/bin/virtualenv
Is this an incorrect location for virtualenv? I can't figure out what else the problem would be. I don't understand the error log at all.
It turns out that virtualenv just doesn't work correctly with conda. For example:
https://github.com/conda/conda/issues/1367
(A workaround is proposed at the end of that thread, but it looks like you may be seeing a slightly different error, so maybe it won't work for you.)
Instead of deploying your app with virtualenv, why not just use a proper conda environment? Conda environments are more general (and powerful) than those provided by virtualenv.
For example, to create a new environment with python-2.7 and flask in it:
conda create -n my-new-env flask python=2.7
I'm trying to imagine a workflow that could be applied on a scientific work environment. My work involves doing some scientific coding, basically with Python, pandas, numpy and friends. Sometimes I have to use some modules that are not common standards in the scientific community and sometimes I have to integrate some compiled code in my chain of simulations. The code I run is most of the time parallelised with IPython notebook.
What do I find interesting about docker?
The fact that I could create a docker containing my code and its working environment. I can then send the docker to my colleges, without asking them to change their work environment, e.g., install an outdated version of a module so that they can run my code.
A rough draft of the workflow I have in mind goes something as follows:
Develop locally until I have a version I want to share with somebody.
Build a docker, possibly with a hook from a git repo.
Share the docker.
Can somebody give me some pointers of what I should take into account to develop further this workflow? A point that intrigues me: code running on a docker can lunch parallel process on the several cores of the machine? e.g., an IPython notebook connected to a cluster.
Docker can launch multiple process/thread on multiple core. Multiple processes may need the use of a supervisor (see : https://docs.docker.com/articles/using_supervisord/ )
You should probably build an image that contain the things you always use and use it as a base for all your project. (Would save you the pain of writing a complete Dockerfile each time)
Why not develop directly in a container and use the commit command to save your progress on a local docker registry? Then share the final image to your colleague.
How to make a local registry : https://blog.codecentric.de/en/2014/02/docker-registry-run-private-docker-image-repository/
Even though you'll have a full container, I think a package manager like conda can still be a solid part of the base image for your workflow.
FROM ubuntu:14.04
RUN apt-get update && apt-get install curl -y
# Install miniconda
RUN curl -LO http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
RUN bash Miniconda-latest-Linux-x86_64.sh -p /miniconda -b
RUN rm Miniconda-latest-Linux-x86_64.sh
ENV PATH=/miniconda/bin:${PATH}
RUN conda update -y conda
* from nice example showing docker + miniconda + flask
Wrt doing source activate <env> in the Dockerfile you need to:
RUN /bin/bash -c "source activate <env> && <do something in the env>"