Import PySpark error in Anaconda venv on Dataproc - pyspark

I have spun up a Dataproc cluster with Anaconda as the additional component. I have created a virtual env. in anaconda and installed RDkit inside it. Now my issue is that when I open up python terminal and try to do this:
from pyspark import SparkContext
It throws error:
Traceback (most recent call last): File "", line 1, in
ModuleNotFoundError: No module named 'pyspark'
I can install the PySpark inside the Anaconda venv and then it works but I wanted to use the pre-installed PySpark on Dataproc. How to resolve this?

To use Dataproc's PySpark in a new Conda environment you need to install file:///usr/lib/spark/python package inside this environment:
conda create -c rdkit -n rdkit-env rdkit
conda activate rdkit-env
sudo "${CONDA_PREFIX}/bin/pip" install -e "file:///usr/lib/spark/python"

Related

bcc: ImportError cannot import name BPF

I am getting the following error when trying run the example hello_world.py.
Traceback (most recent call last):
File "/usr/share/bcc/examples/hello_world.py", line 9, in <module>
from bcc import BPF
ImportError: cannot import name BPF
I installed bcc from source (link).
I also installed both the python bcc bindings packages, python-bcc and python3-bcc but no luck.
I am running Ubuntu 18.04 and kernel version 4.15.0-117-generic.
What am I missing here?
In ubuntu 20.04, I execute the following command to fix it.
sudo apt-get install bpfcc-tools linux-headers-$(uname -r)
I got the issue. I was using the pyenv to manage my python versions, so the python was looking files in the wrong locations.
$ python -c 'import site; print(site.getsitepackages())'
['/home/sagar/.pyenv/versions/3.6.6/lib/python3.6/site-packages']
I tried with a python3 command, which was not installed by pyenv and I don't get the above error.

How to execute the right Python to import the installed tensorflow.transform package?

The version of my Python is 2.7.13.
I run the following in Jupyter Notebook.
Firstly, I installed the packages
%%bash
pip uninstall -y google-cloud-dataflow
pip install --upgrade --force tensorflow_transform==0.15.0 apache-beam[gcp]
Then,
%%bash
pip freeze | grep -e 'flow\|beam'
I can see that the package tensorflow-transform is installed.
apache-beam==2.19.0
tensorflow==2.1.0
tensorflow-datasets==1.2.0
tensorflow-estimator==2.1.0
tensorflow-hub==0.6.0
tensorflow-io==0.8.1
tensorflow-metadata==0.15.2
tensorflow-probability==0.8.0
tensorflow-serving-api==2.1.0
tensorflow-transform==0.15.0
However when I tried to import it, there are warning and error.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/api/_v1/estimator/__init__.py:12: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.
ImportErrorTraceback (most recent call last)
<ipython-input-3-26a4792d0a76> in <module>()
1 import tensorflow as tf
----> 2 import tensorflow_transform as tft
3 import shutil
4 print(tf.__version__)
ImportError: No module named tensorflow_transform
After some investigation, I think I have some ideas of the problem.
I run this:
%%bash
pip show tensorflow_transform| grep Location
This is the output
Location: /home/jupyter/.local/lib/python3.5/site-packages
I tried to modify the $PATH by adding /home/jupyter/.local/lib/python3.5/site-packages to the beginning of $PATH. However, I still failed to import tensorflow_transform.
Based on the above and the following information, I think, when I ran the import command, it executes Python 2.7, not Python 3.5
import sys
print('\n'.join(sys.path))
/usr/lib/python2.7
/usr/lib/python2.7/plat-x86_64-linux-gnu
/usr/lib/python2.7/lib-tk
/usr/lib/python2.7/lib-old
/usr/lib/python2.7/lib-dynload
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
/usr/local/lib/python2.7/dist-packages/IPython/extensions
/home/jupyter/.ipython
Also,
import sys
sys.executable
'/usr/bin/python2'
I think the problem is tensorflow_transform package was installed in /home/jupyter/.local/lib/python3.5/site-packages. But when I run "Import", it goes to /usr/local/lib/python2.7/dist-packages to search for the package, rather than /home/jupyter/.local/lib/python3.5/site-packages, so even updating $PATH does not help. Am I right?
I tried to upgrade my python, but
%%bash
pip install upgrade python
Defaulting to user installation because normal site-packages is not writeable
Then, I added --user. It seems that the python is not really upgraded.
%%bash
pip install --user upgrade python
%%bash
python -V
Python 2.7.13
Any solution?
It seems to me that your jupyter notebook is not using the right python environment.
Perhaps, you installed the package under version 3.5,
but the Notebook uses the other one, thus it cannot find the library
You can pick the other interpreter by clicking on: Python(your version) - bottom left.
VS-Code - Select Python Environment 1
However you can do this also via:
CNTRL+SHIFT+P > Select Python Interpreter to start Jupyter Server
If that does not work make sure that the package you are trying to import is installed under the correct python environment.
If not open up a terminal, activate the environment and install it using:
pip install packagename
For example i did the same thing here: (Note: I'm using Anaconda)
installing tensorflow_transform
After a installation, you can import it in your code directly like this:
importing tensorflow_transform

installing python package in sagemaker sparkmagic pyspark notebook

I want to install new libraries in a running kernel (not bootstrapping). I'm able to create a sagemaker notebook, which is connected to a EMR cluster, but installing package is a headache.
Unable to install packages on notebook. I've tried several methods like installing packages via terminal in jupyterLab.
$ conda install numba
The installation seems to be working fine on conda_pytorch_p36 notebook, but the packages are not installed on SparkMagic (pyspark) notebook.
Error code:
An error was encountered:
No module named numba
Traceback (most recent call last):
ImportError: No module named numba
The jupyter magic command also doesn't work only in pyspark notebook
!pip install keras
Error:
An error was encountered:
invalid syntax (<stdin>, line 1)
File "<stdin>", line 1
!pip install keras
^
SyntaxError: invalid syntax
Based on answer in a github post, neither did this work
from subprocess import call
call("pip install dm-sonnet".split(" "))
when you are running $ conda install numba via the terminal in JupyterLab,
it's actually succeeding the installation on your local environment. The thing is, when you are using Sparkmagic as your kernal, the code in the cells are always running on the spark cluster, not on the local notebook environment. To run the content of a cell locally you should write %%local in the beginning of the cell. After that everything in that cell will run locally and the installed module will be available.
Otherwise you should install the module on the remote spark cluster.
Read more here:
https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb

nbformat error trying to open jupyter notebook

I am trying to install Jupyter Notebook in macOS Mojave.
I have tryied with Anaconda but although anaconda3 is apparently well installed, with the correct path in my bash, I get:
$ jupyter notebook
-bash: jupyter: command not found
I have also tried with
python3 -m pip install --upgrade pip
python3 -m pip install jupyter
but after I try to open jupyter notebook I get this error:
$ jupyter notebook
Traceback (most recent call last):
File "/Users/danielavargasrobles/miniconda3/bin/jupyter-notebook", line 7, in <module>
from notebook.notebookapp import main
File "/Users/danielavargasrobles/miniconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 83, in <module>
from .services.contents.manager import ContentsManager
File "/Users/danielavargasrobles/miniconda3/lib/python3.6/site-packages/notebook/services/contents/manager.py", line 17, in <module>
from nbformat import sign, validate as validate_nb, ValidationError
ModuleNotFoundError: No module named 'nbformat'
Does anyone have an idea of what is happening?
Thank you!
Did you installed jupyter inside your envirnoment ?
I had a similar issue, jupyter was launching because installed outside venv but not installed inside your envirnoment it will lead to nbformat issue.
conda activate your_env
conda install jupyter
should do the trick.

Can't import TensorFlow in Anaconda

I've installed TensorFlow successfully with Anaconda command prompt, but when I use Ipython or use anaconda(spyder) to import Python, I get this error:
<ipython-input-2-41389fad42b5> in <module>()
----> 1 import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
When I check pip list in the list I can't find TensorFlow.
During the installation I've added this
C:> conda create -n tensorflow python=3.5 anaconda
Please check if you followed all the step:
conda create -n tensorflow python=3.5
activate tensorflow
pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/windows/cpu/tensorflow-1.2.1-cp35-cp35m-win_amd64.whl
When opening spyder from Anaconda please select environment as 'tensorflow'. There is one option as 'Applications On' just above the IDE's list. Select 'tensorflow' this should work. you can change your Environment variable(path) to
"XXXX\AppData\Local\Continuum\Anaconda3\envs\tensorflow"