python module not accessible from EMR notebook - pyspark

I am using an EMR notebook attached to my cluster for some experimentation purposes. I needed to install some python modules for testing, specifically spacy and it's data module en_core_web_sm.
I ssh'ed into the master and core nodes and downloaded the modules individually. However I am not able to import from the my EMR notebook. I get the following error :
An error was encountered:
No module named 'spacy'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'spacy'
I know there is a way to install them just for the scope of EMR notebook, but this wouldn't suffice in a production scenario, so please avoid answers which suggest notebook installing as mentioned in this guide : https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/
Please let me know if I am missing some setup steps. Appreciate your response.

You can use bootstraps to install additional modules while creating your EMR
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

I was able to solve this by changing the bootstrap script to use sudo instead of --user. (You could also manually change run the scripts below)
Before I was running
pip3 install spacy --user
python3 -m spacy download en --user
I changed that script to
sudo pip3 install spacy
sudo python3 -m spacy download en
To verify this solution quickly issue the following commands from your EMR notebook (to compare before and after)
sc.list_packages()
You should see an output similar to this
SparkSession available as 'spark'.
Package Version
-------------------------- ----------
beautifulsoup4 4.9.0
blis 0.4.1
boto 2.49.0
catalogue 1.0.0
certifi 2020.4.5.2
chardet 3.0.4
cymem 2.0.3
en-core-web-sm 2.3.0
idna 2.9
importlib-metadata 1.6.1
jmespath 0.9.5
lxml 4.5.0
murmurhash 1.0.2
mysqlclient 1.4.2
nltk 3.4.5
nose 1.3.4
numpy 1.16.5
pip 9.0.1
plac 1.1.3
preshed 3.0.2
py-dateutil 2.2
python37-sagemaker-pyspark 1.3.0
pytz 2019.3
PyYAML 5.3.1
requests 2.24.0
setuptools 28.8.0
six 1.13.0
soupsieve 1.9.5
spacy 2.3.0
srsly 1.0.2
thinc 7.4.1
tqdm 4.46.1
urllib3 1.25.9
wasabi 0.6.0
wheel 0.29.0
windmill 1.6
zipp 3.1.0
This is not the best possible solution IMO, since the first warning that gets displayed after using sudo is
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
If anyone has a better solution please free to post.

Related

Conda not using latest package version

I am using Anaconda (latest version of september). I have executed 'conda update --all' to get the latest packages, in particular Scipy. But when I import scipy from a python shell, it's only importing the 1.7.3 version, not the latest one (1.9.1).
I am not fluent with the conda framework. I tried running conda install scipy which did not change anything, and conda install scipy=1.9.1 (which seems to hang).
I am using the base environment. This is a fresh isntall of the latest Anconda (with package update via conda, not any use of pip that could interfere).
Listing the packages via conda yields:
>>> conda list scipy
# packages in environment at /home/***/anaconda3:
#
# Name Version Build Channel
scipy 1.7.3 py39hc147768_0
However, when I look at the content of the anaconda3/pkgs folder:
>>> ls anaconda3/pkgs/ | grep scipy
anaconda3/pkgs/scipy-1.7.3-py39hc147768_0.conda
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_0.conda
anaconda3/pkgs/scipy-1.7.3-py39hc147768_0 (contains Scipy 1.7.3)
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_0 (contains Scipy 1.9.1)
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_09jfxaf1g (empty folder)
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_0dydy5wnw (empty folder)
So I am assuming that conda has both Scipy 1.7.3 and 1.9.1. But why can't I import the lastest one ?
How may I correct this situation ?
EDIT: creating a new environment and reinstalling the packages as needed solves my problem. However, how come the base environment is stuck with the earlier version ?

ModuleNotFoundError thrown by using pytest in virtualenv in python 3.10

I am new to pytest and tried to write a first test case. Since I am new to programming overall, I developed without a virtual environment (naughty, naughty) using the globally installed python version 3.9.13.
The structure of my program is like this:
mypkg/
sub_mypkg/
file_a.py
pytest.ini
testing/
__init__.py
test_a.py
in which file_a.py imports pandas among other modules. The tests in test_a.py try among other things to run file_a.py.pytest.ini adds the root_dir to the PYTHONPATH.
In this setup pytest ran smoothly without any errors.
I then installed a virtualenv using python 3.11 in this project and installed all necessary modules (including pytest) in it and uninstalled pytest globally. After activating the virtualenv and running pytest from the terminal a ModuleNotFoundError was thrown for pandas.
Here is a list of the modules I use in venv:
Package Version
--------------- -------
attrs 22.1.0
colorama 0.4.6
contourpy 1.0.6
cycler 0.11.0
et-xmlfile 1.1.0
exceptiongroup 1.0.4
fonttools 4.38.0
iniconfig 1.1.1
kiwisolver 1.4.4
matplotlib 3.6.2
numpy 1.23.4
openpyxl 3.0.10
packaging 21.3
pandas 1.5.1
Pillow 9.3.0
pip 22.3.1
pluggy 1.0.0
pyparsing 3.0.9
pytest 7.2.0
python-dateutil 2.8.2
pytz 2022.6
PyYAML 6.0
setuptools 56.0.0
six 1.16.0
tomli 2.0.1
I checked that pandas was installed and I could import pandas in the REPL.
Also I could run the test line by line in the REPL.
Furthermore, I checked that the problem was not pandas itself. If I changed the placement of the import pandas with for example import numpy in file_a.py, it threw the ModuleNotFoundError for numpy.
I tried to use different versions of python in my venv (python 3.11.0, 3.10.7, 3.9.13, 3.9.6). Interestingly, pytest ran only inside the venv for python 3.9.13 (The one I developed it in).
I tried to include and delete __init__.py files in all directories and also tried different versions of pytest (6.2.5, 7.0.0).
I checked sys.path that the right root_dir was included.
Thanks in advance for your answers!

Package list in EMR master node versus package list in EMR Notebook

I have one EMR cluster up and running. In it, I have one Jupyter Notebook with pyspark kernel.
For the master node, I am able to SSH into it. I am able to install Python packages in the master node easily, such as :
pip install pandas
which I can then verify successful with pip freeze
However, when I go to the pyspark notebook, using sc.list_packages(), I see a different list of packages in there. Some package has different version compared to in the master node. Some package (such as pandas) does not appear altogether.
Here is the list of pip freeze in master node SSH.
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6
And here is the package list in the PySpark notebook using sc.list_packages():
aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Note that pandas, scipy and pip are different. Why are they different? How do I upgrade or update the list in the PySpark notebook?
Log into the master node and run sudo docker ps -a. You should see a container named something like emr/jupyter-notebook:6.0.3 and that's where your Jupyter Notebook is running; it is not running in the master node.
If you decide to install any packages in the master node, the Jupyter Notebook will not see them. This is the reason why your packages do not match. To install packages in the Jupyter Notebook I use a requirements file, which contains the packages I want to install, and invoke a bootstrap action script that installs those packages. An important detail is to make sure that if you do specify a package version then it must be supported by the Python version running in the container. To find out just run a step in the Jupyter Notebook:
import sys
print(sys.version)
To find the latest packages that go with a specific version of Python, I highly recommend using Anaconda. For example
conda create --name requests python=3.7.9 matplotlib
will tell me the latest version of matplotlib that works with Python 3.7.9

How to install TensorFlow on Python 3.7

How to install TensorFlow on Python 3.7
Trying:
D:\Users\Downloads>pip install tensorflow
ERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none)
ERROR: No matching distribution found for tensorflow
Windows 10 OS
And with vent error, too
(venv) C:\Users\KvaksManYT>pip install --upgrade tensorflow
ERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none)
ERROR: No matching distribution found for tensorflow
I would recommend using a virtual environment using pip install vitualenv. Then, depending on your OS, you want to create and activate an environment.
python3 -m venv /path/to/new/virtual/environment
Then, activate this environment using,
source ./venv/bin/activate
Now, you can install any Python packages you want.
pip install tensorflow==2.0.0
you can install Tensorflow follow those steps
Ubuntu/Linux /mac os /windows
virtualenv does not require a mention pip version
for system install, you need to mention pip version
upgrade pip version
pip install --upgrade pip
#virtualenv install
pip install --upgrade tensorflow
#system install
pip3 install --user --upgrade tensorflow
reference https://www.tensorflow.org/install/pip
I had the same problem with Windows 10 x64, and it was caused because I was using the wrong Python version, both globally and in the venv. I found questions on the issue multiple times on the internet, including yours.
Be sure to use Python versions 3.5-3.8, as per requirements, but also x64, not x32.
Namely, I ran into this error using both
a venv with 3.9.1 x64 (python --version),
and my globally installed 3.8.2 x32 (python3 --version).
So, I downloaded the x64-version of Python 3.8.6 from here.
Note that the command venv does not allow specifying the python version used in the virtual environment,
as per an answer on this question. So I used virtualenv, which I obviously had to install in my global Python version first.
To specify the Python version used in the venv, I used the command virtualenv, as in:
virtualenv --python="C:\Users\me\AppData\Local\Programs\Python\Python38\python.exe myvenv
where you have to give the path to the newly downloaded Python distribution you want to use, if there are several on your PC (for example, I had Python38-32 and Python39 folders in that directory).
Check Python versions in virtual environment
After I activate my myvenv, created as above, I verify the Python versions as follows:
python3 --version
> Python 3.8.2
python --version
> Python 3.8.6
Then, using the command
import struct
print(struct.calcsize("P") * 8)
Within either python3 or python, shows me whether the version is 32bit or 64bit, as per this answer. The python returns a 64, so that is the one you want to use (not python3).
Finally, within the virtual environment, you can run
pip install --upgrade tensorflow
and it will download and install. (Meanwhile, pip3 install --upgrade tensorflow would still return your error inside and outside the virtual enviroment.)

problem accessing odoo from ubuntu terminal 18

I'm trying to access odoo through commands but I get critical errors one is odoo.modules.module: Couldn't load module web
odoo.modules.module: The 'odoo.addons.web' package was not installed in a way that PackageLoader understands.
ERROR? odoo.service.server: Failed to load server-wide module web.
so I can't access odoo with the command ./odoo-bin
Do you know how I can solve the problem?
You must uninstall jinja2 and reinstalling with this version Jinja2==2.10.1
pip3 uninstall jinja2
and
pip3 install Jinja2==2.10.1
In my case I had this issue in openerp8 with python2.7
I had updated Jinja2 version which is not supported by python2.7 version so I just degrade the version from 2.11 to 2.10 and it's work.