Watson Studio Very Slow - ibm-cloud

I am using jupyter notebook on Watson Studio and installing python libraries is taking very long time.
for example the following lines of code would take like half an hour to get executed :
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
What causes this ?

pip install folium
The root conda environment where you install the custom package must be very complex indeed (the price you pay for universal coverage to cater to needs of many users), and conda is particularly slow installer, as it makes a lot of I/O operations (even on NVMe drives it is so slow as you observe).
This is why Kaggle (and us) have switched to pip for all but a few of python packages, and so should you (if possible to find your package on PyPI - or at least on GitHub).
Try also to unpin the old version of the library you wanted to use (released nearly three years ago!) and use the latest one - much fewer packages will need to be reinstalled if you use the latest version:
So use:
pip install folium==0.11.0
or simply:
pip install folium

Related

Conda not using latest package version

I am using Anaconda (latest version of september). I have executed 'conda update --all' to get the latest packages, in particular Scipy. But when I import scipy from a python shell, it's only importing the 1.7.3 version, not the latest one (1.9.1).
I am not fluent with the conda framework. I tried running conda install scipy which did not change anything, and conda install scipy=1.9.1 (which seems to hang).
I am using the base environment. This is a fresh isntall of the latest Anconda (with package update via conda, not any use of pip that could interfere).
Listing the packages via conda yields:
>>> conda list scipy
# packages in environment at /home/***/anaconda3:
#
# Name Version Build Channel
scipy 1.7.3 py39hc147768_0
However, when I look at the content of the anaconda3/pkgs folder:
>>> ls anaconda3/pkgs/ | grep scipy
anaconda3/pkgs/scipy-1.7.3-py39hc147768_0.conda
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_0.conda
anaconda3/pkgs/scipy-1.7.3-py39hc147768_0 (contains Scipy 1.7.3)
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_0 (contains Scipy 1.9.1)
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_09jfxaf1g (empty folder)
anaconda3/pkgs/scipy-1.9.1-py39h14f4228_0dydy5wnw (empty folder)
So I am assuming that conda has both Scipy 1.7.3 and 1.9.1. But why can't I import the lastest one ?
How may I correct this situation ?
EDIT: creating a new environment and reinstalling the packages as needed solves my problem. However, how come the base environment is stuck with the earlier version ?

Package list in EMR master node versus package list in EMR Notebook

I have one EMR cluster up and running. In it, I have one Jupyter Notebook with pyspark kernel.
For the master node, I am able to SSH into it. I am able to install Python packages in the master node easily, such as :
pip install pandas
which I can then verify successful with pip freeze
However, when I go to the pyspark notebook, using sc.list_packages(), I see a different list of packages in there. Some package has different version compared to in the master node. Some package (such as pandas) does not appear altogether.
Here is the list of pip freeze in master node SSH.
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6
And here is the package list in the PySpark notebook using sc.list_packages():
aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Note that pandas, scipy and pip are different. Why are they different? How do I upgrade or update the list in the PySpark notebook?
Log into the master node and run sudo docker ps -a. You should see a container named something like emr/jupyter-notebook:6.0.3 and that's where your Jupyter Notebook is running; it is not running in the master node.
If you decide to install any packages in the master node, the Jupyter Notebook will not see them. This is the reason why your packages do not match. To install packages in the Jupyter Notebook I use a requirements file, which contains the packages I want to install, and invoke a bootstrap action script that installs those packages. An important detail is to make sure that if you do specify a package version then it must be supported by the Python version running in the container. To find out just run a step in the Jupyter Notebook:
import sys
print(sys.version)
To find the latest packages that go with a specific version of Python, I highly recommend using Anaconda. For example
conda create --name requests python=3.7.9 matplotlib
will tell me the latest version of matplotlib that works with Python 3.7.9

How to execute the right Python to import the installed tensorflow.transform package?

The version of my Python is 2.7.13.
I run the following in Jupyter Notebook.
Firstly, I installed the packages
%%bash
pip uninstall -y google-cloud-dataflow
pip install --upgrade --force tensorflow_transform==0.15.0 apache-beam[gcp]
Then,
%%bash
pip freeze | grep -e 'flow\|beam'
I can see that the package tensorflow-transform is installed.
apache-beam==2.19.0
tensorflow==2.1.0
tensorflow-datasets==1.2.0
tensorflow-estimator==2.1.0
tensorflow-hub==0.6.0
tensorflow-io==0.8.1
tensorflow-metadata==0.15.2
tensorflow-probability==0.8.0
tensorflow-serving-api==2.1.0
tensorflow-transform==0.15.0
However when I tried to import it, there are warning and error.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/api/_v1/estimator/__init__.py:12: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.
ImportErrorTraceback (most recent call last)
<ipython-input-3-26a4792d0a76> in <module>()
1 import tensorflow as tf
----> 2 import tensorflow_transform as tft
3 import shutil
4 print(tf.__version__)
ImportError: No module named tensorflow_transform
After some investigation, I think I have some ideas of the problem.
I run this:
%%bash
pip show tensorflow_transform| grep Location
This is the output
Location: /home/jupyter/.local/lib/python3.5/site-packages
I tried to modify the $PATH by adding /home/jupyter/.local/lib/python3.5/site-packages to the beginning of $PATH. However, I still failed to import tensorflow_transform.
Based on the above and the following information, I think, when I ran the import command, it executes Python 2.7, not Python 3.5
import sys
print('\n'.join(sys.path))
/usr/lib/python2.7
/usr/lib/python2.7/plat-x86_64-linux-gnu
/usr/lib/python2.7/lib-tk
/usr/lib/python2.7/lib-old
/usr/lib/python2.7/lib-dynload
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
/usr/local/lib/python2.7/dist-packages/IPython/extensions
/home/jupyter/.ipython
Also,
import sys
sys.executable
'/usr/bin/python2'
I think the problem is tensorflow_transform package was installed in /home/jupyter/.local/lib/python3.5/site-packages. But when I run "Import", it goes to /usr/local/lib/python2.7/dist-packages to search for the package, rather than /home/jupyter/.local/lib/python3.5/site-packages, so even updating $PATH does not help. Am I right?
I tried to upgrade my python, but
%%bash
pip install upgrade python
Defaulting to user installation because normal site-packages is not writeable
Then, I added --user. It seems that the python is not really upgraded.
%%bash
pip install --user upgrade python
%%bash
python -V
Python 2.7.13
Any solution?
It seems to me that your jupyter notebook is not using the right python environment.
Perhaps, you installed the package under version 3.5,
but the Notebook uses the other one, thus it cannot find the library
You can pick the other interpreter by clicking on: Python(your version) - bottom left.
VS-Code - Select Python Environment 1
However you can do this also via:
CNTRL+SHIFT+P > Select Python Interpreter to start Jupyter Server
If that does not work make sure that the package you are trying to import is installed under the correct python environment.
If not open up a terminal, activate the environment and install it using:
pip install packagename
For example i did the same thing here: (Note: I'm using Anaconda)
installing tensorflow_transform
After a installation, you can import it in your code directly like this:
importing tensorflow_transform

how to import h5py on datalab?

Does anybody know how to install h5py on datalab? pip install h5py doesn't work. apt-get install python-h5py is working in the shell but it doesn't recognize the package in datalab notebook!
Thnaks
It is true that !pip install h5py wil allow you to install the library but unfortunately, even after a successful installation, the import will fail:
The issue is rooted on an ongoing python-future issue ("surrogateescape handler broken when encountering UnicodeEncodeError") that is experienced in datalab because the underlying OS uses an 'ANSI_X3.4-1968' file system encoding.
As a hacky workaround, you may remove line 60 from h5py's __init__.py by running the following command from within a notebook cell:
!sed -i.bak '/run_tests/d' /usr/local/lib/python2.7/dist-packages/h5py/__init__.py
Just make sure you run it using bash syntax: !pip install h5py in any notebook cell.

Trouble installing basemap on ipython

I've got IPython installed for both python3.3 and python3.4
When I tried to install basemap using conda install basemap I keep receving the two same errors saying there is a conflict.
Hint: the following combinations of packages create a conflict with the
remaining packages:
- python 3.3*
- basemap
Hint: the following combinations of packages create a conflict with the
remaining packages:
- python 3.4*
- basemap
Is basemap not supported by those two version of python? Do I need to move to python2.7 to get basemap to work? Or is there a different way to install basemap on ipython?
Thanks!
Basemap is available now for python 3.3 on windows, so this should no longer by an issue for you. I was having problems on python 3.5 and had to use pip to install the wheel file from here:
http://www.lfd.uci.edu/~gohlke/pythonlibs/
Worked with the basemap group to find a way to install it:
https://github.com/matplotlib/basemap/issues/249