Jupyter Notebook PySpark Kernel referencing lowered pip version from host machine site-packages - pyspark

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.
Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the master node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.
In the notebook cell if I run import pip then pip I see that the module is located at
<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'>
I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.
If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.
How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?
If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?
<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>
As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".

I was able to find a solution on updating pip, setuptools, and wheel in the virtualenv that PySpark uses.
I initially had to determine how pip 9 is being sourced. By SSH'ing to my EMR Master node I changed directories into the root cd / and then ran the command sudo find . -name "pip*" to recursively search for where pip files may be located at.
In my scenario there is a pip 9 wheel located at:
./usr/lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl
By searching around a bit more in /usr/lib/python2.7/site-packages there is a virtualenv.py that is being invoked to create the virtualenv and is explained a bit more below.
Within the PySpark notebook session using %%info shows that the virtualenv is created from this file path (thanks Parag):
'spark.pyspark.virtualenv.bin.path': '/usr/bin/virtualenv'
Running cat /usr/bin/virtualenv shows that the virtualenv is being invoked from the following commands:
#!/usr/bin/python
import virtualenv
virtualenv.main()
This version of python in /usr/bin is python2.7. At the terminal I ran the following commands in sequence:
/usr/bin/python
import virtualenv
virtualenv
This outputs:
<module 'virtualenv' from '/usr/lib/python2.7/site-packages/virtualenv.py'>
I have sometimes seen a virtualenv.pyc file being used here which is located in /usr/lib/python2.7/site-packages/ but I have seen other users suggest that .pyc files can be deleted.
On the EMR master node I ran the command /usr/bin/virtualenv which shows some flags that can be used. First I used /usr/bin/virtualenv --verbose ./myVE which shows that pip 9.0.1 is packaged into the virtualenv I created. If I run /usr/bin/virtualenv --verbose --download ./myVE2 this shows output that an updated version of pip, setuptools, and wheel are being downloaded from Artifactory (our private PyPi mirror) into the virtualenv. There is a /etc/pip.conf that we use to setup the index-url and trusted host for Artifactory to be used instead of PyPi.
At this point it seems that the EMR cluster's virtualenv.py file as a default does not download updated wheels from Artifactory/PyPi and instead uses the wheel files located in /usr/lib/python2.7/site-packages/virtualenv_support/*.whl
Running cat /usr/lib/python2.7/site-packages/virtualenv.py shows that this version of virtualenv is 15.1.0 which is very outdated (2016 release).
Reading more into virtualenv.py shows that the main() function has a block of code as follows:
parser.add_option(
"--download",
dest="download",
action="store_true",
help="Download preinstalled packages from PyPI.",
)
I compared this virtualenv.py file on my EMR master to the official release of virtualenv==15.1.0 from PyPi (https://pypi.org/project/virtualenv/15.1.0/). I downloaded the tar.gz file and unzipped it on my local machine. There is a virtualenv.py file in the unzipped folder. When comparing contents using diff of the official virtualenv.py file to the EMR cluster's virtualenv.py file there are only a couple of lines that are not the same. The main difference is that parser.add_option from the code block above has default=True, in the official virtualenv.py file. The EMR cluster's virtualenv.py file does not have this.
parser.add_option(
"--download",
dest="download",
action="store_true",
default=True,
help="Download preinstalled packages from PyPI.",
)
What I did from here was I copied the EMR cluster's virtualenv.py and updated the line of code to set default=True,. I then used this updated virtualenv.py as part of an EMR bootstrap script so that this file is updated on all node types (master/core/task).
The bootstrap script does the following:
sudo rm /usr/lib/python2.7/site-packages/virtualenv.pyc
sudo rm /usr/lib/python2.7/site-packages/virtualenv.py
sudo aws s3 cp <UPDATED_VIRTUALENV_S3_PATH> /usr/lib/python2.7/site-packages/
Ensure that the copied file from S3 is just called virtualenv.py in the event that this causes any issues due to filenames not being kept the same.
Now when I start up a PySpark kernel the spark.pyspark.virtualenv.bin.path invokes the updated virtualenv.py file and I am able to confirm that pip is at a much higher version number (20+) which is what I was looking to achieve.

Related

Use Python library in Slurm job

I want to run a job on Slurm and my Python script needs the evaluate package which I have on my local machine. I don't know if I could change the Python path on the server to match the one on my local machine, and if I could I'm afraid I might break the system.
So I followed this answer, and included a requirements.txt file with just evaluate==0.1.2 in it, and I get even more errors:
load GCC/10.2.0 (PATH, MANPATH, INFOPATH, LIBRARY_PATH, LD_LIBRARY_PATH, STD COMP VARS)
load ROCM/5.1.1 (PATH, MANPATH, LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH)
Set INTEL compilers as MPI wrappers backend
load mkl/2018.4 (LD_LIBRARY_PATH)
load PYTHON/3.7.4 (PATH, MANPATH, LD_LIBRARY_PATH, LIBRARY_PATH, PKG_CONFIG_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH, PYTHONHOME, PYTHONPATH)
/var/spool/slurmd/job216863/slurm_script: line 12: virtualenv: command not found
/var/spool/slurmd/job216863/slurm_script: line 16: /env/bin/activate: No such file or directory
ERROR: Could not find a version that satisfies the requirement evaluate==0.1.2 (from versions: none)
ERROR: No matching distribution found for evaluate==0.1.2
Traceback (most recent call last):
File "eval_comet.py", line 1, in <module>
from evaluate import load
ModuleNotFoundError: No module named 'evaluate'
Most of the time, the Python version on HPCs are old. My Uni's HPC cluster has Python 3.7. If you wish to create a Python virtual environment (not conda) with a newer version, then there is a trick.
Activate the Anaconda Module, some system uses Module load and some uses load depending on your organisation.
[s.1915438#sl2 ~]$ module load anaconda/2021.05
[s.1915438#sl2 ~]$ conda create -n surrogate python=3.8
Here I created a Conda environment named surrogate with Python 3.8. Here, you can choose any version of your choice. Now you can activate the Conda environment and check the Python version.
[s.1915438#sl2 ~]$ source activate surrogate
(modulus) [s.1915438#sl2 ~]$ which python
~/.conda/envs/surrogate/bin/python
(surrogate) [s.1915438#sl2 ~]$ python --version
Python 3.8.13
Now navigate to the directory where you want to install your Python virtual environment and install the virtual environment using the following command.
(surrogate) [s.1915438#sl2 s.1915438]$ mkdir modulus_pysdf
(surrogate) [s.1915438#sl2 s.1915438]$ cd modulus_pysdf/
(surrogate) [s.1915438#sl2 modulus_pysdf]$ python3 -m venv modulus_pysdf
Logout (ctrl + D) from the server to exit the Conda environment and then login again. Remember, in my case the path to the Python virtual environment was /scratch/s.1915438/modulus_pysdf.
This is how I will activate the Python virtual environment.
[s.1915438#sl2 ~]$ cd /scratch/s.1915438
[s.1915438#sl2 s.1915438]$ cd modulus_pysdf/
[s.1915438#sl2 modulus_pysdf]$ source modulus_pysdf/bin/activate
Now I can check the Python version and the path.
(modulus_pysdf) [s.1915438#sl2 modulus_pysdf]$ python --version
Python 3.8.13
(modulus_pysdf) [s.1915438#sl2 modulus_pysdf]$ which python
/scratch/s.1915438/modulus_pysdf/modulus_pysdf/bin/python
As usual, I can install any package using pip. For example, to install evaluate you can download it from PyPI
pip install evaluate
Or if you have a requirements.txt file then you can do this. See this for more details.
cat requirements.txt | grep -Eo '(^[^#]+)' | xargs -n 1 pip install

Package list in EMR master node versus package list in EMR Notebook

I have one EMR cluster up and running. In it, I have one Jupyter Notebook with pyspark kernel.
For the master node, I am able to SSH into it. I am able to install Python packages in the master node easily, such as :
pip install pandas
which I can then verify successful with pip freeze
However, when I go to the pyspark notebook, using sc.list_packages(), I see a different list of packages in there. Some package has different version compared to in the master node. Some package (such as pandas) does not appear altogether.
Here is the list of pip freeze in master node SSH.
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6
And here is the package list in the PySpark notebook using sc.list_packages():
aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Note that pandas, scipy and pip are different. Why are they different? How do I upgrade or update the list in the PySpark notebook?
Log into the master node and run sudo docker ps -a. You should see a container named something like emr/jupyter-notebook:6.0.3 and that's where your Jupyter Notebook is running; it is not running in the master node.
If you decide to install any packages in the master node, the Jupyter Notebook will not see them. This is the reason why your packages do not match. To install packages in the Jupyter Notebook I use a requirements file, which contains the packages I want to install, and invoke a bootstrap action script that installs those packages. An important detail is to make sure that if you do specify a package version then it must be supported by the Python version running in the container. To find out just run a step in the Jupyter Notebook:
import sys
print(sys.version)
To find the latest packages that go with a specific version of Python, I highly recommend using Anaconda. For example
conda create --name requests python=3.7.9 matplotlib
will tell me the latest version of matplotlib that works with Python 3.7.9

How can i use PIL in python?

I have successfully installed Pillow:
chris#MBPvonChristoph sources % python3 -m pip install --upgrade Pillow
Collecting Pillow
Using cached Pillow-9.0.1-1-cp310-cp310-macosx_11_0_arm64.whl (2.7 MB)
Installing collected packages: Pillow
Successfully installed Pillow-9.0.1
but when i try to use it in pycharm got:
Traceback (most recent call last):
File "/Users/chris/PycharmProjects/pythonProject2/main.py", line 1, in
from PIL import Image
ModuleNotFoundError: No module named 'PIL'
or using in Blender i got:
ModuleNotFoundError: No module named 'PIL'
I am not a python lib installing pro...so obviously i made something wrong. But how do i fix that?
Maybe i have to say i am working on a M1 Macbook
looks like you may need to repoint your pycharm to your installed python interpreter.
go to command line and find out python interpreter path. On windows you can where python in your command line an it will give you where your python and packages are installed.. You could also activate python directly in command line and find paths from there. For example, open command line then;
python
press enter = activates python
within then you can do:
import sys
for x in sys.path: x
In pycharm make sure you point to path discovered from step 1 and select that to be your python interpreter within pycharm --- check out examples here https://www.jetbrains.com/help/pycharm/configuring-python-interpreter.html#add-existing-interpreter
Should work. Not sure about all the steps you took, but if you installed python with pycharm on top of your regular installation of python i would recommend :
finding all the paths from step 1
deleting python using system
checking if folders found from paths step still exist
if they do, delete those as well
start over just with one python installation
repoint to that in pycharm
first
pip uninstall PIL
after uninstall
python3 -m pip install --upgrade pip
python3 -m pip install --upgrade Pillow
or
brew install Pillow

How do I install Scala in Jupyter IPython Notebook?

Here's a few links that I went to and did exactly what they said. I don't know what I'm doing wrong.
https://github.com/alexarchambault/jupyter-scala
https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages
https://github.com/apache/incubator-toree
http://jcrudy.github.io/blog/html/2013/12/08/introduction_to_iscala.html
None of this is working. It may be some way that my node is configured. I just don't know. Please help.
I tried the following with Jupyterhub notebook and it works seamlessly:
# Step 1: Install spylon kernel
pip install spylon-kernel
# Step 2: create a kernel spec
python -m spylon_kernel install
# Step 3: start jupyter notebook
jupyter notebook
PS: to list all installed kernels, you can run the following command:
jupyter kernelspec list
You can use the information given here.
Ensure you have IPython 3 installed. ipython --version should return a
value >= 3.0. If it's not the case, a quick way of setting it up
consists in installing the Anaconda Python distribution, and then
running
$ pip install --upgrade "ipython[all]"
ipython --version should then return a value >= 3.0.
Download the Jupyter Scala binaries for Scala 2.10 (txz or zip) or
Scala 2.11 (txz or zip), and unpack them in a safe place. Then run
once the jupyter-scala program (or jupyter-scala.bat on Windows) it
contains. That will set-up the Jupyter Scala kernel for the current
user.
Check that Jupyter/IPython knows about Jupyter Scala by running
$ jupyter kernelspec list
This should print, among others, a line like
scala211
(or scala210 dependending on the Scala version you chose).
Then run either IPython console with
$ ipython console --kernel scala211
and start using the Jupyter Scala kernel straightaway, or run Jupyter
Notebook with
$ jupyter notebook
and create Scala 2.11 notebooks by choosing Scala 2.11 in the dropdown
in the upper right of the Jupyter Notebook start page.
Note: Since IPython has now been replaced by Jupyter, we replaced ipython in the above commands with jupyter.
I've just run:
conda create --name base2 --clone base to create an env just like base.
conda activate base2 to move to the new env.
conda install -c conda-forge spylon-kernel.
python -m spylon_kernel install --user. create a kernel spec for Jupyter notebook
jupyter-notebook
...and works just fine.
I'm using:
Anaconda 4.7.12
Jupyter-notebook 6.0.1
Ubuntu 18.04
ipykernel 5.1.3
ipython 7.9.0
ipython_genutils 0.2.0
jupyter_client 5.3.4
jupyter_core 4.6.0
traitlets 4.3.3
from def suma(a: Int) = a + 3
I can't add a comment to Heapify's answer, but his solution worked for JupyterLab on Windows without problems.
I cut and pasted his code into an Anaconda Powershell prompt
pip install spylon-kernel
python -m spylon_kernel install
jupyter notebook
And refreshed my anacopnda launcher and the spylon project option was available.
The answer for Linux can be found here.
Install Scala. Add these lines to ~/.bashrc
export SCALA_HOME=/usr/local/share/scala export
PATH=$PATH:$SCALA_HOME/bin:$PATH
Follow these instructions from the
GitHub site:
Download and unpack pre-packaged binaries Scala 2.11. Unpack each
downloaded archive(s), and, from a console, go to the bin
sub-directory of the directory it contains. Then run the following to
set-up the corresponding Scala kernel:
./jove-scala --kernel-spec
Make sure spark is installed in local along with SPARK_HOME is added or exported in .profile/environment file.
If not, you might get stuck with the following message:
"Intitializing Scala interpreter ..."
without any result.
For mac, I needed only to 3 commands to add Scala and run it with Spark (I had it already installed) on my Jupyter notebook
pip install spylon-kernel
python -m spylon_kernel install
ipython notebook
Once you run them on your terminal, you'll have spylon-kernel in your notebook, which can be used as your a Scala notebook.
spylon-kernel hasn't seen an update in years. These days its much better to use almond.

How can I make a list of installed packages in a certain virtualenv?

You can cd to YOUR_ENV/lib/pythonxx/site-packages/ and have a look, but is there any convenient ways?
pip freeze list all the packages installed including the system environment's.
You can list only packages in the virtualenv by
pip freeze --local
or
pip list --local.
This option works irrespective of whether you have global site packages visible in the virtualenv.
Note that restricting the virtualenv to not use global site packages isn't the answer to the problem, because the question is on how to separate the two lists, not how to constrain our workflow to fit limitations of tools.
Credits to #gvalkov's comment here. Cf. also pip issue 85.
Calling pip command inside a virtualenv should list the packages visible/available in the isolated environment. Make sure to use a recent version of virtualenv that uses option --no-site-packages by default. This way the purpose of using virtualenv is to create a python environment without access to packages installed in system python.
Next, make sure you use pip command provided inside the virtualenv (YOUR_ENV/bin/pip). Or just activate the virtualenv (source YOUR_ENV/bin/activate) as a convenient way to call the proper commands for python interpreter or pip
~/Projects$ virtualenv --version
1.9.1
~/Projects$ virtualenv -p /usr/bin/python2.7 demoenv2.7
Running virtualenv with interpreter /usr/bin/python2.7
New python executable in demoenv2.7/bin/python2.7
Also creating executable in demoenv2.7/bin/python
Installing setuptools............................done.
Installing pip...............done.
~/Projects$ cd demoenv2.7/
~/Projects/demoenv2.7$ bin/pip freeze
wsgiref==0.1.2
~/Projects/demoenv2.7$ bin/pip install commandlineapp
Downloading/unpacking commandlineapp
Downloading CommandLineApp-3.0.7.tar.gz (142kB): 142kB downloaded
Running setup.py egg_info for package commandlineapp
Installing collected packages: commandlineapp
Running setup.py install for commandlineapp
Successfully installed commandlineapp
Cleaning up...
~/Projects/demoenv2.7$ bin/pip freeze
CommandLineApp==3.0.7
wsgiref==0.1.2
What's strange in my answer is that package 'wsgiref' is visible inside the virtualenv. Its from my system python. Currently I do not know why, but maybe it is different on your system.
In Python3
pip list
Empty venv is
Package Version
---------- -------
pip 19.2.3
setuptools 41.2.0
To create a new environment
python3 -m venv your_foldername_here
Activate
cd your_foldername_here
source bin/activate
Deactivate
deactivate
You can also stand in the folder and give the virtual environment a name/folder (python3 -m venv name_of_venv).
Venv is a subset of virtualenv that is shipped with Python after 3.3.
list out the installed packages in the virtualenv
step 1:
workon envname
step 2:
pip freeze
it will display the all installed packages and installed packages and versions
If you're still a bit confused about virtualenv you might not pick up how to combine the great tips from the answers by Ioannis and Sascha. I.e. this is the basic command you need:
/YOUR_ENV/bin/pip freeze --local
That can be easily used elsewhere. E.g. here is a convenient and complete answer, suited for getting all the local packages installed in all the environments you set up via virtualenvwrapper:
cd ${WORKON_HOME:-~/.virtualenvs}
for dir in *; do [ -d $dir ] && $dir/bin/pip freeze --local > /tmp/$dir.fl; done
more /tmp/*.fl
why don't you try pip list
Remember I'm using pip version 19.1 on python version 3.7.3
If you are using pip 19.0.3 and python 3.7.4. Then go for pip list command in your virtualenv. It will show all the installed packages with respective versions.
.venv/bin/pip freeze worked for me in bash.
In my case the flask version was only visible under so I had to go to
C:\Users\\AppData\Local\flask\venv\Scripts>pip freeze --local
Using python3 executable only, from:
Gitbash:
winpty my_venv_dir/bin/python -m pip freeze
Linux:
my_venv_dir/bin/python -m pip freeze