File Not found in EMR Step (Shell script) - pyspark

I have a shell script as below, which I have added as a step in EMR.
#!/bin/sh -ex
unzip /home/hadoop/test.zip
ls/home/hadoop
pwd
sudo pip3 install pyspark
pytest /home/hadoop/test/read_pre_to_processed_test.py
pytest /home/hadoop/test/read_processed_published_test.py
pytest /home/hadoop/test/raw_to_snowflake_test.py
When I run the step. it says
ERROR: file or directory not found: /home/hadoop/test/read_pre_to_processed_test.py
It is unable to detect any of the pytest files in EMR. Can someone guide me?

Related

Use Python library in Slurm job

I want to run a job on Slurm and my Python script needs the evaluate package which I have on my local machine. I don't know if I could change the Python path on the server to match the one on my local machine, and if I could I'm afraid I might break the system.
So I followed this answer, and included a requirements.txt file with just evaluate==0.1.2 in it, and I get even more errors:
load GCC/10.2.0 (PATH, MANPATH, INFOPATH, LIBRARY_PATH, LD_LIBRARY_PATH, STD COMP VARS)
load ROCM/5.1.1 (PATH, MANPATH, LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH)
Set INTEL compilers as MPI wrappers backend
load mkl/2018.4 (LD_LIBRARY_PATH)
load PYTHON/3.7.4 (PATH, MANPATH, LD_LIBRARY_PATH, LIBRARY_PATH, PKG_CONFIG_PATH, C_INCLUDE_PATH, CPLUS_INCLUDE_PATH, PYTHONHOME, PYTHONPATH)
/var/spool/slurmd/job216863/slurm_script: line 12: virtualenv: command not found
/var/spool/slurmd/job216863/slurm_script: line 16: /env/bin/activate: No such file or directory
ERROR: Could not find a version that satisfies the requirement evaluate==0.1.2 (from versions: none)
ERROR: No matching distribution found for evaluate==0.1.2
Traceback (most recent call last):
File "eval_comet.py", line 1, in <module>
from evaluate import load
ModuleNotFoundError: No module named 'evaluate'
Most of the time, the Python version on HPCs are old. My Uni's HPC cluster has Python 3.7. If you wish to create a Python virtual environment (not conda) with a newer version, then there is a trick.
Activate the Anaconda Module, some system uses Module load and some uses load depending on your organisation.
[s.1915438#sl2 ~]$ module load anaconda/2021.05
[s.1915438#sl2 ~]$ conda create -n surrogate python=3.8
Here I created a Conda environment named surrogate with Python 3.8. Here, you can choose any version of your choice. Now you can activate the Conda environment and check the Python version.
[s.1915438#sl2 ~]$ source activate surrogate
(modulus) [s.1915438#sl2 ~]$ which python
~/.conda/envs/surrogate/bin/python
(surrogate) [s.1915438#sl2 ~]$ python --version
Python 3.8.13
Now navigate to the directory where you want to install your Python virtual environment and install the virtual environment using the following command.
(surrogate) [s.1915438#sl2 s.1915438]$ mkdir modulus_pysdf
(surrogate) [s.1915438#sl2 s.1915438]$ cd modulus_pysdf/
(surrogate) [s.1915438#sl2 modulus_pysdf]$ python3 -m venv modulus_pysdf
Logout (ctrl + D) from the server to exit the Conda environment and then login again. Remember, in my case the path to the Python virtual environment was /scratch/s.1915438/modulus_pysdf.
This is how I will activate the Python virtual environment.
[s.1915438#sl2 ~]$ cd /scratch/s.1915438
[s.1915438#sl2 s.1915438]$ cd modulus_pysdf/
[s.1915438#sl2 modulus_pysdf]$ source modulus_pysdf/bin/activate
Now I can check the Python version and the path.
(modulus_pysdf) [s.1915438#sl2 modulus_pysdf]$ python --version
Python 3.8.13
(modulus_pysdf) [s.1915438#sl2 modulus_pysdf]$ which python
/scratch/s.1915438/modulus_pysdf/modulus_pysdf/bin/python
As usual, I can install any package using pip. For example, to install evaluate you can download it from PyPI
pip install evaluate
Or if you have a requirements.txt file then you can do this. See this for more details.
cat requirements.txt | grep -Eo '(^[^#]+)' | xargs -n 1 pip install

Jupyter Notebook PySpark Kernel referencing lowered pip version from host machine site-packages

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.
Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the master node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.
In the notebook cell if I run import pip then pip I see that the module is located at
<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'>
I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.
If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.
How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?
If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?
<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>
As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".
I was able to find a solution on updating pip, setuptools, and wheel in the virtualenv that PySpark uses.
I initially had to determine how pip 9 is being sourced. By SSH'ing to my EMR Master node I changed directories into the root cd / and then ran the command sudo find . -name "pip*" to recursively search for where pip files may be located at.
In my scenario there is a pip 9 wheel located at:
./usr/lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl
By searching around a bit more in /usr/lib/python2.7/site-packages there is a virtualenv.py that is being invoked to create the virtualenv and is explained a bit more below.
Within the PySpark notebook session using %%info shows that the virtualenv is created from this file path (thanks Parag):
'spark.pyspark.virtualenv.bin.path': '/usr/bin/virtualenv'
Running cat /usr/bin/virtualenv shows that the virtualenv is being invoked from the following commands:
#!/usr/bin/python
import virtualenv
virtualenv.main()
This version of python in /usr/bin is python2.7. At the terminal I ran the following commands in sequence:
/usr/bin/python
import virtualenv
virtualenv
This outputs:
<module 'virtualenv' from '/usr/lib/python2.7/site-packages/virtualenv.py'>
I have sometimes seen a virtualenv.pyc file being used here which is located in /usr/lib/python2.7/site-packages/ but I have seen other users suggest that .pyc files can be deleted.
On the EMR master node I ran the command /usr/bin/virtualenv which shows some flags that can be used. First I used /usr/bin/virtualenv --verbose ./myVE which shows that pip 9.0.1 is packaged into the virtualenv I created. If I run /usr/bin/virtualenv --verbose --download ./myVE2 this shows output that an updated version of pip, setuptools, and wheel are being downloaded from Artifactory (our private PyPi mirror) into the virtualenv. There is a /etc/pip.conf that we use to setup the index-url and trusted host for Artifactory to be used instead of PyPi.
At this point it seems that the EMR cluster's virtualenv.py file as a default does not download updated wheels from Artifactory/PyPi and instead uses the wheel files located in /usr/lib/python2.7/site-packages/virtualenv_support/*.whl
Running cat /usr/lib/python2.7/site-packages/virtualenv.py shows that this version of virtualenv is 15.1.0 which is very outdated (2016 release).
Reading more into virtualenv.py shows that the main() function has a block of code as follows:
parser.add_option(
"--download",
dest="download",
action="store_true",
help="Download preinstalled packages from PyPI.",
)
I compared this virtualenv.py file on my EMR master to the official release of virtualenv==15.1.0 from PyPi (https://pypi.org/project/virtualenv/15.1.0/). I downloaded the tar.gz file and unzipped it on my local machine. There is a virtualenv.py file in the unzipped folder. When comparing contents using diff of the official virtualenv.py file to the EMR cluster's virtualenv.py file there are only a couple of lines that are not the same. The main difference is that parser.add_option from the code block above has default=True, in the official virtualenv.py file. The EMR cluster's virtualenv.py file does not have this.
parser.add_option(
"--download",
dest="download",
action="store_true",
default=True,
help="Download preinstalled packages from PyPI.",
)
What I did from here was I copied the EMR cluster's virtualenv.py and updated the line of code to set default=True,. I then used this updated virtualenv.py as part of an EMR bootstrap script so that this file is updated on all node types (master/core/task).
The bootstrap script does the following:
sudo rm /usr/lib/python2.7/site-packages/virtualenv.pyc
sudo rm /usr/lib/python2.7/site-packages/virtualenv.py
sudo aws s3 cp <UPDATED_VIRTUALENV_S3_PATH> /usr/lib/python2.7/site-packages/
Ensure that the copied file from S3 is just called virtualenv.py in the event that this causes any issues due to filenames not being kept the same.
Now when I start up a PySpark kernel the spark.pyspark.virtualenv.bin.path invokes the updated virtualenv.py file and I am able to confirm that pip is at a much higher version number (20+) which is what I was looking to achieve.

Python virtualenv ImportError: No module named 'zlib'

I am on an Ubuntu machine, which has Python 2.7.6 as its default python. It also has Python 3.4.3, with both versions located in /usr/bin/.
I have downloaded virtualenv and virtualenvwrapper. I then downloaded the current version of Python, 3.5.1. In its directory I ran the following commands:
./configure
make
make test
sudo make altinstall
Python 3.5.1 is now installed in /usr/local/bin/.
I now run the following commands:
mkvirtualenv test1
mkvirtualenv test2 -p /usr/bin/python3
mkvirtualenv test3 -p /usr/local/bin/python3.5
Environment test1 successfully created with Python version 2.7.6, and environment test2 successfully created with Python version 3.4.3. However, test3 fails with the following error:
ImportError: No module named 'zlib'
I see mentioned that I need to have "zlib" installed when compiling python to begin with, though make test didn't seem to give any problems. Do I just need to download and compile zlib from www.zlib.net and recompile python3.5?
zlib is a built-in module for Python 3.5.
I think you just need re-compile Python 3.5...
Look that link for Python virtualenv :
https://www.reddit.com/r/linux4noobs/comments/3uwk76/help_using_python_in_linux/
Get python source and extract
wget https://www.python.org/ftp/python/3.5.0/Python-3.5.0.tgz
tar xvf Python-3.5.0.tgz
configure for local install
cd Python-3.5.0/
./configure --prefix=$HOME/python35
make
If it complains about missing dependencies, install them, make clean and repeat.
make install

Installing psycopg2 failed with python 3.2 but not with 3.4

First of all, I am sorry for asking a question that was asked million times, however, I couldn't resolve my issue.
TL;DR:
psycopg2 builds in Python3.4 virtualenv, but not in Python3.2; suspected dev packages missing, where can I get dev packages for old python releases?
Long story:
I should be writing code for Python3.2 using django with PostgreSQL database engine.
Ubuntu 15.04 by default ships with Python3.4 so I have build Python3.2 from source:
$ python3.2 --version
Python 3.2.6
virtualenv is created so:
myproject $ virtualenv -p python3.2 venv
New python executable in venv/bin/python3.2
Also creating executable in venv/bin/python
Installing setuptools, pip...done.
Installing requirements:
myproject $ source venv/bin/activate
(venv)myproject $ pip install psycopg2
The output log can be found in pastebin.
What I have read there:
GCC finishes whatever its doing without error messages; last non-error reporting line is
running install_lib
then it fails with
Command /home/julka/LP/myproject/venv/bin/python3.2 -c "import setuptools, tokenize;__file__='/tmp/pip-build-cbu7a3/psycopg2/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-cyu65o-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/julka/LP/myproject/venv/include/site/python3.2 failed with error code 1 in /tmp/pip-build-cbu7a3/psycopg2
where it was compiling some lib.
Building with Python3.4
julka#Pyragas-vo2:~/L1P/myproject$ virtualenv -p python3.4 venv
Running virtualenv with interpreter /usr/bin/python3.4
Using base prefix '/usr'
New python executable in venv/bin/python3.4
Also creating executable in venv/bin/python
Installing setuptools, pip...done.
julka#Pyragas-vo2:~/LP/myproject$ source venv/bin/activate
(venv)julka#Pyragas-vo2:~/LP/myproject$ pip install psycopg2 --vv > psycopg2.log
Successfully installed psycopg2
Cleaning up...
(venv)julka#Pyragas-vo2:~/LP/myproject$
Successful installation. I have put the log in pastebin.
Maybe I'm missing some kind of development files for python3.2, what do i need to check this out?

Installing Bigquery command-line tool

I tried installing Bigquery command-line tool under Linux using "easy_install bigquery" as well as manually via "python setup.py install".
I got the message "Finished processing dependencies for Bigquery." without an error.
Still, when I type "bq", I get the message "command not found".
Is there anything else to do?
Can you try running the easy_install command with the --record=log.txt flag? It should then give you a list of the output files when it is completed in the log.txt file.
E.g.
$ sudo easy_install --record=log.txt --upgrade bigquery
....
Installing bq script to /usr/local/bin
....
$ cat log.txt
/usr/local/bin/bq
You might also try the --verbose option as well.
Had the same issue, you can try
pip install --upgrade google-cloud-bigquery