How to make Python version executables global across multiple pyenv-virtualenv virtual environments - pyenv

A pyenv Python version (e.g. 3.10.4) has the "normal" expected Python executables associated with it (e.g., pip, 2to3, pydoc)
$ ls "${PYENV_ROOT}/versions/3.10.4/bin"
2to3 idle idle3.10 pip3 pydoc pydoc3.10 python-config python3-config python3.10-config
2to3-3.10 idle3 pip pip3.10 pydoc3 python python3 python3.10 python3.10-gdb.py
and a pyenv-virtualenv virtual environment has only the executables that one would a get inside the virtual environment directory structure
$ pyenv virtualenv 3.10.4 venv
$ ls "${PYENV_ROOT}/versions/venv"
bin include lib lib64 pyvenv.cfg
$ ls "${PYENV_ROOT}/versions/venv/bin/"
Activate.ps1 activate activate.csh activate.fish pip pip3 pip3.10 pydoc python python3 python3.10
by default after creation, the venv virtual environment doesn't know about executables of the Python version that it is associated to, like 2to3
$ pyenv activate venv
(venv) $ 2to3 --help
pyenv: 2to3: command not found
The `2to3' command exists in these Python versions:
3.10.4
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
so to allow for a virtual environment like venv to have access to these executables you add both it and the Python that created it to pyenv global so that pyenv will "fall back" to the Python version when the executable isn't found
(venv) $ pyenv deactivate
$ pyenv global venv 3.10.4
(venv) $ pyenv global
venv
3.10.4
(venv) $ 2to3 --help | head -n 3
Usage: 2to3 [options] file|dir ...
Options:
This pattern works for one virtual environment, but how do you maintain access to executables like 2to3 (or pipx as seen below`) when you have multiple virtual environments in play?
(venv) $ pyenv virtualenv 3.10.4 example && pyenv activate example
(example) $ 2to3
pyenv: 2to3: command not found
The `2to3' command exists in these Python versions:
3.10.4
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
Reproducible example
Using the following Dockerfile
FROM debian:bullseye
SHELL ["/bin/bash", "-c"]
USER root
RUN apt-get update -y && \
apt-get install --no-install-recommends -y \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
xz-utils \
tk-dev \
libxml2-dev \
libxmlsec1-dev \
libffi-dev \
liblzma-dev \
g++ && \
apt-get install -y \
git && \
apt-get -y clean && \
apt-get -y autoremove && \
rm -rf /var/lib/apt/lists/*
# Install pyenv and pyenv-virtualenv
ENV PYENV_RELEASE_VERSION=2.3.0
RUN git clone --depth 1 https://github.com/pyenv/pyenv.git \
--branch "v${PYENV_RELEASE_VERSION}" \
--single-branch \
~/.pyenv && \
pushd ~/.pyenv && \
src/configure && \
make -C src && \
echo 'export PYENV_ROOT="${HOME}/.pyenv"' >> ~/.bashrc && \
echo 'export PATH="${PYENV_ROOT}/bin:${PATH}"' >> ~/.bashrc && \
echo 'eval "$(pyenv init -)"' >> ~/.bashrc && \
. ~/.bashrc && \
git clone --depth 1 https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv && \
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc
# Install CPython
ENV PYTHON_VERSION=3.10.4
RUN . ~/.bashrc && \
echo "Install Python ${PYTHON_VERSION}" && \
PYTHON_MAKE_OPTS="-j8" pyenv install "${PYTHON_VERSION}"
# Make 'base' virtual envirionment, add it and its Python version to global for
# executables like 2to3 or pipx to be findable
# c.f. https://github.com/pyenv/pyenv-virtualenv/issues/16#issuecomment-37640961
# and then install pipx into the 'base' virtual environment and use pipx to install
# pepotron
RUN . ~/.bashrc && \
pyenv virtualenv "${PYTHON_VERSION}" base && \
echo "" && echo "Python ${PYTHON_VERSION} has additional executables..." && \
ls -lh "${PYENV_ROOT}/versions/${PYTHON_VERSION}/bin" && \
echo "" && echo "...compared to 'base' virtualenv made with Python ${PYTHON_VERSION}" && \
ls -lh "${PYENV_ROOT}/versions/base/bin" && \
echo "" && echo "...because 'base' is actually a symlink" && \
ls -lh "${PYENV_ROOT}/versions/" && \
pyenv global base "${PYTHON_VERSION}" && \
python -m pip --quiet install --upgrade pip setuptools wheel && \
python -m pip --quiet install pipx && \
python -m pipx ensurepath && \
eval "$(register-python-argcomplete pipx)" && \
pipx install pepotron
WORKDIR /home/data
built with
docker build . --file Dockerfile --tag pyenv/multiple-virtualenvs:debug
it can be run with the following to demonstrate the problem
$ docker run --rm -ti pyenv/multiple-virtualenvs:debug
(base) root#26dfa530cd82:/home/data# pyenv global
base
3.10.4
(base) root#26dfa530cd82:/home/data# 2to3 --help | head -n 3
Usage: 2to3 [options] file|dir ...
Options:
(base) root#26dfa530cd82:/home/data# pipx list
venvs are in /root/.local/pipx/venvs
apps are exposed on your $PATH at /root/.local/bin
package pepotron 0.6.0, installed using Python 3.10.4
- bpo
- pep
(base) root#26dfa530cd82:/home/data# pep 3.11
https://peps.python.org/pep-0664/
(base) root#26dfa530cd82:/home/data# pyenv virtualenv 3.10.4 example
(base) root#26dfa530cd82:/home/data# pyenv activate example
pyenv-virtualenv: prompt changing will be removed from future release. configure `export PYENV_VIRTUALENV_DISABLE_PROMPT=1' to simulate the behavior.
(example) root#26dfa530cd82:/home/data# 2to3
pyenv: 2to3: command not found
The `2to3' command exists in these Python versions:
3.10.4
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
(example) root#26dfa530cd82:/home/data# pipx
pyenv: pipx: command not found
The `pipx' command exists in these Python versions:
3.10.4/envs/base
base
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
So how can one have something like pipx, that is designed to be installed globally, work globally if it is installed in a pyenv-virtualenv virtual environment so that you don't have to have anything installed with pip in the system Python?
It would seem that instead of ever using pyenv activate to activate a virtual environment you would need to deactivate any virtual environments and then only use pyenv global <virtual environment name> <virtual environment Python version> to effectively switch environments. I assume that can't be the only way to use Python version executables inside of a virtual environment, as that seems like that would remove the point of having a separate pyenv activate CLI API for pyenv-virtualenv.

You can directly execute the pipx binary in the pyenv prefix, it should work correctly.
Pyenv's shims mechanism isn't really designed for global binaries like this. I naively expected that the global environment would act as fallback when a local environment doesn't have an installed binary, but I think pyenv only looks at the system Python before falling back to $PATH.
So, if you don't install pipx into system (which, if you're not installing a system pip, I doubt you're doing), then the naive fallback doesn't work.
An alternative is to run pyenv with a temporary environment, i.e.
PYENV_VERSION=my-pipx-env pyenv exec pipx.
I'd want to make this an executable, so I'd suggest adding something like this into a special PATH directory that takes precedence from the pyenv paths:
#!/usr/bin/env bash
set -eu
export PYENV_VERSION="pipx"
exec "${PYENV_ROOT}/libexec/pyenv" exec pipx "$#"
Although, I'd be tempted to forgo the whole activation logic and just directly exec the pipx binary from the environment /bin to avoid running into any shell configuration errors.

Related

Deploying Jenkins using skaffold via GitHub Action Runner

I am deploying Jenkins Using GitHub Action Runner using Skaffold.
While the Skaffold is installed over the default image of GitHub Runner
The pod is restarting due to crash loop back off error and causing it to restart.
I am not sure why it is happening.
When I am deploying runner over Google Kubernetes Engine my runner is failing because of following error:
'''A runner exists with the same name
√ Successfully replaced the runner
√ Runner connection is good
# Runner settings
√ Settings Saved.
√ Connected to GitHub
Current runner version: '2.294.0'
2022-12-01 06:03:57Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.299.1 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner in 5 seconds
Restarting runner...
/home/docker/actions-runner/run-helper.sh: line 20: /home/docker/actions-runner/bin/Runner.Listener: No such file or directory
Exiting with unknown error code: 127
Exiting runner...
'''
Following is the Dockerfile used for runner :
'''FROM ubuntu:22.04
#instalIng skaffold
RUN apt-get update -y && apt-get upgrade -y sudo
RUN apt-get install -y curl
RUN curl -LO https://storage.googleapis.com/skaffold/releases/v2.0.2/skaffold-linux-amd64 \
&& sudo chmod +x skaffold-linux-amd64 \
&& sudo mv skaffold-linux-amd64 /usr/local/bin/skaffold
#install
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update -y && apt-get upgrade -y && useradd -m docker
RUN apt-get install -y curl jq build-essential libssl-dev libffi-dev python3 python3-venv python3-dev ca-certificates gnupg2 iputils-ping software-properties-common apt-transport-https lsb-release git zip unzip postgresql-client python3-pip npm
RUN ln -sf /usr/bin/python3 /usr/bin/python
# set the github runner version
ARG RUNNER_VERSION="2.294.0"
# cd into the user directory, download and unzip the github actions runner
RUN cd /home/docker && mkdir actions-runner && cd actions-runner \
&& curl -O -L https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz \
&& tar xzf ./actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz
# install some additional dependencies
RUN chown -R docker ~docker && /home/docker/actions-runner/bin/installdependencies.sh
#Docker
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable" && \
apt-get update && \
apt-get -y install docker-ce
# Downloading gcloud package
RUN curl https://dl.google.com/dl/cloudsdk/release/google-cloud-sdk.tar.gz > /tmp/google-cloud-sdk.tar.gz
# Installing the package
RUN mkdir -p /usr/local/gcloud \
&& tar -C /usr/local/gcloud -xvf /tmp/google-cloud-sdk.tar.gz \
&& /usr/local/gcloud/google-cloud-sdk/install.sh
# Adding the package path to local
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin
#Install Kubectl
RUN curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl" \
&& chmod +x ./kubectl \
&& mv ./kubectl /usr/local/bin/kubectl
# copy over the start.sh script
COPY start.sh start.sh
# make the script executable
RUN chmod +x start.sh && mv start.sh /home/docker
# since the config and run script for actions are not allowed to be run by root,
# set the user to "docker" so all subsequent commands are run as the docker user
USER docker
# set the entrypoint to the start.sh script
ENTRYPOINT ["/home/docker/start.sh"] '''
Below is the startup script :
#!/bin/bash
SNAPTIME=`date '+%Y%m%d%H%M%S'`
echo "Started $SNAPTIME"
ORGANIZATION=$ORGANIZATION
ACCESS_TOKEN=`cat /etc/pat/pat`
GH_PROJECT=$GH_PROJECT
RUNNER_NAME="${RUNNER_NAME:-RUN$SNAPTIME}"
RUNNER_LABELS="${RUNNER_LABELS:-simple}"
REG_TOKEN=$(curl -sX POST -H "Authorization: token ${ACCESS_TOKEN}" https://api.github.com/repos/${ORGANIZATION}/$GH_PROJECT/actions/runners/registration-token | jq .token --raw-output)
# gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
cd /home/docker/actions-runner
./config.sh --name $RUNNER_NAME --labels ${RUNNER_LABELS} --url https://github.com/${ORGANIZATION}/${GH_PROJECT} --unattended --replace --token ${REG_TOKEN}
cleanup() {
echo "Removing runner..."
./config.sh remove --unattended --token ${REG_TOKEN}
}
trap 'cleanup; exit 130' INT
trap 'cleanup; exit 143' TERM
./run.sh & wait $!
The pod is restarting whenever the load is coming into it:
runner-automation-dev-docker-595f48c7dc-k2wbz 1/2 CrashLoopBackOff 7 (67s ago) 18m
I am not sure what exactly is causing this issue.

AWS Lambda - Swift Operation not permitted

I am trying to compile Swift code via AWS Lambda.
Therefore I am using an Ubuntu 18.04 Image as base.
The Swift Version is 5.0.1.
When the image is executed locally, it works fine.
When I try to execute it in AWS Lambda, I get the following error:
/usr/bin/ld.gold: fatal error: /tmp/project/src/a.out: Operation not
permitted\nclang-7: error: linker command failed with exit code 1 (use
-v to see invocation)
I think that the problem is caused by the read-only aws lambda container, that only allows to write into the /tmp/ folder.
Do you know how to fix this error? It seems that swift needs permissions for folders, it doesnt have permission for?
Dockerfile
FROM ubuntu:18.04
# install clang
RUN apt-get update
RUN apt-get install -y clang
# install wget
RUN apt-get install -y wget
# install swift dependencies
RUN apt-get install -y libcurl3 libpython2.7 libpython2.7-dev
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get install -y --no-install-recommends \
binutils \
git \
libc6-dev \
libcurl4 \
libedit2 \
libgcc-5-dev \
libpython2.7 \
libsqlite3-0 \
libstdc++-5-dev \
libxml2 \
pkg-config \
tzdata \
zlib1g-dev \
libbsd-dev
RUN apt-get install -y libicu-dev
# install swift 5.0.1
RUN wget https://swift.org/builds/swift-5.0.1-release/ubuntu1804/swift-5.0.1-RELEASE/swift-5.0.1-RELEASE-ubuntu18.04.tar.gz RUN tar xzf swift-5.0.1-RELEASE-ubuntu18.04.tar.gz RUN mv swift-5.0.1-RELEASE-ubuntu18.04 /usr/lib/swift RUN echo "export PATH=/usr/lib/swift/usr/bin:$PATH" >> ~/.bashrc
RUN . ~/.bashrc
RUN chmod -R o+r /usr/lib/swift
This is the command executed in the AWS-Lambda handler function:
swiftc hello_world.swift -o a.out
hello_world.swift
print("Hello World!")
Your output must be set in tmp folder
swiftc hello_world.swift -o /tmp/a.out

Problem installing modules with cpanm in Perl

I am absolutely new to Perl world. I've picked up a project left behind by a former employee & trying to get it to work. The project was originally in docker form & the requirement now is to run it in a non-docker form (don't get me started on this!) The project works like a charm in it's docker form. I've made the assumption that if I were to install all the packages installed in the docker file, the script should work. In the process of installing the packages, I am bumping into issues.
While the docker image uses ubuntu, the server I am trying on now uses RHEL7
DOCKER FILE
FROM ubuntu:focal-20200703 as ubuntu
FROM ubuntu as mytool
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y -qq build-essential libssl-dev tzdata bash curl wget perl cpanminus libcrypt-ssleay-perl libnet-ssleay-perl
RUN cpanm -n Data::Printer \
Log::Log4perl \
List::Util \
Test::More \
JSON \
JSON::XS \
YAML \
YAML::XS \
GitLab::API::v4 \
IO::Socket::SSL \
HTML::HashTable \
REST::Client \
MIME::Base64 \
File::Spec \
File::Basename \
File::Path \
List::MoreUtils \
DateTime::Format::ISO8601 \
Digest::MD5
ENV TZ=Pacific/Auckland
RUN echo $TZ > /etc/timezone && ln -snf /usr/share/zoneinfo/$TZ /etc/localtime
COPY docker/context/usr/local/bin/entrypoint.sh /usr/local/bin/entrypoint.sh
COPY docker/context/usr/local/etc/mytool/mytool.env.ctmpl /usr/local/etc/mytool/mytool.env.ctmpl
RUN mkdir -p /usr/local/bin /usr/local/bin/cache.d
WORKDIR /usr/local/bin
COPY mytool /usr/local/bin/
COPY lib/ /usr/local/bin/lib
# Container pilot obligatory parameters START
ENV SERVICE_NAME=mytool
ENV SERVICE_PRE_EXEC=/bin/true
ENV SERVICE_HEALTHCHECK_EXEC="test -f /usr/local/etc/mytool/mytool.env"
ENV SERVICE_TEMPLATE_CONFIG_PAIRS=/usr/local/etc/mytool/mytool.env.ctmpl:/usr/local/etc/mytool/mytool.env
EXPOSE 8080
ENTRYPOINT [""]
CMD ["/usr/local/bin/entrypoint.sh"]
# Container pilot obligatory parameters END
I've installed build-essential libssl-dev tzdata bash curl wget perl cpanminus libcrypt-ssleay-perl libnet-ssleay-perl successfully.
Problem is with cpanm stuff from the docker file
[root#npcrver01 home]# cpanm Log::Log4perl
! Finding Log::Log4perl on cpanmetadb failed.
! cannot open file '/root/.cpanm/sources/http%www.cpan.org/02packages.details.txt.gz': No such file or directory opening compressed index
! Couldn't find module or a distribution Log::Log4perl ()
Please could someone help me here
gosh it's the proxy setting. I had to do
export https_proxy=http://<IP>:<port>
export http_proxy=http://<IP>:<port>
export HTTPS_PROXY=http://<IP>:<port>
export HTTP_PROXY=http://<IP>:<port>
Everything is working fine now

pyenv install with .python-version and .python-virtualen fails on MacOS BigSur

This is only partly related to #1737
I have just upgraded to the new MAC OS BigSur.
I have installed XCode Beta 12.3 and configured it with Command Line Tools 12.3 beta.
If I do:
$ CFLAGS="-I$(brew --prefix openssl)/include -I$(brew --prefix bzip2)/include -I$(brew --prefix readline)/include -I$(xcrun --show-sdk-path)/usr/include" LDFLAGS="-L$(brew --prefix openssl)/lib -L$(brew --prefix readline)/lib -L$(brew --prefix zlib)/lib -L$(brew --prefix bzip2)/lib" pyenv install --patch 3.8.0 < <(curl -sSL https://github.com/python/cpython/commit/8ea6353.patch\?full_index\=1)
as per the instructions of this blog: https://dev.to/kojikanao/install-python-3-8-0-via-pyenv-on-bigsur-4oee It works.
However, I started using pyenv after finding a very attractive way of managing many python envs through automatic activation as described in this blog: https://glhuilli.github.io/virtual-environments.html
Since I upgraded, I have not been able to get this to work.
Questions:
When I cd into a directory with .python-version and
.python-virtualenv, the script prompts me to create a new env with
pyenv install. This fails with the ./Modules/pwdmodule.c error. How
can I alter the above script in order to create an environment using
.python-version and .python-virtualenv? I can obviously provide a
different python version in the script, but what about the name of
the virtual environment? How can I include that?
I want the new
virtual environment contents to be located in the directory where
pyenv is called and not /Users/username/.pyenv. How can this be
done? i am sure others are facing similar issues. Will these be
fixed eventually? Ideally, I would like to be able to just do pyenv
install and be done...
Thanks in advance.
So, about question 1: The answer is that pyenv install will not work at the momment. However, as long as the required pyenv version is installed, the script will work like a charm. So you will have to install it in a different way (not with pyenv install).
Example:
Suppose you are given two files:
.python-vesion
.python-virtualenv
respectively encapsulating: 3.8.2 and test-venv. Then just run:
CFLAGS="-I$(brew --prefix openssl)/include -I$(brew --prefix bzip2)/include -I$(brew --prefix readline)/include -I$(xcrun --show-sdk-path)/usr/include"
LDFLAGS="-L$(brew --prefix openssl)/lib -L$(brew --prefix readline)/lib -L$(brew --prefix zlib)/lib -L$(brew --prefix bzip2)/lib"
pyenv install --patch \$(head -n 1 .python-version) < <(curl -sSL https://github.com/python/cpython/commit/8ea6353.patch\?full_index\=1)
This should successfully install a pyenv for 3.8.2.
Then just do:
pyenv virtualenv \$(head -n 1 .python-virtualenv)
Then if you run:
\$ pyenv virtualenvs
3.8.2/envs/test-venv (created from /Users/{your-pc-name}/.pyenv/versions/3.8.2)
test-venv (created from /Users/{your-pc-name}/.pyenv/versions/3.8.2)
you will confirm that the new env has been created.
About question 2: Here is the updated script:
# If you come from bash you might have to change your $PATH.
# export PATH=$HOME/bin:/usr/local/bin:$PATH
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
# Automatic venv activation
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
export PYENV_VIRTUALENV_DISABLE_PROMPT=1
# Undo any existing alias for `cd`
unalias cd 2>/dev/null
# Method that verifies all requirements and activates the virtualenv
hasAndSetVirtualenv() {
# .python-version is mandatory for .python-virtualenv but not vice versa
if [ -f .python-virtualenv ]; then
if [ ! -f .python-version ]; then
echo "To use .python-virtualenv you need a .python-version"
return 1
fi
fi
# Check if pyenv has the Python version needed.
# If not (or pyenv not available) exit with code 1 and the respective instructions.
if [ -f .python-version ]; then
if [ -z "`which pyenv`" ]; then
echo "Install pyenv see https://github.com/yyuu/pyenv"
return 1
elif [ -n "`pyenv versions 2>&1 | grep 'not installed'`" ]; then
# Message "not installed" is automatically generated by `pyenv versions`
echo 'run "pyenv install"'
return 1
fi
fi
# Create and activate the virtualenv if all conditions above are successful
# Also, if virtualenv is already created, then just activate it.
if [ -f .python-virtualenv ]; then
VIRTUALENV_NAME="`cat .python-virtualenv`"
PYTHON_VERSION="`cat .python-version`"
MY_ENV=$PYENV_ROOT/versions/$PYTHON_VERSION/envs/$VIRTUALENV_NAME
([ -d $MY_ENV ] || virtualenv $MY_ENV -p `which python`) && \
source $MY_ENV/bin/activate
fi
}
pythonVirtualenvCd () {
# move to a folder + run the pyenv + virtualenv script
cd "$#" && hasAndSetVirtualenv
}
# Every time you move to a folder, run the pyenv + virtualenv script
alias cd="pythonVirtualenvCd"

Spark-submit AWS EMR with anaconda installed python libraries

I launch an EMR cluster with boto3 from a separate ec2 instance and use a bootstrapping script that looks like this:
#!/bin/bash
############################################################################
#For all nodes including master #########
############################################################################
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh -b -p /mnt1/anaconda3
export PATH=/mnt1/anaconda3/bin:$PATH
echo "export PATH="/mnt1/anaconda3/bin:$PATH"" >> ~/.bash_profile
sudo sed -i -e '$a\export PYSPARK_PYTHON=/mnt1/anaconda3/bin/python' /etc/spark/conf/spark-env.sh
echo "export PYSPARK_PYTHON="/mnt1/anaconda3/bin/python3"" >> ~/.bash_profile
conda install -c conda-forge -y shap
conda install -c conda-forge -y lightgbm
conda install -c anaconda -y numpy
conda install -c anaconda -y pandas
conda install -c conda-forge -y pyarrow
conda install -c anaconda -y boto3
############################################################################
#For master #########
############################################################################
if [ `grep 'isMaster' /mnt/var/lib/info/instance.json | awk -F ':' '{print $2}' | awk -F ',' '{print $1}'` = 'true' ]; then
sudo sed -i -e '$a\export PYSPARK_PYTHON=/mnt1/anaconda3/bin/python' /etc/spark/conf/spark-env.sh
echo "export PYSPARK_PYTHON="/mnt1/anaconda3/bin/python3"" >> ~/.bash_profile
sudo yum -y install git-core
conda install -c conda-forge -y jupyterlab
conda install -y jupyter
conda install -c conda-forge -y s3fs
conda install -c conda-forge -y nodejs
pip install spark-df-profiling
jupyter labextension install jupyterlab_filetree
jupyter labextension install #jupyterlab/toc
fi
Then I add a step programatically to the running cluster using add_job_flow_steps
action = conn.add_job_flow_steps(JobFlowId=curr_cluster_id, Steps=layer_function_steps)
The step is a spark-submit that is perfectly formed.
In one of the imported python files I import boto3. The error I get is
ImportError: No module named boto3
Clearly I am installing this library. If I SSH into the master node and run
python
import boto3
it works fine.Is there some kind of issue with spark-submit using the installed libraries since I am doing a conda install?
AWS has a project (AWS Data Wrangler) that helps with EMR launching.
This snippet should work to launch your cluster with Python 3 enabled:
import awswrangler as wr
cluster_id = wr.emr.create_cluster(
cluster_name="wrangler_cluster",
logging_s3_path=f"s3://BUCKET_NAME/emr-logs/",
emr_release="emr-5.28.0",
subnet_id="SUBNET_ID",
emr_ec2_role="EMR_EC2_DefaultRole",
emr_role="EMR_DefaultRole",
instance_type_master="m5.xlarge",
instance_type_core="m5.xlarge",
instance_type_task="m5.xlarge",
instance_ebs_size_master=50,
instance_ebs_size_core=50,
instance_ebs_size_task=50,
instance_num_on_demand_master=1,
instance_num_on_demand_core=1,
instance_num_on_demand_task=1,
instance_num_spot_master=0,
instance_num_spot_core=1,
instance_num_spot_task=1,
spot_bid_percentage_of_on_demand_master=100,
spot_bid_percentage_of_on_demand_core=100,
spot_bid_percentage_of_on_demand_task=100,
spot_provisioning_timeout_master=5,
spot_provisioning_timeout_core=5,
spot_provisioning_timeout_task=5,
spot_timeout_to_on_demand_master=True,
spot_timeout_to_on_demand_core=True,
spot_timeout_to_on_demand_task=True,
python3=True, # Relevant argument
spark_glue_catalog=True,
hive_glue_catalog=True,
presto_glue_catalog=True,
bootstraps_paths=["s3://BUCKET_NAME/bootstrap.sh"], # Relevant argument
debugging=True,
applications=["Hadoop", "Spark", "Ganglia", "Hive"],
visible_to_all_users=True,
key_pair_name=None,
spark_jars_path=[f"s3://...jar"],
maximize_resource_allocation=True,
keep_cluster_alive_when_no_steps=True,
termination_protected=False,
spark_pyarrow=True, # Relevant argument
tags={
"foo": "boo"
}
)
bootstrap.sh content:
#!/usr/bin/env bash
set -e
echo "Installing Python libraries..."
sudo pip-3.6 install -U awswrangler
sudo pip-3.6 install -U LIBRARY1
sudo pip-3.6 install -U LIBRARY2
...