Spark-submit AWS EMR with anaconda installed python libraries - pyspark

I launch an EMR cluster with boto3 from a separate ec2 instance and use a bootstrapping script that looks like this:
#!/bin/bash
############################################################################
#For all nodes including master #########
############################################################################
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh -b -p /mnt1/anaconda3
export PATH=/mnt1/anaconda3/bin:$PATH
echo "export PATH="/mnt1/anaconda3/bin:$PATH"" >> ~/.bash_profile
sudo sed -i -e '$a\export PYSPARK_PYTHON=/mnt1/anaconda3/bin/python' /etc/spark/conf/spark-env.sh
echo "export PYSPARK_PYTHON="/mnt1/anaconda3/bin/python3"" >> ~/.bash_profile
conda install -c conda-forge -y shap
conda install -c conda-forge -y lightgbm
conda install -c anaconda -y numpy
conda install -c anaconda -y pandas
conda install -c conda-forge -y pyarrow
conda install -c anaconda -y boto3
############################################################################
#For master #########
############################################################################
if [ `grep 'isMaster' /mnt/var/lib/info/instance.json | awk -F ':' '{print $2}' | awk -F ',' '{print $1}'` = 'true' ]; then
sudo sed -i -e '$a\export PYSPARK_PYTHON=/mnt1/anaconda3/bin/python' /etc/spark/conf/spark-env.sh
echo "export PYSPARK_PYTHON="/mnt1/anaconda3/bin/python3"" >> ~/.bash_profile
sudo yum -y install git-core
conda install -c conda-forge -y jupyterlab
conda install -y jupyter
conda install -c conda-forge -y s3fs
conda install -c conda-forge -y nodejs
pip install spark-df-profiling
jupyter labextension install jupyterlab_filetree
jupyter labextension install #jupyterlab/toc
fi
Then I add a step programatically to the running cluster using add_job_flow_steps
action = conn.add_job_flow_steps(JobFlowId=curr_cluster_id, Steps=layer_function_steps)
The step is a spark-submit that is perfectly formed.
In one of the imported python files I import boto3. The error I get is
ImportError: No module named boto3
Clearly I am installing this library. If I SSH into the master node and run
python
import boto3
it works fine.Is there some kind of issue with spark-submit using the installed libraries since I am doing a conda install?

AWS has a project (AWS Data Wrangler) that helps with EMR launching.
This snippet should work to launch your cluster with Python 3 enabled:
import awswrangler as wr
cluster_id = wr.emr.create_cluster(
cluster_name="wrangler_cluster",
logging_s3_path=f"s3://BUCKET_NAME/emr-logs/",
emr_release="emr-5.28.0",
subnet_id="SUBNET_ID",
emr_ec2_role="EMR_EC2_DefaultRole",
emr_role="EMR_DefaultRole",
instance_type_master="m5.xlarge",
instance_type_core="m5.xlarge",
instance_type_task="m5.xlarge",
instance_ebs_size_master=50,
instance_ebs_size_core=50,
instance_ebs_size_task=50,
instance_num_on_demand_master=1,
instance_num_on_demand_core=1,
instance_num_on_demand_task=1,
instance_num_spot_master=0,
instance_num_spot_core=1,
instance_num_spot_task=1,
spot_bid_percentage_of_on_demand_master=100,
spot_bid_percentage_of_on_demand_core=100,
spot_bid_percentage_of_on_demand_task=100,
spot_provisioning_timeout_master=5,
spot_provisioning_timeout_core=5,
spot_provisioning_timeout_task=5,
spot_timeout_to_on_demand_master=True,
spot_timeout_to_on_demand_core=True,
spot_timeout_to_on_demand_task=True,
python3=True, # Relevant argument
spark_glue_catalog=True,
hive_glue_catalog=True,
presto_glue_catalog=True,
bootstraps_paths=["s3://BUCKET_NAME/bootstrap.sh"], # Relevant argument
debugging=True,
applications=["Hadoop", "Spark", "Ganglia", "Hive"],
visible_to_all_users=True,
key_pair_name=None,
spark_jars_path=[f"s3://...jar"],
maximize_resource_allocation=True,
keep_cluster_alive_when_no_steps=True,
termination_protected=False,
spark_pyarrow=True, # Relevant argument
tags={
"foo": "boo"
}
)
bootstrap.sh content:
#!/usr/bin/env bash
set -e
echo "Installing Python libraries..."
sudo pip-3.6 install -U awswrangler
sudo pip-3.6 install -U LIBRARY1
sudo pip-3.6 install -U LIBRARY2
...

Related

Mac M1: Docker compose fails in vscode - mongodb-database-tools not installable

I'm running this Docker file in MAC M1:
Dockerfile
ARG VARIANT=16-bullseye
FROM mcr.microsoft.com/vscode/devcontainers/javascript-node:0-${VARIANT}
RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
&& apt-get -y install --no-install-recommends vim wget redis-tools
ARG MONGO_CLI_VERSION=4.4
RUN wget -qO - https://www.mongodb.org/static/pgp/server-${MONGO_CLI_VERSION}.asc | sudo apt-key add -
RUN echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/${MONGO_CLI_VERSION} main" | tee /etc/apt/sources.list.d/mongodb-org-${MONGO_CLI_VERSION}.list
RUN apt-get update && export DEBIAN_FRONTEND=noninteractive \
&& apt-get -y install --no-install-recommends mongodb-mongosh \
&& apt-get clean -y && rm -rf /var/lib/apt/lists/*
RUN wget https://fastdl.mongodb.org/tools/db/mongodb-database-tools-debian11-x86_64-100.5.3.deb
RUN apt install ./mongodb-database-tools-*-100.5.3.deb
RUN su node -c "wget -O ~/.git-completion.bash https://raw.githubusercontent.com/git/git/master/contrib/completion/git-completion.bash"
RUN su node -c "echo -e '\n# Git Completion' >> ~/.bashrc"
RUN su node -c "echo -e 'source ~/.git-completion.bash\n' >> ~/.bashrc"
The response is shown in the image attached.
On this line:
RUN wget https://fastdl.mongodb.org/tools/db/mongodb-database-tools-debian11-x86_64-100.5.3.deb
You're trying to install a package named mongodb-database-tools-debian11-x86_64-100.5.3.deb that is for Intel processors (x64_64) in your ARM64 image. That's not going to work.
MongoDB doesn't seem to offer a package for Debian on ARM64 on their download page. They do offer one for Ubuntu ARM64.
There doesn't seem to be a variant of javascript-node that builds on Ubuntu 18 Bionic, however, I think you can keep on using this Debian 11 Bullseye variant, because the version of MongoDB-database-tools for Ubuntu 16 seems to install just fine:
RUN wget https://fastdl.mongodb.org/tools/db/mongodb-database-tools-ubuntu1604-arm64-100.5.3.deb
Build the image and test:
docker build -t test . ; docker run --rm test mongodump --version
mongodump version: 100.5.3
git version: 139703c0587796da96c367f365473d0266f9cede
Go version: go1.17.10
os: linux
arch: arm64
compiler: gc
If you want this image to build on both x86 and ARM, check what architecture you're on before downloading the .deb:
RUN if [ "$(arch)" = "aarch64" ] || [ "$(arch)" = "arm64" ]; then\
wget https://fastdl.mongodb.org/tools/db/mongodb-database-tools-ubuntu1604-arm64-100.5.3.deb;\
else\
wget https://fastdl.mongodb.org/tools/db/mongodb-database-tools-debian11-x86_64-100.5.3.deb;\
fi;
RUN apt install ./mongodb-database-tools-*-100.5.3.deb
Try another platform for example x86_64
https://docs.docker.com/desktop/mac/apple-silicon/
Not all images are available for ARM64 architecture. You can add
--platform linux/amd64 to run an Intel image under emulation

How to make Python version executables global across multiple pyenv-virtualenv virtual environments

A pyenv Python version (e.g. 3.10.4) has the "normal" expected Python executables associated with it (e.g., pip, 2to3, pydoc)
$ ls "${PYENV_ROOT}/versions/3.10.4/bin"
2to3 idle idle3.10 pip3 pydoc pydoc3.10 python-config python3-config python3.10-config
2to3-3.10 idle3 pip pip3.10 pydoc3 python python3 python3.10 python3.10-gdb.py
and a pyenv-virtualenv virtual environment has only the executables that one would a get inside the virtual environment directory structure
$ pyenv virtualenv 3.10.4 venv
$ ls "${PYENV_ROOT}/versions/venv"
bin include lib lib64 pyvenv.cfg
$ ls "${PYENV_ROOT}/versions/venv/bin/"
Activate.ps1 activate activate.csh activate.fish pip pip3 pip3.10 pydoc python python3 python3.10
by default after creation, the venv virtual environment doesn't know about executables of the Python version that it is associated to, like 2to3
$ pyenv activate venv
(venv) $ 2to3 --help
pyenv: 2to3: command not found
The `2to3' command exists in these Python versions:
3.10.4
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
so to allow for a virtual environment like venv to have access to these executables you add both it and the Python that created it to pyenv global so that pyenv will "fall back" to the Python version when the executable isn't found
(venv) $ pyenv deactivate
$ pyenv global venv 3.10.4
(venv) $ pyenv global
venv
3.10.4
(venv) $ 2to3 --help | head -n 3
Usage: 2to3 [options] file|dir ...
Options:
This pattern works for one virtual environment, but how do you maintain access to executables like 2to3 (or pipx as seen below`) when you have multiple virtual environments in play?
(venv) $ pyenv virtualenv 3.10.4 example && pyenv activate example
(example) $ 2to3
pyenv: 2to3: command not found
The `2to3' command exists in these Python versions:
3.10.4
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
Reproducible example
Using the following Dockerfile
FROM debian:bullseye
SHELL ["/bin/bash", "-c"]
USER root
RUN apt-get update -y && \
apt-get install --no-install-recommends -y \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
xz-utils \
tk-dev \
libxml2-dev \
libxmlsec1-dev \
libffi-dev \
liblzma-dev \
g++ && \
apt-get install -y \
git && \
apt-get -y clean && \
apt-get -y autoremove && \
rm -rf /var/lib/apt/lists/*
# Install pyenv and pyenv-virtualenv
ENV PYENV_RELEASE_VERSION=2.3.0
RUN git clone --depth 1 https://github.com/pyenv/pyenv.git \
--branch "v${PYENV_RELEASE_VERSION}" \
--single-branch \
~/.pyenv && \
pushd ~/.pyenv && \
src/configure && \
make -C src && \
echo 'export PYENV_ROOT="${HOME}/.pyenv"' >> ~/.bashrc && \
echo 'export PATH="${PYENV_ROOT}/bin:${PATH}"' >> ~/.bashrc && \
echo 'eval "$(pyenv init -)"' >> ~/.bashrc && \
. ~/.bashrc && \
git clone --depth 1 https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv && \
echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc
# Install CPython
ENV PYTHON_VERSION=3.10.4
RUN . ~/.bashrc && \
echo "Install Python ${PYTHON_VERSION}" && \
PYTHON_MAKE_OPTS="-j8" pyenv install "${PYTHON_VERSION}"
# Make 'base' virtual envirionment, add it and its Python version to global for
# executables like 2to3 or pipx to be findable
# c.f. https://github.com/pyenv/pyenv-virtualenv/issues/16#issuecomment-37640961
# and then install pipx into the 'base' virtual environment and use pipx to install
# pepotron
RUN . ~/.bashrc && \
pyenv virtualenv "${PYTHON_VERSION}" base && \
echo "" && echo "Python ${PYTHON_VERSION} has additional executables..." && \
ls -lh "${PYENV_ROOT}/versions/${PYTHON_VERSION}/bin" && \
echo "" && echo "...compared to 'base' virtualenv made with Python ${PYTHON_VERSION}" && \
ls -lh "${PYENV_ROOT}/versions/base/bin" && \
echo "" && echo "...because 'base' is actually a symlink" && \
ls -lh "${PYENV_ROOT}/versions/" && \
pyenv global base "${PYTHON_VERSION}" && \
python -m pip --quiet install --upgrade pip setuptools wheel && \
python -m pip --quiet install pipx && \
python -m pipx ensurepath && \
eval "$(register-python-argcomplete pipx)" && \
pipx install pepotron
WORKDIR /home/data
built with
docker build . --file Dockerfile --tag pyenv/multiple-virtualenvs:debug
it can be run with the following to demonstrate the problem
$ docker run --rm -ti pyenv/multiple-virtualenvs:debug
(base) root#26dfa530cd82:/home/data# pyenv global
base
3.10.4
(base) root#26dfa530cd82:/home/data# 2to3 --help | head -n 3
Usage: 2to3 [options] file|dir ...
Options:
(base) root#26dfa530cd82:/home/data# pipx list
venvs are in /root/.local/pipx/venvs
apps are exposed on your $PATH at /root/.local/bin
package pepotron 0.6.0, installed using Python 3.10.4
- bpo
- pep
(base) root#26dfa530cd82:/home/data# pep 3.11
https://peps.python.org/pep-0664/
(base) root#26dfa530cd82:/home/data# pyenv virtualenv 3.10.4 example
(base) root#26dfa530cd82:/home/data# pyenv activate example
pyenv-virtualenv: prompt changing will be removed from future release. configure `export PYENV_VIRTUALENV_DISABLE_PROMPT=1' to simulate the behavior.
(example) root#26dfa530cd82:/home/data# 2to3
pyenv: 2to3: command not found
The `2to3' command exists in these Python versions:
3.10.4
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
(example) root#26dfa530cd82:/home/data# pipx
pyenv: pipx: command not found
The `pipx' command exists in these Python versions:
3.10.4/envs/base
base
Note: See 'pyenv help global' for tips on allowing both
python2 and python3 to be found.
So how can one have something like pipx, that is designed to be installed globally, work globally if it is installed in a pyenv-virtualenv virtual environment so that you don't have to have anything installed with pip in the system Python?
It would seem that instead of ever using pyenv activate to activate a virtual environment you would need to deactivate any virtual environments and then only use pyenv global <virtual environment name> <virtual environment Python version> to effectively switch environments. I assume that can't be the only way to use Python version executables inside of a virtual environment, as that seems like that would remove the point of having a separate pyenv activate CLI API for pyenv-virtualenv.
You can directly execute the pipx binary in the pyenv prefix, it should work correctly.
Pyenv's shims mechanism isn't really designed for global binaries like this. I naively expected that the global environment would act as fallback when a local environment doesn't have an installed binary, but I think pyenv only looks at the system Python before falling back to $PATH.
So, if you don't install pipx into system (which, if you're not installing a system pip, I doubt you're doing), then the naive fallback doesn't work.
An alternative is to run pyenv with a temporary environment, i.e.
PYENV_VERSION=my-pipx-env pyenv exec pipx.
I'd want to make this an executable, so I'd suggest adding something like this into a special PATH directory that takes precedence from the pyenv paths:
#!/usr/bin/env bash
set -eu
export PYENV_VERSION="pipx"
exec "${PYENV_ROOT}/libexec/pyenv" exec pipx "$#"
Although, I'd be tempted to forgo the whole activation logic and just directly exec the pipx binary from the environment /bin to avoid running into any shell configuration errors.

CloudFormation - User Data run as Ubuntu user

I have the following user data in my CFN template:
UserData:
'Fn::Base64':
!Sub |
#!/bin/bash
sudo apt-get update;
sudo apt-get upgrade -y;
sudo apt-get -y install python-pip;
sudo apt-get -y install gcc;
sudo apt-get -y install gcc-c++;
sudo apt-get install awscli -y;
sudo apt-get install python-mysqldb;
echo "$(pwd)" >> /home/ubuntu/current1.txt
cd /home/ubuntu/;
echo "$(pwd)" >> /home/ubuntu/current2.txt
pip install apache-airflow;
pip install celery==4.4.0;
pip install kombu==4.5.0;
echo "$(pwd)" >> /home/ubuntu/current3.txt
cd /home/ubuntu/airflow/;
echo "$(pwd)" >> /home/ubuntu/current4.txt
mv airflow.cfg airflow.cfg.original_1;
cd /home/ubuntu/;
nohup airflow initdb;
nohup airflow webserver -p 8080 >> webserver.log &;
nohup airflow scheduler >> scheduler.log &;
nohup airflow worker >> worker.log &;
If I do cd /home/ubuntu and then if install apache-airflow it is still getting installed under root.
I want to install the apache-airflow under /home/ubuntu.
How to install packages under /home/ubuntu user ?
I ran into a similar situation when automating the installation of Ghost on an Ubuntu instance. You can try switching users. I would have to test this when attempting to install a package using pip specifically. But here is an example of how I had to run some specific setup commands as a non-root user:
su ghost-user << 'EOF'
cd /ghost-app/ghost
ghost install --no-setup --no-stack --dbhost 10.16.11.80 --dbuser ghost --dbpass myterribledbasepassword --dbname ghost_prod
EOF

PowerShell Core in Debian Docker Container Error

I'm new to Docker and am trying to create a Docker image with Raspbian base and PowerShell Core installed.
EDIT: Updated Dockerfile to include libicu52 package, which resolved the main error: lack of libpsl-native or dependencies not available. Changed CMD parameters and now have a different error.
Here is my Dockerfile:
# Download the latest RPi3 Debian image
FROM resin/raspberrypi3-debian:latest
# Update the image and install prerequisites
RUN apt-get update && apt-get install -y \
wget \
libicu52 \
libunwind8 \
&& apt-get clean
# Grab the latest tar.gz
RUN wget https://github.com/PowerShell/PowerShell/releases/download/v6.0.0-rc.2/powershell-6.0.0-rc.2-linux-arm32.tar.gz
# Make folder to put PowerShell
RUN mkdir ~/powershell
# Unpack the tar.gz file
RUN tar -xvf ./powershell-6.0.0-rc.2-linux-arm32.tar.gz -C ~/powershell
# Run PowerShell
CMD pwsh -v
New error:
hostname: you must be root to change the host name
/bin/sh: 1: pwsh: not found
How do I resolve these errors?
Thanks in advance!
Instead of downloading from source and extracting it in your container, I'd recommend using the official apt installer packages for your Dockerfile from Microsoft's official Debian repository as described at:
https://learn.microsoft.com/en-us/powershell/scripting/setup/installing-powershell-core-on-macos-and-linux?view=powershell-6#debian-9
So transforming that to Dockerfile format:
# Install powershell related system components
RUN apt-get install -y \
gnupg curl apt-transport-https \
&& apt-get clean
# Import the public repository GPG keys
RUN curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
# Register the Microsoft's Debian repository
RUN sh -c 'echo "deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-debian-stretch-prod stretch main" > /etc/apt/sources.list.d/microsoft.list'
# Install PowerShell
RUN apt-get update \
&& apt-get install -y \
powershell
# Start PowerShell
CMD pwsh
Alternatively you can also try to start from one of the original Microsoft docker Linux images, but of course then you need to solve then the raspberry installation for yourself:
https://hub.docker.com/r/microsoft/powershell/tags/

Not all pre-reqs install correctly for Hyperledger Composer

I've been following the Hyperledger Composer tutorial. I managed to install Ubuntu 16.04 on Hyper-V on my Windows 10 Enterprise. I then started on the following pre-req installation instructions:
https://hyperledger.github.io/composer/installing/installing-prereqs.html
I ran the prereqs-ubuntu.sh script. It ran fine with no errors. I examined the logs and saw that it had successfully installed npm 5.6.0, node 8.9.4, docker 17.12.x, docker composer 1.13.x, and Python 2.7.12.
However, when I run run $ sudo npm --version
it tells me that the npm command is not found
Same with $ sudo node --version
Not found...?!
Why would that be when the log clearly shows that npm and node where successfuly installed?!
Well, what I did and managed through:
--> install nodejs and npm:
sudo snap install node --classic --channel=8
so you get the latest node8.
--> then to solve "sudo" problem with node specify the npm prefix:
npm config set prefix ~/.node_modules
add the following to .bash_profile
export PATH=$HOME/.node_modules/bin:$PATH
Now the packages will install into your user directory and no permissions will be harmend.
--> install nvm (to get exactly node 8.9 version on the next step):
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
or
wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash
Verify:
node -v nvm
which should output 'nvm' if the installation was successful.
--> get and set node 8.9 version:
nvm install v8.9.0
nvm use 8.9.0
--> reset PATHs:
echo export PATH="$HOME/npm/bin:$PATH" >> ~/.bashrc
npm config set prefix ~/npm
echo "export NODE_PATH=$NODE_PATH:/home/$USER/npm/lib/node_modules" >> ~/.bashrc && source ~/.bashrc
--> at this stage the docker previous setup shall be destroyed:
docker kill $(docker ps -q)
docker rm $(docker ps -aq)
docker rmi $(docker images dev-* -q)
--> Installing the rest of prereqs:
sudo apt-add-repository -y ppa:git-core/ppa
sudo apt-get update
# install git
sudo apt-get install -y git
# Ensure that CA certificates are installed
sudo apt-get -y install apt-transport-https ca-certificates
# Add Docker repository key to APT keychain
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# Update package lists
sudo apt-get update
# Verifies APT is pulling from the correct Repository
sudo apt-cache policy docker-ce
# Install Docker
echo "# Installing Docker"
sudo apt-get -y install docker-ce
# Add user account to the docker group
sudo usermod -aG docker $(whoami)
# Install docker compose
echo "# Installing Docker-Compose"
sudo curl -L "https://github.com/docker/compose/releases/download/1.13.0/docker-compose-$(uname -s)-$(uname -m)" \
-o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
# Install unzip, required to install hyperledger fabric.
sudo apt-get -y install unzip
--> now you can install Fabric dev. env. (assuming the rest of prereq components stand available):
npm install -g composer-cli
etc.
I think you need to log out and close the shell. And then restart with the new session, as the shell stores your session.
Also, after installation, the use of sudo is not recommended as mentioned on IBM hyperledger website.