EMR PySpark "ModuleNotFoundError: No module named 'spacy'" - pyspark

I've been unsuccessfully trying to install Spacy onto my EMR cluster to run a Pyspark job.
My Bootstrapping actions to EMR look something like this
pip install --upgrade pip
sudo conda install -c conda-forge spacy
sudo python3 -m spacy download en_core_web_sm
sudo python3 -m spacy download en
sudo python3 -m pip install -U spacy
sudo python3 -m pip install -U boto3
sudo python3 -m pip install -U pandas
sudo python3 -m spacy download en_core_web_sm
sudo python3 -m spacy download en
As you see above i've been trying to install it via pip and conda both but none seem to work.
Suprisingly when I use a jupyter notebook and not try to submit my pyspark job as a step to EMR it works.

I had faced a similar problem. Some of the things that could work:
Check the stdout, stderr files in EMR, on bootstrap actions. It's mentioned under the summary section of the cluster-Configuration details-LOG URI
Apparently, Spacy has Cython dependency and it's not downloaded automatically. Thus including the following commands helped:
sudo python3 -m pip install --upgrade pip
sudo python3 -m pip install --upgrade pip setuptools
sudo python3 -m pip install wheel
sudo python3 -m pip install -U Cython
sudo python3 -m pip install -U spacy==2.3.5
sudo python3 -m spacy download en_core_web_sm

Related

How to import rdkit in google colab these days?

!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
: The code from <Installing RDKit in Google Colab>
The code above is one of the solutions from another article in stackoverflow on importing 'rdkit' in Google Colab, but it didn't work for me with this error message:
from rdkit import Chem
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /usr/local/lib/python3.7/site-packages/rdkit/DataStructs/cDataStructs.so)
Does anybody know how to solve this ImportError: `GLIBCXX_3.4.26' not found problem?
I sincerely need help! Big thx!
The answer you linked is a little outdated now. Seems like there is also an issue with installing the latest build of the RDKit (2020.09.3) on Colab. Installing the older version (2020.09.2) seems to solve the issue:
%%bash
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
conda config --set always_yes yes --set changeps1 no
conda install -q -y -c conda-forge python=3.7
conda install -q -y -c conda-forge rdkit==2020.09.2
Followed by:
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
from rdkit import Chem
If you must install the latest build (2020.09.3), I have found a workaround by adding a few lines to the bash cell:
%%bash
add-apt-repository ppa:ubuntu-toolchain-r/test
apt-get update --fix-missing
apt-get dist-upgrade
wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
conda config --set always_yes yes --set changeps1 no
conda install -q -y -c conda-forge python=3.7
conda install -q -y -c conda-forge rdkit
To make this work the runtime also will need to be restarted, I just add a try/except around the rdkit import to restart the runtime automatically:
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
try:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
except ImportError:
print('Stopping RUNTIME. Colaboratory will restart automatically. Please run cell again.')
exit()
Colab link for the first solution: https://colab.research.google.com/drive/1vhsLgzA7A_INMcbU-hG6go4M6axvbUpi?usp=sharing
Colab link for the second solution: https://colab.research.google.com/drive/1Ix0oyUU4cA1b2rD9JfkMhy8M2z5Y_vTL?usp=sharing
I think the best way of doing that is to use condacolab by Jaime Rodríguez-Guerra and Alex Malins.
https://github.com/conda-incubator/condacolab
!pip install -q condacolab
import condacolab
condacolab.install()
then
import condacolab
condacolab.check()
Well one simple trick is to install deepchem and it comes with rdkit
!pip install deepchem
from rdkit import Chem

How to install scala kernel for hydrogen?

I'm looking at: https://nteract.io/kernels it's saying how to install python kernels for hydrogen:
python -m pip install ipykernel
python -m ipykernel install --user
but which python -m pip install [scala kernel?] command should I use to install the scala kernel? or is it a whole other process? where is the documentation or step by step on how to install hydrogen with scala kernel and working with it?

Can't access exe symlinks in centos 6.6 on docker

I'm trying to access the /proc/< pid >/exe symlink on centos 6.6 but i'm geting
ls: cannot access /proc/57/exe: Permission denied
My Dockerfile is
FROM centos:6
# install base packages
RUN yum -y update
RUN yum -y groupinstall "Development Tools"
RUN yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite- devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel
RUN yum -y install wget
RUN wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
RUN yum -y install tar
RUN tar -xf Python-2.7.6.tar.xz
RUN cd Python-2.7.6 && ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
RUN cd Python-2.7.6 && make
RUN cd Python-2.7.6 && make altinstall
# First get the setup script for Setuptools:
RUN wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py
# Then install it for Python 2.7 and/or Python 3.3:
RUN python2.7 ez_setup.py
# Now install pip using the newly installed setuptools:
RUN easy_install-2.7 pip
#lxml
RUN yum -y install python-devel libxml2-devel libxslt-devel
RUN pip install lxml
RUN pip install pytest
RUN yum install -y gcc-c++
RUN pip install protobuf
RUN yum install -y rpm-build
This happens for any process.
Thanks in advance.

Fail Jupyter Notebook installation on clean Ubuntu 14.04 LTS

How do I install develompment version of Jupyter Notebook?
$ wget https://bootstrap.pypa.io/get-pip.py
$ sudo python get-pip.py
$ sudo pip install virtualenv
$ cd ~
$ virtualenv local/python/jupyter
$ source local/python/jupyter/bin/activate
$ git clone --recursive https://github.com/ipython/ipython.git
$ cd ipython
$ pip install -e ".[notebook]"
Could not find a version that satisfies the requirement jupyter-notebook (from ipython==4.0.0.dev0) (from versions: )
Some externally hosted files were ignored as access to them may be unreliable (use --allow-external jupyter-notebook to allow).
No matching distribution found for jupyter-notebook (from ipython==4.0.0.dev0)
Could it be you have to install from the Jupyter repo now as things were moving around since the release of jupyter? Next releases will not be only for ipython but also for other kernels like julia and bash etc.
From https://github.com/jupyter/jupyter_notebook
Create a virtual env (ie jupyter-dev)
ensure that you have node/npm installed (ie brew install node on OS X)
Clone this repo and cd into it
pip install -r requirements.txt
pip install -e .

Using Kivy on Eclipse Indigo, Ubuntu 10.04 & Python 2.7

I would like to use Kivy on Eclipse Indigo on Ubuntu 10.04. I understand that python 2.7 is required (2.6 is the default on 10.04) and have python 2.7 installed as well. I've done lots of research but not found an answer. Can I do this and if so how? I don't want to upgrade ubuntu nor Eclipse since this would probably destabilise existing developments.
Kivy and Eclipse are not related, and Eclipse is not necessary for running or editing Kivy programs. I can help to answer the Kivy part of your question, and will leave Eclipse to others.
Since Ubuntu 10.04 is out of support, it's hard to tell which required system packages are not available. This will probably be the most tedious part of the process. For Kivy on Ubuntu 12.04 you need:
sudo apt-get install -y build-essential mercurial git python2.7 python-dev ccache ffmpeg libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev libsmpeg-dev libsdl1.2-dev libportmidi-dev libswscale-dev libavformat-dev libavcodec-dev zlib1g-dev unzip
Some of those packages will have different versions on Ubuntu 10.04. Hopefully they are all available in some form.
Next you need to bootstrap an up-to-date Python setuptools environment:
sudo apt-get remove --purge -y python-virtualenv python-pip python-setuptools
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | sudo python2.7
rm -f setuptools*.zip
sudo easy_install-2.7 -U pip
Now you can install an up-to-date Cython:
sudo apt-get remove --purge -y cython
sudo pip2.7 install -U cython
Next you can install an up-to-date NumPy, which is required for PyGame:
sudo apt-get remove --purge -y python-numpy
sudo pip2.7 install -U numpy
Now you can install an up-to-date PyGame:
sudo apt-get remove --purge -y python-pygame
hg clone https://bitbucket.org/pygame/pygame
cd pygame
python2.7 setup.py build
sudo python2.7 setup.py install
cd ..
sudo rm -rf pygame
Now that all of the dependencies are met, you can install an up-to-date Kivy:
sudo apt-get remove --purge -y python-kivy
sudo pip install -U kivy