installing python package in sagemaker sparkmagic pyspark notebook - pyspark

I want to install new libraries in a running kernel (not bootstrapping). I'm able to create a sagemaker notebook, which is connected to a EMR cluster, but installing package is a headache.
Unable to install packages on notebook. I've tried several methods like installing packages via terminal in jupyterLab.
$ conda install numba
The installation seems to be working fine on conda_pytorch_p36 notebook, but the packages are not installed on SparkMagic (pyspark) notebook.
Error code:
An error was encountered:
No module named numba
Traceback (most recent call last):
ImportError: No module named numba
The jupyter magic command also doesn't work only in pyspark notebook
!pip install keras
Error:
An error was encountered:
invalid syntax (<stdin>, line 1)
File "<stdin>", line 1
!pip install keras
^
SyntaxError: invalid syntax
Based on answer in a github post, neither did this work
from subprocess import call
call("pip install dm-sonnet".split(" "))

when you are running $ conda install numba via the terminal in JupyterLab,
it's actually succeeding the installation on your local environment. The thing is, when you are using Sparkmagic as your kernal, the code in the cells are always running on the spark cluster, not on the local notebook environment. To run the content of a cell locally you should write %%local in the beginning of the cell. After that everything in that cell will run locally and the installed module will be available.
Otherwise you should install the module on the remote spark cluster.
Read more here:
https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb

Related

PyAudio import error : ImportError: DLL load failed: The specified module could not be found

I encountered an import problem by PyAudio.
I have a winodws 10, 64 bit, and use Anaconda and Spyder IDE with python 3.7.
I installed PyAudio in Ananconda, ran as administrator, with these commands :
cd
conda install -c conda-forge PyAudio
The installation ran without any problems.
I then restarted both Anaconda and SPYDER. PyAudio now shows up in Anaconda's list of installed packages.
When I try to import Pyaudio in Spyder (IPython console), I encounter this error message :
[1]: import pyaudio
Could not import the PyAudio C module '_portaudio'.
Traceback (most recent call last):
File "", line 1, in
import pyaudio
File "C:\ProgramData\Anaconda3\lib\site-packages\pyaudio.py", line 116, in
import _portaudio as pa
ImportError: DLL load failed: The specified module could not be found.
I tried to fix it by answers to similar ImportError message issued, while other users tried to import other packages like SKlearn ... , but with no success.
Your problem (and mine) are the same. The issue is, unfortunately, the version of python you're running (in tandem with your OS).
Check out this link:
https://people.csail.mit.edu/hubert/pyaudio/#:~:text=Note%3A%20As%20of%20this%20update,4.
Under the INSTALLATION section in the link for WINDOWS, PyAudio's latest version (0.2.11) is compatible with Python versions: 2.7, 3.4, 3.5, 3.6.
My current python is 3.8.5, so you (and I) could never use PyAudio unless they added compatibilities or we revert to an above python version.
I tried to install portaudio using
conda install portaudio
but it seems like it didn't work as it should be. However,
conda install -c anaconda portaudio
solved the problem.
see the official anaconda page

ImportError: No module named pytesseract on Jupiter lab and VSCode but not my local

I have tried running a ProcessImage.py file in which I import the package pytesseract in Jupiter Lab and VSCode.
This is the error that pops out :
import pytesseract
ImportError: No module named pytesseract
I already know that pytesseract is installed in my environment because with conda list:
pytesseract 0.3.2 py_0 conda-forge
pytest 5.3.5 py38_0 coda-forge
However, if I run my ProcessImage.py file on my local no error is prompted.
I know the error is related to paths in Jupiter Lab and VsCode but I can't seem to find a solution.
Have you tried to launch your python file using Ctrl+F5 command for run?
I had this same ImportError when trying to launch my python program using the .run extension. It persistently started python 2.7 while my environment was using Python3. I first thought that it was a VS code error but it wasn't.

Import PySpark error in Anaconda venv on Dataproc

I have spun up a Dataproc cluster with Anaconda as the additional component. I have created a virtual env. in anaconda and installed RDkit inside it. Now my issue is that when I open up python terminal and try to do this:
from pyspark import SparkContext
It throws error:
Traceback (most recent call last): File "", line 1, in
ModuleNotFoundError: No module named 'pyspark'
I can install the PySpark inside the Anaconda venv and then it works but I wanted to use the pre-installed PySpark on Dataproc. How to resolve this?
To use Dataproc's PySpark in a new Conda environment you need to install file:///usr/lib/spark/python package inside this environment:
conda create -c rdkit -n rdkit-env rdkit
conda activate rdkit-env
sudo "${CONDA_PREFIX}/bin/pip" install -e "file:///usr/lib/spark/python"

Import local module to the Jupyter Notebook

I've a local function to transform my data on a local.
For example, when I tried to import:
import com.company.area.project.area.functions
I've received the error:
console: 49: error: object company is not a member of package com
import com.company.area.project.area.functions
I'm using Anaconda and the pip. Reading trough the web I've found this question: Is there a pip / easy_install for Scala?
So, I noticed that there is no a equivalent to "Pip install ." we have in python.
So, the question is. How can I import a custom function in my Jupyter Notebook?
First you have to ensure that you have Scala environment enabled for your jupyter notebook.
pixiedust can help you with installing scala packages in your jupyter notebook.
A specific example in your case will be to deploy your project as a jar file to a repository or an address where you can then pass to the pixiedust package
Code example should look like this:
import pixiedust
pixiedust.installPackage("https://github.com/companyrepo/jars/raw/master/dist/functions-assembly-1.0.jar")

Issue with dependencies running python file on another PC

I have a python file running perfectly in the IDE.
I want to run it on a different PC without any IDE.
I run the program from the command line: python program.py
Error message: File "program.py", line 8, in
from mpl_finance import candlestick_ohlc
ModuleNotFoundError: No module named 'mpl_finance'
When trying: pip install mpl_finance (or pip install mpl_toolkits)
I get the message: No matching distribution found for mpl_finance (or mpl_toolkits)
There also seems to be a problem with matplotlib backend.
Looking for a solution please.
After many failed paths, here's what worked:
matplotlib.finance is deprecated and it is now mpl_finance.
Create 2 files named mpl_finance.py and setup.py and get their contents from here.
Then from the command-line: python setup.py install
Fixing the backend (this can save you a few days):
If the backend of matplotlib was set in a file original PC (and not in the code). Then you need to do the same on the second PC.
Windows Path: C:\Program Files\Python36\Lib\site-packages\matplotlib\mpl-data\matplotlibrc
Change (probably line 38) to this: backend : Qt5Agg