jupyter pyspark outputs: No module name sknn.mlp - pyspark

I have 1 WorkerNode SPARK HDInsight cluster. I need to use scikit-neuralnetwork and vaderSentiment module in Pyspark Jupyter.
Installed the library using commands below:
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Next I open pyspark terminal and i am able to successfully import the module. Screenshot below.
Now, i open Jupyter Pyspark Notebook:
Just to add, I am able to import pre-installed module from Jupyter like "import pandas"
The installation goes to:
admin123#hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "vaderSentiment"
/usr/bin/anaconda/lib/python2.7/site-packages/vaderSentiment
/usr/local/lib/python2.7/dist-packages/vaderSentiment
For pre-installed modules:
admin123#hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "pandas"
/usr/bin/anaconda/pkgs/pandas-0.17.1-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/pandas-0.16.2-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/bokeh-0.9.0-np19py27_0/Examples/bokeh/compat/pandas
/usr/bin/anaconda/Examples/bokeh/compat/pandas
/usr/bin/anaconda/lib/python2.7/site-packages/pandas
sys.executable path is same in both Jupyter and terminal.
print(sys.executable)
/usr/bin/anaconda/bin/python
Any help would greatly appreciated.

The issue is that while you are installing it on the headnode (one of the VMs), you are not installing it on all the other VMs (worker nodes). When the Pyspark app for Jupyter gets created, it gets run in YARN cluster mode, and so the application master starts in a random worker node.
One way of installing the libraries in all worker nodes would be to create a script action that runs against worker nodes and installs the necessary libraries:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/
Do note that there's two python installations in the cluster, and you have to refer to the Anaconda installation explicitly. Installing scikit-neuralnetwork would look something like this:
sudo /usr/bin/anaconda/bin/pip install scikit-neuralnetwork
The second way of doing this is to simply ssh into the workernodes from the headnode. First, ssh into the headnode, then figure out the workernode IPs by going to Ambari at: https://YOURCLUSTER.azurehdinsight.net/#/main/hosts. Then, ssh 10.0.0.# and execute the installation commands yourself for all worker nodes.
I did this for scikit-neuralnetwork and while it does import correctly, it throws saying it cannot create a file in ~/.theano. Because YARN is running Pyspark sessions as the nobody user, Theano cannot create its config file. Doing a little bit of digging around, I see that there's a way to change where Theano writes/looks for its config file. Please also take care of that while doing the installation: http://deeplearning.net/software/theano/library/config.html#envvar-THEANORC
Forgot to mention, to modify an env var, you need to set the variable when creating the pyspark session. Execute this in the Jupyter notebook:
%%configure -f
{
"conf": {
"spark.executorEnv.THEANORC": "{YOURPATH}",
"spark.yarn.appMasterEnv.THEANORC": "{YOURPATH}"
}
}
Thanks!

Easy way to resolve this was:
Create a bash script
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Copy the above created bash script to any container in Azure storage account.
While creating HDInsight Spark cluster, use script action and mention the above path in URL. Ex: https://sa-account-name.blob.core.windows.net/containername/path-of-installation-file.sh
Install it in both HeadNodes and WorkerNodes.
Now, open Jupyter and you should be able to import the modules.

Related

How to set the Jupyter default user for Pyspark in GCP Dataproc

In a Jupyter notebook connected to a GCP Spark cluster, the cell !pip3 install pyLDAvis==3.2.1 works, but gives a warning:
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager.
It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
The warning is not unique to pyLDAvis, other packages — even numpy — give the same warning.
Running the notebook as root shouldn't be the default. How can the default user in the notebook be set to singhj rather than root? I have searched through IPython Configuration and customization for any hints.
Configuration: Fresh cluster in GCP Dataproc, default Jupyter notebook, nothing customized.
The Jupyter server in a Dataproc cluster is run by the systemd service defined in the file /usr/lib/systemd/system/jupyter.service.
If you want to change the user it runs as, then you can modify that file and replace the line saying User=root with one saying the name of the user you want (e.g. User=singhj in your example).
Then, once the file has been updated, restart the systemd service by running the following commands as root:
systemctl daemon-reload
systemctl restart jupyter
If you want to automate that, you can write an initialization action to make the change at cluster creation time.

Pyspark kernel error while running jupyter notebook on Amazon EMR cluster

I'm new in AWS EMR. I would like to run jupyter notebook on my cluster from command line with pyspark kernel.
To create cluster I ran command as follows:
aws emr create-cluster --release-label emr-5.32.0 --name 'spark_jupyter_2'
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway Name=Hive
--ec2-attributes KeyName=…………..,InstanceProfile=EMR_EC2_DefaultRole --service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,
InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge
--region eu-central-1 --log-uri s3://………….../logs/ --no-termination-protected
Then I installed jupyter with sudo pip3 install jupyter
Then I signed in again with ssh -i "…………...pem" -L 8888:localhost:8888 hadoop#ec2-…………..compute.amazonaws.com
,ran command jupyter notebook and went to webpage from the link on the screen.
Until this step everything went smoothly but when I tried to run notebook with pyspark kernel I got ``kernel error```
I don't completely understand why is that. When I run pyspark from command line there's no error.
What do I have to do to run jupyter with pyspark kernel without any errors?
1.

can sparkmagic be used outside the ipython?

i'm using a jupyter notebook with sparkmagic extension, but i can only access the spark cluster by create a pyspark kernel. The conflict is that i can't use the py3 environment(some installed python package) in pyspark kernel, either i can't use spark context in python3 kernel.
i don't know how to introduce packages in sparkmagic, so can i use pyspark that actually implement by sparkmagic in py3? or are there any other opinions?
Both kernels - PySpark and default IPython can be used with python3 interpreter on pyspark. It can be specified in ~/.sparkmagic/config.json. This is standard spark configuration and will be just passed by sparkmagic to the livy server running on the spark master node.
"session_configs": {
"conf": {
"spark.pyspark.python":"python3"
}
}
spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.
python3 is in this case available as command on the PATH of each node in the spark cluster. You can install it also into a custom directory on each node and specify the full path. "spark.pyspark.python":"/Users/hadoop/python3.8/bin/python"
All spark conf options can be passed like that.
Thera are 2 ways for importing tensorflow:
install on all spark machines (master and workers) via python3 -m pip install tensorflow
zip, upload and pass the remote path through sparkmagic via spark.submit.pyFiles setting. Accepts a path on s3, hdfs or the master node file system (not a path on your machine)
See answer about --py-files

using a conda virtual environment in jupyter notebook

I have read and implemented instructions from earlier posts like:
How to start an ipython shell(not notebook) within a conda or virtualenv
My goal is to use a kernel in ipython which has all conda packages from my virtual environment.
I have a google ubuntu 16.04 machine where I have installed anaconda and a virtual environment in which i installed all my packages..
when i run
python -m ipykernel.kernelspec
i get the following error:
/home/admin/anaconda3/envs/py36ve/lib/python3.6/site-packages/IPython/paths.py:61: UserWarning: IPython dir
'/home/admin/.ipython' is not a writable location, using a temp
directory.
" using a temp directory.".format(ipdir))
[Errno 13] Permission denied: '/usr/local/share/jupyter/kernels/python3'
I tried running with sudo too.. i created a kernel but when i use it then it has none of the packages i installed in the virtual environment..
I do have a similar issue with this when I try to submit my program to a cluster where it doesn't have access to my local directory and it shows the same message. But I don't get Permission denied message and everything is fine by me. But I wanted to address this issue and looked into it and I found that paths.py at line 62 in python package in the case of not writable, it creates a temp directory like the following:
ipdir = tempfile.mkdtemp()
As in tempfile documentation says:
Creates a temporary directory in the most secure manner possible. There are no race conditions in the directory’s creation. The directory is readable, writable, and searchable only by the creating user ID.
It is strange that you do get this but if you want to make it work, find the paths.py and change that to your liking and makes sure it works and replace it with the original.

ipython notebook on remote server peculiarity

I am taking my first steps with ipython notebook and I installed it successfully on a remote server of mine (over SSH) and I started it using the following command:
ipython notebook --ip='*' ---pylab=inline --port=7777
I then checked on http://myserver.sth:7777/ and the notebook was running just fine. I then wanted to close the SSH connection with the server and keep ipython running in the background. When I did this, I couldn't connect to myserver.sth:7777 anymore. Once I connected again to the remote server by SSH, I could connect again to the notebook. I then tried to use screen to start ipython: I created a new screen by screen -S ipy, I started ipython notebook as above and I used Ctrl+A,D to detach the screen and exit to the TTY. I could still connect remotely to the notebook. I then closed the SSH connection and I got a 404 NOT FOUND error when I tried to access my previously stored notebook and I couldn't see it on the list of notebook at http://myserver.sth:7777/. I tried to create a new notebook, but I got a 500 Internal Server Error.
I also tried running ipython notebook with and without using sudo.
Any ideas?
Rather than use screen, perhaps you could switch to an init script or supervisord to keep IPython notebook up and running.
Let's assume you go the supervisord route:
Install supervisord
Install supervisord using your package manager. For ubuntu it's named supervisor.
apt-get install supervisor
If you decide to install supervisor through pip, you'll have to set up its init.d script yourself.
Write a supervisor configuration file for IPython
The configuration file tells supervisor what to run and how.
After you install supervisor, it should have created /etc/supervisor/supervisord.conf. These lines should exist in the file:
[include]
files = /etc/supervisor/conf.d/*.conf
If they contain these lines, you're in good shape. I only show them to demonstrate where it expects new configuration files. Your configuration file can go there, named something like /etc/supervisor/conf.d/ipynb.conf.
Here's a sample configuration that was generated by Chef by an ipython-notebook-cookbook that runs the notebook in a virtualenv:
[program:ipynb]
command=/home/ipynb/.ipyvirt/bin/ipython notebook --profile=cooked
process_name=%(program_name)s
numprocs=1
numprocs_start=0
autostart=true
autorestart=true
startsecs=1
startretries=3
exitcodes=0,2
stopsignal=QUIT
stopwaitsecs=10
user=ipynb
redirect_stderr=false
stdout_logfile=AUTO
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10
stdout_capture_maxbytes=0
stdout_events_enabled=false
stderr_logfile=AUTO
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=10
stderr_capture_maxbytes=0
stderr_events_enabled=false
environment=HOME="/home/ipynb",SHELL="/bin/bash",USER="ipynb",PATH="/home/ipynb/.ipyvirt/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games",VIRTUAL_ENV="/home/ipynb/.ipyvirt"
directory=/home/ipynb
serverurl=AUTO
The above supervisor config also relies on an IPython notebook configuration (located at /home/ipynb/.ipython/profile_cooked/ipython_notebook_config.py). This makes configuration much easier (as you can also set up your password hash and many other configurables).:
c = get_config()
# Kernel config
# Make matplotlib plots inline
c.IPKernelApp.pylab = 'inline'
# The IP address the notebook server will listen on.
# If set to '*', will listen on all interfaces.
# c.NotebookApp.ip= '127.0.0.1'
c.NotebookApp.ip='*'
# Port to host on (e.g. 8888, the default)
c.NotebookApp.port = 8888 # If you want it on 80, I recommend iptables rules
# Open browser (probably want False)
c.NotebookApp.open_browser = False
Re-read and update, now that you have the configuration file
supervisorctl reread
supervisorctl update
Reality
In reality, I used to use a Chef cookbook to do the entire installation and configuration. However, using configuration management with tiny stuff like this is a bit of overkill (unless you're orchestrating these in automation).
Nowadays I use Docker images for IPython notebook, orchestrating via JupyterHub or tmpnb.