how to access pyspark from jupyter notebook - pyspark

I have been using pyspark [ with python 2.7] in an ipython notebook on Ubuntu 14.04 quite successfully by creating a special profile for spark and starting the notebook by calling $ipython notebook --profile spark. The mechanism for creating the spark profile is given on many websites but i have used the one given in here.
and the $HOME/.ipython/profile_spark/startup/00-pyspark-setup.py contains the following code
import os
import sys
# Configure the environment
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = '/home/osboxes/spark16'
# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "build"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
I have just created a new VM of Ubuntu 16.04 for my students where I want them to run pyspark programs in ipython notebook. Python, Pyspark is working quite well. We are using Spark 1.6.
However I have discovered that the current versions of ipython notebook [ or jupyter notebook ] whether downloaded through Anaconda or installed with sudo pip install ipython .. DO NOT SUPPORT the --profile option and all configuration parameters have to be specified in the ~/.jupyter/jupyter_notebook_config.py file.
Can someone please help me with the config parameters that I need to put into this file? Or is there an alternative solution? I have tried the findshark() explained here but could not make it work. Findspark got installed but findspark.init() failed, possibly because it was written for python 3.
My challenge is that everything is working just fine on my old installation of ipython on my machine but my students who are installing everything from scratch cannot get pyspark going on their VMs.

i work with spark just for test purpose locally from ~/apps/spark-1.6.2-bin-hadoop2.6/bin/pyspark
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" ~/apps/spark-1.6.2-bin-hadoop2.6/bin/pyspark

I have found a ridiculously simple answer to my own question by looking at the advice given in this page.
forget about all configuration files etc. Simply start notebook with this command -- $IPYTHON_OPTS="notebook" pyspark
thats all.
Obviously the paths to SPARK have to set as given here.
and if you get an error with Py4j then look at this page.
With this you are good to go. The spark context is available at sc so don't import it again

With Python 2.7.13 from Anaconda 4.3.0 and Spark 2.1.0 on Ubuntu 16.04:
$ cd
$ gedit .bashrc
Add following lines (where "*****" is the proper path):
export SPARK_HOME=*****/spark-2.1.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
Save, then do:
$ *****/anaconda2/bin/.pip install py4j
$ cd
$ source .bashrc
Check if it works with:
$ ipython
In [1]: import pyspark
For more details go here

Related

jupyter setup i18n on exiting notebook

I have been trying to translate jupyter notebook interface with my native language, using existing i18n implementation. I have already created translation files just like readme advised and now i want to add it to jupyter.
https://github.com/jupyter/notebook/tree/master/notebook/i18n
but i can't find /notebook/i18n/ folder on my computer ( Ubuntu 16.04 ).Do i have to install jupyter one more time or can i just add translate files to already existing jupyter installation on my machine and run it?
I just reinstalled jupyter and this time i18n folder is on its place in:
/usr/local/lib/python3.5/dist-packages/notebook/i18n/i18n
First, find lib_path by python:
import sys
from distutils.sysconfig import get_python_lib
print (get_python_lib())
And you will find it in
${lib_path}/notebook/i18n/

Can't launch PySpark in browser (windows 10)

I'm trying to launch PySpark notebook in my browser by typing in pyspark from the console, but I get the following error:
c:\Spark\bin>pyspark
python: can't open file 'notebook': [Errno 2] No such file or directory
What am I doing wrong here?
Please help?
Sounds like the jupyter notebook is either not installed or not in your path.
I prefer to use Anaconda for my python distribution and Jupyter comes standard and will install all necessary path information as well.
After that as long as you have set PYSPARK_PYTHON_DRIVER=jupyter and PYSPARK_PYTHON_DRIVER_OPTS='notebook' correctly you are good to go.
You want to launch the jupyter notebook when you invoke the command pyspark. Therefore you need to add the following to the bash_profile or zshrc_profile.
export PYSPARK_SUBMIT_ARGS="pyspark-shell"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

jupyter notebook not using python in conda environment from which it was started

I had started with udacity deep learning course and was setting up environments. I think the kernel notebook uses does not use python from conda environment. Following are some of the results of things I have tried.
Started conda environment
source activate tensorflow
With python terminal inside conda environment from linux terminal:
import sys
sys.executable
>>> '/home/username/anaconda2/envs/tensorflow/bin/python'
Also tensorflow gets imported with python shell
With ipython terminal inside conda environment, it shows same executable path. and tensorflow gets imported inside ipython shell.
However with jupyter notebook when I execute a cell in notebook, tensorflow module cannot be found. Also terminal spawned from notebook shows executable path of global python installation which is in anaconda/bin directoty, not of environment I had created from which I started the notebook
'/home/username/anaconda2/bin/python'
However conda environment of shell is still tensorflow
conda info --envs
# conda environments:
#
tensorflow * /home/username/anaconda2/envs/tensorflow
root /home/username/anaconda2
Does that mean kernel is linked to python installation in this location and not in conda env? How to link the same?
There is some more nuance to this question that is good to clarify. Each notebook is bound to a particular kernel. With the latest 4.0 release of Anaconda we (Continuum) have bundled a Conda-environment-aware extension that will try to associate a Notebook with a particular Conda environment. If that cannot be found then the "default" environment (or "root" environment) will be used. In your case you have a Notebook that is, I am guessing, asking for the default (or "root") environment, and so Jupyter starts a kernel in that environment, and not in the environment from which the Jupyter server was started. You can change the associated kernel by going to the Kernel->Change kernel menu and picking your tensorflow environment's kernel, along the lines of this:
Or when you create a new Notebook you can pick at that time which Conda environment's kernel should back the Notebook (note that one Conda environment can have multiple kernels available, e.g. Python and R):
We appreciate that this can be a common cause of confusion, especially when sharing notebooks, since the person who shared it either used the "default" kernel (probably called just "Python"), or they were using a Conda environment with a different name. We are working on ways to make this smoother and less confusing, but if you have suggestions for expected/desired behavior, please let us know (GitHub issue to https://github.com/ContinuumIO/anaconda-issues/issues/new is the best way to do this)

Problems with importing self-defined module in Jupyter notebook using PyCharm

I'm trying to import a self-defined module in a Jupyter notebook using PyCharm (2016.1). However, I always get "ImportError: No module named xxx". Importing packages like NumPy or Matplotlib works fine. The self-defined module and the notebook are in the same directory and I've tried to set the directory as sources root. How can I fix this? Thanks a lot!
If you run the following in your notebook...
import sys
sys.path
...and you don't see the path to the directory containing the packages/modules, there are a couple ways around it. I can't speculate why this might happen in this example. I have seen some discrepancies in the results of sys.path when running Jupyter locally from PyCharm on OS X vs. on a managed Linux service.
An easy if hacky workaround is to set the sys path in your notebook to reflect where the packages/modules are rooted. For example, if your notebook was in a subdirectory from where the packages or modules are and sys.path only reflects that subdirectory:
import sys
sys.path.append("../")
The point is that sys.path must include the the directory the packages and modules are rooted in so the path you append will depend on the circumstances.
Perhaps a more proper solution, if you are using a virtualenv as your project interpreter, is to create a setup.py for your project and install the project as an editable package with pip. E.g. pip install -e . Then as long as Jupyter is running from that virtualenv there shouldn't be any issues with imports.
One ugly gotcha I ran into on OS X was Jupyter referencing the wrong virtualenv when started. This should also be apparent by inspecting the results of sys.path. I don't really know how I unintentionally managed set this but presume it was due to futzing around my first time getting Jupyter working in PyCharm. Instead of starting Jupyter with the local virtual env it would run with the one defined in ~/Library/Jupyter/kernels/.python/kernel.json. I was able to clear it by cleaning out that directory, e.g. rm -r ~/Library/Jupyter/kernels/.python.
As stated by Thomas in the comments make sure that your notebook serving path and project path are same. When you start your notebook in pycharm you should get something like this :
Serving notebooks from local directory: <path to your project root folder>

Ipython Notebook shows Import Error for Seaborn even when package is installed in Conda environment

So I am trying to use Ipython notebook with Anaconda (Windows10). I got into anaconda cmd and create a new environment TryThis. I install Seaborn in this environment. And then I run Ipython command in the conda cmd.
conda create --name TryThis python=2
activate TryThis
conda install seaborn
ipython
When I run
import seaborn as sns
in this it executes allright.
However if I exit this and then run
ipython notebook
in the conda cmd and go on to do the import in an ipython notebook in browser, it throws error
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-ed9806ce3570> in <module>()
----> 1 import seaborn as sns
ImportError: No module named seaborn
I do not understand what is going wrong. If Seaborn is in this anaconda environment and I initiated Ipython notebook in this environment and Ipython in console can recognize it, why doesn't the notebook ?
What I might be doing is something blatantly incorrect, but I just started out with using anaconda !
Type:
!conda info
in your notebook. Check what default environment says. It should be the same as in your session in which you can import seaborn.
First try
conda install seaborn
Restart your Jupyther notebook and see if it works.
If you have already installed Seaborn using conda, make sure that when you start Jupyter notebook, it uses the Anaconda path.
It typically prints out the path in terminal when you start Jupyter notebook.
I have run into this issue earlier, and the reason was that my Jupyter notebook was using the path from .graphlab (a tool by Dato/Turi/Apple). So even though I had installed Seaborn correctly with conda insatall seaborn , the Jupyter notebook was not able to find the library.
You may not have the exact same issue, but from what you're describing, it sounds like your issue is somewhat similar.
If you're able to import seaborn, when you run ipython from terminal; and if you're not able to import seaborn from Jupyter notebook, then follow these steps:
From your terminal, find the ipython path with
which ipython
Now, Start Jupyter notebook and pay attention (in your terminal) to which path your Jupyter notebook is using.
If you're not able to import seaborn in Jupyter notebook, most likely that path is different from the ipython path that you saw earlier.
Once you have confirmed that this is the issue, then all you need to do is make Jupyter use correct path. There are various ways to do it. My way was to get rid of my installation of Anaconda entirely, and install jupyer notebook using pip.
pip install jupyter
As long as you have installed your libraries (NumPy, SciPy, Pandas, Seaborn, etc) using pip, your jupyter will be able to import these libraries. In my opinion, pip install * is the way to go for anything Python.