Can I clone a jupyterhub kernel? - pyspark

I access a JupyterHub instance with some preconfigured kernels. I know how to create a new kernel from scratch, but how can I clone an existing one? I want to customize it with my own packages.
I created a pure python kernel, but I'd like to clone our default spark kernel. The new kernel should have the spark object already created.
The JupyterHub env was set up by an admin. It uses pure Python (no Conda).

Related

The Pyspark always use the system‘s python

We know that a system has two Python:
①system's python
/usr/bin/python
②user's python
~/anaconda3/envs/Python3.6/bin/python3
Now I have a cluster with my Desktop(master) and Laptop(slave).
It's OK for different mode of PysparkShell if I set like this:
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
for both two nodes' ~/.bashrc
However,I want to configure it with jupyter notebook.So I set like this in each node's
~/.bashrc
export PYSPARK_PYTHON=~/anaconda3/envs/Python3.6/bin/python3
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
then I get the log
My Spark version is:
spark-3.0.0-preview2-bin-hadoop3.2
I have read all the answers in
environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
and
different version 2.7 than that in driver 3.6 in jupyter/all-spark-notebook
But no luck.
I guess slave's python2.7 is from system's python.not from anaconda's python.
How to force spark's slave node to use anaconda's python?
Thanks~!
Jupiyter is looking for ipython, you probably only have ipython installed in your system python.
In order to use jupyter in different python version. You need to use python version manager (pyenv), and python environment manager(virtualenv), together you can choose which version of python you are going to use and which environment you are going to install jupyter, and fully isolated python versions and packages.
Install ipykernel in your chosen python environment and install jupyter.
After you finish above step. You need to make sure that the Spark worker will switch to your chosen python version and environment every time Spark ReourceManager launches a worker executor. In order to swtich python version and environment when the Spark worker executor, you need to make sure that a little script ran right after the Spark Resource Manager ssh into worker:
go to the python environment directory
source 'whatever/bin/activate'
After you have done above steps, you should have chosen python version and jupyter ran by Spark worker executor.

Different Python version between Dataproc master and worker nodes

I had created a dataproc cluster with Anaconda as optional component and created a virtual env. in that. Now when running a pyspark py file on master node I'm getting this error -
Exception: Python in worker has different version 2.7 than that in
driver 3.6, PySpark cannot run with different minor versions.Please
check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
are correctly set.
I need RDKit package inside the virtual env. and with that python 3x version gets installed. The following commands on my master node and then the python version changes.
conda create -n my-venv -c rdkit rdkit=2019.*
conda activate my-venv
conda install -c conda-forge rdkit
How can I solve this?
There's a few things here:
The 1.3 (default) image uses conda with Python 2.7. I recommend switching to 1.4 (--image-version 1.4) which uses conda with Python 3.6.
If this library will be needed on the workers you can use this initialization action to apply the change consistently to all nodes.
Pyspark does not currently support virtualenvs, but this support is coming. Currently you can run pyspark program from within a virtualenv, but this will not mean workers will run inside the virtualenv. Is it possible to apply your changes to the base conda environment without virtualenv?
Additional info can be found here https://cloud.google.com/dataproc/docs/tutorials/python-configuration

virtualenv can't execute on my system?

I'm trying to create a virtual environment to deploy a Flask app. However, when I try to create a virtual environment using virtualenv, I get this error:
Using base prefix '//anaconda'
New python executable in /Users/sydney/Desktop/ptproject/venv/bin/python
ERROR: The executable /Users/sydney/Desktop/ptproject/venv/bin/python is not functioning
ERROR: It thinks sys.prefix is '/Users/sydney/Desktop/ptproject' (should be '/Users/sydney/Desktop/ptproject/venv')
ERROR: virtualenv is not compatible with this system or executable
I think that I installed virtualenv using conda. When I use which virtualenv, I get this
//anaconda/bin/virtualenv
Is this an incorrect location for virtualenv? I can't figure out what else the problem would be. I don't understand the error log at all.
It turns out that virtualenv just doesn't work correctly with conda. For example:
https://github.com/conda/conda/issues/1367
(A workaround is proposed at the end of that thread, but it looks like you may be seeing a slightly different error, so maybe it won't work for you.)
Instead of deploying your app with virtualenv, why not just use a proper conda environment? Conda environments are more general (and powerful) than those provided by virtualenv.
For example, to create a new environment with python-2.7 and flask in it:
conda create -n my-new-env flask python=2.7

adding packages to pyspark using jupyter notebook

I am able to run jupyter with pyspark successfully using https://cloud.google.com/dataproc/tutorials/jupyter-notebook
My question is - if I had to add packages to pyspark (like spark-csv or graphframes) and use them through the notebook, what is the best practice to follow ?
I can add the package in a new pyspark job using --packages option, but how do i connect that new pyspark context to the notebook ?
To get the notebook working, you'll really want the notebook setup to pick up the right packages itself. Since the initialization action you linked works to ensure Jupyter will be using the cluster's configured Spark directories and thus pick up all the necessary YARN/filesystem/lib configurations, the best way to do this is to add the property at cluster-creation time instead of job-submission time:
gcloud dataproc clusters create \
--properties spark:spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0
Per this StackOverflow error, setting the spark-defaults.conf property spark.jars.packages is the more portable equivalent of specifying the --packages option, since --packages is just syntactic sugar in spark-shell/spark-submit/pyspark wrappers which sets the spark.jars.packages configuration entry anyways.

Vagrant Berkshelf - Shelf Path?

Is it possible to set the path where the berkshelf plugin puts the cookbooks it installs? (As in the .berkshelf folder)
I am running Windows 7.
I am currently trying to install a mysql server using an opscode cookbook to a vm and here at work they have the %HOMEDRIVE% system variable set to a network drive. So when .berkshelf starts at the beginning of the Vagrantfile, it pushes the cookbooks to the network drive and it causes it to be slow and well, its not where it should be. Is there a fix to this?
VirtualBox did this as well, but I fixed it by altering the settings. I tried looking for some sort of equivalent settings for berkshelf, but the closest I got was for the standard berkshelf (thats not a vagrant plugin), it appears you can set this environment variable:
ENV['BERKSHELF_PATH']
Found here:
http://www.rubydoc.info/github/RiotGames/berkshelf/Berkshelf#berkshelf_path-class_method
I need to be able to have the cookbooks it reads from the berksfile store to my laptops local drive instead, as in my scenario I cannot have the mobility of the VM limited to the building because of files that are stored on the network.
Any incite would be much appreciated.
Perhaps its better to use the actual berkshelf over the vagrant plugin?
Thanks.
If you want to have the portability - a full chef-repo ready for chef-solo runs, better off using standalone berkshelf instead of the vagrant-berkshelf plugin - which is NOT that flexibly.
For complex cookbooks, I prefer to use standalone berkshelf as it allows me to do berks install --path chef/cookbooks to copy all cookbooks required from ~/.berkshelf/cookbooks, then I can just tar the whole thing and transfer to other machines for the same chef-solo run. some people use capistrano automate the tar and scp/rsync over the network. I just use rysnc/scp;-)
HTH