Bind engine to the notebook session - ipython

Is it possible to run notebook server with a kernel scheduled as processes on remote cluster (ssh or pbs), (with common directory on NFS)?
For example I have three servers with GPU and would like to run a notebook on one of them, but I do not like to start more than one notebook server. It would be ideal to have notebook server on 4th machine which would in some way scheduele kernels automatically or manually.
I did some trials with making cluster with one engine. Using %%px in each cell is almost a solution, but one cannot use introspection and the notebook code in fact is dependent on the cluster configuration which is not very good.

This is not possible with the notebook at this time. The notebook cannot use a kernel that it did not start.
You could possibly write a new KernelManager that starts kernels remotely distributed across your machines and plug that into the notebook server, but you cannot attach an Engine or other existing kernel to the notebook server.

Related

Attach notebook to pool or compute resource

I normally connect my Databricks notebooks to a cluster.
However we are starting to use pools instead now.
There are temp "job" clusters which get attached to our pool, such as these:
How can I connect a notebook to a pool? (for example, if I am running a notebook interactively via the browser)
I did not find any obvious answer reviewing the documentation such as
https://learn.microsoft.com/en-us/azure/databricks/clusters/instance-pools/

kernelspec not found after setting JUPYTER_PATH

I am working in Google Vertex AI, which has a two-disk system of a boot disk and a data disk, the latter of which is mounted to /home/jupyter. I am trying to expose python venv environments with kernelspec files, and then keep those environments exposed across repeated stop-start cycles. All of the default locations for kernelspec files are on the boot disk, which is ephemeral and recreated each time the VM is started (i.e., the exposed kernels vaporize each time the VM is stopped). Conceptually, I want to use a VM start-up script to add a persistent data disk path to the JUPYTER_PATH variable, since, according to the documentation, "Jupyter uses a search path to find installable data files, such as kernelspecs and notebook extensions." During interactive testing in the Terminal, I have not found this to be true. I have also tried setting the data directory variable, but it does not help.
export JUPYTER_PATH=/home/jupyter/envs
export JUPYTER_DATA_DIR=/home/jupyter/envs
I have a beginner's understanding of jupyter and of the important ramifications of using two-disk systems. Could someone please help me understand:
(1) Why is Jupyter failing to search for kernelspec files on the JUPYTER_PATH or in the JUPYTER_DATA_DIR?
(2) If I am mistaken about how the search paths work, what is the best strategy for maintaining virtual environment exposure when Jupyter is installed on an ephemeral boot disk? (Note, I am aware of nb_conda_kernels, which I am specifically avoiding)
A related post focused on the start-up script can be found at this url. Here I am more interested in the general Jupyter + two-disk use case.

Connect PySpark session to DataProc

I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm ready to scale. I realize that DataProc runs Spark on Yarn, and I've copied over the yarn-site.xml locally. I've also opened up an ssh tunnel from my local machine to the DataProc master node and set up port forwarding for the ports identified in the yarn xml. It doesn't seem to be working though, when I try to create a session in a Jupyter notebook it hangs indefinitely. Nothing in stdout or DataProc logs that I can see. Has anyone had success with this?
For anyone interested, I eventually abandoned this approach. I'm instead running Jupyter Enterprise Gateway on the master node, setting up port forwarding, and then launching my notebooks locally to connect to kernel(s) running on the server. It works very nicely so far.

How to stress/load test JupyterHub for multiple users?

I followed the tutorial for setting up JupyterHub on an AWS EMR cluster at this link: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
I got the cluster up and running, but now my question is how do I stress/load test? (i.e. simulate 100 users running through the notebooks simultaneously).
In a classroom setting, I had about 30 users sshed into my cluster running through the notebook exercises, but there was a huge slowdown when more people started executing the code blocks in the notebooks. What happened was some python library imports took forever, some exercises stopped working or was just hanging. Cloudwatch showed that there was a network bottleneck.
Basically what I'm asking is how can I go about debugging something like that? What's the best way to simulate multiple users sshing into the EMR cluster, opening up jupyter notebooks and running the code blocks concurrently?
You should look (and contribute?) to project like this one which are meant to load-test JupyterHub and should migrate to jupyterHub organisation once more polished.
Note that in your case you are not really wishing to test JupyterHub, you are testing your cluster; just run N scripts in parallel importing your library and you have your load test.

Should I use VirtualBox on a production server?

I just completed my vagrant box for a product that made by my company.
I needed that because we're running same product on different
operating systems. I want to serve sites inside virtual machines, I
have questions:
Am I on correct way? Can a virtual machine used as production
server?
If you say yes:
How should I keep virtualbox running? Are there any script or sth
to restart if something crashes?
What happens if somebody accidentally gives "vagrant destroy"
command? What should I do if I don't want to lose my database and user
uploaded files?
We have some import scripts that running every beginning of the
month. sometimes they're using 7gb ram (running 1500 lines of mysql
code with lots of asynchronised instances). Can it be dangerous to run
inside VirtualBox?
Are there any case study blog post about this?
Vagrant is mainly for Development environment. I personally recommend using Type 1 hypervisor (Bare metal), VirtualBox is a desktop virtulization tool (Type 2, running on top of a traditional OS), not recommended for production.
AWS is ok, the VMs are running as Xen guest, Xen is on bare metal;-)
I wouldn't.
The w/ Vagrant + Virtualbox is that these are development instances. I would look at Amazon Web Services for actually deploying your project into the wild.