I normally connect my Databricks notebooks to a cluster.
However we are starting to use pools instead now.
There are temp "job" clusters which get attached to our pool, such as these:
How can I connect a notebook to a pool? (for example, if I am running a notebook interactively via the browser)
I did not find any obvious answer reviewing the documentation such as
https://learn.microsoft.com/en-us/azure/databricks/clusters/instance-pools/
Related
I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm ready to scale. I realize that DataProc runs Spark on Yarn, and I've copied over the yarn-site.xml locally. I've also opened up an ssh tunnel from my local machine to the DataProc master node and set up port forwarding for the ports identified in the yarn xml. It doesn't seem to be working though, when I try to create a session in a Jupyter notebook it hangs indefinitely. Nothing in stdout or DataProc logs that I can see. Has anyone had success with this?
For anyone interested, I eventually abandoned this approach. I'm instead running Jupyter Enterprise Gateway on the master node, setting up port forwarding, and then launching my notebooks locally to connect to kernel(s) running on the server. It works very nicely so far.
I want to get the cluster link (or the cluster ID to manually compose the link) inside a running Spark job.
This will be used to print the link in an alerting message, making it easier for engineers to reach the logs.
Is it possible to achieve that in a Spark job running in Databricks?
When Databricks cluster starts, there is a number of Spark configuration properties added. Most of them are having name starting with spark.databricks. - you can find all of the in the Environment tab of the Spark UI.
Cluster ID is available as spark.databricks.clusterUsageTags.clusterId property and you can get it as:
spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
You can get workspace host name via dbutils.notebook.getContext().apiUrl.get call (for Scala), or dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() (for Python)
I followed the tutorial for setting up JupyterHub on an AWS EMR cluster at this link: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
I got the cluster up and running, but now my question is how do I stress/load test? (i.e. simulate 100 users running through the notebooks simultaneously).
In a classroom setting, I had about 30 users sshed into my cluster running through the notebook exercises, but there was a huge slowdown when more people started executing the code blocks in the notebooks. What happened was some python library imports took forever, some exercises stopped working or was just hanging. Cloudwatch showed that there was a network bottleneck.
Basically what I'm asking is how can I go about debugging something like that? What's the best way to simulate multiple users sshing into the EMR cluster, opening up jupyter notebooks and running the code blocks concurrently?
You should look (and contribute?) to project like this one which are meant to load-test JupyterHub and should migrate to jupyterHub organisation once more polished.
Note that in your case you are not really wishing to test JupyterHub, you are testing your cluster; just run N scripts in parallel importing your library and you have your load test.
I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.
Is it possible to run notebook server with a kernel scheduled as processes on remote cluster (ssh or pbs), (with common directory on NFS)?
For example I have three servers with GPU and would like to run a notebook on one of them, but I do not like to start more than one notebook server. It would be ideal to have notebook server on 4th machine which would in some way scheduele kernels automatically or manually.
I did some trials with making cluster with one engine. Using %%px in each cell is almost a solution, but one cannot use introspection and the notebook code in fact is dependent on the cluster configuration which is not very good.
This is not possible with the notebook at this time. The notebook cannot use a kernel that it did not start.
You could possibly write a new KernelManager that starts kernels remotely distributed across your machines and plug that into the notebook server, but you cannot attach an Engine or other existing kernel to the notebook server.