Connect PySpark session to DataProc - pyspark

I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm ready to scale. I realize that DataProc runs Spark on Yarn, and I've copied over the yarn-site.xml locally. I've also opened up an ssh tunnel from my local machine to the DataProc master node and set up port forwarding for the ports identified in the yarn xml. It doesn't seem to be working though, when I try to create a session in a Jupyter notebook it hangs indefinitely. Nothing in stdout or DataProc logs that I can see. Has anyone had success with this?

For anyone interested, I eventually abandoned this approach. I'm instead running Jupyter Enterprise Gateway on the master node, setting up port forwarding, and then launching my notebooks locally to connect to kernel(s) running on the server. It works very nicely so far.

Related

Unable to run spark-submit on a spark cluster running in docker container

I have a set up of spark cluster running on docker in which the following things are running:-
spark-master
three spark-workers (spark-worker-1, spark-worker-2, spark-worker-3)
For setting up the spark cluster I have followed the instructions given on URL:-
https://github.com/big-data-europe/docker-spark
Now I want to launch a spark application which can run on this cluster and for this, I am using bde2020/spark-scala-template and following the instructions given on URL:-
https://github.com/big-data-europe/docker-spark/tree/master/template/scala
But when I tried to run the jar file then it starts running on the spark master present in the bde2020/spark-scala-template image and not on the master of my cluster running in a different container.
Please help me to do that. Stucked very badly.

GCP Dataproc: Directly working with Spark over Yarn Cluster

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:
spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
[options] <app jar> [app options]
without using GCP SDK.
I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari.
Is there a way to do the same?
Thank you
Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.
In a comment you mention that what you really want to do is
Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
The script has dependencies that mean it cannot run inside of the Dataproc master node.
Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.
If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

How to stress/load test JupyterHub for multiple users?

I followed the tutorial for setting up JupyterHub on an AWS EMR cluster at this link: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
I got the cluster up and running, but now my question is how do I stress/load test? (i.e. simulate 100 users running through the notebooks simultaneously).
In a classroom setting, I had about 30 users sshed into my cluster running through the notebook exercises, but there was a huge slowdown when more people started executing the code blocks in the notebooks. What happened was some python library imports took forever, some exercises stopped working or was just hanging. Cloudwatch showed that there was a network bottleneck.
Basically what I'm asking is how can I go about debugging something like that? What's the best way to simulate multiple users sshing into the EMR cluster, opening up jupyter notebooks and running the code blocks concurrently?
You should look (and contribute?) to project like this one which are meant to load-test JupyterHub and should migrate to jupyterHub organisation once more polished.
Note that in your case you are not really wishing to test JupyterHub, you are testing your cluster; just run N scripts in parallel importing your library and you have your load test.

Spark fails with too many open files on HDInsight YARN cluster

I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.

Apache spark in cluster mode where to run the jobs. In Master or in worker node?

I have installed the spark in cluster mode. 1 master and 2 workers.And When I start spark shell in master node it is countinously running without getting the scala shell.
But when I run spark-shell on a worker node I am getting scala shell.And I am able to do the jobs.
val file=sc.textFile(“hdfs://192.168.1.20:9000/user/1gbdata”)
file.count()
And for this I got the output.
So My doubt is actually where to run the spark jobs.
Is it in worker nodes?
Based on the documentation, you need to connect your spark-shell to the master node with the following command : spark-shell --master spark://IP:PORT. This url can be retrieved from the master's UI or log file.
You should be able to launch the spark-shell on the master node (machine), make sure to check out the UI to see if the spark-shell is effectively running and that the prompt is shown (you might need to press enter on your keyboard after issuing spark-shell).
Please note that when you are using spark-submit in cluster mode, the driver will be submitted directly from one of the worker nodes, contrary to client mode where it will run as a client process. Refer to the documentation for more details.