amazon emr jupyterhub and spark cluster; notebook has no autocomplete - pyspark

The pyspark3, pyspark, and spark kearnels in jupyterhub docker on amazon emr do not seem to allow autocomplete of function names or the doc string , shift-tab. Has anyone else noticed this behaviour?
I launched a cluster with jupyterhub and spark.
I created a new notebook for pyspark or pyspark3.
It seems to be using conda inside the docker. I have tried to upgrade all but that just breaks everything.

Using EMR 5.33.1, JupyterHub 1.1.0, Spark 2.4.7, tab suggestions works for me when pyspark kernel is set to "Trusted".
I believe tab suggestions is not enabled by default because it is considered code "the user opened but did not execute". https://jupyter-notebook.readthedocs.io/en/stable/security.html

Related

Use pyspark on yarn cluster without creating the context

I'll try to do my best to explain my self. I'm using JupyterHub to connect to the cluster of my university and write some code. Basically i'm using pyspark but, since i'va always used "yarn kernel" (i'm not sure of what i'm sying) i've never defined the spark context or the spark session. Now, for some reason, it doesn't work anymore and when i try to use spark this error appears:
Code ->
df = spark.read.csv('file:///%s/.....
Error ->
name 'spark' is not defined
It already happend to me but i just solved by installing another version of pyspark. Now i don't know what to do.

Connect PySpark session to DataProc

I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm ready to scale. I realize that DataProc runs Spark on Yarn, and I've copied over the yarn-site.xml locally. I've also opened up an ssh tunnel from my local machine to the DataProc master node and set up port forwarding for the ports identified in the yarn xml. It doesn't seem to be working though, when I try to create a session in a Jupyter notebook it hangs indefinitely. Nothing in stdout or DataProc logs that I can see. Has anyone had success with this?
For anyone interested, I eventually abandoned this approach. I'm instead running Jupyter Enterprise Gateway on the master node, setting up port forwarding, and then launching my notebooks locally to connect to kernel(s) running on the server. It works very nicely so far.

Unable to run spark-submit on a spark cluster running in docker container

I have a set up of spark cluster running on docker in which the following things are running:-
spark-master
three spark-workers (spark-worker-1, spark-worker-2, spark-worker-3)
For setting up the spark cluster I have followed the instructions given on URL:-
https://github.com/big-data-europe/docker-spark
Now I want to launch a spark application which can run on this cluster and for this, I am using bde2020/spark-scala-template and following the instructions given on URL:-
https://github.com/big-data-europe/docker-spark/tree/master/template/scala
But when I tried to run the jar file then it starts running on the spark master present in the bde2020/spark-scala-template image and not on the master of my cluster running in a different container.
Please help me to do that. Stucked very badly.

Using LD_LIBRARY_PATH in Cloud Dataproc Pyspark

I've setup a highly customized virtual environment on Cloud Dataproc. Some of the libraries in this virtual environment depend on certain shared libraries. which are packaged along with the Virtual Environment.
For the Virtual Environment: I made PYSPARK_PYTHON point to the python present inside the Virtual Environment.
However these libraries are not able to work as the LD_LIBRARY_PATH is not set when I do gcloud dataproc jobs submit....
I've tried:
Setting spark-env.sh on the workers and master to export LD_LIBRARY_PATH
Setting spark.executorEnv.LD_LIBRARY_PATH
Creating an initialization script where (1) is being added during initialization
However both of these fail.
This is what finally worked:
Running the gcloud command as:
gcloud dataproc jobs submit pyspark --cluster spark-tests spark_job.py --properties spark.executorEnv.LD_LIBRARY_PATH="path1:path2"
When I tried to set the spark.executorEnv inside the pyspark script(using the Spark Config object) it didnt work though. I'm not sure why that is?

Apache spark in cluster mode where to run the jobs. In Master or in worker node?

I have installed the spark in cluster mode. 1 master and 2 workers.And When I start spark shell in master node it is countinously running without getting the scala shell.
But when I run spark-shell on a worker node I am getting scala shell.And I am able to do the jobs.
val file=sc.textFile(“hdfs://192.168.1.20:9000/user/1gbdata”)
file.count()
And for this I got the output.
So My doubt is actually where to run the spark jobs.
Is it in worker nodes?
Based on the documentation, you need to connect your spark-shell to the master node with the following command : spark-shell --master spark://IP:PORT. This url can be retrieved from the master's UI or log file.
You should be able to launch the spark-shell on the master node (machine), make sure to check out the UI to see if the spark-shell is effectively running and that the prompt is shown (you might need to press enter on your keyboard after issuing spark-shell).
Please note that when you are using spark-submit in cluster mode, the driver will be submitted directly from one of the worker nodes, contrary to client mode where it will run as a client process. Refer to the documentation for more details.